Loading & Cleaning Real Data
read_csv/read_excel to load, then the cleaning loop: inspect → fix types → handle missing values → drop duplicates → standardise text. Trust the data, then analyse.Loading a file
One line pulls a CSV or Excel file straight into a DataFrame:
import pandas as pd
df = pd.read_csv("sales.csv")
# Excel (needs openpyxl, which you installed in Chapter 1):
df = pd.read_excel("sales.xlsx", sheet_name="Q1")
# real-world files are rarely pristine — these arguments earn their keep:
df = pd.read_csv(
"sales.csv",
parse_dates=["order_date"], # read that column as real dates
thousands=",", # "1,200" -> 1200 (numeric)
na_values=["", "N/A", "NULL"] # treat these strings as missing
)
No file handy? Make one to practise on: df.to_csv("sales.csv", index=False) from any DataFrame, then read it back. (And if you ever need data straight from a database, pd.read_sql() takes a query and a connection — your PDO instincts, in pandas.)
🐘 PHP: Remember hand-rolling CSV parsing with fgetcsv() line by line? read_csv() does all of it — splitting, type inference, headers, encodings — in one call, and hands you a queryable table instead of a raw array of rows.
The cleaning loop
1. Inspect honestly
Same opening move as always, but now you're hunting for problems: wrong dtypes, suspicious null counts, weird min/max values.
df.info() # which columns have nulls? are dates actually 'object' (text)?
df.describe() # any impossible values? negative revenue? a max of 9999999?
df.head(10) # eyeball the actual contents
2. Fix the types
A "revenue" column read as text can't be summed. Convert it. errors="coerce" turns anything unparseable into a missing value instead of crashing — which is what you usually want.
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["region"] = df["region"].astype("category") # saves memory on repeats
3. Handle missing values
The honest first step is to measure the gaps, then decide — fill them or drop them. There's no universal right answer; it's a judgement call you document.
df.isna().sum() # count missing per column
df["revenue"] = df["revenue"].fillna(0) # fill with a constant
df["region"] = df["region"].fillna("Unknown") # fill text
df["score"] = df["score"].fillna(df["score"].median()) # fill with the median
df = df.dropna(subset=["order_date"]) # or drop rows missing a critical field
4. Drop duplicates
df.duplicated().sum() # how many exact-duplicate rows?
df = df.drop_duplicates() # keep the first of each
# duplicates that only matter on certain columns:
df = df.drop_duplicates(subset=["order_id"], keep="last")
5. Standardise text
"North", "north ", and " NORTH" are three different groups to a computer until you normalise them — which is why a groupby sometimes shows the same category twice. The .str accessor runs string methods down a whole column.
df["region"] = df["region"].str.strip().str.title() # trim + "North"
df["email"] = df["email"].str.lower()
df["sku"] = df["sku"].str.replace("-", "", regex=False)
Rescue a Messy File
Goal: take a deliberately dirty dataset and clean it into something analysable — the real first hour of almost any analytics task.
- Create the mess (note the bad types, blanks, dupes, ragged text):
import pandas as pd import numpy as np raw = pd.DataFrame({ "order_id": [1, 2, 2, 3, 4], "region": [" north", "South ", "South ", "EAST", None], "revenue": ["1,200", "800", "800", "N/A", "1500"], }) - Fix revenue:
raw["revenue"] = pd.to_numeric(raw["revenue"].str.replace(",", ""), errors="coerce") - Standardise region:
raw["region"] = raw["region"].str.strip().str.title().fillna("Unknown") - De-dupe on the real key:
clean = raw.drop_duplicates(subset=["order_id"]) - Prove it worked:
clean.info()(revenue now numeric, no dupes) andclean.groupby("region")["revenue"].sum()(groups are clean)
Watch the groupby at the end: because you standardised the text first, "South" is one group, not two. That's the entire reason cleaning comes before analysis — dirty categories silently split your numbers.
clean(df) function in your metrics.py module that applies your standard cleaning steps and returns the tidy DataFrame. Reusable cleaning is a quiet superpower — every new file from the same source becomes a single function call, and your pipeline stays consistent instead of re-improvised each time.