Loading & Cleaning Real Data

Here's the open secret of the job: analysts spend most of their time cleaning data, not modelling it. Real files have missing values, wrong types, duplicate rows, and inconsistent text. This chapter is the unglamorous, high-leverage skill — getting a messy CSV or Excel file into a tidy DataFrame you can actually trust.

read_csv/read_excel to load, then the cleaning loop: inspect → fix types → handle missing values → drop duplicates → standardise text. Trust the data, then analyse.

Loading a file

One line pulls a CSV or Excel file straight into a DataFrame:

import pandas as pd

df = pd.read_csv("sales.csv")
# Excel (needs openpyxl, which you installed in Chapter 1):
df = pd.read_excel("sales.xlsx", sheet_name="Q1")

# real-world files are rarely pristine — these arguments earn their keep:
df = pd.read_csv(
    "sales.csv",
    parse_dates=["order_date"],   # read that column as real dates
    thousands=",",                # "1,200" -> 1200 (numeric)
    na_values=["", "N/A", "NULL"] # treat these strings as missing
)

No file handy? Make one to practise on: df.to_csv("sales.csv", index=False) from any DataFrame, then read it back. (And if you ever need data straight from a database, pd.read_sql() takes a query and a connection — your PDO instincts, in pandas.)

🐘 PHP: Remember hand-rolling CSV parsing with fgetcsv() line by line? read_csv() does all of it — splitting, type inference, headers, encodings — in one call, and hands you a queryable table instead of a raw array of rows.

The cleaning loop

1. Inspect honestly

Same opening move as always, but now you're hunting for problems: wrong dtypes, suspicious null counts, weird min/max values.

df.info()        # which columns have nulls? are dates actually 'object' (text)?
df.describe()    # any impossible values? negative revenue? a max of 9999999?
df.head(10)      # eyeball the actual contents

2. Fix the types

A "revenue" column read as text can't be summed. Convert it. errors="coerce" turns anything unparseable into a missing value instead of crashing — which is what you usually want.

df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["region"] = df["region"].astype("category")   # saves memory on repeats

3. Handle missing values

The honest first step is to measure the gaps, then decide — fill them or drop them. There's no universal right answer; it's a judgement call you document.

df.isna().sum()                      # count missing per column

df["revenue"] = df["revenue"].fillna(0)              # fill with a constant
df["region"]  = df["region"].fillna("Unknown")       # fill text
df["score"]   = df["score"].fillna(df["score"].median())  # fill with the median

df = df.dropna(subset=["order_date"])  # or drop rows missing a critical field

Fill or drop? Filling keeps rows but invents data — fine for a sensible default, risky if it distorts an average. Dropping is cleaner but loses information. State which you did and why; that note is part of trustworthy analysis.

4. Drop duplicates

df.duplicated().sum()      # how many exact-duplicate rows?
df = df.drop_duplicates()  # keep the first of each

# duplicates that only matter on certain columns:
df = df.drop_duplicates(subset=["order_id"], keep="last")

5. Standardise text

"North", "north ", and " NORTH" are three different groups to a computer until you normalise them — which is why a groupby sometimes shows the same category twice. The .str accessor runs string methods down a whole column.

df["region"] = df["region"].str.strip().str.title()   # trim + "North"
df["email"]  = df["email"].str.lower()
df["sku"]    = df["sku"].str.replace("-", "", regex=False)

Rescue a Messy File

Goal: take a deliberately dirty dataset and clean it into something analysable — the real first hour of almost any analytics task.

Create the mess (note the bad types, blanks, dupes, ragged text):

import pandas as pd
import numpy as np
raw = pd.DataFrame({
    "order_id": [1, 2, 2, 3, 4],
    "region":   [" north", "South ", "South ", "EAST", None],
    "revenue":  ["1,200", "800", "800", "N/A", "1500"],
})

Fix revenue: raw["revenue"] = pd.to_numeric(raw["revenue"].str.replace(",", ""), errors="coerce")
Standardise region: raw["region"] = raw["region"].str.strip().str.title().fillna("Unknown")
De-dupe on the real key: clean = raw.drop_duplicates(subset=["order_id"])
Prove it worked: clean.info() (revenue now numeric, no dupes) and clean.groupby("region")["revenue"].sum() (groups are clean)

Watch the groupby at the end: because you standardised the text first, "South" is one group, not two. That's the entire reason cleaning comes before analysis — dirty categories silently split your numbers.

You can load CSV and Excel files with sane parsing options and run the five-step cleaning loop: inspect, fix types, handle nulls, drop duplicates, standardise text. This is the skill that makes every later number believable — and it's most of the actual job.

Build a one-shot clean(df) function in your metrics.py module that applies your standard cleaning steps and returns the tidy DataFrame. Reusable cleaning is a quiet superpower — every new file from the same source becomes a single function call, and your pipeline stays consistent instead of re-improvised each time.