pandas — The Analyst's Workhorse
[], filter rows with boolean masks, and answer business questions with .groupby() + an aggregation.The two core objects
A Series is a single column with an index. A DataFrame is a dict of Series sharing one index — i.e. a table. Remember the "list of dicts" shape from Chapter 4? A DataFrame is exactly that, optimised and given superpowers.
import pandas as pd
df = pd.DataFrame({
"month": ["Jan", "Feb", "Mar", "Jan", "Feb"],
"region": ["North", "North", "North", "South", "South"],
"revenue": [1200, 1800, 1500, 900, 1100],
})
df
In a notebook, typing df renders a clean, styled table. That instant visual feedback is why analysts work in Jupyter.
Look before you leap — inspection
The first thing you do with any new dataset, every single time, is look at it. These five calls are your standard opening move:
df.head() # first 5 rows
df.tail(3) # last 3 rows
df.shape # (rows, columns) -> (5, 3)
df.columns # the column names
df.info() # types + non-null counts per column
df.describe() # count/mean/std/min/quartiles/max for numeric columns
.describe() is the one to fall in love with — it's an instant statistical summary of every numeric column, the automated version of that summary() function you wrote by hand in Chapter 5. One call profiles your whole dataset.
Selecting columns and rows
df["revenue"] # one column (a Series)
df[["month", "revenue"]] # several columns (a DataFrame) — note double brackets
# filter rows with a boolean mask — straight from NumPy:
df[df["revenue"] > 1300] # rows above 1300
df[df["region"] == "North"] # North only
df[(df["region"] == "North") & (df["revenue"] > 1300)] # combine with &
This is the NumPy mask idea, now labelled. Read df[df["revenue"] > 1300] as "the rows where revenue exceeds 1300." Same parenthesise-each-condition rule applies when you combine them with & and |.
🐘 PHP: Think of a DataFrame as the result of a SELECT you can keep transforming. df[df["region"]=="North"] is WHERE region = 'North' — but instead of writing SQL and round-tripping to MariaDB, the table lives in memory and you reshape it line by line. The PDO chapter taught you to query a database; pandas lets you query a file (or a query result) with the same WHERE/GROUP BY/ORDER BY mindset.
New columns from old
Assign to a column that doesn't exist yet and pandas creates it — vectorised across every row, NumPy-style:
df["with_tax"] = df["revenue"] * 1.15
df["tier"] = df["revenue"].apply(lambda r: "High" if r >= 1500 else "Low")
The first line is pure vectorised maths. The second uses .apply() with a lambda for row-by-row logic that's too custom for a simple formula — your Chapter 5 lambdas, working on a whole column.
groupby — the question-answering machine
This is the single most valuable move in analytics: split the data into groups, apply an aggregation to each, combine the results. "Revenue per region?" "Average order by month?" "Count of customers per tier?" — all one line.
df.groupby("region")["revenue"].sum()
# region
# North 4500
# South 2000
df.groupby("region")["revenue"].mean() # average per region
df.groupby("region").size() # row count per group
# group by two keys, aggregate several ways at once:
df.groupby(["region", "month"])["revenue"].agg(["sum", "mean", "count"])
Remember rolling up regions by hand with a dict accumulator in Chapter 4? This is that — split/apply/combine — but correct, fast, and one line. When you catch yourself about to loop-and-accumulate, the answer is almost always groupby.
Sorting and quick exports
df.sort_values("revenue", ascending=False).head(3) # top 3 rows
summary = df.groupby("region")["revenue"].sum()
summary.to_csv("region_totals.csv") # back out to a file
summary.to_excel("region_totals.xlsx") # real Excel (needs openpyxl)
From Raw Table to Boardroom Answer
Goal: run a complete analyst workflow — load, inspect, enrich, filter, group, rank — on a small sales table, the same sequence you'll repeat on real data next chapter.
- Build the data:
import pandas as pd df = pd.DataFrame({ "rep": ["Ana","Bo","Ana","Cy","Bo","Ana"], "region": ["N","S","N","E","S","E"], "deal": [1200, 800, 1500, 600, 2100, 900], }) - Inspect: run
df.describe()anddf.info()and read them - Enrich:
df["big"] = df["deal"] >= 1000 - Answer "total sales per rep, best first":
by_rep = df.groupby("rep")["deal"].sum().sort_values(ascending=False) print(by_rep) - Answer "how many big deals per region?":
print(df[df["big"]].groupby("region").size())
Inspect → enrich → group → rank. That four-step rhythm is the daily reality of business analytics, and you just ran it end to end.
groupby + aggregation. This is the literacy that the rest of the field assumes. Big moment.
commission column at 8% of deal, then produce a per-rep table showing total deals, total commission, and average deal size in a single .agg() call: df.groupby("rep").agg(total=("deal","sum"), avg=("deal","mean")). That "named aggregation" syntax is how you build a polished summary table in one expression — exactly what you'd paste into a report.