pandas — The Analyst's Workhorse

This is the chapter the whole module has been walking toward. pandas turns Python into a spreadsheet on steroids: load a CSV, inspect it, filter it, group it, and summarise it — in a handful of expressive lines. If you learn one library for Business Analytics, it's this one. Everything you practised — dicts, lists, comprehensions, NumPy masks — now pays off at once.

A Series is one labelled column; a DataFrame is a whole table. Select columns with [], filter rows with boolean masks, and answer business questions with .groupby() + an aggregation.

The two core objects

A Series is a single column with an index. A DataFrame is a dict of Series sharing one index — i.e. a table. Remember the "list of dicts" shape from Chapter 4? A DataFrame is exactly that, optimised and given superpowers.

import pandas as pd

df = pd.DataFrame({
    "month":   ["Jan", "Feb", "Mar", "Jan", "Feb"],
    "region":  ["North", "North", "North", "South", "South"],
    "revenue": [1200, 1800, 1500, 900, 1100],
})
df

In a notebook, typing df renders a clean, styled table. That instant visual feedback is why analysts work in Jupyter.

Look before you leap — inspection

The first thing you do with any new dataset, every single time, is look at it. These five calls are your standard opening move:

df.head()        # first 5 rows
df.tail(3)       # last 3 rows
df.shape         # (rows, columns) -> (5, 3)
df.columns       # the column names
df.info()        # types + non-null counts per column
df.describe()    # count/mean/std/min/quartiles/max for numeric columns

.describe() is the one to fall in love with — it's an instant statistical summary of every numeric column, the automated version of that summary() function you wrote by hand in Chapter 5. One call profiles your whole dataset.

Selecting columns and rows

df["revenue"]                 # one column (a Series)
df[["month", "revenue"]]      # several columns (a DataFrame) — note double brackets

# filter rows with a boolean mask — straight from NumPy:
df[df["revenue"] > 1300]                       # rows above 1300
df[df["region"] == "North"]                    # North only
df[(df["region"] == "North") & (df["revenue"] > 1300)]   # combine with &

This is the NumPy mask idea, now labelled. Read df[df["revenue"] > 1300] as "the rows where revenue exceeds 1300." Same parenthesise-each-condition rule applies when you combine them with & and |.

🐘 PHP: Think of a DataFrame as the result of a SELECT you can keep transforming. df[df["region"]=="North"] is WHERE region = 'North' — but instead of writing SQL and round-tripping to MariaDB, the table lives in memory and you reshape it line by line. The PDO chapter taught you to query a database; pandas lets you query a file (or a query result) with the same WHERE/GROUP BY/ORDER BY mindset.

New columns from old

Assign to a column that doesn't exist yet and pandas creates it — vectorised across every row, NumPy-style:

df["with_tax"] = df["revenue"] * 1.15
df["tier"] = df["revenue"].apply(lambda r: "High" if r >= 1500 else "Low")

The first line is pure vectorised maths. The second uses .apply() with a lambda for row-by-row logic that's too custom for a simple formula — your Chapter 5 lambdas, working on a whole column.

groupby — the question-answering machine

This is the single most valuable move in analytics: split the data into groups, apply an aggregation to each, combine the results. "Revenue per region?" "Average order by month?" "Count of customers per tier?" — all one line.

df.groupby("region")["revenue"].sum()
# region
# North    4500
# South    2000

df.groupby("region")["revenue"].mean()       # average per region
df.groupby("region").size()                  # row count per group

# group by two keys, aggregate several ways at once:
df.groupby(["region", "month"])["revenue"].agg(["sum", "mean", "count"])

Remember rolling up regions by hand with a dict accumulator in Chapter 4? This is that — split/apply/combine — but correct, fast, and one line. When you catch yourself about to loop-and-accumulate, the answer is almost always groupby.

Sorting and quick exports

df.sort_values("revenue", ascending=False).head(3)   # top 3 rows

summary = df.groupby("region")["revenue"].sum()
summary.to_csv("region_totals.csv")                  # back out to a file
summary.to_excel("region_totals.xlsx")               # real Excel (needs openpyxl)

From Raw Table to Boardroom Answer

Goal: run a complete analyst workflow — load, inspect, enrich, filter, group, rank — on a small sales table, the same sequence you'll repeat on real data next chapter.

Build the data:

import pandas as pd
df = pd.DataFrame({
    "rep":     ["Ana","Bo","Ana","Cy","Bo","Ana"],
    "region":  ["N","S","N","E","S","E"],
    "deal":    [1200, 800, 1500, 600, 2100, 900],
})

Inspect: run df.describe() and df.info() and read them
Enrich: df["big"] = df["deal"] >= 1000

Answer "total sales per rep, best first":

by_rep = df.groupby("rep")["deal"].sum().sort_values(ascending=False)
print(by_rep)

Answer "how many big deals per region?":

print(df[df["big"]].groupby("region").size())

Inspect → enrich → group → rank. That four-step rhythm is the daily reality of business analytics, and you just ran it end to end.

You can build a DataFrame, inspect it five ways, select and filter rows, add computed columns, and — the headline skill — answer business questions with groupby + aggregation. This is the literacy that the rest of the field assumes. Big moment.

Add a commission column at 8% of deal, then produce a per-rep table showing total deals, total commission, and average deal size in a single .agg() call: df.groupby("rep").agg(total=("deal","sum"), avg=("deal","mean")). That "named aggregation" syntax is how you build a polished summary table in one expression — exactly what you'd paste into a report.