Collections — Where Your Data Lives
[...]. Dict = labelled key→value {...}. Tuple = fixed, unchangeable (...). Set = unique, unordered. Pick the shape that matches the job.Lists — the ordered workhorse
A list is an ordered, mutable sequence. Square brackets, comma-separated, mixed types allowed (though in data work you'll usually keep them uniform).
revenue = [1200, 1800, 1500, 2100]
revenue.append(2400) # add to the end
revenue[0] = 1250 # change in place
first = revenue[0] # 1250 (zero-indexed)
last = revenue[-1] # 2400 (negative = from the end!)
chunk = revenue[1:3] # [1800, 1500] (slice: start:stop)
print(len(revenue)) # 5
Two things to highlight. Negative indexing (revenue[-1] for the last item) is a constant convenience. And slicing (list[start:stop]) grabs a sub-range, with stop excluded — the same half-open convention as range(). You'll slice rows out of datasets constantly.
🐘 PHP: A Python list is a PHP indexed array, but cleaner: real negative indexing, real slicing, and a tidy set of methods (.append(), .sort(), .pop()) instead of a sprawl of array_* functions. No array_push($a, $x) — just a.append(x).
Dictionaries — labelled data
A dict maps keys to values. This is the most important collection for analytics because a "record" — one customer, one transaction, one row — is naturally a dict.
customer = {
"name": "Acme Co",
"tier": "VIP",
"spend": 5400,
}
print(customer["name"]) # Acme Co
customer["region"] = "North" # add a new field
customer["spend"] += 600 # update a field
# safe lookup that won't crash if the key is missing:
phone = customer.get("phone", "n/a") # "n/a"
Loop over a dict's pairs with .items() — this is the dict counterpart to PHP's foreach ($a as $k => $v):
for key, value in customer.items():
print(f"{key}: {value}")
Use .get(key, default) whenever a key might be absent — it returns the default instead of throwing a KeyError. Real-world data is full of missing fields, so this habit prevents a lot of crashes.
🐘 PHP: A dict is a PHP associative array. $a['name'] → a["name"]. The big difference: PHP smushes indexed and associative arrays into one type; Python keeps lists and dicts separate, which actually makes your intent clearer. And .get() is the clean version of the $a['x'] ?? 'default' null-coalescing trick.
A list of dicts — the shape of a dataset
Put these two together and you get the universal "table in plain Python": a list, where each item is a dict (a row). Burn this pattern into memory — it's what a CSV becomes when you read it, and what a DataFrame is built from.
sales = [
{"month": "Jan", "revenue": 1200, "region": "North"},
{"month": "Feb", "revenue": 1800, "region": "South"},
{"month": "Mar", "revenue": 1500, "region": "North"},
]
# total revenue with a comprehension + sum()
total = sum(row["revenue"] for row in sales) # 4500
# just the North rows
north = [row for row in sales if row["region"] == "North"]
That's an entire mini-analysis in pure Python. pandas will make it terser and faster, but this is the mental model underneath it.
Tuples — fixed, unchangeable groupings
A tuple is like a list that can't be changed after creation. Round brackets. Use it for things that belong together and shouldn't drift — coordinates, a (min, max) pair, a row that must stay fixed.
point = (40.7, -74.0) # lat, lon
low, high = (10, 99) # "unpacking" — two vars at once!
print(low, high) # 10 99
That second line, tuple unpacking, is everywhere in Python. It's how for i, region in enumerate(...) and for k, v in dict.items() work under the hood — each item is a tuple being split into named variables. It also gives you the slickest swap in any language: a, b = b, a.
Sets — unique values, fast membership
A set holds unique items with no order. Two killer uses in analytics: de-duplicating, and lightning-fast "is this in here?" checks.
regions = ["North", "South", "North", "East", "South"]
unique = set(regions) # {'North', 'South', 'East'}
print(len(unique)) # 3 -> "how many distinct regions?"
vip_ids = {101, 102, 103}
print(102 in vip_ids) # True, and very fast even on millions
Sets even do real maths — a & b (in both), a | b (in either), a - b (in a but not b) — which is perfect for questions like "which customers are in last month's list but not this month's?"
Roll Up the Regions
Goal: take a list-of-dicts dataset and produce a per-region revenue summary using a dict as an accumulator — the manual version of a pandas groupby, so the magic later isn't mysterious.
- Use this data:
sales = [ {"region": "North", "rev": 1200}, {"region": "South", "rev": 1800}, {"region": "North", "rev": 1500}, {"region": "East", "rev": 900}, {"region": "South", "rev": 400}, ] - Build the rollup:
totals = {} for row in sales: r = row["region"] totals[r] = totals.get(r, 0) + row["rev"] - Print it sorted high-to-low:
for region, rev in sorted(totals.items(), key=lambda kv: kv[1], reverse=True): print(f"{region:<6} ${rev:,}")
You used .get(r, 0) to start each region at zero, and a lambda to sort by value (more on lambdas next chapter). This exact "group and sum" is the single most common analytics operation — you just built it by hand.
sales list and answer three questions with one-liners: (1) the set of distinct regions, (2) the single highest rev value using max(row["rev"] for row in sales), and (3) the average using sum(...) / len(sales). Three questions, three lines. That ratio — one question, one line — is the goal you're training toward.