Collections — Where Your Data Lives

Before pandas, you need the four shapes data takes in plain Python: lists, dictionaries, tuples, and sets. A DataFrame is really just dictionaries of lists wearing a suit — so the better you know these, the less pandas will feel like magic and the more it'll feel like leverage.

List = ordered, changeable [...]. Dict = labelled key→value {...}. Tuple = fixed, unchangeable (...). Set = unique, unordered. Pick the shape that matches the job.

Lists — the ordered workhorse

A list is an ordered, mutable sequence. Square brackets, comma-separated, mixed types allowed (though in data work you'll usually keep them uniform).

revenue = [1200, 1800, 1500, 2100]
revenue.append(2400)        # add to the end
revenue[0] = 1250           # change in place
first = revenue[0]          # 1250  (zero-indexed)
last  = revenue[-1]         # 2400  (negative = from the end!)
chunk = revenue[1:3]        # [1800, 1500]  (slice: start:stop)
print(len(revenue))         # 5

Two things to highlight. Negative indexing (revenue[-1] for the last item) is a constant convenience. And slicing (list[start:stop]) grabs a sub-range, with stop excluded — the same half-open convention as range(). You'll slice rows out of datasets constantly.

🐘 PHP: A Python list is a PHP indexed array, but cleaner: real negative indexing, real slicing, and a tidy set of methods (.append(), .sort(), .pop()) instead of a sprawl of array_* functions. No array_push($a, $x) — just a.append(x).

Dictionaries — labelled data

A dict maps keys to values. This is the most important collection for analytics because a "record" — one customer, one transaction, one row — is naturally a dict.

customer = {
    "name": "Acme Co",
    "tier": "VIP",
    "spend": 5400,
}
print(customer["name"])        # Acme Co
customer["region"] = "North"   # add a new field
customer["spend"] += 600       # update a field

# safe lookup that won't crash if the key is missing:
phone = customer.get("phone", "n/a")   # "n/a"

Loop over a dict's pairs with .items() — this is the dict counterpart to PHP's foreach ($a as $k => $v):

for key, value in customer.items():
    print(f"{key}: {value}")

Use .get(key, default) whenever a key might be absent — it returns the default instead of throwing a KeyError. Real-world data is full of missing fields, so this habit prevents a lot of crashes.

🐘 PHP: A dict is a PHP associative array. $a['name'] → a["name"]. The big difference: PHP smushes indexed and associative arrays into one type; Python keeps lists and dicts separate, which actually makes your intent clearer. And .get() is the clean version of the $a['x'] ?? 'default' null-coalescing trick.

A list of dicts — the shape of a dataset

Put these two together and you get the universal "table in plain Python": a list, where each item is a dict (a row). Burn this pattern into memory — it's what a CSV becomes when you read it, and what a DataFrame is built from.

sales = [
    {"month": "Jan", "revenue": 1200, "region": "North"},
    {"month": "Feb", "revenue": 1800, "region": "South"},
    {"month": "Mar", "revenue": 1500, "region": "North"},
]

# total revenue with a comprehension + sum()
total = sum(row["revenue"] for row in sales)   # 4500

# just the North rows
north = [row for row in sales if row["region"] == "North"]

That's an entire mini-analysis in pure Python. pandas will make it terser and faster, but this is the mental model underneath it.

Tuples — fixed, unchangeable groupings

A tuple is like a list that can't be changed after creation. Round brackets. Use it for things that belong together and shouldn't drift — coordinates, a (min, max) pair, a row that must stay fixed.

point = (40.7, -74.0)        # lat, lon
low, high = (10, 99)        # "unpacking" — two vars at once!
print(low, high)            # 10 99

That second line, tuple unpacking, is everywhere in Python. It's how for i, region in enumerate(...) and for k, v in dict.items() work under the hood — each item is a tuple being split into named variables. It also gives you the slickest swap in any language: a, b = b, a.

Sets — unique values, fast membership

A set holds unique items with no order. Two killer uses in analytics: de-duplicating, and lightning-fast "is this in here?" checks.

regions = ["North", "South", "North", "East", "South"]
unique = set(regions)            # {'North', 'South', 'East'}
print(len(unique))               # 3  -> "how many distinct regions?"

vip_ids = {101, 102, 103}
print(102 in vip_ids)            # True, and very fast even on millions

Sets even do real maths — a & b (in both), a | b (in either), a - b (in a but not b) — which is perfect for questions like "which customers are in last month's list but not this month's?"

Roll Up the Regions

Goal: take a list-of-dicts dataset and produce a per-region revenue summary using a dict as an accumulator — the manual version of a pandas groupby, so the magic later isn't mysterious.

Use this data:

sales = [
    {"region": "North", "rev": 1200},
    {"region": "South", "rev": 1800},
    {"region": "North", "rev": 1500},
    {"region": "East",  "rev": 900},
    {"region": "South", "rev": 400},
]

Build the rollup:

totals = {}
for row in sales:
    r = row["region"]
    totals[r] = totals.get(r, 0) + row["rev"]

Print it sorted high-to-low:

for region, rev in sorted(totals.items(), key=lambda kv: kv[1], reverse=True):
    print(f"{region:<6} ${rev:,}")

You used .get(r, 0) to start each region at zero, and a lambda to sort by value (more on lambdas next chapter). This exact "group and sum" is the single most common analytics operation — you just built it by hand.

You know the four containers and, crucially, the list-of-dicts shape that a dataset takes. You can total, filter, de-duplicate, and roll up data in pure Python. pandas now has a foundation to stand on.

Take the sales list and answer three questions with one-liners: (1) the set of distinct regions, (2) the single highest rev value using max(row["rev"] for row in sales), and (3) the average using sum(...) / len(sales). Three questions, three lines. That ratio — one question, one line — is the goal you're training toward.