NumPy — Thinking in Arrays

NumPy is the engine room of the entire Python data world — pandas, scikit-learn, and every ML library are built on it. Its big idea is the vectorised operation: do maths on a whole array at once, no loop in sight. Once your brain switches from "loop over numbers" to "operate on the whole column," analytics code gets shorter, faster, and clearer.

An ndarray is a fast, fixed-type grid of numbers. Maths applies element-wise to the whole array. Boolean masks filter it. This is the mental model pandas inherits.

From list to array

import numpy as np

prices = np.array([10, 25, 40, 8, 100])
print(prices)          # [ 10  25  40   8 100]
print(prices.mean())   # 36.6
print(prices.sum())    # 183
print(prices.max())    # 100

Looks like a list, behaves like a spreadsheet column. The summary methods (.mean(), .sum(), .std(), .min(), .max()) are built in and run in fast compiled code under the hood.

Vectorised maths — the whole point

Here's the shift. To add 15% tax to every price, you do not write a loop. You operate on the array directly:

with_tax = prices * 1.15
# [ 11.5  28.75  46.   9.2  115. ]

discounted = prices - 5          # subtract 5 from every element
totals     = prices * quantities # element-wise across two arrays

One expression touches every element. This isn't just shorter than a loop — for big data it's dramatically faster, because NumPy pushes the work down into optimised C instead of stepping through Python one item at a time. "Don't write the loop; express the operation" is the core data-science instinct, and NumPy is where you learn it.

🐘 PHP: There's no real PHP equivalent. In PHP you'd foreach over the array applying tax to each element, or reach for array_map. NumPy makes prices * 1.15 mean "every price, taxed" — the loop is implied and runs at machine speed. This is a genuinely new way of thinking, and it's the foundation of everything that follows.

Boolean masks — filtering by condition

Compare an array to a value and you get back an array of True/False — a "mask." Feed that mask back in as an index and you get only the matching elements. This is the ancestor of pandas filtering, so it's worth a careful look.

prices = np.array([10, 25, 40, 8, 100])

mask = prices >= 40          # [False False  True False  True]
big  = prices[mask]          # [ 40 100]

# usually written in one line:
big = prices[prices >= 40]   # [ 40 100]

# combine conditions with & and | (wrap each in parentheses!)
mid = prices[(prices >= 10) & (prices < 100)]   # [10 25 40]

Read prices[prices >= 40] as "the prices where price is at least 40." Note the operators are & and | (not and/or) for array masks, and each condition must be parenthesised — forgetting those parentheses is the classic NumPy/pandas gotcha you'll hit exactly once.

2-D arrays and axes

Real data is a grid — rows and columns — so arrays go 2-D. The new idea is the axis: axis=0 runs down columns, axis=1 runs across rows.

grid = np.array([
    [1200, 1800, 1500],   # North: Jan, Feb, Mar
    [ 900, 1100, 1300],   # South
])

grid.sum()              # 7800  -> everything
grid.sum(axis=0)        # [2100 2900 2800]  -> total per month (down columns)
grid.sum(axis=1)        # [4500 3300]       -> total per region (across rows)
grid[0]                 # first row  -> North's three months
grid[:, 1]              # all rows, column 1  -> every region's Feb

That axis concept carries straight into pandas, where axis=0 means "down the rows" and axis=1 means "across the columns." Getting comfortable with it now pays off immediately.

Quarter in Arrays

Goal: do a full mini-analysis of a region×month sales grid using nothing but NumPy — so you feel the power before pandas hands you the convenience.

Build the data and labels:

import numpy as np
regions = np.array(["North", "South", "East"])
sales = np.array([
    [1200, 1800, 1500],
    [ 900, 1100, 1300],
    [2000, 1700, 2200],
])

Per-region totals and the best region:

region_totals = sales.sum(axis=1)        # [4500 3300 5900]
best = regions[region_totals.argmax()]   # 'East'

Apply a 10% target uplift to the whole grid at once: target = sales * 1.10
Flag the standout months (above 1800) with a mask: hot = sales[sales > 1800]

argmax() gave you the position of the biggest total, which you used to index back into regions — a pattern you'll reuse forever ("which label had the highest value?"). No loops anywhere.

You can create arrays, do vectorised maths on whole columns, filter with boolean masks, and reduce 2-D data along an axis. That's the NumPy mental model — and it's 80% of what makes pandas click in the next chapter.

Compute each region's share of total sales as a percentage, in one vectorised line: share = region_totals / region_totals.sum() * 100. No loop, no intermediate variables beyond what you have. Then round it with np.round(share, 1). Feel how you just described a calculation over the entire dataset as a single sentence — that's the array mindset locking in.