NumPy — Thinking in Arrays
ndarray is a fast, fixed-type grid of numbers. Maths applies element-wise to the whole array. Boolean masks filter it. This is the mental model pandas inherits.From list to array
import numpy as np
prices = np.array([10, 25, 40, 8, 100])
print(prices) # [ 10 25 40 8 100]
print(prices.mean()) # 36.6
print(prices.sum()) # 183
print(prices.max()) # 100
Looks like a list, behaves like a spreadsheet column. The summary methods (.mean(), .sum(), .std(), .min(), .max()) are built in and run in fast compiled code under the hood.
Vectorised maths — the whole point
Here's the shift. To add 15% tax to every price, you do not write a loop. You operate on the array directly:
with_tax = prices * 1.15
# [ 11.5 28.75 46. 9.2 115. ]
discounted = prices - 5 # subtract 5 from every element
totals = prices * quantities # element-wise across two arrays
One expression touches every element. This isn't just shorter than a loop — for big data it's dramatically faster, because NumPy pushes the work down into optimised C instead of stepping through Python one item at a time. "Don't write the loop; express the operation" is the core data-science instinct, and NumPy is where you learn it.
🐘 PHP: There's no real PHP equivalent. In PHP you'd foreach over the array applying tax to each element, or reach for array_map. NumPy makes prices * 1.15 mean "every price, taxed" — the loop is implied and runs at machine speed. This is a genuinely new way of thinking, and it's the foundation of everything that follows.
Boolean masks — filtering by condition
Compare an array to a value and you get back an array of True/False — a "mask." Feed that mask back in as an index and you get only the matching elements. This is the ancestor of pandas filtering, so it's worth a careful look.
prices = np.array([10, 25, 40, 8, 100])
mask = prices >= 40 # [False False True False True]
big = prices[mask] # [ 40 100]
# usually written in one line:
big = prices[prices >= 40] # [ 40 100]
# combine conditions with & and | (wrap each in parentheses!)
mid = prices[(prices >= 10) & (prices < 100)] # [10 25 40]
Read prices[prices >= 40] as "the prices where price is at least 40." Note the operators are & and | (not and/or) for array masks, and each condition must be parenthesised — forgetting those parentheses is the classic NumPy/pandas gotcha you'll hit exactly once.
2-D arrays and axes
Real data is a grid — rows and columns — so arrays go 2-D. The new idea is the axis: axis=0 runs down columns, axis=1 runs across rows.
grid = np.array([
[1200, 1800, 1500], # North: Jan, Feb, Mar
[ 900, 1100, 1300], # South
])
grid.sum() # 7800 -> everything
grid.sum(axis=0) # [2100 2900 2800] -> total per month (down columns)
grid.sum(axis=1) # [4500 3300] -> total per region (across rows)
grid[0] # first row -> North's three months
grid[:, 1] # all rows, column 1 -> every region's Feb
That axis concept carries straight into pandas, where axis=0 means "down the rows" and axis=1 means "across the columns." Getting comfortable with it now pays off immediately.
Quarter in Arrays
Goal: do a full mini-analysis of a region×month sales grid using nothing but NumPy — so you feel the power before pandas hands you the convenience.
- Build the data and labels:
import numpy as np regions = np.array(["North", "South", "East"]) sales = np.array([ [1200, 1800, 1500], [ 900, 1100, 1300], [2000, 1700, 2200], ]) - Per-region totals and the best region:
region_totals = sales.sum(axis=1) # [4500 3300 5900] best = regions[region_totals.argmax()] # 'East' - Apply a 10% target uplift to the whole grid at once:
target = sales * 1.10 - Flag the standout months (above 1800) with a mask:
hot = sales[sales > 1800]
argmax() gave you the position of the biggest total, which you used to index back into regions — a pattern you'll reuse forever ("which label had the highest value?"). No loops anywhere.
share = region_totals / region_totals.sum() * 100. No loop, no intermediate variables beyond what you have. Then round it with np.round(share, 1). Feel how you just described a calculation over the entire dataset as a single sentence — that's the array mindset locking in.