First Models — Machine Learning & AI
What "machine learning" actually means
Strip away the mystique and it's this: instead of you writing the rule (if spend > 400 then...), you show an algorithm lots of examples and it infers the rule from the data. You did the human version in Chapter 3 with hand-written tiers; ML learns the thresholds itself. Two big families:
- Regression — predict a number (next month's revenue, a house price, expected spend).
- Classification — predict a category (will this customer churn? is this email spam? which tier?).
The scikit-learn rhythm — split, fit, predict, evaluate
Every model in scikit-learn follows the same four beats. Learn the rhythm once and you can swap the model out freely. Here's a regression that turns your scatter-plus-trend-line into a predictor:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
"spend": [200, 150, 400, 250, 180, 420, 300, 160, 350, 220],
"revenue": [1200, 800, 2100, 1500, 900, 2300, 1700, 850, 1900, 1300],
})
X = df[["spend"]] # features (2-D: rows × columns)
y = df["revenue"] # target (what we predict)
# 1. SPLIT — hold back 25% to test honestly on unseen data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# 2. FIT — learn the pattern from the training data
model = LinearRegression()
model.fit(X_train, y_train)
# 3. PREDICT — on data the model never saw
preds = model.predict(X_test)
# 4. EVALUATE — how good is it?
print("R² score:", model.score(X_test, y_test)) # 1.0 = perfect, 0 = useless
print("Predicted revenue for $500 spend:", model.predict([[500]])[0])
The single most important idea here is the train/test split. You never judge a model on the data it learned from — of course it does well there. You hold back a chunk, predict on it, and see if the pattern generalises. A model that aces training data but flops on the test set is "overfitting" — memorising instead of learning. That instinct (always test on unseen data) is what separates a real result from a fooled one.
🐘 PHP: Nothing in the PHP world maps to this — it's a genuinely different kind of programming. In PHP you write every rule explicitly. Here you define the shape of a solution and let the data fill in the parameters. Same Python you already know (a DataFrame, a few method calls), pointed at a completely new kind of problem.
Classification — predicting a category
Swap the model, keep the rhythm. Predicting a yes/no like "high-value customer?" looks almost identical:
from sklearn.tree import DecisionTreeClassifier
df["high_value"] = (df["revenue"] >= 1500).astype(int) # the label: 1 or 0
X = df[["spend"]]
y = df["high_value"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)
clf = DecisionTreeClassifier(max_depth=3).fit(X_tr, y_tr)
print("Accuracy:", clf.score(X_te, y_te))
Same four beats: split, fit, predict, evaluate. That consistency is scikit-learn's gift — once you know the pattern, trying a new algorithm is a one-line change, and you'll genuinely use that to compare models in your coursework.
The honest caveats
AI & LLMs — the tool reshaping the field
Large Language Models (the tech behind modern AI assistants) are a different beast from the models above. They aren't trained by you on your spreadsheet; they're enormous pattern-completers trained on vast text, and you call them like a service. For Business Analytics, they're becoming a standard tool for summarising, classifying, and extracting structure from messy text data. The shape of using one is refreshingly familiar — it's just an API call:
# conceptual shape — you'd install a provider's SDK and use a key
from some_ai_provider import Client
client = Client(api_key="...")
response = client.complete(
"Classify this review as Positive, Neutral, or Negative: "
"'Shipping was slow but the product is excellent.'"
)
print(response.text) # -> "Mixed / leaning Positive"
The analytics superpower is pairing the two worlds: use pandas to wrangle thousands of customer comments into a column, loop them through an LLM to tag sentiment or pull themes, then groupby and chart the results like any other data. Structured analysis over unstructured text — that's a genuinely new capability, and Python is where the two meet.
🐍 Where this goes: Your degree will go deeper on the statistics (proper validation, feature engineering, model selection) and the ethics (bias, fairness, interpretability). What you've got now is the engineering literacy underneath all of it — enough to read the code, run the rhythm, and ask sharp questions instead of nodding along.
Build, Predict, and Question
Goal: run a complete model end to end, visualise its predictions against reality, and practise the skepticism that makes the result trustworthy.
- Use a slightly bigger sample (12+ rows of
spendandrevenue— extend the data above) - Run the full regression rhythm: split, fit, predict, and print the R² score
- Plot predictions vs. actuals to see the fit with your own eyes:
import matplotlib.pyplot as plt plt.scatter(X_test["spend"], y_test, label="actual") plt.scatter(X_test["spend"], preds, label="predicted", marker="x") plt.legend(); plt.xlabel("spend"); plt.ylabel("revenue") plt.savefig("model_fit.png", dpi=150); plt.show() - Use it: predict revenue for a
$500spend and write down whether you'd actually trust that number — and why or why not - Now stress-test your skepticism: re-run with a different
random_statein the split. Does the R² wobble a lot? On a tiny dataset it will — which is the lesson
That last step is the whole point. A result that swings wildly when you reshuffle the data isn't a finding, it's noise. Learning to distrust a too-good number is the most valuable habit in the entire field.