First Models — Machine Learning & AI

The finale. You'll fit your first machine-learning model with scikit-learn — turning the trend line from the last chapter into something that actually predicts — and then get an honest, grounded taste of AI / LLMs, the tools reshaping Business Analytics right now. The goal isn't to make you a data scientist overnight; it's to demystify the words and show you the shape of the work.

ML = learn a pattern from data, then predict on new data. The scikit-learn rhythm is always the same: split → fit → predict → evaluate. LLMs are a different tool — pattern-completing language models you call like an API.

What "machine learning" actually means

Strip away the mystique and it's this: instead of you writing the rule (if spend > 400 then...), you show an algorithm lots of examples and it infers the rule from the data. You did the human version in Chapter 3 with hand-written tiers; ML learns the thresholds itself. Two big families:

Regression — predict a number (next month's revenue, a house price, expected spend).
Classification — predict a category (will this customer churn? is this email spam? which tier?).

The scikit-learn rhythm — split, fit, predict, evaluate

Every model in scikit-learn follows the same four beats. Learn the rhythm once and you can swap the model out freely. Here's a regression that turns your scatter-plus-trend-line into a predictor:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    "spend":   [200, 150, 400, 250, 180, 420, 300, 160, 350, 220],
    "revenue": [1200, 800, 2100, 1500, 900, 2300, 1700, 850, 1900, 1300],
})

X = df[["spend"]]     # features (2-D: rows × columns)
y = df["revenue"]     # target (what we predict)

# 1. SPLIT — hold back 25% to test honestly on unseen data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# 2. FIT — learn the pattern from the training data
model = LinearRegression()
model.fit(X_train, y_train)

# 3. PREDICT — on data the model never saw
preds = model.predict(X_test)

# 4. EVALUATE — how good is it?
print("R² score:", model.score(X_test, y_test))   # 1.0 = perfect, 0 = useless
print("Predicted revenue for $500 spend:", model.predict([[500]])[0])

The single most important idea here is the train/test split. You never judge a model on the data it learned from — of course it does well there. You hold back a chunk, predict on it, and see if the pattern generalises. A model that aces training data but flops on the test set is "overfitting" — memorising instead of learning. That instinct (always test on unseen data) is what separates a real result from a fooled one.

🐘 PHP: Nothing in the PHP world maps to this — it's a genuinely different kind of programming. In PHP you write every rule explicitly. Here you define the shape of a solution and let the data fill in the parameters. Same Python you already know (a DataFrame, a few method calls), pointed at a completely new kind of problem.

Classification — predicting a category

Swap the model, keep the rhythm. Predicting a yes/no like "high-value customer?" looks almost identical:

from sklearn.tree import DecisionTreeClassifier

df["high_value"] = (df["revenue"] >= 1500).astype(int)   # the label: 1 or 0
X = df[["spend"]]
y = df["high_value"]

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)
clf = DecisionTreeClassifier(max_depth=3).fit(X_tr, y_tr)
print("Accuracy:", clf.score(X_te, y_te))

Same four beats: split, fit, predict, evaluate. That consistency is scikit-learn's gift — once you know the pattern, trying a new algorithm is a one-line change, and you'll genuinely use that to compare models in your coursework.

The honest caveats

A model is only as good as its data and its honesty. Three things to hold onto as you go deeper: (1) Correlation isn't causation — spend predicting revenue doesn't prove spend causes it. (2) Garbage in, garbage out — which is exactly why Chapter 8's cleaning matters more than the model choice. (3) Tiny datasets lie — ten rows can't support real conclusions; these examples are for learning the mechanics, not for decisions. Good analysts are professionally skeptical of their own results.

AI & LLMs — the tool reshaping the field

Large Language Models (the tech behind modern AI assistants) are a different beast from the models above. They aren't trained by you on your spreadsheet; they're enormous pattern-completers trained on vast text, and you call them like a service. For Business Analytics, they're becoming a standard tool for summarising, classifying, and extracting structure from messy text data. The shape of using one is refreshingly familiar — it's just an API call:

# conceptual shape — you'd install a provider's SDK and use a key
from some_ai_provider import Client

client = Client(api_key="...")
response = client.complete(
    "Classify this review as Positive, Neutral, or Negative: "
    "'Shipping was slow but the product is excellent.'"
)
print(response.text)   # -> "Mixed / leaning Positive"

The analytics superpower is pairing the two worlds: use pandas to wrangle thousands of customer comments into a column, loop them through an LLM to tag sentiment or pull themes, then groupby and chart the results like any other data. Structured analysis over unstructured text — that's a genuinely new capability, and Python is where the two meet.

🐍 Where this goes: Your degree will go deeper on the statistics (proper validation, feature engineering, model selection) and the ethics (bias, fairness, interpretability). What you've got now is the engineering literacy underneath all of it — enough to read the code, run the rhythm, and ask sharp questions instead of nodding along.

Build, Predict, and Question

Goal: run a complete model end to end, visualise its predictions against reality, and practise the skepticism that makes the result trustworthy.

Use a slightly bigger sample (12+ rows of spend and revenue — extend the data above)
Run the full regression rhythm: split, fit, predict, and print the R² score

Plot predictions vs. actuals to see the fit with your own eyes:

import matplotlib.pyplot as plt
plt.scatter(X_test["spend"], y_test, label="actual")
plt.scatter(X_test["spend"], preds, label="predicted", marker="x")
plt.legend(); plt.xlabel("spend"); plt.ylabel("revenue")
plt.savefig("model_fit.png", dpi=150); plt.show()

Use it: predict revenue for a $500 spend and write down whether you'd actually trust that number — and why or why not
Now stress-test your skepticism: re-run with a different random_state in the split. Does the R² wobble a lot? On a tiny dataset it will — which is the lesson

That last step is the whole point. A result that swings wildly when you reshuffle the data isn't a finding, it's noise. Learning to distrust a too-good number is the most valuable habit in the entire field.

You fit both a regression and a classifier, you understand the split/fit/predict/evaluate rhythm and why the train/test split keeps you honest, and you've seen how LLMs slot into an analytics pipeline. That's a real, grounded foundation for Data Science and AI coursework — not buzzwords, but the actual shape of the work.

You started this module installing Python and end it predicting the future with it — that's a genuine arc. For a capstone, take the cleaning function and metrics module you built earlier and connect the whole pipeline: load a CSV → clean it → engineer a feature → fit a model → save a chart of the result. One script, start to finish. That end-to-end flow — raw file to defensible insight — is business analytics. Go build something with it. 🐍