Understand AI,
from nothing
to agents.
Fifteen days. No coding. Every idea arrives as a real-world picture first, then a thing you can play with. Built for a curious mind that's strong in maths and physics.
Foundations
What does it even mean for a machine to learn?
- · Teach by example, not rules
- · Features & boundaries
- · Rolling downhill
Neural Networks
Stacking simple decisions into something powerful.
- · The neuron
- · How networks learn
- · Why depth matters
Transformers
How machines read, write, and pay attention.
- · Words as vectors
- · Attention
- · How an LLM writes
Agents
A model that uses tools, remembers, and acts.
- · Prompting & tools
- · Memory
- · The agent loop
The words,
in plain English
Every term in the course, defined simply — with the everyday picture that goes with it. Start typing to filter.
- That classifier box can be a simple KNN (compare to the nearest examples — your Day 1 demo) or a logistic regression (one straight boundary — Day 2).
- The error forms a landscape with valleys: a shallow local minimum you can get stuck in, vs. the true global minimum.
- One full sweep through all the data is an epoch; training repeats the loop for many of them.
The learning loop
One model: data → prediction → error → gradient descent nudges the weights.
Neural network
Stack many of those units in layers; backpropagation sends the error back through all of them.
Transformer
Add embeddings + attention so it reads language and predicts the next token — an LLM.
Agent
Wrap the LLM in a loop with tools and memory so it can act toward a goal.
Every term above (and the rest of the course) is defined below — search to jump to any one.
Classifier
A model whose whole job is to sort things into categories — cat or dog, spam or not-spam, ripe or unripe.
a sorting hat that drops each thing into the right binModel
The pattern a machine has learned from examples. It's what the machine actually uses to make a guess.
the "sense" a child builds for which mangoes are ripeTraining (learning)
The process of showing the model examples and adjusting it again and again until it gets good.
practising free-throws until it just clicksData / dataset
The collection of examples a model learns from. More good examples usually means a better model.
the stack of flashcards you study fromFeature
A measurable clue describing one example — a fruit's colour, a person's height, a pixel's darkness.
the symptoms a doctor checksLabel
The correct answer attached to a training example, telling the model "this one is a cat."
the answer written on the back of a flashcardWeight
A number the model tunes that says how much a feature matters to the decision.
how much a judge trusts each piece of evidenceBias (the number)
An extra adjustable number that nudges the result up or down. (Different from "bias" meaning unfairness.)
a thumb resting gently on one side of the scaleDecision boundary
The line or surface a classifier draws to separate one category from another.
the doctor's mental line between "fine" and "hospital"Error / loss
A single number for how wrong the model is right now. The whole goal of learning is to make it small.
your golf score — lower is betterGradient / slope
The direction in which the error drops fastest. It's exactly the derivative from calculus.
which way the ground tilts downhill under your feetGradient descent
The method of learning: feel the slope, take a small step downhill on the error, repeat thousands of times.
a hiker feeling their way down a mountain in fogLearning rate
How big each downhill step is. Too small is painfully slow; too big overshoots and flies off.
the length of each stride downhillMinimum (local vs global)
A low point in the error. A shallow dip you can get stuck in (local) vs. the true lowest point (global = best answer).
a roadside pond vs. the actual seaPrediction / inference
Using an already-trained model to make a guess on something new. (Training = learning; inference = using.)
sitting the real exam after all the studyingk-Nearest Neighbours (KNN)
A dead-simple classifier: to label something new, look at the most similar examples you've already seen and copy theirs.
"this looks most like the ones we called smileys" — the Day 1 demoLogistic regression
A simple model that learns one straight decision boundary between two classes.
the single line in the Day 2 demoOverfitting
When a model memorises the exact training examples instead of the general pattern — great on practice, poor on anything new.
memorising past papers without understanding the subjectNeuron
One tiny decision: multiply each input by a weight, add them up, and pass the total through a "switch."
one judge casting a weighted voteNeural network
Many neurons wired together in layers, so together they can learn patterns a single line never could.
a whole panel of judges, in roundsActivation function
The "switch" inside a neuron that decides how strongly it fires for a given input.
a dimmer dial, not just on/offForward pass
Data flowing in through the network, layer by layer, until an answer pops out the end.
an order moving down an assembly lineBackpropagation
Sending the error backwards through the network so every weight learns which way to adjust.
tracing a mistake back to find who to coachLayer
A row of neurons. Early layers spot simple things (edges); deeper ones spot meaning (faces).
stations on an assembly line, simple to complexParameters
All the weights and biases a model has learned. Modern models have billions of them.
every knob on a giant mixing boardEpoch
One complete pass through the entire training dataset.
going through the whole deck of flashcards onceToken
A small chunk of text — roughly a word or part of a word — that the model reads and writes one at a time.
a single Lego brick of languageVector
Just a list of numbers, which you can also picture as a point or arrow in space.
map coordinates, but with many directionsEmbedding
Turning a word into a vector so that related words sit close together and maths can capture meaning.
a map where "king" and "queen" are neighboursAttention
A mechanism that lets each word look at the other words that matter to it, to understand context.
a spotlight you aim at the relevant wordsTransformer
The architecture — built mostly from attention — behind today's AI like ChatGPT and Claude.
the engine design under the hoodLLM (large language model)
A huge transformer trained to predict the next token. From that one skill it can write, answer, and summarise.
autocomplete that has read most of the internetNext-token prediction
How an LLM writes: guess the most fitting next word, add it, then guess again — on repeat.
an improv actor adding one word at a timeTemperature
A dial for how adventurous vs. safe the model's word choices are. High = creative, low = predictable.
how much a cook improvises off the recipeContext window
How much text the model can hold "in mind" at one time. Beyond it, earlier text is forgotten.
how much fits on your desk before things fall offHallucination
When a model states something false but confidently. It predicts plausible-sounding text, not guaranteed truth.
a confident student bluffing an exam answerPrompt
The instructions you give a model in plain language. Clearer prompts give better results.
the brief you hand a brilliant but literal internTool / function calling
Letting a model use a calculator, a search, or a database so it can actually do things, not just talk.
handing the chef the keys to the kitchenAgent
A model placed in a loop so it can plan, act with tools, see the result, and keep going toward a goal.
a person working a project with a to-do listMemory (agentic)
Short-term (the context window) plus long-term — a store the agent writes to and later searches by meaning.
working memory plus a diary you can look things up inRAG (retrieval-augmented generation)
Fetching the most relevant saved facts and handing them to the model, so it answers from real information.
checking your notes before answeringFine-tuning
Training an existing model a bit more on your own examples to make it a specialist.
sending a graduate to do a focused apprenticeshipWhat even is AI?
Before we teach a machine anything — what are we actually talking about? And why does today's AI feel so different?
Artificial Intelligence is any technique that lets a machine do something we'd normally call "intelligent" — recognising a face, understanding a sentence, planning a route, beating you at chess.
That's a huge umbrella. A chess program from 1997 and ChatGPT are both AI — but they work in completely different ways. The first big idea to get straight is that "AI" is a family, not a single thing.
Each box sits inside the one before it. All generative AI is deep learning; all deep learning is machine learning; all of it is AI — but plenty of older AI (like rule-based systems) isn't machine learning at all.
- What it doesSorts, scores, predicts, or picks — it analyses existing things.
- Typical job"Is this spam?" · "Will this customer churn?" · "What's the best chess move?"
- OutputA label, a number, or a choice from fixed options.
- HowHand-written rules, or simpler models like decision trees and classifiers.
- ExampleYour bank's fraud alert; Netflix's "you might like…"; a thermostat's logic.
- What it doesProduces brand-new content that didn't exist before — it generates.
- Typical job"Write this email" · "Draw a cat astronaut" · "Explain attention to my son."
- OutputOpen-ended: paragraphs, images, code, audio.
- HowHuge deep neural networks (transformers) trained on enormous data.
- ExampleChatGPT, Claude, Midjourney — the tools that started the recent boom.
The simplest way to feel the difference: traditional AI chooses from what exists; generative AI makes something that didn't. A spam filter picks one of two boxes. An LLM writes a sentence no one has written before. This course starts with the traditional ideas (Days 1–3) because generative AI is built directly on top of them.
A useful distinction you'll hear: supervised learning (you give labelled examples — classification & regression) vs. unsupervised learning (the machine finds structure itself — clustering). Day 1 is supervised.
ChatGPT is one recent kind (generative). AI has existed since the 1950s and runs quietly in maps, banks, and email today.
No — traditional AI still powers most real systems. Generative is an addition, built on the same foundations.
Today's AI has no understanding or intent. It's maths finding patterns — astonishingly useful, but not conscious.
Next → Now that AI has a shape, we zoom into the idea that powers all the modern stuff: instead of writing rules, we let a machine learn from examples. That's Day 1.
Don't program it. Teach it.
The single biggest idea in AI: machines learn from examples, not from rules we write.
Try to write exact step-by-step rules to tell a cat from a dog — without using the words "cat" or "dog." Pointy ears? Some dogs have them. Bigger? Not a Chihuahua. Every rule breaks.
Nobody hands a child a rulebook for ripe mangoes.
Machine learning works exactly like the child at the fruit stall: examples go in, a pattern comes out. The machine writes its own rulebook — one it usually can't even put into words.
The old way
A human writes the rules; the computer obeys. Great for taxes. Hopeless for "is this a cat?"
rules → answersThe new way
Show it examples; it finds the pattern itself. The pattern it keeps is the model.
examples → pattern0
0
Teach a few of each (try smiley vs. frowny), then draw a new one and hit guess.
There are no rules inside that box. It shrinks your drawing to a 16×16 sketch and compares it to every example you taught — picking the label of the ones it looks most like. More examples → better guesses.
It doesn't. It's matching shapes to remembered examples — nothing more.
Only good data. Biased or sloppy examples teach a biased, sloppy pattern.
Tomorrow → Today it compared whole drawings. But what is it really comparing? Everything boils down to features — a few numbers — and learning becomes finding a line between them.
Two numbers,
one dividing line.
Data becomes features; a model is just the boundary that separates them.
Describe every classmate with two numbers — height and hours of sleep — and guess who plays basketball. Could you draw one line that mostly separates the players? That line is today's whole lesson.
A doctor doesn't memorise every patient.
Two things to notice: she chose which features matter (not shoe size), and she stores a boundary, not the patients. A machine-learned model is exactly the same.
Features
The measurable clues. Each example becomes a dot in a space.
The boundary
Learning = finding the line that best splits the groups.
The model
For a straight line, the whole "intelligence" is just three numbers.
w₁·x + w₂·y + b = 0w₁ = —
w₂ = —
bias = —
Now hit tricky one: no straight line can separate it. Hold that thought — fixing it is exactly why Phase 2 builds neural networks.
No — it keeps only the boundary. Delete the dots and it still classifies.
The tricky example just disproved that. Real data often isn't linearly separable.
Tomorrow → You watched the line "settle" — but how did it know which way to move? It's rolling downhill on a landscape of its own mistakes. Your calculus becomes the hero.
Learning is
rolling downhill.
Measure the error, feel the slope, take a step. Repeat. That's the engine of all of it.
You're blindfolded, shooting arrows. After each shot I tell you only "too high" or "too low." Can you eventually hit the target? Of course — you nudge, and shrink the nudge as you close in.
Pretend the AI has just one knob it can turn.
The "valley" in the demo below is not a real place — it's a picture of that one idea. Before you touch it, here's exactly how to read the picture:
Left → right
The knob — the single number the AI is allowed to change (its "weight"). Left is one setting, right is another.
Up ↑ down ↓
How wrong the AI is at that knob setting. High up = lots of mistakes. Low down = few mistakes.
The dip ↓
The knob setting that makes the fewest mistakes. That low point is the answer the AI is hunting for.
The orange ball is the AI's current guess for the knob. Like a hiker in thick fog (above), it can't see the whole curve — it only feels the slope right under it, and steps in the downhill direction (toward fewer mistakes). The size of each step is the learning rate. Roll downhill enough times and the ball settles in the dip — the AI has found its best setting.
In short: downhill = less wrong. Watch the ball, the dashed slope line, and the live equation move together as you press Step.
Three experiments: a tiny rate (slow), a big rate past ~1.6 (it flies off — divergence is real!), and dropping the ball left vs. right to land in a deep valley vs. a shallow trap.
Not guaranteed — local minima and bad step sizes can trap or destabilise it.
Only up to a point. Past it you get worse, not faster — overshoot.
Phase 2 → One straight line couldn't split the tricky data. The fix: stack many simple weighted decisions into a network, and train the whole thing by rolling downhill — in millions of dimensions. That unit is the neuron.
The neuron:
a weighted vote.
The whole of deep learning is built from one tiny, simple unit. Here it is.
Picture a panel of judges scoring a dish.
A neuron takes its inputs, multiplies each by a weight (how much it trusts that input), adds them up with a bias, and passes the total through a switch (the activation) that decides how strongly it fires.
Weights
How much each input matters. Big weight = "I trust this a lot."
Sum + bias
Add the weighted inputs, nudge with a bias.
w₁x₁ + w₂x₂ + bActivation
A switch that turns the sum into an output between 0 and 1.
Blue and red dots are two classes. The line is where the neuron "fires at 50%." Tune the weights so the line separates them.
Loosely inspired by one, but it's just multiply-add-and-switch. Pure arithmetic.
Alone it's just Day 2's straight line again. The power comes from stacking many — next lesson.
Next → One neuron draws one line. Wire many together in layers and they can trace any curve — and learn it automatically by sending the error backwards.
How a network
learns.
Stack neurons, push data through, then send the mistake backwards. Watch it learn a curve live.
A relay team passing a baton — and passing back the blame.
Forward pass = make a prediction. Backpropagation = trace the error back through every weight so each one knows which way to nudge. It's the chain rule from calculus, applied layer by layer — then it's just gradient descent (Day 3) on all of them at once.
Forward pass
Inputs flow through the layers; an answer pops out.
Backprop
The error flows back, assigning blame to each weight.
Repeat
Do it over the data many times (epochs). The fit improves.
The grey curve is the target. The coloured line is the network's current guess — random at first, then bending to fit as the error (loss) drops.
It's the chain rule — a tidy way to compute every weight's slope at once.
Too many and it memorises noise instead of the pattern — overfitting.
Next → Our network has one hidden layer. Stack many layers — go "deep" — and something remarkable happens: the layers start inventing their own concepts, simple to complex.
Going deep.
Why stacking layers lets a network build meaning — from edges, to shapes, to "that's a face."
An assembly line, or building with Lego.
"Deep" just means many layers. Each layer composes the previous one's output into something more abstract. Nobody tells it what an edge or an eye is — it invents these useful features itself. That's representation learning.
Draw a shape or letter above.
This is a real edge-detector — exactly the kind of filter a first layer learns. It lights up where your drawing has an edge in the chosen direction. Deeper layers combine these into shapes and objects.
We don't. We only set up the layers and the goal; the features emerge from training.
Depth adds power but also cost and trickier training. It's a balance.
Phase 3 → So far, inputs were numbers and dots. But how does a network read words? The trick is to turn every word into a list of numbers — a vector — arranged so meaning becomes geometry.
Words become
points in space.
Turn every word into numbers, and meaning turns into geometry you can do maths on.
A map — but of meaning instead of places.
Each word becomes a vector — a point in space. Direction and distance encode meaning. The astonishing part: this lets you do arithmetic on meaning. king − man + woman lands right on queen.
Real embeddings have hundreds of dimensions. We flatten to 2D just to see it.
No — the positions are learned from billions of sentences about which words appear together.
Next → If words live as points in space, how does a machine actually find the closest ones? That's the trick behind every "search by meaning" — and how AIs remember things in a vector database.
Searching
by meaning.
Two ideas — cosine similarity and the vector database — turn "find related stuff" into pure geometry.
Ask ChatGPT "what did we talk about last week?" — it doesn't remember in the human sense. It searches, by meaning, through a big pile of stored vectors. Same trick that Google uses to find a page that doesn't even contain your exact words.
A library where books aren't shelved by title — they're shelved by what they're about.
On Day 7 we drew words as dots on a 2D map. That was a lie of convenience. A real embedding is a list of hundreds of numbers — often 768 or 1536. Each number is a learned "feature direction" that no human ever named.
No one wrote those numbers. The model learned them from billions of sentences — by nudging the values until words that appear in similar contexts ended up at similar coordinates. That's all "training an embedding" is.
If a word is a point, the line from origin to that point is an arrow. To ask "are these two words similar?" we ask: do their arrows point the same way?
Perpendicular → similarity = 0.
Opposite directions → similarity = −1.
The formula looks scary but it's just the cosine of the angle between the two arrows:
Why angle and not plain distance? Because the length of an embedding vector doesn't carry meaning — only its direction does. Two arrows pointing the same way are "about the same thing", even if one is twice as long.
Now glue the two ideas together. A vector database is a big table where each row is some piece of text (a sentence, a doc, a memory) and its embedding. When you ask it a question:
- Turn your question into an embedding (one arrow).
- Compute the cosine similarity against every stored embedding.
- Return the top few highest scores.
That's it. No keyword matching. No clever indexing magic at the concept level — just "who points the most in my direction?"
The DB knows these 12 words. Tap one to use it as your query — the rest get ranked by cosine similarity to it:
It's not. Two arrows can be far apart in length but point the same way (cosine = 1). Cosine only cares about direction.
It doesn't read anything. It just does arithmetic on numbers an embedding model handed it.
Individual dimensions usually mean nothing a human can name. Only the overall direction has meaning.
Next → A word's embedding is fixed the moment it's looked up — but the same word means different things in different sentences. So before any searching, the model needs to reshape each word by its neighbours. That's context.
Context changes
everything.
The same word means different things depending on its neighbours. Watch meaning shift.
Say "I never said she stole the money" out loud, stressing a different word each time. Seven stresses, seven meanings — same words. Context is everything.
The word "bank" alone tells you nothing.
The word in focus: bank. Tap a context word to add it to the sentence:
In a transformer, a word's representation is reshaped by its context every time.
Next → If every word must look at the others to settle its meaning, we need a mechanism for that looking. That mechanism — attention — is the heart of the transformer.
Attention:
the spotlight.
The single idea behind every modern AI. Each word asks: which other words should I look at?
A spotlight in a dark theatre.
Each word sends out a question, every word offers a key, and the match scores decide how much attention to pay. High match → look here. It's dot products → softmax → a weighted blend.
Tap a word above. Before the lines appear — guess which word it reaches for.
Keeping it honest: a real transformer learns its own meaning-vectors. Here each word has a small hand-built vector so the mechanism is visible — but the engine (match → soften → blend) is the genuine article.
It's a weighted average steered by learned matches. Powerful, but still arithmetic.
Next → Stack many attention layers, train on the whole internet, and you get a machine that predicts the next word brilliantly — a large language model.
How an LLM
writes.
It's autocomplete that read the whole library: predict the next word, add it, repeat.
A very well-read improv actor.
A transformer stacks attention layers to read all the context, then predicts the next token. Temperature controls the choice: low plays it safe and predictable; high gets adventurous and creative.
It only ever predicts the next word, then re-reads everything and predicts again.
It predicts plausible text. That's why it can sound confident yet be wrong — a hallucination.
Phase 4 → A model that writes is amazing — but it only talks. Next we give it tools, memory, and a loop, turning the talker into a doer: an agent.
Words → vectors →
attention → next word.
A single live wiring diagram of everything Phase 3 taught. Change the sentence — watch every stage update.
Real LLMs split text into "tokens" — often word-pieces. Our toy model uses whole words. The order matters: each token gets a position.
Each token is looked up in a giant table — the embedding matrix. This is the vector DB: row = token, columns = the 8 numbers that define its meaning. (Real models use 768 or more.) Each strip below is one row of the DB.
Now the model asks every token: "which other tokens are you most similar to?" Cosine similarity between every pair. The hotter the cell, the more aligned the meaning.
Cosines become attention weights by passing through a softmax: the top scores get amplified, the rest get squashed. The query token then gathers a weighted mix of every other token's vector.
A transformer doesn't run attention once — it stacks layers. Each layer takes the previous layer's vectors and re-mixes them through attention. The same word's vector changes at every layer, getting richer with context. That's why "bank" near "river" ends up very different from "bank" near "money".
To produce the next word, the model takes the last token's vector from the top layer and compares it against every word in the vocabulary — cosine again. Softmax over those scores gives a probability distribution. Sample from it: that's your next word.
Storing embeddings is literally rows of numbers, one per token, plus a similarity function.
Cosine measures alignment; softmax turns raw scores into clean weights that sum to 1.
One attention pass doesn't see enough. Stacking 12, 24, 96 of them is what makes modern LLMs powerful.