A visual, hands-on course

Understand AI,
from nothing
to agents.

Fifteen days. No coding. Every idea arrives as a real-world picture first, then a thing you can play with. Built for a curious mind that's strong in maths and physics.

PHASE 1

Foundations

What does it even mean for a machine to learn?

· Teach by example, not rules
· Features & boundaries
· Rolling downhill

PHASE 2

Neural Networks

Stacking simple decisions into something powerful.

· The neuron
· How networks learn
· Why depth matters

PHASE 3

Transformers

How machines read, write, and pay attention.

· Words as vectors
· Attention
· How an LLM writes

PHASE 4

Agents

A model that uses tools, remembers, and acts.

· Prompting & tools
· Memory
· The agent loop

Reference

The words,
in plain English

Every term in the course, defined simply — with the everyday picture that goes with it. Start typing to filter.

How it all fits together

Worked example · a spam filter Every email is an example. Its features might be: number of links, whether it says “free”, and if the subject is ALL CAPS. Its label is what you once called it — spam or not. The classifier (a kind of model) gives each feature a weight, adds a bias, and makes a prediction. At first it's wrong often — that gap is the error. We find the gradient (which weights to blame), and step by the learning rate to nudge them: gradient descent. After many epochs the error is small and the model has carved a decision boundary between spam and not. Run it on a brand-new email → that's inference. (If it merely memorised your old emails → overfitting.)

The learning loop — the heart of it

That classifier box can be a simple KNN (compare to the nearest examples — your Day 1 demo) or a logistic regression (one straight boundary — Day 2).
The error forms a landscape with valleys: a shallow local minimum you can get stuck in, vs. the true global minimum.
One full sweep through all the data is an epoch; training repeats the loop for many of them.

…and it scales up to the whole course

PHASE 1

The learning loop

One model: data → prediction → error → gradient descent nudges the weights.

→

PHASE 2

Neural network

Stack many of those units in layers; backpropagation sends the error back through all of them.

→

PHASE 3

Transformer

Add embeddings + attention so it reads language and predicts the next token — an LLM.

→

PHASE 4

Agent

Wrap the LLM in a loop with tools and memory so it can act toward a goal.

Every term above (and the rest of the course) is defined below — search to jump to any one.

Foundations Phase 1 · you're here

Classifier

A model whose whole job is to sort things into categories — cat or dog, spam or not-spam, ripe or unripe.

a sorting hat that drops each thing into the right bin

Model

The pattern a machine has learned from examples. It's what the machine actually uses to make a guess.

the "sense" a child builds for which mangoes are ripe

Training (learning)

The process of showing the model examples and adjusting it again and again until it gets good.

practising free-throws until it just clicks

Data / dataset

The collection of examples a model learns from. More good examples usually means a better model.

the stack of flashcards you study from

Feature

A measurable clue describing one example — a fruit's colour, a person's height, a pixel's darkness.

the symptoms a doctor checks

Label

The correct answer attached to a training example, telling the model "this one is a cat."

the answer written on the back of a flashcard

Weight

A number the model tunes that says how much a feature matters to the decision.

how much a judge trusts each piece of evidence

Bias (the number)

An extra adjustable number that nudges the result up or down. (Different from "bias" meaning unfairness.)

a thumb resting gently on one side of the scale

Decision boundary

The line or surface a classifier draws to separate one category from another.

the doctor's mental line between "fine" and "hospital"

Error / loss

A single number for how wrong the model is right now. The whole goal of learning is to make it small.

your golf score — lower is better

Gradient / slope

The direction in which the error drops fastest. It's exactly the derivative from calculus.

which way the ground tilts downhill under your feet

Gradient descent

The method of learning: feel the slope, take a small step downhill on the error, repeat thousands of times.

a hiker feeling their way down a mountain in fog

Learning rate

How big each downhill step is. Too small is painfully slow; too big overshoots and flies off.

the length of each stride downhill

Minimum (local vs global)

A low point in the error. A shallow dip you can get stuck in (local) vs. the true lowest point (global = best answer).

a roadside pond vs. the actual sea

Prediction / inference

Using an already-trained model to make a guess on something new. (Training = learning; inference = using.)

sitting the real exam after all the studying

k-Nearest Neighbours (KNN)

A dead-simple classifier: to label something new, look at the most similar examples you've already seen and copy theirs.

"this looks most like the ones we called smileys" — the Day 1 demo

Logistic regression

A simple model that learns one straight decision boundary between two classes.

the single line in the Day 2 demo

Overfitting

When a model memorises the exact training examples instead of the general pattern — great on practice, poor on anything new.

memorising past papers without understanding the subject

Neural Networks Phase 2 · coming up

Neuron

One tiny decision: multiply each input by a weight, add them up, and pass the total through a "switch."

one judge casting a weighted vote

Neural network

Many neurons wired together in layers, so together they can learn patterns a single line never could.

a whole panel of judges, in rounds

Activation function

The "switch" inside a neuron that decides how strongly it fires for a given input.

a dimmer dial, not just on/off

Forward pass

Data flowing in through the network, layer by layer, until an answer pops out the end.

an order moving down an assembly line

Backpropagation

Sending the error backwards through the network so every weight learns which way to adjust.

tracing a mistake back to find who to coach

Layer

A row of neurons. Early layers spot simple things (edges); deeper ones spot meaning (faces).

stations on an assembly line, simple to complex

Parameters

All the weights and biases a model has learned. Modern models have billions of them.

every knob on a giant mixing board

Epoch

One complete pass through the entire training dataset.

going through the whole deck of flashcards once

Language & Transformers Phase 3 · coming up

Token

A small chunk of text — roughly a word or part of a word — that the model reads and writes one at a time.

a single Lego brick of language

Vector

Just a list of numbers, which you can also picture as a point or arrow in space.

map coordinates, but with many directions

Embedding

Turning a word into a vector so that related words sit close together and maths can capture meaning.

a map where "king" and "queen" are neighbours

Attention

A mechanism that lets each word look at the other words that matter to it, to understand context.

a spotlight you aim at the relevant words

Transformer

The architecture — built mostly from attention — behind today's AI like ChatGPT and Claude.

the engine design under the hood

LLM (large language model)

A huge transformer trained to predict the next token. From that one skill it can write, answer, and summarise.

autocomplete that has read most of the internet

Next-token prediction

How an LLM writes: guess the most fitting next word, add it, then guess again — on repeat.

an improv actor adding one word at a time

Temperature

A dial for how adventurous vs. safe the model's word choices are. High = creative, low = predictable.

how much a cook improvises off the recipe

Context window

How much text the model can hold "in mind" at one time. Beyond it, earlier text is forgotten.

how much fits on your desk before things fall off

Hallucination

When a model states something false but confidently. It predicts plausible-sounding text, not guaranteed truth.

a confident student bluffing an exam answer

Agents Phase 4 · coming up

Prompt

The instructions you give a model in plain language. Clearer prompts give better results.

the brief you hand a brilliant but literal intern

Tool / function calling

Letting a model use a calculator, a search, or a database so it can actually do things, not just talk.

handing the chef the keys to the kitchen

Agent

A model placed in a loop so it can plan, act with tools, see the result, and keep going toward a goal.

a person working a project with a to-do list

Memory (agentic)

Short-term (the context window) plus long-term — a store the agent writes to and later searches by meaning.

working memory plus a diary you can look things up in

RAG (retrieval-augmented generation)

Fetching the most relevant saved facts and handing them to the model, so it answers from real information.

checking your notes before answering

Fine-tuning

Training an existing model a bit more on your own examples to make it a specialist.

sending a graduate to do a focused apprenticeship

No term matches that — try another word.

Day 00 · Start here

What even is AI?

Before we teach a machine anything — what are we actually talking about? And why does today's AI feel so different?

The one-sentence answer

Artificial Intelligence is any technique that lets a machine do something we'd normally call "intelligent" — recognising a face, understanding a sentence, planning a route, beating you at chess.

That's a huge umbrella. A chess program from 1997 and ChatGPT are both AI — but they work in completely different ways. The first big idea to get straight is that "AI" is a family, not a single thing.

The family tree

Each box sits inside the one before it. All generative AI is deep learning; all deep learning is machine learning; all of it is AI — but plenty of older AI (like rule-based systems) isn't machine learning at all.

The big split — two eras of AI

Traditional AI · recognise & decide

What it doesSorts, scores, predicts, or picks — it analyses existing things.
Typical job"Is this spam?" · "Will this customer churn?" · "What's the best chess move?"
OutputA label, a number, or a choice from fixed options.
HowHand-written rules, or simpler models like decision trees and classifiers.
ExampleYour bank's fraud alert; Netflix's "you might like…"; a thermostat's logic.

Generative AI · create something new

What it doesProduces brand-new content that didn't exist before — it generates.
Typical job"Write this email" · "Draw a cat astronaut" · "Explain attention to my son."
OutputOpen-ended: paragraphs, images, code, audio.
HowHuge deep neural networks (transformers) trained on enormous data.
ExampleChatGPT, Claude, Midjourney — the tools that started the recent boom.

The simplest way to feel the difference: traditional AI chooses from what exists; generative AI makes something that didn't. A spam filter picks one of two boxes. An LLM writes a sentence no one has written before. This course starts with the traditional ideas (Days 1–3) because generative AI is built directly on top of them.

Core ideas in traditional AI

Rule-based / expert systems

The earliest AI: humans write thousands of explicit if-this-then-that rules. Powerful where rules are clear (tax software), brittle where they aren't ("is this a cat?").

Search & planning

Exploring many possible moves or routes to find a good one. How chess engines and GPS navigation "think ahead."

Classification

Sorting things into categories — spam/not-spam, ripe/unripe. The workhorse of traditional AI, and exactly where Day 1 begins.

Regression

Predicting a number rather than a category — a house price, tomorrow's temperature.

Clustering

Finding natural groups in data without being told the answers — e.g. grouping customers by behaviour.

Decision trees

A flowchart of yes/no questions the machine learns from data. Simple, fast, and easy to read.

A useful distinction you'll hear: supervised learning (you give labelled examples — classification & regression) vs. unsupervised learning (the machine finds structure itself — clustering). Day 1 is supervised.

Watch out for

✕

"AI means ChatGPT."

ChatGPT is one recent kind (generative). AI has existed since the 1950s and runs quietly in maps, banks, and email today.

✕

"Generative AI replaced traditional AI."

No — traditional AI still powers most real systems. Generative is an addition, built on the same foundations.

✕

"AI = a thinking mind."

Today's AI has no understanding or intent. It's maths finding patterns — astonishingly useful, but not conscious.

Explain it back

In one sentence, what is AI?

What's the core difference between traditional and generative AI?

Is a spam filter generative AI? Why not?

Where does "machine learning" sit inside the AI family?

Next → Now that AI has a shape, we zoom into the idea that powers all the modern stuff: instead of writing rules, we let a machine learn from examples. That's Day 1.

Day 01 · Foundations

Don't program it. Teach it.

The single biggest idea in AI: machines learn from examples, not from rules we write.

The hook

?

Try to write exact step-by-step rules to tell a cat from a dog — without using the words "cat" or "dog." Pointy ears? Some dogs have them. Bigger? Not a Chihuahua. Every rule breaks.

The analogy

Nobody hands a child a rulebook for ripe mangoes.

You just point — "ripe… not that one… ripe" — and after a dozen examples, something clicks. That clicked-into-place pattern is a model.

Machine learning works exactly like the child at the fruit stall: examples go in, a pattern comes out. The machine writes its own rulebook — one it usually can't even put into words.

The idea

The old way

A human writes the rules; the computer obeys. Great for taxes. Hopeless for "is this a cat?"

rules → answers

The new way

Show it examples; it finds the pattern itself. The pattern it keeps is the model.

examples → pattern

Try it — teach a machine

Workbench · KNN classifierdraw, teach, then let it guess

0

Teach a few of each (try smiley vs. frowny), then draw a new one and hit guess.

There are no rules inside that box. It shrinks your drawing to a 16×16 sketch and compares it to every example you taught — picking the label of the ones it looks most like. More examples → better guesses.

Watch out for

✕

"It understands faces."

It doesn't. It's matching shapes to remembered examples — nothing more.

✕

"More data is always better."

Only good data. Biased or sloppy examples teach a biased, sloppy pattern.

Explain it back

What's the difference between programming and machine learning?

Why couldn't we just write rules to recognise faces?

The machine made a mistake — whose fault, and how do you fix it?

Tomorrow → Today it compared whole drawings. But what is it really comparing? Everything boils down to features — a few numbers — and learning becomes finding a line between them.

Day 02 · Foundations

Two numbers,
one dividing line.

Data becomes features; a model is just the boundary that separates them.

The hook

?

Describe every classmate with two numbers — height and hours of sleep — and guess who plays basketball. Could you draw one line that mostly separates the players? That line is today's whole lesson.

The analogy

A doctor doesn't memorise every patient.

She learns which signals matter — temperature, blood pressure — and roughly where the line sits between "fine" and "hospital." That boundary is her model.

Two things to notice: she chose which features matter (not shoe size), and she stores a boundary, not the patients. A machine-learned model is exactly the same.

The idea

Features

The measurable clues. Each example becomes a dot in a space.

The boundary

Learning = finding the line that best splits the groups.

The model

For a straight line, the whole "intelligence" is just three numbers.

w₁·x + w₂·y + b = 0

Try it — find the boundary

Workbench · logistic regressiondrop dots, watch it learn the line

accuracy

—

The model is just three numbers:
w₁ = —
w₂ = —
bias = —

Now hit tricky one: no straight line can separate it. Hold that thought — fixing it is exactly why Phase 2 builds neural networks.

Watch out for

✕

"The model remembers the dots."

No — it keeps only the boundary. Delete the dots and it still classifies.

✕

"A line can separate anything."

The tricky example just disproved that. Real data often isn't linearly separable.

Explain it back

What's a "feature"? Give two for telling apples from oranges.

Why is the model the boundary, not the data?

The tricky example failed — what does that tell us we'll need next?

Tomorrow → You watched the line "settle" — but how did it know which way to move? It's rolling downhill on a landscape of its own mistakes. Your calculus becomes the hero.

Day 03 · Foundations

Learning is
rolling downhill.

Measure the error, feel the slope, take a step. Repeat. That's the engine of all of it.

The hook

?

You're blindfolded, shooting arrows. After each shot I tell you only "too high" or "too low." Can you eventually hit the target? Of course — you nudge, and shrink the nudge as you close in.

The analogy

Pretend the AI has just one knob it can turn.

Every position of that knob makes the AI a little better or worse. "Rolling downhill" just means turning the knob toward fewer mistakes.

The "valley" in the demo below is not a real place — it's a picture of that one idea. Before you touch it, here's exactly how to read the picture:

How to read the picture

Left → right

The knob — the single number the AI is allowed to change (its "weight"). Left is one setting, right is another.

Up ↑ down ↓

How wrong the AI is at that knob setting. High up = lots of mistakes. Low down = few mistakes.

The dip ↓

The knob setting that makes the fewest mistakes. That low point is the answer the AI is hunting for.

The orange ball is the AI's current guess for the knob. Like a hiker in thick fog (above), it can't see the whole curve — it only feels the slope right under it, and steps in the downhill direction (toward fewer mistakes). The size of each step is the learning rate. Roll downhill enough times and the ball settles in the dip — the AI has found its best setting.

In short: downhill = less wrong. Watch the ball, the dashed slope line, and the live equation move together as you press Step.

Try it — roll downhill

Workbench · gradient descentdrop the ball, set the step, run

Learning rate (step size) 0.25

A sensible step — the ball should glide into a valley.

step

0

weight w

2.60

error

—

slope

—

new w = w − rate × slope

Three experiments: a tiny rate (slow), a big rate past ~1.6 (it flies off — divergence is real!), and dropping the ball left vs. right to land in a deep valley vs. a shallow trap.

Watch out for

✕

"It always finds the best answer."

Not guaranteed — local minima and bad step sizes can trap or destabilise it.

✕

"Bigger rate = faster learning."

Only up to a point. Past it you get worse, not faster — overshoot.

Explain it back — Phase 1 capstone

Walk through one learning step using error, slope, and learning rate.

Why can too large a learning rate make things worse?

Local vs. global minimum — why does the starting point matter?

Tie it together: how does a pile of examples become a trained model?

Phase 2 → One straight line couldn't split the tricky data. The fix: stack many simple weighted decisions into a network, and train the whole thing by rolling downhill — in millions of dimensions. That unit is the neuron.

Day 04 · Neural Networks

The neuron:
a weighted vote.

The whole of deep learning is built from one tiny, simple unit. Here it is.

The analogy

Picture a panel of judges scoring a dish.

Each judge cares about different things — one weighs taste heavily, another barely cares. You add up their weighted opinions and decide. A neuron is exactly one such judge.

A neuron takes its inputs, multiplies each by a weight (how much it trusts that input), adds them up with a bias, and passes the total through a switch (the activation) that decides how strongly it fires.

The idea

Weights

How much each input matters. Big weight = "I trust this a lot."

Sum + bias

Add the weighted inputs, nudge with a bias.

w₁x₁ + w₂x₂ + b

Activation

A switch that turns the sum into an output between 0 and 1.

Try it — tune a neuron

Workbench · one neurondrag the weights, watch the boundary turn

Blue and red dots are two classes. The line is where the neuron "fires at 50%." Tune the weights so the line separates them.

weight 1 (left–right) 1.0

weight 2 (up–down) 1.0

bias 0.0

Watch out for

✕

"A neuron is like a brain cell."

Loosely inspired by one, but it's just multiply-add-and-switch. Pure arithmetic.

✕

"One neuron is powerful."

Alone it's just Day 2's straight line again. The power comes from stacking many — next lesson.

Explain it back

What are the three steps inside a neuron?

What does a weight represent?

Why is one neuron no stronger than the Day 2 line?

Next → One neuron draws one line. Wire many together in layers and they can trace any curve — and learn it automatically by sending the error backwards.

Day 05 · Neural Networks

How a network
learns.

Stack neurons, push data through, then send the mistake backwards. Watch it learn a curve live.

The analogy

A relay team passing a baton — and passing back the blame.

Data runs forward through the layers to make a guess. Then the error runs backward, telling each runner how they contributed to the miss, so everyone adjusts.

Forward pass = make a prediction. Backpropagation = trace the error back through every weight so each one knows which way to nudge. It's the chain rule from calculus, applied layer by layer — then it's just gradient descent (Day 3) on all of them at once.

The idea

Forward pass

Inputs flow through the layers; an answer pops out.

Backprop

The error flows back, assigning blame to each weight.

Repeat

Do it over the data many times (epochs). The fit improves.

Try it — watch a network learn a curve

Workbench · tiny neural network8 hidden neurons learning to trace the curve

The grey curve is the target. The coloured line is the network's current guess — random at first, then bending to fit as the error (loss) drops.

Watch out for

✕

"Backprop is magic."

It's the chain rule — a tidy way to compute every weight's slope at once.

✕

"More neurons = always better."

Too many and it memorises noise instead of the pattern — overfitting.

Explain it back

What happens in the forward pass vs. the backward pass?

How does backprop connect to Day 3's "rolling downhill"?

What is an epoch?

Next → Our network has one hidden layer. Stack many layers — go "deep" — and something remarkable happens: the layers start inventing their own concepts, simple to complex.

Day 06 · Neural Networks

Going deep.

Why stacking layers lets a network build meaning — from edges, to shapes, to "that's a face."

The analogy

An assembly line, or building with Lego.

The first station sees only edges. The next combines edges into shapes. The next, shapes into parts — an eye, a wheel. The last says "that's a cat."

"Deep" just means many layers. Each layer composes the previous one's output into something more abstract. Nobody tells it what an edge or an eye is — it invents these useful features itself. That's representation learning.

Try it — see what the layers see

Workbench · layer feature viewerdraw, then watch the first layer detect edges

Draw a shape or letter above.

This is a real edge-detector — exactly the kind of filter a first layer learns. It lights up where your drawing has an edge in the chosen direction. Deeper layers combine these into shapes and objects.

Watch out for

✕

"We program it to find eyes."

We don't. We only set up the layers and the goal; the features emerge from training.

✕

"Deeper is always better."

Depth adds power but also cost and trickier training. It's a balance.

Explain it back — Phase 2 capstone

What does each successive layer do to the previous layer's output?

Who decides what an "edge" or "eye" detector is?

Tie Phase 2 together: neuron → network → deep network.

Phase 3 → So far, inputs were numbers and dots. But how does a network read words? The trick is to turn every word into a list of numbers — a vector — arranged so meaning becomes geometry.

Day 07 · Transformers

Words become
points in space.

Turn every word into numbers, and meaning turns into geometry you can do maths on.

The analogy

A map — but of meaning instead of places.

On a real map, related cities sit close together. On a meaning map, "king" and "queen" are neighbours, and the arrow from "man → woman" is parallel to "king → queen".

Each word becomes a vector — a point in space. Direction and distance encode meaning. The astonishing part: this lets you do arithmetic on meaning. king − man + woman lands right on queen.

Try it — explore the meaning map

Workbench · embedding explorertap words; try the analogy

Tap two related words to see the arrow between them — or run the analogy below.

Watch out for

✕

"The map is really 2D."

Real embeddings have hundreds of dimensions. We flatten to 2D just to see it.

✕

"Someone places the words."

No — the positions are learned from billions of sentences about which words appear together.

Explain it back

What is a word embedding, in one sentence?

Why can "king − man + woman ≈ queen" possibly work?

What does distance on the map mean?

Next → If words live as points in space, how does a machine actually find the closest ones? That's the trick behind every "search by meaning" — and how AIs remember things in a vector database.

Day 07½ · Transformers

Searching
by meaning.

Two ideas — cosine similarity and the vector database — turn "find related stuff" into pure geometry.

The hook

?

Ask ChatGPT "what did we talk about last week?" — it doesn't remember in the human sense. It searches, by meaning, through a big pile of stored vectors. Same trick that Google uses to find a page that doesn't even contain your exact words.

The analogy

A library where books aren't shelved by title — they're shelved by what they're about.

You walk in holding a book on "how puppies learn tricks". The librarian doesn't look at the words on the cover. She walks to the spot where that book belongs, and hands you every neighbour — a book on dog training, one on animal behaviour, one on rewards. None of them share your title. All of them share your meaning.

Idea 1 — An embedding is just a list of numbers

On Day 7 we drew words as dots on a 2D map. That was a lie of convenience. A real embedding is a list of hundreds of numbers — often 768 or 1536. Each number is a learned "feature direction" that no human ever named.

Workbench · what a real embedding looks likeeach row is one word

king   →  [  … 765 more numbers ]
queen  →  [  … 765 more numbers ]
apple  →  [  … 765 more numbers ]

No one wrote those numbers. The model learned them from billions of sentences — by nudging the values until words that appear in similar contexts ended up at similar coordinates. That's all "training an embedding" is.

Idea 2 — Cosine similarity: the angle between two arrows

If a word is a point, the line from origin to that point is an arrow. To ask "are these two words similar?" we ask: do their arrows point the same way?

Same direction → similarity = 1.0.
Perpendicular → similarity = 0.
Opposite directions → similarity = −1.

The formula looks scary but it's just the cosine of the angle between the two arrows:

          cos(θ)  =  (a · b) / (|a| · |b|)
        

Why angle and not plain distance? Because the length of an embedding vector doesn't carry meaning — only its direction does. Two arrows pointing the same way are "about the same thing", even if one is twice as long.

Workbench · feel the angledrag either arrow

Drag the purple arrow or the teal arrow. Watch the angle and the cosine update.

Idea 3 — A vector database is just "sort by cosine"

Now glue the two ideas together. A vector database is a big table where each row is some piece of text (a sentence, a doc, a memory) and its embedding. When you ask it a question:

Turn your question into an embedding (one arrow).
Compute the cosine similarity against every stored embedding.
Return the top few highest scores.

That's it. No keyword matching. No clever indexing magic at the concept level — just "who points the most in my direction?"

Workbench · a tiny vector DBtap a query word

The DB knows these 12 words. Tap one to use it as your query — the rest get ranked by cosine similarity to it:

Pick a query word to see the top matches and the actual cosine numbers.

Watch out for

✕

"Cosine is the same as distance."

It's not. Two arrows can be far apart in length but point the same way (cosine = 1). Cosine only cares about direction.

✕

"A vector DB understands the words."

It doesn't read anything. It just does arithmetic on numbers an embedding model handed it.

✕

"The numbers in an embedding mean something specific."

Individual dimensions usually mean nothing a human can name. Only the overall direction has meaning.

Explain it back

Why use the angle between arrows instead of the straight-line distance?

In one sentence, what does a vector database do when you give it a query?

If cosine(a, b) = 1, what does that say about words a and b?

Next → A word's embedding is fixed the moment it's looked up — but the same word means different things in different sentences. So before any searching, the model needs to reshape each word by its neighbours. That's context.

Day 08 · Transformers

Context changes
everything.

The same word means different things depending on its neighbours. Watch meaning shift.

The hook

?

Say "I never said she stole the money" out loud, stressing a different word each time. Seven stresses, seven meanings — same words. Context is everything.

The analogy

The word "bank" alone tells you nothing.

Add "river" and it's a riverbank. Add "money" and it's a vault. The surrounding words decide which meaning wins.

Try it — flip the meaning

Workbench · context disambiguatoradd a context word, watch "bank" move

The word in focus: bank. Tap a context word to add it to the sentence:

Watch out for

✕

"Each word has one fixed meaning."

In a transformer, a word's representation is reshaped by its context every time.

Explain it back

Give a word whose meaning flips with context.

Why isn't a fixed embedding (Day 7) enough on its own?

Next → If every word must look at the others to settle its meaning, we need a mechanism for that looking. That mechanism — attention — is the heart of the transformer.

Day 09 · Transformers · ★

Attention:
the spotlight.

The single idea behind every modern AI. Each word asks: which other words should I look at?

The analogy

A spotlight in a dark theatre.

Reading "…because it was too heavy," your eyes dart back to find what "it" means. You shine a spotlight on the few words that matter right now. The machine learns where to aim it.

Each word sends out a question, every word offers a key, and the match scores decide how much attention to pay. High match → look here. It's dot products → softmax → a weighted blend.

Try it — watch the spotlight

Workbench · attention visualizertap a word; predict where it looks first

Tap a word above. Before the lines appear — guess which word it reaches for.

Keeping it honest: a real transformer learns its own meaning-vectors. Here each word has a small hand-built vector so the mechanism is visible — but the engine (match → soften → blend) is the genuine article.

Watch out for

✕

"Attention = understanding."

It's a weighted average steered by learned matches. Powerful, but still arithmetic.

Explain it back

What does attention let each word do?

What are the "question" and "key" doing?

Why do the percentages add up to 100%?

Next → Stack many attention layers, train on the whole internet, and you get a machine that predicts the next word brilliantly — a large language model.

Day 10 · Transformers

How an LLM
writes.

It's autocomplete that read the whole library: predict the next word, add it, repeat.

The analogy

A very well-read improv actor.

Given everything said so far, it picks the most fitting next word — then reads that back in and picks the next. One word at a time, on a loop.

A transformer stacks attention layers to read all the context, then predicts the next token. Temperature controls the choice: low plays it safe and predictable; high gets adventurous and creative.

Try it — predict the next word

Workbench · next-token predictorpick a start, set the temperature, generate

temperature 0.7

the model's ranked guesses for the next word:

Watch out for

✕

"It plans the whole sentence."

It only ever predicts the next word, then re-reads everything and predicts again.

✕

"It looks up true facts."

It predicts plausible text. That's why it can sound confident yet be wrong — a hallucination.

Explain it back — Phase 3 capstone

How does an LLM generate a sentence?

What does temperature change?

Tie Phase 3 together: embeddings → context → attention → next-token.

Phase 4 → A model that writes is amazing — but it only talks. Next we give it tools, memory, and a loop, turning the talker into a doer: an agent.

Sandbox · End to End

Words → vectors →
attention → next word.

A single live wiring diagram of everything Phase 3 taught. Change the sentence — watch every stage update.

Pick a sentence

Stage 0 · inputor type your own using the vocab below

vocab the toy model knows:

current sentence:

Stage 1 · Tokenize

Real LLMs split text into "tokens" — often word-pieces. Our toy model uses whole words. The order matters: each token gets a position.

tokens with positions

Stage 2 · Embedding lookup (the vector DB)

Each token is looked up in a giant table — the embedding matrix. This is the vector DB: row = token, columns = the 8 numbers that define its meaning. (Real models use 768 or more.) Each strip below is one row of the DB.

Stage 2 · embedding rows (Layer 0)colour = number value

Each token = a row of 8 numbers. That's all a "stored token" really is in the vector DB. Real models use 768+ numbers, but the idea is identical.

Each number is a learned "feature direction". teal = positive, purple = negative, white = near zero. The number inside each cell is the actual stored value.

Stage 2½ · Inspect one token (all 8 dimensions)

A single number-row is dry. Spread it across 8 axes and the token becomes a shape. Words that mean similar things have similar shapes. That's the whole secret of the vector DB.

Stage 2½ · multi-dimensional view— pick a token

↔ drag the cluster to rotate in 3D · scroll to zoom

Line thickness = cosine similarity. Teal lines = related meaning. The whole vocabulary projected from 8D down to 3D — drag to see it from any angle.

side-by-side: how this token compares to its closest neighbours, dimension by dimension

Stage 3 · Cosine similarity matrix

Now the model asks every token: "which other tokens are you most similar to?" Cosine similarity between every pair. The hotter the cell, the more aligned the meaning.

Stage 3 · pairwise cosinehover a cell for the number

Click a row's label to use that token as the attention query below.

Stage 4 · Attention (softmax of cosines)

Cosines become attention weights by passing through a softmax: the top scores get amplified, the rest get squashed. The query token then gathers a weighted mix of every other token's vector.

Stage 4 · attention from one token— pick a query token

Click any token's row label in the matrix above to make it the query.

Stage 5 · Multi-layer vector DB

A transformer doesn't run attention once — it stacks layers. Each layer takes the previous layer's vectors and re-mixes them through attention. The same word's vector changes at every layer, getting richer with context. That's why "bank" near "river" ends up very different from "bank" near "money".

Stage 5 · the same tokens, three layers deepwatch vectors morph downward

Layer 0 = pure dictionary lookup. Layer 1 & 2 = each token's vector is now a context-aware blend. Notice how identical words drift apart when they sit in different contexts.

Stage 6 · Generation (next-word)

To produce the next word, the model takes the last token's vector from the top layer and compares it against every word in the vocabulary — cosine again. Softmax over those scores gives a probability distribution. Sample from it: that's your next word.

Stage 6 · next-word probabilitiestop of the distribution

temperature 0.8

What you just watched

✓

Vector DB = lookup table.

Storing embeddings is literally rows of numbers, one per token, plus a similarity function.

✓

Cosine ranks meaning. Softmax picks winners.

Cosine measures alignment; softmax turns raw scores into clean weights that sum to 1.

✓

Layers stack the same trick.

One attention pass doesn't see enough. Stacking 12, 24, 96 of them is what makes modern LLMs powerful.

Day 11 · Agents

How you ask
shapes what you get.

A prompt isn't just a question — it's a specification. Four parts: role, task, constraints, format.

The hook

?

Ask the same model "explain photosynthesis" two ways: once as "explain photosynthesis" and once as "You're a biology teacher for 8-year-olds. Explain photosynthesis in 3 short sentences. Use a kitchen analogy." Same model. Wildly different answers. The prompt did all the work.

The analogy

Ordering food.

"Get me food" → you'll get something, but who knows what.
"Two veg sandwiches, no onion, no spice, packed to go, in 10 minutes" → exactly what you wanted. A good prompt is the second sentence.

A great prompt almost always has four moves:

Role — who is the model pretending to be?
Task — what exactly do you want done?
Constraints — what must it stick to (length, style, age, audience)?
Format — how should the answer come back (bullets? JSON? a poem?)

Try it — build a prompt by clicking parts

Workbench · prompt buildertap chips, watch the assembled prompt & mock reply

role

task

constraints

format

your assembled prompt

pick chips above to assemble your prompt

mock reply

Watch out for

✕

"More words = better prompt."

Not always. Clarity beats length. A 1-line precise prompt often beats a vague paragraph.

✕

"The model reads my mind."

It can only see your text. Unstated constraints are invisible to it.

Explain it back

Name the four moves of a good prompt.

Why does adding a role ("you are a maths tutor for 10-year-olds") change the answer?

Next → Even with the perfect prompt, the model can't check the weather or open your calendar. For that it needs tools.

Day 12 · Agents

Give the model
hands.

The LLM only knows words. To act in the real world it has to ask another program to do the work — a tool call.

The hook

?

Ask an LLM "what's the weather in Chennai right now?" It cannot know. Its knowledge is frozen at training time. Unless you give it a weather tool, the honest answer is "I don't know." A tool turns a talker into a doer.

The analogy

A brilliant friend, locked in a library with no internet.

She can answer anything in books. But to know today's stock price, today's weather, or today's date — she has to call out through a phone. The phone is the tool. The number she dials is the function. What you hear back is the result.

A tool call has three parts: which function, what arguments, and the result that comes back. The model writes the first two as text (usually JSON). Your code runs the function. The result is sent back into the conversation, and the model continues.

Try it — watch the model pick the right tool

Workbench · tool routerpick a question, see the JSON call

available tools

user question

model's tool call

click a question above

tool returns

—

model's final reply (after seeing the tool result)

—

Watch out for

✕

"The model runs the tool."

It doesn't. It writes a request for a tool. Your code (or the agent framework) actually runs it.

✕

"Tools make the model smarter."

They make it more capable. The reasoning is still the model's job.

Explain it back

What are the three parts of a tool call?

Why can't the model just "look up" today's weather without a tool?

Next → A tool gives the model hands. But it still forgets everything between conversations. Next: memory.

Day 13 · Agents

Remembering
across time.

The model forgets every conversation. Memory means writing important things into a vector DB so later it can search and recall.

The hook

?

Tell the model "my dog's name is Rex" on Monday. Come back Tuesday and ask "what's my dog's name?" — it has no idea. Same model, but a new conversation. To survive across time, an agent has to save what matters and look it up later.

The analogy

Short-term memory vs. a notebook.

The context window is short-term memory — like the last few minutes of a conversation you can still hear in your head. The vector DB is the notebook — write the important bits down, search later by meaning (same cosine trick from Day 7½).

An agent with memory does three things on every turn:

Recall — turn the user's question into an embedding, search the memory DB, pull top matches.
Reply — answer using both the context window and the recalled memories.
Remember — decide what's worth saving, write it back as a new vector.

Try it — chat with a remembering agent

Workbench · agent with memorysend a turn, watch the memory bank fill

conversation

memory bank (vector DB)

Watch out for

✕

"Memory = a database of facts."

It's a database of vectors. You look it up by meaning, not by exact words.

✕

"Save everything."

Memory bloats fast. Good agents are choosy — only save what's likely to matter later.

Explain it back

What's the difference between the context window and a memory DB?

How does the agent find the right memory when you ask a question?

Next → Prompts ✓ Tools ✓ Memory ✓. The last piece is the loop that ties them together: think, act, observe, repeat.

Day 14 · Agents · ★

Think → Act →
Observe → repeat.

The single loop behind every agent. Watch one step through it, live.

The analogy

A detective on a case.

Look at the scene (observe), form a theory (think), pull a clue (act), look at what you found (observe), think again. The loop ends when she knows.

An agent is just an LLM in a while loop. Each pass:

Think — the model writes its reasoning ("I need to know the weather first").
Act — the model picks a tool and arguments.
Observe — the tool's result is added to the conversation.
Repeat — until the model decides it's done.

Try it — step through a real task

Workbench · the agent loop, one step at a time— pick a task

Watch out for

✕

"Agents are a special kind of model."

They're the same model — just called in a loop with tools wired up.

✕

"The loop always ends."

Not always — agents can loop forever if their stopping condition is bad. Real agents have a max step count.

Explain it back

What are the three repeating moves of the agent loop?

When does the loop stop?

Capstone → Time to wire everything together. Pick tools, set memory, send a task, and watch your own agent run.

Day 15 · Capstone

Build your
agent.

Assemble all four pieces — prompt, tools, memory, loop — and run a real task end to end.

Step 1 — pick your agent's brain

Capstone · agent builderconfigure, then send a task

system prompt (role)

give it tools (pick any)

seed its memory

user task

agent trace

configure the agent above, then press run

What you've built

✓

An LLM

The brain. Takes text in, writes text out (Phase 3).

✓

A prompt

Tells the brain who it is and what's expected (Day 11).

✓

Tools

Lets the brain do things in the world (Day 12).

✓

Memory

Survives across time via a vector DB (Day 13).

✓

A loop

Think → Act → Observe → repeat (Day 14).

✓

Vectors everywhere

Words, memory, search — all the same trick (Day 7½).

You did it. Everything from "what is AI?" on Day 0 to a working agent. The real world's agents have more tools and far more steps — but the wiring is exactly what you just built.

Understand AI,from nothingto agents.

Foundations

Neural Networks

Transformers

Agents

The words,in plain English

The learning loop

Neural network

Transformer

Agent

Classifier

Model

Training (learning)

Data / dataset

Feature

Label

Weight

Bias (the number)

Decision boundary

Error / loss

Gradient / slope

Gradient descent

Learning rate

Minimum (local vs global)

Prediction / inference

k-Nearest Neighbours (KNN)

Logistic regression

Overfitting

Neuron

Neural network

Activation function

Forward pass

Backpropagation

Layer

Parameters

Epoch

Token

Vector

Embedding

Attention

Transformer

LLM (large language model)

Next-token prediction

Temperature

Context window

Hallucination

Prompt

Tool / function calling

Agent

Memory (agentic)

RAG (retrieval-augmented generation)

Fine-tuning

What even is AI?

Don't program it. Teach it.

The old way

The new way

0

0

Two numbers,one dividing line.

Features

The boundary

The model

Learning isrolling downhill.

Left → right

Up ↑ down ↓

The dip ↓

The neuron:a weighted vote.

Weights

Sum + bias

Activation

How a networklearns.

Forward pass

Backprop

Repeat

Going deep.

Words becomepoints in space.

Searchingby meaning.

Context changeseverything.

Attention:the spotlight.

How an LLMwrites.

Understand AI,
from nothing
to agents.

The words,
in plain English

Two numbers,
one dividing line.

Learning is
rolling downhill.

The neuron:
a weighted vote.

How a network
learns.

Words become
points in space.

Searching
by meaning.

Context changes
everything.

Attention:
the spotlight.

How an LLM
writes.

Words → vectors →
attention → next word.

How you ask
shapes what you get.

Give the model
hands.

Remembering
across time.

Think → Act →
Observe → repeat.

Build your
agent.