Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Before we can train a Transformer, we need to understand two fundamental mathematical functions that power every language model: Softmax and Cross-Entropy Loss. These are the bridge between raw neural network outputs and meaningful predictions.

Why This Matters: Without softmax, we can't convert model outputs to probabilities. Without cross-entropy loss, we can't tell the model how wrong its predictions are. Together, they make training possible.

Function	Purpose	Input → Output
Softmax	Convert raw scores to probabilities	Logits → Probability Distribution
Cross-Entropy	Measure prediction quality	Predictions + Labels → Single Loss Number

Let's understand each one deeply, with intuition, formulas, and hands-on examples.

What is Softmax?

Imagine you're building a language model that needs to predict the next word. After processing through all the neural network layers, you get a list of scores (called logits) - one for each word in your vocabulary.

The problem? These logits can be any real number: positive, negative, large, small. They don't tell you the actual probability of each word being correct.

The Challenge: Raw logits like [2.0, 1.0, 0.5, -1.0] don't tell us much. Is 2.0 good? Is -1.0 bad? We need a way to convert these to probabilities that sum to 1.

The Intuition Behind Softmax

Softmax solves this problem elegantly with two key ideas:

Exponentiation (e^x): Makes all values positive and amplifies differences. Larger logits get exponentially larger values.
Normalization: Divide each value by the sum of all values, ensuring everything adds up to 1.0 (a valid probability distribution).

Property	What It Means	Why It's Useful
All values are positive	e^x is always > 0	Probabilities can't be negative
Sum equals 1	We divide by total	Valid probability distribution
Preserves order	Larger logit → larger probability	Respects model's preferences
Amplifies differences	Exponential scaling	Winner takes more share

The Softmax Formula

The softmax function converts a vector of K real numbers (logits) into a probability distribution:

ext{softmax}(z_i) = rac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Let's break down what each part means:

$z_i$ - The logit (raw score) for class i
$e^{z_i}$ - Exponential of the logit, making it positive and amplified
$sum e^{z_j}$ - Sum of all exponentials (the normalizing factor)
Result - A probability between 0 and 1

Interactive Softmax Explorer

Use the interactive visualization below to see how softmax transforms logits into probabilities. Adjust the sliders to see how changing one logit affects all probabilities:

Interactive Softmax Visualizer

Adjust logits and watch probabilities change in real-time

Input Logits (adjust with sliders)

catz = 2.00

dogz = 1.00

birdz = 0.50

fishz = -1.00

Step 1

Raw logits from the model

cat

2.00

dog

1.00

bird

0.50

fish

-1.00

Note: Logits can be any real number (positive or negative). They don't sum to 1 and aren't interpretable as probabilities yet.

The Softmax Formula:

softmax(z_i) = e^z_i / Σ e^z_j

Try This: Set one logit much higher than the others and watch how it dominates the probability distribution. Then try making all logits equal and see how probabilities become uniform.

Step-by-Step Softmax Example

Let's work through a concrete example. Suppose our model outputs these logits for predicting the next word:

Scenario: The model is predicting the next word after "The cat sat on the"

mat

z_1 = 2.0

dog

z_2 = 1.0

floor

z_3 = 0.5

car

z_4 = -1.0

Computing Softmax Probabilities

$\text{Converting logits } [2.0, 1.0, 0.5, -1.0] \text{ to probabilities}$

e^{2.0} = 7.389, quad e^{1.0} = 2.718, quad e^{0.5} = 1.649, quad e^{-1.0} = 0.368

[7.389, \, 2.718, \, 1.649, \, 0.368]

💡 The exponential makes all values positive and amplifies the differences. Notice how 2.0 (the highest logit) becomes 7.389, much larger than the others.

7.389 + 2.718 + 1.649 + 0.368

12.124

💡 This sum will be our denominator, ensuring probabilities sum to 1.

rac{7.389}{12.124} = 0.609, quad rac{2.718}{12.124} = 0.224, quad rac{1.649}{12.124} = 0.136, quad rac{0.368}{12.124} = 0.030

[0.609, \, 0.224, \, 0.136, \, 0.030]

💡 Now we have valid probabilities! Notice they sum to approximately 1.0

0.609 + 0.224 + 0.136 + 0.030

0.999 approx 1.0 \; \checkmark

💡 The tiny difference from 1.0 is due to rounding. In practice, it's exactly 1.0.

Final Probability Distribution

ext{mat: } 60.9\%, \; ext{dog: } 22.4\%, \; ext{floor: } 13.6\%, \; ext{car: } 3.0\%

Interpretation: The model predicts "mat" with 60.9% confidence as the next word after "The cat sat on the". Makes sense!

What is Cross-Entropy Loss?

Now that we can get probabilities from our model, we need a way to tell it how wrong those probabilities are. This is where Cross-Entropy Loss comes in.

The Goal: We want a single number that says "your prediction was this bad." The higher the number, the worse the prediction. During training, we minimize this number.

The Intuition Behind Cross-Entropy

Cross-entropy measures the "surprise" or "unexpectedness" of the model's prediction relative to the true answer. Think of it this way:

Scenario	Model Says	True Answer Is	Loss
Confident & Correct	90% cat	cat	Low (0.1)
Uncertain	25% each	cat	Medium (1.4)
Confident & Wrong	90% dog	cat	Very High (2.3)

The key insight is that cross-entropy heavily penalizes confident wrong predictions. This pushes the model to be both accurate and appropriately confident.

Confident and correct? Low loss. The model is rewarded.
Uncertain? Moderate loss. The model should be more decisive.
Confident and wrong? Very high loss! The model is severely penalized.

The Cross-Entropy Formula

For classification problems (like next-word prediction), the cross-entropy loss is:

L = -sum_{i=1}^{K} y_i cdot log(p_i)

Where:

$y_i$ - The true label (one-hot encoded: 1 for correct class, 0 otherwise)
$p_i$ - The predicted probability for class i
$log$ - Natural logarithm (base e)

Since y is one-hot encoded (only one class has y=1, others are 0), this simplifies beautifully to:

L = -log(p_{correct})

The Simple Rule: Cross-entropy loss is just the negative log of the probability assigned to the correct answer. Higher probability for the correct answer → lower loss.

Interactive Cross-Entropy Explorer

Use this interactive visualization to see how cross-entropy loss changes based on the model's predictions:

Interactive Cross-Entropy Loss

See how the loss penalizes wrong predictions

Model's Probability Distribution

cat

90.0%

🎯

dog

5.0%

bird

3.0%

fish

2.0%

True Label (One-Hot Encoded):

Cross-Entropy Loss

0.1054

✓ Excellent! Low loss = confident & correct

Calculation Breakdown

Correct answer:cat

P(correct):90.00%

log(P):-0.1054

-log(P):0.1054

Low Loss (Good)High Loss (Bad)

02.55+

💡

Key Insight: Why -log(p)?

The negative log function penalizes confident wrong predictions severely. If the model says 1% for the correct answer, loss = -log(0.01) = 4.6. But if it says 90%, loss = -log(0.9) = 0.1. This gradient pushes the model to be both correct and confident.

Cross-Entropy Loss Formula:

L = -Σ y_i · log(p_i) = -log(p_correct)

Since y is one-hot encoded, only the probability of the correct class matters

Try This: Compare "Confident & Correct" vs "Confident & Wrong" scenarios. Notice how the loss explodes when the model is confidently wrong!

Step-by-Step Cross-Entropy Example

Let's calculate the cross-entropy loss for our next-word prediction example:

Scenario: The correct next word is "mat" and our model predicted:

mat (correct)

p = 0.609

dog

p = 0.224

floor

p = 0.136

car

p = 0.030

True Label (One-Hot Encoded):

mat

y = 1

dog

y = 0

floor

y = 0

car

y = 0

Computing Cross-Entropy Loss

$L = -\sum y_i \cdot \log(p_i)$

ext{The correct answer is 'mat' with probability } p = 0.609

p_{ ext{correct}} = 0.609

💡 Since y is one-hot, only the correct class contributes to the loss.

-log(0.609) = -(-0.496)

0.496

💡 The natural log of 0.609 is about -0.496. The negative sign flips it to positive.

-(1 imes log(0.609) + 0 imes log(0.224) + 0 imes log(0.136) + 0 imes log(0.030))

-(1 imes (-0.496)) = 0.496

💡 All terms with y=0 contribute nothing. Only the correct class matters.

Cross-Entropy Loss

L = 0.496

Interpretation: A loss of 0.496 is quite low! It means the model was reasonably confident (60.9%) in the correct answer. Perfect prediction (100% confidence) would give loss = 0.

Scenario: What if the model had been 90% confident in "dog" instead?

mat (correct)

p = 0.03

dog (wrong)

p = 0.90

floor

p = 0.05

car

p = 0.02

Cross-Entropy for Wrong Prediction

$\text{Confident but wrong — high penalty!}$

ext{The correct answer 'mat' only has probability } 0.03

p_{ ext{correct}} = 0.03

💡 Very low confidence in the right answer!

-log(0.03) = -(-3.507)

3.507

💡 The loss is much higher because log of a small number is very negative!

Cross-Entropy Loss

L = 3.507 \; (7 imes ext{ higher!})

Key Insight: The loss jumped from 0.496 to 3.507 - about 7 times higher! This steep penalty for confident wrong predictions is what makes cross-entropy so effective for training.

How They Work Together

In training, softmax and cross-entropy work as a team:

Forward Pass: Model outputs logits → Softmax converts to probabilities
Loss Calculation: Cross-entropy measures how far predictions are from truth
Backpropagation: Gradients flow back to improve the model

Beautiful Gradient: The gradient of softmax + cross-entropy has a remarkably simple form: for the correct class, it's (p - 1), and for wrong classes, it's just p. This elegant simplicity makes training efficient!

Class	Gradient	Effect
Correct class	p - 1 (negative)	Push probability up toward 1
Wrong classes	p (positive)	Push probability down toward 0

Temperature Scaling

Sometimes we want to control how "sharp" or "flat" our probability distribution is. This is done using a temperature parameter T:

ext{softmax}(z_i, T) = rac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}

Temperature	Effect	Use Case
T < 1	Sharper distribution, more confident	When you want decisive predictions
T = 1	Standard softmax	Normal training and inference
T > 1	Flatter distribution, more uncertain	Sampling for creative text generation

Try adjusting the temperature in the interactive softmax visualizer above to see this effect!

Summary

Softmax: Converts raw logits to a valid probability distribution using exponentiation and normalization.

Cross-Entropy: Measures prediction quality as -log(p_correct). Heavily penalizes confident wrong predictions.

Softmax Properties: All outputs positive, sum to 1, preserves ranking, amplifies differences
Cross-Entropy Properties: 0 when perfect, increases as confidence in wrong answer grows
Together: They create clean gradients (p - y) that efficiently train neural networks
Temperature: Controls distribution sharpness - lower = more confident, higher = more uncertain

With this understanding of softmax and cross-entropy, you now have the mathematical foundation needed to understand how Transformers learn from data. In the next chapters, we'll see these functions in action as we build attention mechanisms and train our model.