Chapter 0
20 min read
Section 3 of 75

Softmax and Cross-Entropy Loss

Prerequisites

Introduction

Before we can train a Transformer, we need to understand two fundamental mathematical functions that power every language model: Softmax and Cross-Entropy Loss. These are the bridge between raw neural network outputs and meaningful predictions.

Why This Matters: Without softmax, we can't convert model outputs to probabilities. Without cross-entropy loss, we can't tell the model how wrong its predictions are. Together, they make training possible.
FunctionPurposeInput → Output
SoftmaxConvert raw scores to probabilitiesLogits → Probability Distribution
Cross-EntropyMeasure prediction qualityPredictions + Labels → Single Loss Number

Let's understand each one deeply, with intuition, formulas, and hands-on examples.


What is Softmax?

Imagine you're building a language model that needs to predict the next word. After processing through all the neural network layers, you get a list of scores (called logits) - one for each word in your vocabulary.

The problem? These logits can be any real number: positive, negative, large, small. They don't tell you the actual probability of each word being correct.

The Challenge: Raw logits like [2.0, 1.0, 0.5, -1.0] don't tell us much. Is 2.0 good? Is -1.0 bad? We need a way to convert these to probabilities that sum to 1.

The Intuition Behind Softmax

Softmax solves this problem elegantly with two key ideas:

  1. Exponentiation (e^x): Makes all values positive and amplifies differences. Larger logits get exponentially larger values.
  2. Normalization: Divide each value by the sum of all values, ensuring everything adds up to 1.0 (a valid probability distribution).
PropertyWhat It MeansWhy It's Useful
All values are positivee^x is always > 0Probabilities can't be negative
Sum equals 1We divide by totalValid probability distribution
Preserves orderLarger logit → larger probabilityRespects model's preferences
Amplifies differencesExponential scalingWinner takes more share

The Softmax Formula

The softmax function converts a vector of K real numbers (logits) into a probability distribution:

ext{softmax}(z_i) = rac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Let's break down what each part means:

  • ziz_i - The logit (raw score) for class i
  • ezie^{z_i} - Exponential of the logit, making it positive and amplified
  • sumezjsum e^{z_j} - Sum of all exponentials (the normalizing factor)
  • Result - A probability between 0 and 1

Interactive Softmax Explorer

Use the interactive visualization below to see how softmax transforms logits into probabilities. Adjust the sliders to see how changing one logit affects all probabilities:

Interactive Softmax Visualizer

Adjust logits and watch probabilities change in real-time

Input Logits (adjust with sliders)

catz = 2.00
dogz = 1.00
birdz = 0.50
fishz = -1.00
Step 1
Raw logits from the model
cat
2.00
dog
1.00
bird
0.50
fish
-1.00
Note: Logits can be any real number (positive or negative). They don't sum to 1 and aren't interpretable as probabilities yet.
The Softmax Formula:
softmax(zi) = ezi / Σ ezj
Try This: Set one logit much higher than the others and watch how it dominates the probability distribution. Then try making all logits equal and see how probabilities become uniform.

Step-by-Step Softmax Example

Let's work through a concrete example. Suppose our model outputs these logits for predicting the next word:

Scenario: The model is predicting the next word after "The cat sat on the"

matz1=2.0z_1 = 2.0
dogz2=1.0z_2 = 1.0
floorz3=0.5z_3 = 0.5
carz4=1.0z_4 = -1.0

Computing Softmax Probabilities

Converting logits [2.0,1.0,0.5,1.0] to probabilities\text{Converting logits } [2.0, 1.0, 0.5, -1.0] \text{ to probabilities}

e2.0=7.389,quade1.0=2.718,quade0.5=1.649,quade1.0=0.368e^{2.0} = 7.389, quad e^{1.0} = 2.718, quad e^{0.5} = 1.649, quad e^{-1.0} = 0.368
[7.389,2.718,1.649,0.368][7.389, \, 2.718, \, 1.649, \, 0.368]

💡 The exponential makes all values positive and amplifies the differences. Notice how 2.0 (the highest logit) becomes 7.389, much larger than the others.

7.389+2.718+1.649+0.3687.389 + 2.718 + 1.649 + 0.368
12.12412.124

💡 This sum will be our denominator, ensuring probabilities sum to 1.

rac{7.389}{12.124} = 0.609, quad rac{2.718}{12.124} = 0.224, quad rac{1.649}{12.124} = 0.136, quad rac{0.368}{12.124} = 0.030
[0.609,0.224,0.136,0.030][0.609, \, 0.224, \, 0.136, \, 0.030]

💡 Now we have valid probabilities! Notice they sum to approximately 1.0

0.609+0.224+0.136+0.0300.609 + 0.224 + 0.136 + 0.030
0.999approx1.0  0.999 approx 1.0 \; \checkmark

💡 The tiny difference from 1.0 is due to rounding. In practice, it's exactly 1.0.

Final Probability Distribution
extmat:60.9%,  extdog:22.4%,  extfloor:13.6%,  extcar:3.0%ext{mat: } 60.9\%, \; ext{dog: } 22.4\%, \; ext{floor: } 13.6\%, \; ext{car: } 3.0\%
Interpretation: The model predicts "mat" with 60.9% confidence as the next word after "The cat sat on the". Makes sense!

What is Cross-Entropy Loss?

Now that we can get probabilities from our model, we need a way to tell it how wrong those probabilities are. This is where Cross-Entropy Loss comes in.

The Goal: We want a single number that says "your prediction was this bad." The higher the number, the worse the prediction. During training, we minimize this number.

The Intuition Behind Cross-Entropy

Cross-entropy measures the "surprise" or "unexpectedness" of the model's prediction relative to the true answer. Think of it this way:

ScenarioModel SaysTrue Answer IsLoss
Confident & Correct90% catcatLow (0.1)
Uncertain25% eachcatMedium (1.4)
Confident & Wrong90% dogcatVery High (2.3)

The key insight is that cross-entropy heavily penalizes confident wrong predictions. This pushes the model to be both accurate and appropriately confident.

  • Confident and correct? Low loss. The model is rewarded.
  • Uncertain? Moderate loss. The model should be more decisive.
  • Confident and wrong? Very high loss! The model is severely penalized.

The Cross-Entropy Formula

For classification problems (like next-word prediction), the cross-entropy loss is:

L=sumi=1Kyicdotlog(pi)L = -sum_{i=1}^{K} y_i cdot log(p_i)

Where:

  • yiy_i - The true label (one-hot encoded: 1 for correct class, 0 otherwise)
  • pip_i - The predicted probability for class i
  • loglog - Natural logarithm (base e)

Since y is one-hot encoded (only one class has y=1, others are 0), this simplifies beautifully to:

L=log(pcorrect)L = -log(p_{correct})
The Simple Rule: Cross-entropy loss is just the negative log of the probability assigned to the correct answer. Higher probability for the correct answer → lower loss.

Interactive Cross-Entropy Explorer

Use this interactive visualization to see how cross-entropy loss changes based on the model's predictions:

Interactive Cross-Entropy Loss

See how the loss penalizes wrong predictions

Model's Probability Distribution

cat
90.0%
🎯
dog
5.0%
bird
3.0%
fish
2.0%
True Label (One-Hot Encoded):
1
0
0
0
Cross-Entropy Loss
0.1054
✓ Excellent! Low loss = confident & correct

Calculation Breakdown

Correct answer:cat
P(correct):90.00%
log(P):-0.1054
-log(P):0.1054
Low Loss (Good)High Loss (Bad)
02.55+
💡
Key Insight: Why -log(p)?

The negative log function penalizes confident wrong predictions severely. If the model says 1% for the correct answer, loss = -log(0.01) = 4.6. But if it says 90%, loss = -log(0.9) = 0.1. This gradient pushes the model to be both correct and confident.

Cross-Entropy Loss Formula:
L = -Σ yi · log(pi) = -log(pcorrect)

Since y is one-hot encoded, only the probability of the correct class matters

Try This: Compare "Confident & Correct" vs "Confident & Wrong" scenarios. Notice how the loss explodes when the model is confidently wrong!

Step-by-Step Cross-Entropy Example

Let's calculate the cross-entropy loss for our next-word prediction example:

Scenario: The correct next word is "mat" and our model predicted:

mat (correct)p=0.609p = 0.609
dogp=0.224p = 0.224
floorp=0.136p = 0.136
carp=0.030p = 0.030

True Label (One-Hot Encoded):

maty=1y = 1
dogy=0y = 0
floory=0y = 0
cary=0y = 0

Computing Cross-Entropy Loss

L=yilog(pi)L = -\sum y_i \cdot \log(p_i)

extThecorrectanswerismatwithprobabilityp=0.609ext{The correct answer is 'mat' with probability } p = 0.609
pextcorrect=0.609p_{ ext{correct}} = 0.609

💡 Since y is one-hot, only the correct class contributes to the loss.

log(0.609)=(0.496)-log(0.609) = -(-0.496)
0.4960.496

💡 The natural log of 0.609 is about -0.496. The negative sign flips it to positive.

(1imeslog(0.609)+0imeslog(0.224)+0imeslog(0.136)+0imeslog(0.030))-(1 imes log(0.609) + 0 imes log(0.224) + 0 imes log(0.136) + 0 imes log(0.030))
(1imes(0.496))=0.496-(1 imes (-0.496)) = 0.496

💡 All terms with y=0 contribute nothing. Only the correct class matters.

Cross-Entropy Loss
L=0.496L = 0.496
Interpretation: A loss of 0.496 is quite low! It means the model was reasonably confident (60.9%) in the correct answer. Perfect prediction (100% confidence) would give loss = 0.

Scenario: What if the model had been 90% confident in "dog" instead?

mat (correct)p=0.03p = 0.03
dog (wrong)p=0.90p = 0.90
floorp=0.05p = 0.05
carp=0.02p = 0.02

Cross-Entropy for Wrong Prediction

Confident but wrong — high penalty!\text{Confident but wrong — high penalty!}

extThecorrectanswermatonlyhasprobability0.03ext{The correct answer 'mat' only has probability } 0.03
pextcorrect=0.03p_{ ext{correct}} = 0.03

💡 Very low confidence in the right answer!

log(0.03)=(3.507)-log(0.03) = -(-3.507)
3.5073.507

💡 The loss is much higher because log of a small number is very negative!

Cross-Entropy Loss
L=3.507  (7imesexthigher!)L = 3.507 \; (7 imes ext{ higher!})
Key Insight: The loss jumped from 0.496 to 3.507 - about 7 times higher! This steep penalty for confident wrong predictions is what makes cross-entropy so effective for training.

How They Work Together

In training, softmax and cross-entropy work as a team:

  1. Forward Pass: Model outputs logits → Softmax converts to probabilities
  2. Loss Calculation: Cross-entropy measures how far predictions are from truth
  3. Backpropagation: Gradients flow back to improve the model
Beautiful Gradient: The gradient of softmax + cross-entropy has a remarkably simple form: for the correct class, it's (p - 1), and for wrong classes, it's just p. This elegant simplicity makes training efficient!
ClassGradientEffect
Correct classp - 1 (negative)Push probability up toward 1
Wrong classesp (positive)Push probability down toward 0

Temperature Scaling

Sometimes we want to control how "sharp" or "flat" our probability distribution is. This is done using a temperature parameter T:

ext{softmax}(z_i, T) = rac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}
TemperatureEffectUse Case
T < 1Sharper distribution, more confidentWhen you want decisive predictions
T = 1Standard softmaxNormal training and inference
T > 1Flatter distribution, more uncertainSampling for creative text generation

Try adjusting the temperature in the interactive softmax visualizer above to see this effect!


Summary

Softmax: Converts raw logits to a valid probability distribution using exponentiation and normalization.
Cross-Entropy: Measures prediction quality as -log(p_correct). Heavily penalizes confident wrong predictions.
  • Softmax Properties: All outputs positive, sum to 1, preserves ranking, amplifies differences
  • Cross-Entropy Properties: 0 when perfect, increases as confidence in wrong answer grows
  • Together: They create clean gradients (p - y) that efficiently train neural networks
  • Temperature: Controls distribution sharpness - lower = more confident, higher = more uncertain

With this understanding of softmax and cross-entropy, you now have the mathematical foundation needed to understand how Transformers learn from data. In the next chapters, we'll see these functions in action as we build attention mechanisms and train our model.