LLMs: How They Work - Graves Light Consulting

LLMs

How They Work: For Leaders and Luddites Alike

I spent my career as a Field CTO focused on financial services. Key to that role was understanding the technology at a nuts-and-bolts level so I could communicate the value with both technical and non-technical C-level audiences.

It’s the same approach I use at Graves Light Consulting — and the same tack I’m taking with this GenAI article, although the material here is necessarily more technical. GenAI being what it is. The end goal is the same: make something complex understandable and useful.

This paper explains how modern GenAI models work under the hood — clearly, practically, and without assuming a background in math. It also sets the foundation for future papers. My coursework at Harvard and MIT helped enormously in this effort.

LLM Architecture

The diagram below shows the core components of a modern Large Language Model (LLM), beginning with the user’s input and flowing through tokenization, embeddings, multiple transformer layers, and finally the Output Head where the next-token prediction is generated. Each Transformer Block contains two sublayers: Self-Attention and Feed-Forward, each followed by a normalization step. Modern LLMs stack hundreds of these Transformer Blocks to build deep understanding and generate predictions.

Figure 1: LLM Architecture

LLM Training

LLMs are trained on massive datasets — the internet, books, and curated documents — up to a fixed point in time. Training the LLM is foundational and the most resource-intensive aspect of GenAI systems — fueling rapid growth for chip companies like Nvidia.

To train the model, a way is needed to measure how wrong it is — that’s what the error function does.

Despite all the hype, the math behind the error function — and GenAI in general — isn’t that complicated.

In plain English, the formula says that the total loss E(w) is the sum of all the squared differences between the correct answers (y) and the model’s current predictions f(x, w). ‘ w’ is the weight that we are attempting to train.

LLMs rely on billions of weights that form high-dimensional vectors and matrices, each one emphasizing a direction or feature in high-dimensional space. This lets the model nudge a word like “fund” toward the correct meaning—finance, charity, education—based on context. Because weights match the dimensions of token vectors, the math works cleanly. A weight vector acts like a compass setting guiding a token’s interpretation.

Squaring the differences prevents positive and negative errors from canceling out, gives more weight to bigger mistakes, and creates a smooth surface that gradient descent, guided by backpropagation, can follow down to the lowest error.

In a nutshell, you can visualize the errors as points on a curved surface — the goal is to move downhill to the lowest point, where error is minimized.

Once the errors are plotted, gradient descent is applied to adjust the weights:

The above gradient descent formula computes the derivative (∂) of the error with respect to the weight to understand:

If I tweak this weight slightly — with η controlling how big that tweak is — does the error get better or worse, and by how much?

Not unlike a light dimmer:

If turning the knob slightly makes the light worse, turn the other direction.
Keep adjusting until the brightness is just right.

The model does this automatically — trillions of tiny adjustments — to minimize error.

Embedding

During training, the model builds an embedding matrix — a library of vectors mapped to text-based tokens that are stored in a high-dimensional space (think graph with many axes).

The embedding steps are as follows:

Tokenization: The text prompt from a user is broken into smaller units called tokens.
Index mapping: Each token is mapped to an index number in the model’s vocabulary (library of tokens).
Vector lookup: Each index number points to a learned vector (from the training step above) — in the embedding matrix.

Coordinates in High-Dimensional Space

The numbers in a vector represent the endpoint there axes intersect in a high-dimensional space.
The closer the vectors’ endpoints are in the high-dimensional space, the closer they are in meaning. The direction between endpoints reflects relationship.

Below right is a 2-D embedding example that shows a king and a queen close on the X-axis (similar meaning) and separated on the Y-axis by their gender direction.

It’s the same with an LLM — but with thousands of axes.

Modern models use extremely high-dimensional vectors:

GPT-4: ~12,000 to 16,000 dimensions
BERT Large: 1,024 dimensions

With each new model release, dimensionality typically increases. All vectors together form an embedding space where:

Distance ≈ similarity in meaning
Direction ≈ type of relationship

Transformer

Next is the Transformer layer, which has two main parts: Self-Attention and Feed-Forward with a sub-function called Residual Connections and Normalization.

Transformers are where a vector’s meaning and context get enriched. You will hear experts mention Transformer Architectures — this is what they are talking about.

Self-attention

In self-attention, context is added to the original meaning of the token output from the embedding space.

For example, tokens like fund and grows are initially independent. After self-attention, fund becomes fund that grows — the model has learned that the two tokens are related. To accomplish its new self awareness, a token extracts little bits of each other token (from the same prompt) via computation as follows:

Figure 9: Self Attention-Formula

Create Q, K and V Vectors

In the Attention formula above, the vector (V) output from the embedding step is fed into self-attention and multiplied by W_Q, W_K, and W_V learned weight matrices. The weight matrices are static and trained by the model for the sole purpose of producing three new Q, K, and V vectors used to enhance a token’s meaning as follows:

Query (Q): what this token is looking for
Key (K): what this token offers to others
Value (V): the information to be passed along

Establish Attention Score

For each token’s new Q vector, the model computes dot products with all other tokens’ K vectors — including itself (QK^T). The dot product simplifies matrix math and provides an attention score for how strongly two token vectors relate to each other.

You’ve likely seen dot products in action before: portfolio alignment to a market factor (finance), customer similarity scoring (marketing), candidate cultural fit (HR analytics)

The denominator √(d_k) prevents the attention scores from getting too large, keeping SoftMax balanced and ensuring the model distributes attention effectively.

Apply SoftMax

Next, SoftMax — a normalization function — converts the attention scores from step two into attention weights (probabilities) between 0 and 1.

SoftMax is used widely to turn raw scores into probabilities.

Finally, we take the attention weights and compute a weighted combination of the other tokens’ (V) vectors and add that to the original token vector.

Each token’s updated meaning comes from attending to itself and all preceding tokens in the prompt, but never to tokens that follow it, which ensures the model only uses past context to predict the next token.

The result is a new vector containing the contextualized meaning — what was originally fund is now fund that grows.

Modern LLMs use multi-head attention heads—running self-attention in parallel 12-128 times depending on the model—with each head learning to focus on different types of relationships between tokens.

Feed-Forward

Feed forward adds weights and biases (learned in training) to the vector output from self-attention to apply richer and more abstract meaning.

(xW₁ + b₁): Multiply the context-aware token vector x (from self-attention) by a learned matrix W₁ designed to elicit abstract meaning—allowing the model to discover deeper patterns (e.g., fund that grows → performance).
Bias (b₁): Added to ensure a baseline. For example, if the fund that grows has only grown slightly, performance is still relevant, so let’s keep it.
ReLU activation: ReLU activation function is applied to remove weak indicators and negative values such as ‘Bankruptcy’. E.g. ‘Bankruptcy’ is related to “fund that grows” but not likely relevant so let’s not muddy the waters with it.
Final step (W₂ + b₂): Refines and compresses the new meaning back to the model’s original dimension so the token can move forward—now “smarter”—to the next layer.

Residual Connections and Normalization

After each major transformer step — Self-Attention and Feed-Forward — the model adds back the original information so nothing important gets lost during processing.

Figure 11: Residual Connection and Normalization

Residual Connection : X+ F(x) adds the original token vector (e.g., “fund”) back to preserve its meaning.
Normalization: LayerNorm normalizes the result by keeping token values consistent in scale and preventing any dimension from growing too large or too small.

The End Result from Transformer

Original token vector: “fund”
Context from self-attention: “fund that grows”
Deeper interpretation from feed-forward: “growth → performance”
Final hidden vector: “fund with growth implying performance”

The hidden (invisible to the user) vector is passed to the next transformer layer. After the final transformer layer, the hidden vector gets passed to the Output Head which converts it into a word prediction.

output head

The hidden vector h (the last token in the sequence) is multiplied by a learned matrix W to produce a raw score for every token in the vocabulary. SoftMax normalizes these scores into probabilities between 0 and 1. The highest-scoring token is selected, mapped from the vocabulary to text, and is either run through the model or returned to the user.

Why LLMs are Like the Human Brain

In your brain:

A neuron receives inputs from other neurons through dendrites.
It combines them as weighted sums of signals.
It fires an output if the total signal is strong enough (activation).

In an LLM:

The artificial neuron receives inputs (numbers, x).
It multiplies each input by a weight and adds a bias (xW₁ + b₁).
It applies a nonlinear activation function (ReLU).
It outputs a new value (FFN(x)).

Figure 13: Human and Artificial Neuron

Each artificial “neuron” mimics the math of a biological one—combining, activating, and passing signals forward. Together, embedding layers, transformer blocks, and the Output Head form networks containing millions of artificial neurons working with trillions of parameters.

Why Do LLMs Work So Well? No one Knows for Sure

As MIT Technology Review noted in 2024, “for all its runaway success, nobody knows exactly how—or why—it works.”

The mechanics—embedding, attention, normalization are understood—but how scaling these simple formulas produces reasoning, abstraction, and understanding that far exceeds expectations—is not. Many now believe LLMs are more a discovery than an invention. Raising the question: How do we fully predict and govern LLMs if we don’t understand their most powerful and emergent behavior?

Call to action

At Graves Light Consulting, we simplify complex technologies into compelling narratives—turning C-level prospects into champions for your product. Ready to elevate your messaging and win more deals? Let’s talk!