如何从零搭建 LLM 架构

How to Build LLM Architectures From Scratch

Source: @shabnam_774 on X · 2026-05-24

A Deep Dive Into the Systems Behind Models Like OpenAI ChatGPT and Anthropic Claude


Most people use AI models every day.

Very few understand how they're actually built.

Under the hood, Large Language Models (LLMs) are not magic. They are massive prediction systems trained on huge amounts of text using carefully designed neural network architectures.

But building one from scratch is far more complex than simply "training a chatbot."

It involves:

This article breaks down the full architecture of modern LLMs step-by-step in a practical and understandable way.


1. What Is an LLM?

A Large Language Model is a neural network trained to predict the next token in a sequence.

Example:

Input:

"The future of AI is"

The model predicts:

"transformative"

Then continues predicting one token at a time.

That's the foundation of systems like:

At scale, this simple prediction process becomes incredibly powerful.


2. The Core Pipeline of Building an LLM

The full process looks like this:

Raw Internet Data
        ↓
Cleaning + Filtering
        ↓
Tokenization
        ↓
Transformer Architecture
        ↓
Pretraining
        ↓
Fine-Tuning
        ↓
RLHF / Alignment
        ↓
Inference Optimization
        ↓
Deployment

Every stage matters.

A weak dataset or poor architecture design can ruin the entire model.


3. Step One: Data Collection

LLMs need enormous datasets.

Modern frontier models train on:

Data sources may include:

The goal is diversity + scale.

A small model trained on excellent data often beats a larger model trained on noisy data.


4. Data Cleaning and Filtering

Raw internet data is messy.

You must remove:

This stage is massively underestimated.

Companies spend enormous resources on data quality because:

Better data > Bigger models

Common filtering methods include:


5. Tokenization: Converting Text Into Numbers

Neural networks don't understand words.

They understand numbers.

So text becomes tokens.

Example:

"ChatGPT is powerful"
↓
[1532, 4021, 318, 7821]

This process is called tokenization.

Popular tokenization methods:

Tokens can represent:

Efficient tokenization dramatically affects performance and cost.


6. Embeddings: Giving Tokens Meaning

Tokens are converted into vectors.

A vector is basically a list of numbers representing semantic meaning.

Example:

King → [0.2, -0.8, 1.4, ...]
Queen → [0.3, -0.7, 1.5, ...]

Similar concepts end up close together in vector space.

This is how models learn relationships between words.

Embeddings are the foundation of semantic understanding.


7. The Transformer Architecture

This changed everything.

The Transformer architecture was introduced in the landmark paper:

"Attention Is All You Need" by Google Brain researchers in 2017.

Transformers replaced older systems like:

Because they scaled dramatically better.

The Transformer architecture powers nearly every modern LLM today.


8. Self-Attention: The Heart of LLMs

Self-attention allows the model to determine:

Which words matter most in context.

Example:

"The animal didn't cross the street because it was tired."

The model learns that:

"it" refers to "animal"

not "street."

Self-attention dynamically weighs relationships between tokens.

This enables contextual understanding.


9. Understanding Q, K, and V (Query, Key, Value)

Attention works using:

Think of it like search.

Each token asks:

"Which other tokens are relevant to me?"

Then attention scores determine importance.

Formula:

Attention(Q,K,V) = softmax(QKᵀ / √dₖ)V

This is one of the most important equations in modern AI.


10. Multi-Head Attention

Instead of using one attention mechanism:

LLMs use many attention heads simultaneously.

Each head learns different relationships:

This massively improves representation learning.


11. Positional Encoding

Transformers process tokens in parallel.

But language has order.

So models need positional information.

Example:

Dog bites man
Man bites dog

Same words. Completely different meaning.

Positional encoding helps the model understand sequence structure.


12. Feed Forward Networks

After attention layers, tokens pass through feed-forward neural networks.

These layers:

A transformer block usually contains:

Attention
↓
Normalization
↓
Feed Forward Network
↓
Normalization

Repeated dozens or hundreds of times.


13. Scaling Laws

One major discovery in AI:

Bigger models trained on more data generally perform better.

Scaling involves:

Examples:

Modern frontier systems may use trillions of parameters (sometimes via Mixture-of-Experts).


14. Training the Model

Training means adjusting weights to minimize prediction error.

Process:

Input sentence
↓
Predict next token
↓
Compare prediction vs actual token
↓
Calculate loss
↓
Backpropagation
↓
Update weights

This repeats billions of times.

Training large models can require:


15. GPUs and Distributed Training

LLMs are computational monsters.

Training requires clusters of GPUs like:

Training methods include:

Frameworks:

Infrastructure becomes as important as model design.


16. Loss Functions and Optimization

The model learns using optimization algorithms like:

Objective:

Minimize prediction loss.

Cross-entropy loss is commonly used for language modeling.

Smaller loss = better predictions.


17. Fine-Tuning

After pretraining, models are specialized.

Examples:

Fine-tuning uses smaller curated datasets.

This adapts the base model to specific tasks.


18. RLHF: Reinforcement Learning From Human Feedback

This is what makes ChatGPT-like systems conversational.

Humans rank outputs.

The model learns preferences.

Pipeline:

Base Model
↓
Supervised Fine-Tuning
↓
Reward Model
↓
Reinforcement Learning

RLHF helps models become:


19. Context Windows and Memory

Context window = how much text the model can "remember" during inference.

Examples:

Longer context requires advanced optimization because attention costs grow rapidly.

New techniques include:


20. Inference Optimization

Training is expensive.

Inference must be fast.

Optimization techniques include:

Goal:

Lower latency + lower cost.


21. Retrieval-Augmented Generation (RAG)

LLMs don't truly "know" everything.

So modern systems retrieve external knowledge dynamically.

Pipeline:

User Query
↓
Search Database
↓
Retrieve Relevant Chunks
↓
Inject Into Prompt
↓
Generate Response

This improves:


22. Mixture-of-Experts (MoE)

Modern frontier models increasingly use MoE architectures.

Instead of activating the entire model:

Only selected expert networks activate per token.

Benefits:

This is believed to be important in many modern systems.


23. AI Alignment and Safety

Raw models can produce harmful outputs.

Alignment layers help enforce:

Techniques include:

Alignment is now one of the hardest problems in AI.


24. The Real Challenge Isn't the Architecture

Most people think the hardest part is building the transformer.

It isn't.

The hardest parts are:

The transformer paper was only the beginning.

The real engineering challenge is making these systems scalable and usable.


25. Final Thought

LLMs are one of the most important technological breakthroughs in modern history.

But they are not magic.

They are the result of:

And we are still extremely early.

The next decade of AI will likely be defined by:

Understanding how LLMs are built is no longer optional for engineers.

It's becoming foundational knowledge for the future of technology.


原推文链接: https://x.com/shabnam_774/status/2058517919760355729