Information Theory: The 1948 Paper That Built the Digital World

What Is This?

In 1948, Claude Shannon — a 32-year-old mathematician at Bell Telephone Laboratories — published a paper in the Bell System Technical Journal. It was titled "A Mathematical Theory of Communication." In 55 pages, it founded an entirely new scientific discipline, introduced the concept of the bit as the fundamental unit of information, proved that any message could be compressed to a mathematical minimum, proved that any message could be transmitted reliably over a noisy channel, and established the theoretical limits of both compression and transmission that remain the ceiling of every communications system built since.

John von Neumann reportedly told Shannon to use the term "entropy" for his central measure — not because it meant the same thing as thermodynamic entropy, but because "no one knows what entropy really is, so in a debate you will always have the advantage." Shannon was more rigorous than that quip suggests, but the strangeness of information entropy — a measure of uncertainty, of missing knowledge, of the irreducible randomness in a message — is genuinely deep.

The problem Shannon was solving:

Bell Labs needed to know how efficiently telephone cables could carry information. Before Shannon, "information" had no mathematical definition. Engineers had intuitions — more complex signals carry more information, redundancy wastes bandwidth, noise corrupts messages — but no theoretical framework to quantify any of it. Shannon built the framework from scratch.

The core concepts:

Entropy (H): Shannon defined the information content of a message as proportional to its surprise value. A message you could predict perfectly contains zero information — it tells you nothing you didn't already know. A message that is completely unpredictable contains maximum information. Formally: H = -Σ p(x) log₂ p(x), where the sum runs over all possible messages weighted by their probability. The result is measured in bits.

This has a profound implication: information has nothing to do with meaning. Shannon explicitly stripped meaning from information theory. A random string of characters and a Shakespeare sonnet of the same length can have the same information content — what matters is the probability distribution of the symbols, not what they signify. The theory is about uncertainty reduction, not semantic content.

The Source Coding Theorem: Any message source can be compressed, at most, down to its entropy rate — the minimum number of bits per symbol required to represent the source without loss. Compressing below entropy is impossible without losing information. Every lossless compression algorithm (ZIP, PNG, FLAC) approaches this limit.

Channel Capacity: Every communication channel — a telephone wire, a radio spectrum, an optical fibre, a Wi-Fi signal — has a maximum rate at which information can be reliably transmitted, regardless of noise. Shannon called this the channel capacity and gave a formula for calculating it: C = B log₂(1 + S/N), where B is bandwidth and S/N is signal-to-noise ratio. The Noisy Channel Coding Theorem proved that as long as you transmit below channel capacity, you can achieve arbitrarily low error rates by using sufficiently clever encoding — even on a noisy channel. This was shocking. Engineers had assumed noise was a fundamental barrier that could only be reduced, not overcome. Shannon proved it could be overcome, in principle, completely.

Why Does It Matter?

Every digital system ever built runs on Shannon's math. The bit is Shannon's invention (he credited John Tukey for coining the word). The compression algorithms that make files small, the error-correcting codes that make data transmission reliable, the modulation schemes that encode data onto radio waves, the JPEG and MP3 formats that compress images and audio — all are applications of Shannon's theoretical framework. The internet, mobile networks, storage systems, and satellite communications are all operating within limits Shannon defined in 1948.^1
Shannon entropy is what language model training is actually minimising. LLM training loss — cross-entropy loss — is a direct application of Shannon's entropy measure. When a model predicts the next token, it's producing a probability distribution over the vocabulary. Cross-entropy loss measures how far that distribution is from the true distribution (the actual next token). Minimising this loss is minimising the divergence between the model's predictions and the entropy of the training data. The entire current AI revolution is, at its mathematical core, a Shannon entropy minimisation exercise. This is not a loose analogy — it's the literal loss function.^2
The separation of information from meaning is philosophically consequential. Shannon's abstraction — stripping meaning from information to make it mathematically tractable — is the move that made digital communication possible. But it also reveals something strange: the fundamental quantity in information theory is uncertainty, not content. This maps onto the hard problem of consciousness (information processing doesn't explain subjective experience), onto the limits of statistics (correlations don't imply meaning), and onto the fundamental challenge for AI (a model that minimises Shannon entropy on text is not necessarily understanding it in any deeper sense). The gap between information-theoretic information and semantic meaning is one of the deep unresolved questions in philosophy of mind.
It defines the absolute limits of what's compressible — which matters for AI training data. The entropy of English text is approximately 1-1.5 bits per character, meaning a perfectly efficient compression algorithm could represent English using about 1.5 bits per character rather than the 8 bits per character of ASCII. LLMs implicitly compress this — the model's weights are a compressed representation of the statistical regularities in the training data. The question of how much information is in a training corpus, and how much a model has retained, is fundamentally a Shannon entropy question.^3
Redundancy is a feature, not a bug — and Shannon explains why. Human languages are highly redundant. English text is approximately 50% redundant — you could remove half the letters and still recover the message (txt mssg r rdbl dspte mssng ltrs). This redundancy makes language robust to noise, error, and partial information loss. Shannon's framework makes redundancy quantifiable and controllable: error-correcting codes deliberately add redundancy in exactly the right amount to allow error recovery. The 3 copies of every critical system, the parity bits in storage, the forward error correction in DVDs — all are Shannon entropy in application.

Key People & Players

Claude Shannon (1916–2001) — Bell Labs mathematician and the singular founder of information theory. He also built the first machine learning chess program, invented the unicycle-riding juggling robot (he was a committed unicyclist and juggler at Bell Labs), created the first wearable computer (a roulette-prediction device in 1961 with Edward Thorp), and pioneered the mathematical theory of cryptography. His intellectual range was extraordinary. He largely withdrew from public life in the 1960s despite his influence, and spent his later years playing with toys and games.^4

Norbert Wiener (MIT) — Developed cybernetics simultaneously with Shannon's information theory. Wiener's Cybernetics (1948, same year as Shannon's paper) examined feedback, control, and communication in both machines and living systems. The two frameworks are complementary — information theory is about transmitting signals; cybernetics is about using feedback to control systems.

Richard Hamming (Bell Labs) — Shannon's colleague who developed Hamming codes — the first practical error-correcting codes, directly applying Shannon's noisy channel theorem. Every memory system, storage device, and communication protocol uses error correction descended from Hamming's work. His lecture "You and Your Research" is one of the most quoted talks in computer science.

John von Neumann — The polymath who named the "bit" through Shannon (via John Tukey) and whose architecture for digital computers (the von Neumann architecture that almost all computers use) was developed in close intellectual proximity to Shannon's information theory. The stored-program computer and information theory were born in the same period, in the same intellectual ecosystem.

David MacKay (Cambridge, 1967–2016) — Author of Information Theory, Inference, and Learning Algorithms (2003), freely available online, which remains the most accessible rigorous textbook connecting Shannon's information theory to modern machine learning, Bayesian inference, and neural networks. His death from cancer at 48 was a significant loss to the field.

The Current State

Information theory is not primarily a research frontier — it is foundational infrastructure, the way Newtonian mechanics is infrastructure. The theorems Shannon proved in 1948 are still the limits. No compression algorithm has beaten the Shannon entropy limit. No communication system has exceeded channel capacity.

Where it's active:

Quantum information theory — How does Shannon's framework extend to quantum systems, where information is encoded in quantum states and governed by quantum mechanics? The field began with John Bell's 1964 inequalities and has exploded with quantum computing. Quantum channel capacity, quantum error correction, and quantum key distribution are all Shannon theory extended to quantum mechanics.

Algorithmic information theory (Kolmogorov complexity) — Andrei Kolmogorov, Ray Solomonoff, and Gregory Chaitin independently developed a measure of information that is about individual objects rather than ensembles: how short can the shortest program be that generates a given string? This is the theoretical foundation for compression, and for a deep connection between information, randomness, and computability.

AI training and scaling — The connection between Shannon entropy and LLM loss functions is the most active application currently. Questions about the information content of training data, the entropy of different data sources, and what it means for a model to have "learned" the information in its training set are all fundamentally information-theoretic questions being explored empirically at enormous scale.

Shannon's paper is 55 pages and freely available online. Parts of it are readable by anyone. It is one of the most consequential documents ever written, and almost nobody who uses every system it underlies has ever read a sentence of it.

Best Resources to Learn More

"A Mathematical Theory of Communication" (1948) — The original paper, free online. The introduction and early sections are accessible.^5
The Information by James Gleick — The narrative history of information theory, from its precursors to Shannon to the digital age. Beautifully written, no equations.^6
Information Theory, Inference, and Learning Algorithms by David MacKay — The textbook connecting Shannon's theory to modern ML, free online at inference.org.uk.^7
3Blue1Brown: "Hamming codes" and information theory videos — The best visual introductions to the concepts. The Hamming code series is exceptional.^8
A Mind at Play by Jimmy Soni & Rob Goodman — Biography of Shannon. Excellent for understanding the man and the era.^9

Sources

What Is This?

The problem Shannon was solving:

The core concepts:

Why Does It Matter?

Every digital system ever built runs on Shannon's math. The bit is Shannon's invention (he credited John Tukey for coining the word). The compression algorithms that make files small, the error-correcting codes that make data transmission reliable, the modulation schemes that encode data onto radio waves, the JPEG and MP3 formats that compress images and audio — all are applications of Shannon's theoretical framework. The internet, mobile networks, storage systems, and satellite communications are all operating within limits Shannon defined in 1948.^1
Shannon entropy is what language model training is actually minimising. LLM training loss — cross-entropy loss — is a direct application of Shannon's entropy measure. When a model predicts the next token, it's producing a probability distribution over the vocabulary. Cross-entropy loss measures how far that distribution is from the true distribution (the actual next token). Minimising this loss is minimising the divergence between the model's predictions and the entropy of the training data. The entire current AI revolution is, at its mathematical core, a Shannon entropy minimisation exercise. This is not a loose analogy — it's the literal loss function.^2
The separation of information from meaning is philosophically consequential. Shannon's abstraction — stripping meaning from information to make it mathematically tractable — is the move that made digital communication possible. But it also reveals something strange: the fundamental quantity in information theory is uncertainty, not content. This maps onto the hard problem of consciousness (information processing doesn't explain subjective experience), onto the limits of statistics (correlations don't imply meaning), and onto the fundamental challenge for AI (a model that minimises Shannon entropy on text is not necessarily understanding it in any deeper sense). The gap between information-theoretic information and semantic meaning is one of the deep unresolved questions in philosophy of mind.
It defines the absolute limits of what's compressible — which matters for AI training data. The entropy of English text is approximately 1-1.5 bits per character, meaning a perfectly efficient compression algorithm could represent English using about 1.5 bits per character rather than the 8 bits per character of ASCII. LLMs implicitly compress this — the model's weights are a compressed representation of the statistical regularities in the training data. The question of how much information is in a training corpus, and how much a model has retained, is fundamentally a Shannon entropy question.^3
Redundancy is a feature, not a bug — and Shannon explains why. Human languages are highly redundant. English text is approximately 50% redundant — you could remove half the letters and still recover the message (txt mssg r rdbl dspte mssng ltrs). This redundancy makes language robust to noise, error, and partial information loss. Shannon's framework makes redundancy quantifiable and controllable: error-correcting codes deliberately add redundancy in exactly the right amount to allow error recovery. The 3 copies of every critical system, the parity bits in storage, the forward error correction in DVDs — all are Shannon entropy in application.

Key People & Players

The Current State

Where it's active:

Best Resources to Learn More

"A Mathematical Theory of Communication" (1948) — The original paper, free online. The introduction and early sections are accessible.^5
The Information by James Gleick — The narrative history of information theory, from its precursors to Shannon to the digital age. Beautifully written, no equations.^6
Information Theory, Inference, and Learning Algorithms by David MacKay — The textbook connecting Shannon's theory to modern ML, free online at inference.org.uk.^7
3Blue1Brown: "Hamming codes" and information theory videos — The best visual introductions to the concepts. The Hamming code series is exceptional.^8
A Mind at Play by Jimmy Soni & Rob Goodman — Biography of Shannon. Excellent for understanding the man and the era.^9

Information Theory: The 1948 Paper That Built the Digital World

What Is This?

Why Does It Matter?

Key People & Players

The Current State

Best Resources to Learn More

Sources

Want to go deeper?

Questions & Answers

Information Theory: The 1948 Paper That Built the Digital World

What Is This?

Why Does It Matter?

Key People & Players

The Current State

Best Resources to Learn More

Sources

Want to go deeper?

Questions & Answers