Demystifying Floating-Point Precision: Half, Single, and Double

If you’ve ever written code in Python, CUDA, or TensorFlow, you’ve probably seen terms like float16, float32, or float64. They map directly to the IEEE-754 floating-point standard:

Half precision (16 bits)
Single precision (32 bits)
Double precision (64 bits)

But what do these formats really mean? Why does 16-bit give you only “3 digits of accuracy,” and why is 32-bit called single precision while 64-bit is double? Let’s dive in — and also tackle the common confusions people face when they first see binary scientific notation.

The Core Idea: Scientific Notation in Binary

In decimal, we write large or small numbers in scientific notation: 6.022×10236.022 \times 10^{23}

6.022 → significant digits
10²³ → scaling factor

Floating point does exactly the same thing in binary:

Sign → 1 bit for +/−
Exponent → controls scale (powers of 2)
Mantissa (fraction/significand) → stores the significant digits

Example: Representing the Number 5

Decimal 5 = binary 101₂.

Now we want to write this like binary scientific notation. Move the “binary point” until only one digit remains before it:

101₂  =  1.01₂ × 2²

At first glance, someone might think:

“But 1.01 × 4 = 4.04, not 5!”

That’s the classic confusion.

Here’s the fix:

1.01₂ is not decimal 1.01.
It means 1+0/2+1/4=1.251 + 0/2 + 1/4 = 1.25 in decimal.
So:

✅ Exactly right.

Binary Fractions Cheat Sheet

To avoid tripping over this:

Binary fraction	Decimal value
0.1₂	0.5
0.01₂	0.25
0.11₂	0.75
1.01₂	1.25

So whenever you see a binary point, think in terms of powers of 2 (½, ¼, ⅛…), not powers of 10.

Precision: Why Half = ~3 Digits, Single = ~7, Double = ~16

The mantissa is where precision lives. Each mantissa bit lets you store one more “digit” in binary. To estimate decimal precision, we convert bits → digits:

decimal digits≈mantissa bits×log10(2)

Since log10(2)≈0.301:

Half precision (10 mantissa bits + 1 hidden):
1×0.301≈3.3 → ~3 digits
Single precision (23 + 1 = 24 bits):
24×0.301≈7.2 → ~7 digits
Double precision (52 + 1 = 53 bits):
53×0.301≈15.95 → ~16 digits

So if you store π (3.14159…):

Half → 3.14
Single → 3.141593
Double → 3.141592653589793

Why “Half,” “Single,” and “Double”?

Historically, 32-bit was the standard float size on most hardware — called single precision. Then:

Half precision (16-bit) is literally half of single.
Double precision (64-bit) is twice single.

Why Does It Look So Weird?

This scheme was designed in the early 1980s by the IEEE 754 committee, led by mathematician William Kahan (“the father of floating point”). Their goals:

Wide range & predictable accuracy
Efficient hardware implementation
Consistency across platforms

Before this, every vendor had their own floating-point rules, which made numerical code unpredictable. IEEE-754 unified the world.

Takeaways

Floating point = scientific notation in binary.
Mantissa = significant digits, controls accuracy.
Exponent = scale, controls size range.
Half, single, double differ only in how many bits they allocate.

👉 That’s why:

Half precision = ~3 decimal digits
Single precision = ~7 decimal digits
Double precision = ~16 decimal digits

Floating-point formats may look strange at the bit level, but they’re nothing more than binary scientific notation with fixed storage budgets. Once you understand the split between exponent (scale) and mantissa (precision), the whole system makes sense: half precision gives you about 3 reliable digits, single precision about 7, and double precision about 16. Whether you’re tuning a deep learning model, running a simulation, or just curious why your decimal “5” becomes 1.01₂ × 2², this knowledge helps you reason about when numbers are trustworthy and when they’re not. In short: floating point isn’t magic — it’s carefully engineered compromise.