If you’ve ever written code in Python, CUDA, or TensorFlow, you’ve probably seen terms like float16, float32, or float64. They map directly to the IEEE-754 floating-point standard:

  • Half precision (16 bits)
  • Single precision (32 bits)
  • Double precision (64 bits)

But what do these formats really mean? Why does 16-bit give you only “3 digits of accuracy,” and why is 32-bit called single precision while 64-bit is double? Let’s dive in — and also tackle the common confusions people face when they first see binary scientific notation.


The Core Idea: Scientific Notation in Binary

In decimal, we write large or small numbers in scientific notation: 6.022×10236.022 \times 10^{23}

  • 6.022 → significant digits
  • 10²³ → scaling factor

Floating point does exactly the same thing in binary:

  • Sign → 1 bit for +/−
  • Exponent → controls scale (powers of 2)
  • Mantissa (fraction/significand) → stores the significant digits

Example: Representing the Number 5

  • Decimal 5 = binary 101₂.

Now we want to write this like binary scientific notation. Move the “binary point” until only one digit remains before it:

101₂  =  1.01₂ × 2²

At first glance, someone might think:

“But 1.01 × 4 = 4.04, not 5!”

That’s the classic confusion.

Here’s the fix:

  • 1.01₂ is not decimal 1.01.
  • It means 1+0/2+1/4=1.251 + 0/2 + 1/4 = 1.25 in decimal.
  • So:

✅ Exactly right.


Binary Fractions Cheat Sheet

To avoid tripping over this:

Binary fractionDecimal value
0.1₂0.5
0.01₂0.25
0.11₂0.75
1.01₂1.25

So whenever you see a binary point, think in terms of powers of 2 (½, ¼, ⅛…), not powers of 10.


Precision: Why Half = ~3 Digits, Single = ~7, Double = ~16

The mantissa is where precision lives. Each mantissa bit lets you store one more “digit” in binary. To estimate decimal precision, we convert bits → digits:

decimal digits≈mantissa bits×log10​(2)

Since log10​(2)≈0.301:

  • Half precision (10 mantissa bits + 1 hidden):
    1×0.301≈3.3 → ~3 digits
  • Single precision (23 + 1 = 24 bits):
    24×0.301≈7.2 → ~7 digits
  • Double precision (52 + 1 = 53 bits):
    53×0.301≈15.95 → ~16 digits

So if you store π (3.14159…):

  • Half → 3.14
  • Single → 3.141593
  • Double → 3.141592653589793

Why “Half,” “Single,” and “Double”?

Historically, 32-bit was the standard float size on most hardware — called single precision. Then:

  • Half precision (16-bit) is literally half of single.
  • Double precision (64-bit) is twice single.

Why Does It Look So Weird?

This scheme was designed in the early 1980s by the IEEE 754 committee, led by mathematician William Kahan (“the father of floating point”). Their goals:

  1. Wide range & predictable accuracy
  2. Efficient hardware implementation
  3. Consistency across platforms

Before this, every vendor had their own floating-point rules, which made numerical code unpredictable. IEEE-754 unified the world.


Takeaways

  • Floating point = scientific notation in binary.
  • Mantissa = significant digits, controls accuracy.
  • Exponent = scale, controls size range.
  • Half, single, double differ only in how many bits they allocate.

👉 That’s why:

  • Half precision = ~3 decimal digits
  • Single precision = ~7 decimal digits
  • Double precision = ~16 decimal digits

Floating-point formats may look strange at the bit level, but they’re nothing more than binary scientific notation with fixed storage budgets. Once you understand the split between exponent (scale) and mantissa (precision), the whole system makes sense: half precision gives you about 3 reliable digits, single precision about 7, and double precision about 16. Whether you’re tuning a deep learning model, running a simulation, or just curious why your decimal “5” becomes 1.01₂ × 2², this knowledge helps you reason about when numbers are trustworthy and when they’re not. In short: floating point isn’t magic — it’s carefully engineered compromise.

By Anjing

Mia writes wonderful articles.

Leave a Reply

Your email address will not be published. Required fields are marked *