IEEE 754 Standards - "float" and "double"

This section describes IEEE 754 standards on how floating-point values of 'float' and 'double' are represented in binary format.

In order to answer some of the questions raised in the previous sections, let's take a close look at how real numbers are represented in a computer system. In C#, the IEEE 754 single-precision and double-precision standards are used to represent "float" and "double" type values respectively. We will discuss the storage representation of "decimal" later in this book.

Since the IEEE 754 standards are widely used in the computer industry, there are a lots of publications available talking about these standards. In this section, I will explain only some basic rules of the standards.

Rule 1: The binary representation is divided into 3 components with different number of bits assigned to each components:

                   Sign   Exponent   Fraction   Total
Single-Precision      1          8         23      32
Double-Precision      1         11         52      64

Since the difference between the single-precision and double-precision is only the number of bits used in the exponent and fraction components, we will only use the single-precision in the discussion from now on.

Rule 2: The single bit in the sign component represents a sign value (s) through the binary sign convention:

          Sign     (s)
   Bit Pattern   Value
             0       1
             1      -1     

Rule 3: The 8-bit combination in the exponent component represents an exponent value (e) through the binary integral number convention:

      Exponent     (e)    
   Bit Pattern   Value    
      00000001       1 <-- 2^0 
      00000010       2 <-- 2^1
      00000011       3 <-- 2^1 + 2^0
           ...     ...
      11111110     254

Note that the bit patterns of all 0s and all 1s are not list here, because then are reserved for special purposes, which will be discussed later.

Rule 4: The 23-bit combination in the fraction component represents a fraction value through the binary fractional number convention:

                    Fraction   (f)
                 Bit Pattern   Value
   0000000 00000000 00000000   .0
   1000000 00000000 00000000   .5   <-- 2^(-1)
   0100000 00000000 00000000   .25  <-- 2^(-2)
   1100000 00000000 00000000   .75  <-- 2^(-1) + 2^(-2)
                         ...   ...  
   1111111 11111111 11111111   .?

Rule 5: Putting all 3 components together, the single-precision 32-bit pattern represents a real number (r) through the following expression:

   r = s * 1.f * 2^(e-127)

Note that a numbers expressed in this form of expression is called a normalized number, because its binary point has been normalized to right after the leading 1. Also note that the leading 1 is not stored any where.

Now let's look at the range of positive normalized numbers:

                                     s   e  f    r
0 00000001 0000000 00000000 00000000 1   1 .0 -> 1 * 1.0 * 2^(1-127)
0 00000001 1000000 00000000 00000000 1   1 .5 -> 1 * 1.5 * 2^(1-127)
...
0 01111111 0000000 00000000 00000000 1 127 .0 -> 1 * 1.0 * 2^(127-127)
0 01111111 1000000 00000000 00000000 1 127 .5 -> 1 * 1.5 * 2^(127-127)
...
0 11111110 0000000 00000000 00000000 1 254 .0 -> 1 * 1.0 * 2^(254-127)
0 11111110 1000000 00000000 00000000 1 254 .0 -> 1 * 1.5 * 2^(254-127)
...
0 11111110 1111111 11111111 11111111 1 254 .? -> 1 * 1.? * 2^(254-127)

Rule 6: When the exponent component stores all 0s, the 3 components are put together to represent a denormalized number through the following expression:

   r = s * 0.f * 2^(1-127)

This expression is designed to extend the range of the normalized numbers a little bit on the 0 side:

                                     s e  f     r
0 00000000 0000000 00000000 00000000 1 0 .0 -> 1 * 0.0 * 2^(1-127)
0 00000000 1000000 00000000 00000000 1 0 .5 -> 1 * 0.5 * 2^(1-127)
...
0 00000000 1111111 11111111 11111111 1 0 .? -> 1 * 0.? * 2^(1-127)

Rule 7: When the exponent component store all 1s, the fraction component is reserved for arithmetic operation uses.

For negative values, everything is identical with the positive numbers, except the sign bit.

The constant, 127, used in exponent part of the expression is called the bias. For double-precision standard, the bias is 1023.

The IEEE 754 standards are also called binary floating point number standards, because:

Last update: 2010.

Table of Contents

 About This Book

 Introduction of C# (C Sharp)

 Data Type and Variables

 Logical Expressions and Conditional Statements

 Arrays and Loop Statements

 Data Type Features

Floating-Point Data Types

 Precision of Floating-Point Data Types

 Precision of Floating-Point Data Types - Test

 Performance of Floating-Point Data Types

 Performance of Floating-Point Data Types - Test

IEEE 754 Standards - "float" and "double"

 IEEE 754 Standards - "float" and "double" - Test

 Binary Representation of "decimal"

 Accuracy of "decimal" Data Type

 Passing Parameters to Methods

 Execution Environment Class

 Visual C# 2010 Express Edition

 Class Features

 C# Compiler and Intermediate Language

 Compiling C# Source Code Files

 MSBuild - Microsoft Build Engine

 Memory Usages of Processes

 Multithreading in C#

 Async Feature from C# 5

 System.IO.FileInfo Class

 System.Diagnostics.FileVersionInfo Class

 WPF - Windows Presentation Foundation

 Partial Classes and Partial Methods

 Outdated Tutorials

 References

 PDF Printing Version