Floating point caveats

Every programmer knows (or should know) the drill: Comparing floating point values which have been calculated in different ways (eg. comparing the result of evaluating a mathematical expression with a floating point literal) is almost always slightly imprecise. In other words, even though mathematically the two expressions should be the same, due to limited precision of floating point numbers and the rounding errors this introduces, the two values may in fact differ in the least-significant bits. For this reason comparisons should usually be done using an epsilon value, like:

if ( abs(var1 - var2) < some_epsilon ) ... 

This is where most literature and common knowledge stops, though. This is, however, simplistic and naive. The problems with floating point accuracy are much more complex and programmers should know them, especially if they intend to write programs which use floating point variables extensively and where proper behavior even in the face of limited accuracy is essential.

Compare most significant digits, not absolute value

What is the proper value of epsilon when doing such comparisons? The answer to this is actually a bit complicated and reveals a huge problem in the simplistic advice above.

The major problem with "abs(var1 - var2) < some_epsilon" is that it doesn't compare the most significant digits of the two variables, which is what should be compared.

For example, let's assume that you use an epsilon value of 10-10 in that comparison. If the values of the variables are somewhere around eg. 5, then you will be comparing about 10 most significant digits of both values.

However, if the variables are somewhere around eg. 5000000, you will now be comparing about 16 most significant digits of both values. In other words, now the threshold at which the values are considered different is considerably stricter.

Likewise if the variables were somewhere around 0.000005, you will now be comparing only about 4 most significant digits, which means that the comparison is now significantly more lenient.

Floating point rounding errors have always a magnitude relative to the amount of bits in the mantissa of the floating point values. Due to how floating point numbers work, this means in practice that rounding errors have always a magnitude relative to the amount of significant digits in the value (not relative to the absolute value). This doesn't change. For this reason to get a reliable comparison you must compare a certain (fixed) amount of significant digits.

The traditional "abs(var1 - var2) < some_epsilon" comparison doesn't achieve this because it's comparing a dynamic amount of significant digits: The amount of significant digits being compared depends on the values being compared. This means that if the values being compared are very large, the comparison will be a lot stricter (ie. allows for less rounding errors) than if the values are closer to zero, in which case the comparison is a lot more lenient (ie. allows for more rounding errors).

This is often a bad thing, especially if the range of values being used in the program is large.

What you want is to compare a certain amount of most-significant digits in both variables. This comparison is slightly more complicated (and obviously slower, which might be important in extremely time-critical situations where millions of comparisons need to be performed in a very short amount of time), but doable using a relatively simple mathematical formula like this (in C-style pseudocode):

scale = pow(10.0, floor(log10(abs(value1))));
sv1 = value1 / scale;
sv2 = value2 / scale;
if(abs(sv2 - sv1) < epsilon) ...

What the above code does is that if epsilon is, for example, 10-10 then the 10 most significant digits of the values will always be compared regardless of the magnitude of those values. The idea is to "shift" the decimal point to be located at the most significant digit of the value (which is what dividing by scale is doing), and only then make the comparison.

(Note that there's a possibility of division by zero in the code above, if scale is zero, or extremely close to it. To avoid this possibility it may be necessary to make a pre-check for value1 being extremely close to zero and then check if value2 is also extremely close to zero, in which case they can be considered equal.)

An alternative method, which is even more accurate and more efficient, is to do it like this:

if(abs((value1 - value2) / value1) < epsilon) ...

(Once again note that the case where value1 is zero has to be taken into account before doing that comparison, in order to avoid a division by zero.)

So returning to the original question: What is a proper value for epsilon?

As said, the rounding errors are always relative to the amount of mantissa bits. Hence the decision of a good epsilon depends on how many mantissa bits there are in the floating point type being used.

For example a double-precision floating point number has 52 mantissa bits. This is equivalent to approximately 15 decimal digits of accuracy (that is, floor(log10(252))). A good epsilon compares slightly less than this amount of significant digits. Hence a good epsilon value could be, for example, 10-12 or maybe even 10-13.

The actual value may need to be fine-tuned depending on the actual operations being performed. Some floating point functions will produce more rounding errors than others, and the amount of operations performed can cause rounding errors to stack up. A good value has to always be found through extensive testing of the program.

The problem with values close to zero

But of course it's not that simple. There are still problematic situations which are not solved by the above comparison.

Let's assume that you are comparing two different ways of calculating the cosine of a value.

You start testing with 0 degrees and comparing that the two methods are returning the same value, within acceptable range, and you slowly increase the degrees and keep comparing. For example at 60 degrees one method might return 0.5 and the other 0.5000000000001, which is still completely within acceptable accuracy range (12 decimal digits of accuracy). If you are using an epsilon of eg. 10-10, they will nicely compare equal.

However, when you start approaching 90 degrees, bad things start to happen. The return values start more and more approaching zero. The closer you get to 90 degrees, the closer the return value gets to zero.

Uh oh! We have a problem here. For example when one of the methods returns the value 0.00001, the other might return a value like 0.0000100000001. This is still completely within an acceptable range because the second method is still returning the correct value with 12 decimal digits of accuracy...

But if we followed the advise of comparing significant digits, not absolute values, we are now making a way more stricter comparison: Now the relative difference is appearing in the 8th most significant digit, so our epsilon of 10-10 is going to report that the values differ too much.

However, in this case they don't differ too much. In this case this is simply an artifact which happens with values which are very close to zero, when testing a function which returns values in the range [-1.0, 1.0]. A difference in the 14th decimal doesn't matter anywhere else than with values which are around zero. Only there will the relative comparison wrongly kick in.

So now what? Back to "abs(var1 - var2) < some_epsilon"?

Well, once again, it's not that simple: Imagine that instead of "cos(x)" you are testing "10000*cos(x)". Using the simple comparison with a fixed epsilon is, once again, going to be much stricter on the latter function than the former one.

The only way to solve this problem is to use the proper algorithm and the proper epsilon on a case-by-case basis. Unfortunately there is no one single solution which will work properly in all possible situations.

As you can see, floating point comparison is a much more problematic issue than most programmers are aware of.

Floating point variables need to be initialized properly

How many times have you seen eg. C/C++ code which carelessly does something like this:

long double value = 0.1;

It's actually surprisingly common. So what's the problem here?

The problem is that long double is often a larger type than double. In most systems the latter is a double-precision floating point type, which has a mantissa of 52 bits, while the former is an extended-precision floating point type with a mantissa of at least 64 bits (in Intel-compatible processors).

In the above code "0.1" is a literal of type double, while the variable being initialized is of type long double.

Why is this a problem? The problem is that the variable is not being initialized with the value of 0.1 with as much precision as a long double would permit. More precisely, only 52 bits of the mantissa will be used to initialize the variable, instead of the 64 bits that a long double would support (in Intel systems). This means that the variable will already have 12 bits (ie. about 4 decimal digits) of rounding error right from the start, way more than would be expected for a floating point value of that size.

This is not a problem which happens only when initializing floating point variables. It happens anywhere where a floating point literal is used. For example in an expression like:

a = b * 0.1 + c;

there will be a considerable rounding error if a, b and c are of type long double.

Yes, this does indeed happen. The compiler will not "guess" what the programmer really wanted, but instead it will literally take the double type value 0.1, and initialize the long double type variable with it, thus effectively "dropping" the 12 least significant bits of the mantissa, introducing a significant rounding error right from the start.

Naturally the proper way of making the above is:

long double value = 0.1L;

and:

a = b * 0.1L + c;

Here we are telling the compiler that the literal is of type long double, and hence the literal will be interpreted with as many mantissa bits as that type allows.

(Some programmers have the misconception that "(long double)(0.1)" will do the same thing as "0.1L" above, but that's not the case. It's still a literal of type double being converted into a long double, with the ensuing rounding error.)

The same is true if we use eg. the C standard library functions atof() or strtod() if the results of these functions are then assigned to a long double variable: The strings will be interpreted as double-precision, rather than as long double precision, and hence the 12 least significant bits will be lost.

This was a real problem before the C99 standard was ratified, which introduced the new strtold() function which can be used to parse a value as a long double. (This isn't really a problem in C++ if C++ streams are used. If the standard C function has to be used instead, then it will be introduced with the next C++ standard.)

This problem isn't very grave with long doubles, as "only" 12 least significant bits of accuracy are being lost in the initialization. It does become especially grave if a multiple-precision floating point library is being used, where the mantissa could have an enormous amount of bits (such as eg. 1000 bits or more). Initializing such a huge multiple-precision floating point value with a literal of type double will introduce a humongous rounding error right from the start which can skew results very significantly (even to the point of rendering the program outright erroneous and its results invalid).

Thus when using such libraries it's of utmost importance to always initialize values in such way that literals will be interpreted with the maximum possible number of mantissa bits. (Usually these libraries provide a function to interpret a value from a string. This is usually what should be used, unless you know that a literal you are using doesn't need that many bits of accuracy because it can be accurately represented with less. An example of such value would be 1.0.)

Like the epsilon thing above, this is also something you will seldom see in books and tutorials. The vast majority of programmers out there are completely unaware of this problem. In the worst case scenarios you may even see code like this:

Float_t value = 0.1f;

If Float_t is eg. a double-precision floating point type, 30 least-significant bits of the mantissa (about 9 decimal digits) will be lost, causing a huge rounding error. (The idea of doing it like that is to avoid a warning by some compilers if Float_t happens to be a smaller type than the literal it's being initialized with.)