Threaded Mode | Linear Mode

Dan · 12-25-2018, 04:06 AM

I've been writing firmware for a hobby calculator project using BCD. Today I was looking at the datasheet for the ARM Cortex M4 (for the TI Tiva C Launchpad microcontroller) and was impressed to see a hardware implementation of +,-,*,/ and square root for floating point numbers in IEEE754 single precision format.

I've seen a few tutorials on how to convert decimal numbers like 8.25 to binary floating point format and I should be able to code the algorithm but I'm not sure about an algorithm for converting larger numbers like 8.25 * 10^84 to binary floating point, i.e. I'm not sure what's the best way to deal with the 10^84 part.

Now 10^84 = 2^84 * 5^84 but then you have to deal with 5^84.

I want to code an algorithm that converts decimal floating point numbers entered by the user to binary floating point numbers in order to exploit the Cortex floating point instructions.

Thomas Klemm · 12-25-2018, 08:32 AM

You may have a look at sun.misc.FloatingDecimal.

Some constants:

Code:

        /**

         * All the positive powers of 10 that can be

         * represented exactly in double/float.

         */

        private static final double[] SMALL_10_POW = {

            1.0e0,

            1.0e1, 1.0e2, 1.0e3, 1.0e4, 1.0e5,

            1.0e6, 1.0e7, 1.0e8, 1.0e9, 1.0e10,

            1.0e11, 1.0e12, 1.0e13, 1.0e14, 1.0e15,

            1.0e16, 1.0e17, 1.0e18, 1.0e19, 1.0e20,

            1.0e21, 1.0e22

        };

        private static final float[] SINGLE_SMALL_10_POW = {

            1.0e0f,

            1.0e1f, 1.0e2f, 1.0e3f, 1.0e4f, 1.0e5f,

            1.0e6f, 1.0e7f, 1.0e8f, 1.0e9f, 1.0e10f

        };

        private static final double[] BIG_10_POW = {

            1e16, 1e32, 1e64, 1e128, 1e256 };

        private static final double[] TINY_10_POW = {

            1e-16, 1e-32, 1e-64, 1e-128, 1e-256 };

And then in ASCIIToBinaryBuffer.doubleValue:

Code:

            //

            // Harder cases:

            // The sum of digits plus exponent is greater than

            // what we think we can do with one error.

            //

            // Start by approximating the right answer by,

            // naively, scaling by powers of 10.

            //

            if (exp > 0) {

                if (decExponent > MAX_DECIMAL_EXPONENT + 1) {

                    //

                    // Lets face it. This is going to be

                    // Infinity. Cut to the chase.

                    //

                    return (isNegative) ? Double.NEGATIVE_INFINITY : Double.POSITIVE_INFINITY;

                }

                if ((exp & 15) != 0) {

                    dValue *= SMALL_10_POW[exp & 15];

                }

                if ((exp >>= 4) != 0) {

                    int j;

                    for (j = 0; exp > 1; j++, exp >>= 1) {

                        if ((exp & 1) != 0) {

                            dValue *= BIG_10_POW[j];

                        }

                    }

                    //

                    // The reason for the weird exp > 1 condition

                    // in the above loop was so that the last multiply

                    // would get unrolled. We handle it here.

                    // It could overflow.

                    //

                    double t = dValue * BIG_10_POW[j];

                    if (Double.isInfinite(t)) {

                        //

                        // It did overflow.

                        // Look more closely at the result.

                        // If the exponent is just one too large,

                        // then use the maximum finite as our estimate

                        // value. Else call the result infinity

                        // and punt it.

                        // ( I presume this could happen because

                        // rounding forces the result here to be

                        // an ULP or two larger than

                        // Double.MAX_VALUE ).

                        //

                        t = dValue / 2.0;

                        t *= BIG_10_POW[j];

                        if (Double.isInfinite(t)) {

                            return (isNegative) ? Double.NEGATIVE_INFINITY : Double.POSITIVE_INFINITY;

                        }

                        t = Double.MAX_VALUE;

                    }

                    dValue = t;

                }

            } else if (exp < 0) {

(…)

Thus in case of 8.25e84 we'd end up with:

8.25 * 1e4 * 1e16 * 1e64

Kind regards
Thomas

Albert Chan · 12-25-2018, 11:37 PM

(12-25-2018 08:32 AM)Thomas Klemm Wrote: ... in case of 8.25e84 we'd end up with:

8.25 * 1e4 * 1e16 * 1e64

Sun code actually do this: 8.25e84 => 825e82 => 825 * 1e2 * 1e16 * 1e64

By starting the mantissa as integer, it can do fast-path trick, say 1.23e-12 => 123 / 1e14
Hardware divide (assuming FE_PC53_ENV and FE_TONEAREST) guaranteed correct conversion.

Above binary conversion is correct, but it is just lucky.
It is possible multiplied result are off, 1 ULP or more.

The slow and messy part is the correction loop, to fix the approximated binary float.

This is my implementation, using normalized 96-bits big float: strtod-fast.c

I also made a lite version that does not use arbitrary precision math to break ties.
The cost is a very slight chance of wrong conversion, ± 1 ULP

https://github.com/achan001/dtoa-fast

Dan · 12-26-2018, 07:10 AM

Thanks Thomas and Albert,

Perhaps a look-up table, storing m * 2^n, where m,n satisfy m * 2^n = 1.25^N for N element [-99,99]. Then

8.25 * 10^84 = 8.25 * (2^3 * 1.25)^84
= 8.25 * 2^252 * 1.25^84
= 8.25 * 2^252 * m * 2^n from look-up table, so fast
= 8.25 * m * 2^(252+n)

Then call assembly subroutine (I'm doing mixed C/assembly) converting 8.25 * m to double precision floating point, giving sign bit and fraction and intermediate exponent and add 252+n to it to obtain final exponent.

Paul Dale · 12-26-2018, 07:51 AM

I've not checked the datasheet, but you mentioned single precision in the opening comment. Double precision won't be fast if it is done in software.

An alternative approach would be to use the floating point hardware to produce initial approximations for the decimal float library. Intel's decimal library does this in places.

Pauli

Dan · 12-26-2018, 08:46 AM

Thanks Pauli,

This is new territory for me, my previous projects used BCD, shift and add and CORDIC algorithms. Yes they are single precision instructions so it would have to be done in software, but I think they'll nevertheless be faster that my previous efforts. The libraries are great, but I like the control and learning I get from writing the routines, even though they won't be as good as the ones written by the pros.

Thomas Klemm · 12-26-2018, 11:14 AM

(12-26-2018 07:10 AM)Dan Wrote: Perhaps a look-up table, storing m * 2^n, where m,n satisfy m * 2^n = 1.25^N for N element [-99,99].

What exactly are you trying to gain by using \(1.25^N\) instead of \(10^N\)?
The mantissa in binary is the same, but now you have to deal additionally with powers of 8.

(12-26-2018 08:46 AM)Dan Wrote: The libraries are great, but I like the control and learning I get from writing the routines, even though they won't be as good as the ones written by the pros.

Reading the source code of the libraries can be inspiring. You may wonder why they chose that mixed approach with two look-up tables in Java. Was it about keeping the size of the classes small? Or was it a premature optimization?

Thanks for leading me down the rabbit hole with your question. I learned a lot of new stuff: the fast path mentioned by Albert, or that casts from integer types to floating-point types are handled by the VM (i2d or l2d) or machine code (cvtsi2sdl or cvtsi2sdq).
Unfortunately Wikipedia does a bad job explaining converting from decimal representation to binary32 format.

Kind regards
Thomas

Dan · 12-28-2018, 03:45 AM

(12-26-2018 11:14 AM)Thomas Klemm Wrote: What exactly are you trying to gain by using \(1.25^N\) instead of \(10^N\)?
The mantissa in binary is the same, but now you have to deal additionally with powers of 8.

It's one way that came to mind to change 10^N to 2^n, because IEEE 754 binary32 format is (p,e,m) where M*10^N = (-1)^p * (1+m) * 2^e.

I thought to store numbers in IEEE 754 binary64 format, i.e. double precision and write algorithms for +.-,*,/ using the single precision hardware instructions, but I may drop that idea.

Paul Dale · 12-28-2018, 09:33 AM

(12-28-2018 03:45 AM)Dan Wrote: I thought to store numbers in IEEE 754 binary64 format, i.e. double precision and write algorithms for +.-,*,/ using the single precision hardware instructions, but I may drop that idea.

I wouldn't be surprised if the library implementation of double precision did this already.

Pauli

BruceH · 12-30-2018, 05:46 PM

(12-26-2018 08:46 AM)Dan Wrote: The libraries are great, but I like the control and learning I get from writing the routines, even though they won't be as good as the ones written by the pros.

Have you read What Every Computer Scientist Should Know About Floating Point Arithmetic?

SlideRule · 12-30-2018, 11:36 PM

(12-26-2018 08:46 AM)Dan Wrote: … I like the control and learning I get from writing the routines, even though they won't be as good as the ones written by the pros …

Might I suggest Numerical Computing with IEEE Floating Point Arithmetic by Michael L. Overton, siam.

Numerical computing is a vital part of the modern scientific infrastructure. Almost all numerical computing uses floating point arithmetic, and almost every modern computer implements the IEEE1 binary floating point standard, published in 1985. This standard is arguably the most important in the computer industry, the result of an unprecedented cooperation between academic computer scientists and the cutting edge of industry. Nonetheless, many years after its publication, the key ideas of the IEEE standard remain poorly understood by many students and computer professionals . Perhaps this is because an easily accessible yet reasonably detailed discussion of the standard has not been available-hence, the evolution of this short book. Although it is intended primarily for computer science or mathematics students, as a supplement to a more traditional textbook for a course in scientific computing, numerical analysis, or computer architecture, it also aims to reach a broader audience. As well as the IEEE standard, topics include the floating point architecture of the Intel microprocessors, a discussion of programming language support for the standard, and an introduction to the key concepts of cancellation, conditioning, and stability. The book should be accessible to any reader with an interest in computers and mathematics . Some basic knowledge of calculus and programming is assumed in the second half. The style is not that of a traditional textbook. There is enough variety of content that all but the most expert readers will find something of interest here.

BEST!
SlideRule

Dan · 01-01-2019, 07:27 AM

(12-28-2018 09:33 AM)Paul Dale Wrote:
(12-28-2018 03:45 AM)Dan Wrote: I thought to store numbers in IEEE 754 binary64 format, i.e. double precision and write algorithms for +.-,*,/ using the single precision hardware instructions, but I may drop that idea.

I wouldn't be surprised if the library implementation of double precision did this already.

Pauli

Keil's ARM compiler uses single precision hardware instructions in the multiplication subroutine for float data types but not for doubles, as far as I can tell from the disassembly.

Dan · (This post was last modified: 01-03-2019 06:59 AM by Dan.)

(12-26-2018 07:51 AM)Paul Dale Wrote: An alternative approach would be to use the floating point hardware to produce initial approximations for the decimal float library. Intel's decimal library does this in places.

Pauli

Intel's library is impressive but I have no idea how to add it to my project.

However I can include the math.h and complex.h headers - I'm pleasantly surprised by the number of mathematical functions in the C libraries. You pretty much have a scientific calculator right away.

Also there is no need to convert from decimal to binary because the float and double data types are supported, you just type

value=0;
.
.
value = value * 10 + key;

and "value" is updated each time the user presses a key.