Random float number generator using x86 ASM code optimized in size

This blog post was originally posted on my previous blog code4k
While I'm working on the next 4k-64k softsynth for FRequency, I need to implement a simple noise generator. In the past, i used to call directly the c rand() function without digging into the internals... Because i have to develop the softsynth using x86 ASM in order to keep it as small as possible, i went through revisiting the problem of generating a random uniform float number in the range [-1,1) or (-1,1], but with size optimization in mind. This is not a common problem as you'll more likely be able to easily find a speed optimized version other the Internet but not a sized one...

I will focus mainly on 3 implementations, so i will not describe an exhaustive list of technique around. I would be glad if you are aware about a smaller implementation! I found quite some interesting results, like a poor performance of the c rand() function, both in terms of speed and uniformity.

Let's recall one of the simplest C random function you can find around :
float Rand() {
return ((float)rand()/RAND_MAX) * 2.0f - 1.0f;
}

The rand() is generating an integer in the range [0, 32767]. RAND_MAX is equal to the max value 32767. The result is then scaled to the range [-1,1]. Let us first implement this in x86 ASM, just as an exercise.

Version 1: Using the c rand() function


A straight implementation of the previous C in x86 ASM could be like this (thanks to AsmHighlighter for the syntax highlighting and the instruction size calculation ;) ) :
_Rand proc
      call dword ptr [_imp__rand]   ;#6 Call c Rand
      push eax                      ;#1 st(0)         st(1)
      fild dword ptr [esp]          ;#3 rand           
      fidiv word ptr [RandMax]      ;#6 (-1,0]          
      fchs                          ;#2 [0,1)          
      fadd st(0),st(0)              ;#2 [0,2)          
      fld1                          ;#2   1           [0,2)
      fsubp                         ;#2 [-1,+1]
      pop eax                       ;#1 clean stack
      ret                           ;#1
_Rand endp                          ;#26 total bytes

There is a slight difference with the previous C version. I want to generate a number in the range [-1,1), so the RandMax is set to a short (or word) 32768, meaning that the integer value is -2^15. Because the c rand function generate a number in the range 0-32767, i have to scale it to the correct [-1,1) range. Dividing by -2^15 generate a number in the range (-1,0]. I just have to negate this result, multiply by 2 and minus 1 to get the correct range [-1,1) (although this in not critical to have this range. The range (-1,1] is fine as well).

The number of bytes for this function is 26 bytes. Although you have to add the cost of referencing the msvcrt runtime (this is not so much, but still...). The crinkler compressed size is around 20.5 bytes. You will find attached to this article the RandAsm Visual C++ Project using MASM and crinkler.

The protocol to test the compressed size is to leave only one type of random function at a time when generating the code. The other implementations are disabled. This will give you a more accurate estimation of the compressed size (but this is not a true final compressed size in an intro exe, as context modeling from crinkler heavily depends on... context, previous code compressed... and so on...). But with this approach, you have at least something close to reality... I have also measure the execution time minus the code to fill the buckets (i have run an empty random function to get the reference time).

I have implemented a basic checker of the uniformity of the random number generator. My friend ulrick advise me to use a simple Kolmogorov-Smirnov (KS) test to check the efficiency of the generator, but i found this too difficult (i didn't have time to implement such a test, with storing, sorting all random numbers...). He then suggest me to use a simpler test called Chi-Square Test (check also this nice article from Dr Dobb's about Testing Random Number Generators by Jerry Dwyer and K.B. Williams, June 1996 ). It's basically working on frequencies (and not on individual random number as for the KS test) using a limited amount of buckets to count the frequency of the random number. For this test, i'm using 100 buckets on the range [-1,1[ (bucket 0 is in the range [-1, -0.98] while the bucket number 99 is in the range [0.98, 1[), and generating 1,000,000,000 random float numbers with it. It means that by bucket, i should expect a perfect : 1,000,000,000 / 100 = 10,000,000 values. The Chi-Square Test computes a factor k that is checked against a Chi-Square distribution table.

Be aware that this test (as well as the KS test) is mainly relevant for testing the uniformity of the random numbers, but they don't ensure that the numbers are truly random. We should implement another algo to test this "random-ability" but i don't have time to do this! ;)

The results for the C-Rand ASM version are :

Rand Type Time(ms) Precision(bits) Code Size(bytes) Compressed Code Size(bytes) k-ChiSq
Rand C ASM 29890 ms 15 26 20.5 2066.67

You may notice that the k-ChiSquare factor is very high. This factor should be below 134.642! (column 0.01, line 99 in the ChiSquare table). A first look at these results tell us that they are quite poor. We are going to see that other x86 ASM implementation are performing much better than this one.

For the curious, what's behind the c rand() function? Under Visual C++, you have the following code:
int __cdecl rand (void)
{
return( ((next = next * 214013L + 2531011L) >> 16) & 0x7fff );
}
We can see that the 0x7FFF truncation is hurting a lot the precision of the rand function. Only 15 bits are then used when converting to the float.

Version 2: Using iq's method


iq wrote a couple of very great article about coding, mainly for graphics. But among them, there is an interesting article about the problem of random float numbers : float, small and random.

Read this article carefully, as it is going a little deeper in the internals of IEEE-754 floating point representation (used by the FPU).
Finally, iq suggest a new efficient algorithm:
float sfrand( int *seed )
{
float res;
seed[0] *= 16807;
*((unsigned int *) &res) = ( ((unsigned int)seed[0])>>9 ) | 0x40000000;
return( res-3.0f );
}

A straight implementation of his algo in asm could be like this one:

_Rand proc
      imul eax, dword ptr [RandSeed], 16807     ;#A eax = RandSeed * 16807
      mov dword ptr [RandSeed], eax             ;#5 RandSeed = eax
      mov al, 080h                              ;#2 eax: rrrrrr80h where r are bits from random seeds
      ror eax, 8                                ;#3 eax: 80rrrrrrh. value = s * 2^(e-127). e is set to 128, thus 040h value ( sEee eeee efff ffff ffff ffff ffff ffff)
      shr eax, 1                                ;#2 eax: 40rrrrrrh
      push eax                                  ;#1 push eax on stack to load it as a float. rand [2,4)
      fld dword ptr [esp]                       ;#3 load rand st0 = rand [2,4)
      push 3                                    ;#2 load 3
      fisub dword ptr [esp]                     ;#3 st(0) = rand [2,4) - 3 = rand[-1, 1)
      pop eax                                   ;#1 free stack
      pop eax                                   ;#1 free stack   
      ret                                       ;#1 return to caller
_Rand endp                                      ;#34 total bytes

This is not probably the smallest x86 ASM implementation of iq's code, but from 1 to 2 bytes, this should be enough close... (hey, i don't claim to be a x86 expert anyway! ;) )

This implementation gives the following results :

Rand Type Time(ms) Precision(bits) Code Size(bytes) Compressed Code Size(bytes) k-ChiSq
iq's ASM 13042 ms 23 34 30 7.36

They are significantly better than the previous one. Both in terms of execution time and uniform distribution. The k-ChiSqr factor is 7.36 much below the 134.642 limit. Although, the result is a function a bit larger, around 30 bytes compressed. Not a lot, but still, we need to have a smaller one.

Version 3: Using simple int min divider (Park Miller-Carta's method)


When there is a demo/intro that is using a cool hack or whatever, i'm always curious to see how they did it. Well, i should post someday an article about the fact that "demomaking... is a bit of hacking". So when i went around the code of the 4klang great synth from gopher, i found an interesting tiny random function. [Edit]After some research, i found the original authors are Payne, Rabung & Bogyo (in 1969), then Park Miller (in 1988) and Carta (in 1990), this is a nice implementation and in fact, very simple.[/Edit]

We are basically using the same random generator that iq's suggest at the beginning (other candidates are also good. 4klang is using 16007. I'm wondering if we should use a prime number?) :
seed *= 16807;
seed is an integer. So the range value is going from [-2^31, 2^31-1]. Following the simple C rand function, we just have to load this seed directly on the FPU as an integer, keeping all the 31 bits precision. Then, we just have to divide by the largest - absolute - value : -2^31 aka 080000000h in hex. In C, this is simple as :
seed *= 16807;
return ((float)seed) / (float)0x80000000;

or in x86 ASM:

_Rand proc
      imul eax, dword ptr [RandSeed], 16807 ;#A eax = RandSeed * 16807
      mov dword ptr [RandSeed], eax         ;#5 RandSeed = eax
      fild dword ptr [RandSeed]             ;#6 load RandSeed as an integer
      fidiv dword ptr [RandDiv]             ;#6 div by max int value (absolute) = eax / (-2^31)
      ret                                   ;#1 return to caller
_Rand endp                                  ;#28 total bytes

We can see that this implementation return a result in the range (-1,1]. This implementation gives the following results :

Rand Type Time(ms) Precision(bits) Code Size(bytes) Compressed Code Size(bytes) k-ChiSq
Int-Min Divider 6006 ms 31 28 19 7.31

Results are of course significantly better than the C version, but they are also better in terms of execution time and size compare to iq's method. Not surprisingly, k-ChiSqr factor is 7.31 much below the 134.642 limit and comparable to iq.

With 31 bits for the precision, This method is even better, as the seed is fully loaded on the fpu stack with it's 31 bits precision : because the FPU has a total of 80 bits to store a floating point number (with extended precision), it means that we have 64 bits available for the mantissa, large enough to store the 31 bits . The compression is working well on this, mainly due to the fact that we are using a simple memory addressing that is compressing well with crinkler. Yep, sometimes, it's better to have something like : mov eax, [address1]; imul eax, dword ptr [address2] than having to preload a base adress like lea ebx, [address1]; mov eax, [ebx]; imul eax, [ebx+4];. So with only 19 bytes compressed, this version is really cool and the execution time is impressive too!

Faster and better floating point generator


Let's recap the results :

Rand Type Time(ms) Precision(bits) Code Size(bytes) Compressed Code Size(bytes) k-ChiSq
Rand C ASM 29890 ms 15 26 20.5 2066.67
iq's ASM 13042 ms 23 34 30 7.36
Int-Min Divider 6006 ms 31 28 19 7.31

We can see that the "Int-Min Divider" technique is by far the most efficient, both in terms of execution time and size. It's also a very good generator, as the distribution is quite uniform.

This should be very useful when you need a fast random floating point generator!

A question of floor?


Just a few words about truncating a float to an int. I needed such think in order to increments the frequecies for the bucket. I'm usually using a poor implementation using the fpu instructions couple fld/fistp but the results depends on the FPU rounding mode (and the default rounding mode is "nearest" and not "toward zero" as expected). You don't see this when you are using the msvcrt library, as they have implemented a true floor() conversion in the back. But when using the /QIfist option in VC++, you can't use anymore this function from the msvcrt.

However, I found a straight function that may do the trick, using following SSE function :
inline long floorSSE(double x) {
__asm cvttsd2si eax,x
}
The cvttsd2si SSE instruction is converting a double to an int (you can find about this an interesting discussion here). This is actually working well if the double is coming originally from a float (moving from the FPU to SSE requires writing to memory). But this instruction seems to work well on float. Although, I should check more extensively the reliability of this function...
Update 29/10/2009 : In fact, there is the FPU instruction fisttp in SSE3 that do also and is less restrictive than the cvttsd2si instruction (notice although they are more considered as "truncate" than "floor" as i mentioned).
________________________________________
Here is attached the VC++ 2008 RandAsm project i used for this article.

Comments