!j development
Programming Article:

Micro surgical implementation of a 4 pole low pass filter with Inline Assembler


by !jdev
©2003-2006 !j development division audio
This article is copyrighted. All rights reserved. Don't publish entirely or parts without explicit permission.

Introduction

Well, this is the continuation of our article series about inline Assembly for digital audio processing and other possible streaming multimedia applications using single precision floating point routines.

In the last topic we had a look at the fantastic possibilities of speed optimisation with the new instruction sets for the latest IA (Intel Architecture) microprocessors. Now we will go somewhat deeper into the scope and examine, how to implement an analogue filter simulation exemplary with (inline) Assembler coding.

As a real model, we will take the famous "Moog Filter Cascade", therefore a 4 pole low pass filter or with other words: the legendary 24 db resonant filter from the also legendary dinosaur synthesizers and analogue filter hardware modules of the past.

Don't misunderstand us here, we will not invent an entirely new concept of a digital filter implementation or so.
No. We simply take an already existing exemplary C++ function (available all over the internet) and transform it into pure inline Assembler code. Per hand, of course, doing it the hard way...

This article is not intended to deliver free stuff to the "communities". It is intended to developers, who want to benefit from our experiences to optimise their time critical processes for real time audio and other multimedia applications.

So lets start and see, how our C/C++ function exemplary looks like:

inline void MoogFilter(float *buffer, float *paramsCutoff, float *paramsReso, long samples)
{
    static float x1 = 0.f;
    static float x2 = 0.f;
    static float x3 = 0.f;
    static float x4 = 0.f;
    static float y0 = 0.f;
    static float y1 = 0.f;
    static float y2 = 0.f;
    static float y3 = 0.f;

    float F;
    register float K;
    register float P;
    float S;
    float R;
    register float x0;

    while (--samples);
    {
        F = (*paramsCutoff);
        K = ((3.6f * F) - (1.6f * (F * F))) - 1.f;
        P = (K + 1.f) * 0.5f;
        S = (float)exp( (1.f - P) * 1.386249f );
        R = (*paramsReso) * S;
        x0 = (*buffer) - (R * x4);

       x1 = (x0 * P) + (y0 * P) - (K * x1);
       x2 = (x1 * P) + (y1 * P) - (K * x2);
       x3 = (x2 * P) + (y2 * P) - (K * x3);
       x4 = (x3 * P) + (y3 * P) - (K * x4);
       y0 = x0;
       y1 = x1;
       y2 = x2;
       y3 = x3;

       x4 = x4 - (((x4 * x4) * x4) / 6.0f);
       *buffer++ = x4;

       paramsCutoff++;
       paramsReso++;
   }
}

One note for "code rippers": The code in this article has (intentionally) some disadvantages. If you use improper control parameter values for resonance and/or cutoff or low sample rates, the code may drive into "out of order"! Also there is no inbuilt denormalization blocker, so it may produce massive performance problems on new generation processors with very low internal floating point values.
This filter should NOT be used inside any serious digital audio applications or plugins without the necessary modifications to fix that!

How you can see, there is really nothing spectacular. The opcode produces an inplace processing of a sample stream buffer, filled with single precision floating point values. We assume, that this buffer represents an audio signal and the control data for cutoff and resonance parameters are also in buffer based (per sample steamed) format, equal to the length of the output buffer. We will not explain, how the real filter or the digital clone works, nor will we teach any of the here used coding languages.

Whatever, you may notice those excessive usage of brackets inside the function above. This is not necessary for C/C++, but will help us later very much to manually design our Assembler code. You have to know the hierarchy of operators very well, if you want to write Assembler code successfully.

Unlike other examples on the internet, this filter is intended for (real) real time processing, that means; the control parameters are absolutely sample accurate, not only calculated ones per loop, but with each new sample. You can use such a filter for instance to filter audio streams controlled by fast envelopes or LFOs without any quantization in control or audio signal. It is even predestined to support sample accurate filter cross modulation (filter frequency modulation). So the parameters will be recalculated on each new sample tick. This helps also to analyse the speed differences of our opcodes with lesser loops.


Why buffer based?

As we have already seen in the last article, buffer based processing is the fastest way possible. The reason is very simple: If you implement such a relatively complex algorithm ( as the digital modeled analogue filter above ) with single sample processing (one function call per tick), the processor has heavily to jump around while executing the code for each sample. The entire speed will be slowed down enormously. Also we cannot use the great benefits, the processor, especially the FPU (floating point unit) offers. This means: holding the variables in registers and so doing jumps as less as possible while we process as much samples as possible with one function call. This speeds up enormously!

If you begin to think in the manner of a microprocessor, you will see very quickly, how much advantages the buffer based processing and the additional use of hand coded inline Assembly has in contradiction to automatic optimisation techniques or/and "one sample tick" based calculations.


Register variables

The FPU adds 8 additional 80bit registers ( st(0) to st(7) ) to the available general purpose registers of the main processor. This so called "FPU stack" is very often not optimal used, even if you compile with highly optimising HLL compilers.
We will see later, what such a HLL compiler does with the function above. This is really not, what you may expect. Mostly only two or three of the FPU registers will be used. And the jumping around, such a code finally causes, is indeed questionable...

The excessive usage of all available registers inside the CPU is very important for fast execution. Why?
Each time, the processor has to access variables in memory, it will need time. Up to today, ultra fast memory is very expensive. And so the memory modules you can usually buy are relatively slow in comparison to the high speed of today's CPUs. The result is the so called ( and by manufacturers well hidden ) "wait states" of the CPU, that means: to wait until the data from memory is finally transferred into the registers where all the action performs. The same happens vice versa on storing the proceeded data back to any memory locations outside the CPU registers.

Okay, you may say: There is something called "cache" inside my processor, that will help to reduce those wait states. Right.
But those mysterious "cache" memory needs some time also. Even this can actually use more time in some situations: The processor has to look first into the cache, and if he can't find the desired memory object, he will nevertheless be forced do a search in slower memory... This needs actually more time, you can see.

The only logical consequence is finally, to hold as much of the variables a function uses inside the registers of the CPU, especially the 8 FPU registers with our filter opcode. So the access time to internal register variables is approximately zero and we can speedup our opcodes to a new dimension.

Unfortunately the most HLL compilers became extremely ignorant regarding our instructions to use "register variables" if declared in C/C++ or other HLLs source code. The compiler will actually do what "he" want and completely ignore all our well thought "register" keyword implementations.

Remember our function. We have declared at least 3 register variables:

register float K;
register float P;
register float x0;

Now try to find those variables inside the disassembly below...
They have simply ignored our high level coding instructions or removed by the automatic optimizations of the compiler!

The only way to solve this problem and force the compiler to do exactly what we want is finally ... what a surprise ... coding with (inline) Assembler. The inline assembly will be executed 100% conform to our coding.

The Disassembly

Now, take a look at the disassembly the compiler produces, if we compile our C++ function with highest possible (hyper) optimisation (all opt. switches on /O2 /Og /Ob2 /Ot /G6 ...) on VC++ 7:

...
1000:00401000 loc_00401000:
1000:00401000 51            push ecx
1000:00401001 53            push ebx
1000:00401002 56            push esi
1000:00401003 57            push edi
1000:00401004 6660          pusha
1000:00401006 6661          popa

1000:00401008 8b742420      mov esi, [esp+20h]
1000:0040100c 4e            dec esi
1000:0040100d 0f8448010000  jz loc_0040115b
1000:00401013 8b54241c      mov edx, [esp+1ch]
1000:00401017 8b4c2418      mov ecx, [esp+18h]
1000:0040101b 8b442414      mov eax, [esp+14h]
1000:0040101f 90            nop
1000:00401020               ; XREFS First: 1000:00401155 Number : 1
1000:00401020 loc_00401020:
1000:00401020 d901          fld [ecx]
1000:00401022 83c004        add eax, 04h
1000:00401025 d905e4804000  fld [loc_004080e4]
1000:0040102b 83c104 add    ecx, 04h
1000:0040102e d8c9          fmul st(0), st(1)
1000:00401030 83c204        add edx, 04h
1000:00401033 d9c1          fld st(1)
1000:00401035 d8ca          fmul st(0), st(2)
1000:00401037 d80de0804000  fmul [loc_004080e0]
1000:0040103d dee9          fsubp st(1), st(0)
1000:0040103f d825dc804000  fsub [loc_004080dc]
1000:00401045 d95c240c      fstp [esp+0ch]
1000:00401049 ddd8          fstp st(0)
1000:0040104b d944240c      fld [esp+0ch]
1000:0040104f d805dc804000  fadd [loc_004080dc]
1000:00401055 d80dd8804000  fmul [loc_004080d8]
1000:0040105b d95c2420      fstp [esp+20h]
1000:0040105f d905dc804000  fld [loc_004080dc]
1000:00401065 d8642420      fsub [esp+20h]
1000:00401069 d80dd4804000  fmul [loc_004080d4]
1000:0040106f d9ea          fldl2e
1000:00401071 dec9          fmulp st(1), st(0)
1000:00401073 d9c0          fld st(0)
1000:00401075 d9fc          frndint
1000:00401077 d9c9          fxch st(1)
1000:00401079 d8e1          fsub st(0), st(1)
1000:0040107b d9f0          f2xm1
1000:0040107d d9e8          fld1
1000:0040107f dec1          faddp st(1), st(0)
1000:00401081 d9fd          fscale
1000:00401083 ddd9          fstp st(1)
1000:00401085 d84afc        fmul [edx-04h]
1000:00401088 d80dfca94600  fmul [loc_0046a9fc]
1000:0040108e d868fc        fsubr [eax-04h]
1000:00401091 d905f8a94600  fld [loc_0046a9f8]
1000:00401097 d8c1          fadd st(0), st(1)
1000:00401099 d84c2420      fmul [esp+20h]
1000:0040109d d905f4a94600  fld [loc_0046a9f4]
1000:004010a3 d84c240c      fmul [esp+0ch]
1000:004010a7 dee9          fsubp st(1), st(0)
1000:004010a9 d91df4a94600  fstp [loc_0046a9f4]
1000:004010af d905f0a94600  fld [loc_0046a9f0]
1000:004010b5 d805f4a94600  fadd [loc_0046a9f4]
1000:004010bb d84c2420      fmul [esp+20h]
1000:004010bf d905eca94600  fld [loc_0046a9ec]
1000:004010c5 d84c240c      fmul [esp+0ch]
1000:004010c9 dee9          fsubp st(1), st(0)
1000:004010cb d91deca94600  fstp [loc_0046a9ec]
1000:004010d1 d905e8a94600  fld [loc_0046a9e8]
1000:004010d7 d805eca94600  fadd [loc_0046a9ec]
1000:004010dd d84c2420      fmul [esp+20h]
1000:004010e1 d905e4a94600  fld [loc_0046a9e4]
1000:004010e7 d84c240c      fmul [esp+0ch]
1000:004010eb dee9          fsubp st(1), st(0)
1000:004010ed d91de4a94600  fstp [loc_0046a9e4]
1000:004010f3 8b3de4a94600  mov edi, [loc_0046a9e4]
1000:004010f9 d905e0a94600  fld [loc_0046a9e0]
1000:004010ff 893de0a94600  mov [loc_0046a9e0], edi
1000:00401105 d805e4a94600  fadd [loc_0046a9e4]
1000:0040110b 8b3df4a94600  mov edi, [loc_0046a9f4]
1000:00401111 893df0a94600  mov [loc_0046a9f0], edi
1000:00401117 8b3deca94600  mov edi, [loc_0046a9ec]
1000:0040111d d84c2420      fmul [esp+20h]
1000:00401121 893de8a94600  mov [loc_0046a9e8], edi
1000:00401127 d905fca94600  fld [loc_0046a9fc]
1000:0040112d d84c240c      fmul [esp+0ch]
1000:00401131 dee9          fsubp st(1), st(0)
1000:00401133 d9c9          fxch st(1)
1000:00401135 d91df8a94600  fstp [loc_0046a9f8]
1000:0040113b d9c0          fld st(0)
1000:0040113d d8c9          fmul st(0), st(1)
1000:0040113f d8c9          fmul st(0), st(1)
1000:00401141 d80dd0804000  fmul [loc_004080d0]
1000:00401147 d8e9          fsubr st(0), st(1)
1000:00401149 d915fca94600  fst [loc_0046a9fc]
1000:0040114f ddd9          fstp st(1)
1000:00401151 d958fc        fstp [eax-04h]
1000:00401154 4e            dec esi
1000:00401155 0f85c5feffff  jnz loc_00401020
1000:0040115b               ; XREFS First: 1000:0040100d Number : 1
1000:0040115b loc_0040115b:
1000:0040115b 6660          pusha
1000:0040115d 6661          popa

1000:0040115f 5f            pop edi
1000:00401160 5e            pop esi
1000:00401161 5b            pop ebx
1000:00401162 59            pop ecx
1000:00401163 c3            ret
...

Note: The pusha, popa sequences at the beginning and the end are markers pasted by us to identify the function exactly inside this disassembly. We used VC++ 7 to compile the filter example and Borg to disassemble the executable file. Those markers are not used in the real test function.

Isn't that somehow ... strange ???
Not only, that the compiler has completely ignored our register instructions. About 32 times the opcode has to get and put variables from and to some external memory locations (loc_00whatever). There are also some strange thingies more, which makes our filter slower than even possible. Please take a look at the FPU register usage. This is not satisfactory, you can see. Only st(0), st(1) and ones st(3) are effectively used.

So we think, we do it from the scratch and begin to transform our function manually into Assembler code with our own logic.


The Coding

We want to do some things somehow different. So we take again a look at the variables of our function.

static float x1 = 0.f;
static float x2 = 0.f;
static float x3 = 0.f;
static float x4 = 0.f;
static float y0 = 0.f;
static float y1 = 0.f;
static float y2 = 0.f;
static float y3 = 0.f;

float F;
register float K;
register float P;
float S;
float R;
float x0;

Note: The static variables must be memorized until the next loop executes. We also could declare it global to the program or implement it as class members. Some other variables are excessively used, so we want to hold those inside the registers.

At first we will try to hold as much of the variables inside the registers of the FPU. So we can expect an access time of nearly zero for those. As said, there are 8 FPU registers we can actually use for this. The variables x1, x2, x3, x4, K and P are predestined for holding inside the registers while looping, because we use it excessively (multiple times) with the calculations inside the loop. Especially The variables K and P are important, because needed for each of the 4 "pole calculations" of our filter opcode. This shows clearly, how important loop based processing may be for speed increase.

Second, we will calculate most of the coefficients internally, without using any additional variable storages. For instance: the variables F, S and R are only helper variables in our C/C++ code, each access time new created to pre-calculate some values. Those must not be saved to the next loop and in this way we can pre-calculate it internally inside the FPU registers on demand.

Third, we will replace all divisions inside the code, because the division instruction needs many cycles to compute in general. Fortunately we can replace many division instructions with multiplications by pre-calculated coefficients. So we will i.e. not divide by 6.0f, but multiply a 1.0f / 6.0f constant (0.16666...f). This may produce slightly different values for the result due the rounding error, but in our case this is insignificant.

A digital simulated analogue filter needs not to be a perfect mathematical construct, because the real (analogue) filter it is an highly instable piece of technique which causes finally the characteristic sound of a Moog filter. But this is a completely different theme, not deeper discussed here ...


The Flow

This all sounds very simple, but in fact we need a well thought structure of our function "flow". We want to prevent our code to do things twice or inefficiently or to jump around and so on. Finally we want to create a better one than the machine generated optimizations could ever do.

A good way is, to analyse the machine generated code in detail and see what's going on. We have already discovered some aspects to this. Now we create stepwise an alternative to do things better. There is more than one possibility to do things. We want to try to find only one of them.

Unfortunately we cannot use immediate values with the FPU instructions. so we have to create some extra variables for the constants in our code, but the compiler would do exactly the same. So it isn't that worse.

There is an other very difficult thing with FPU register usage: st(0) to st(7) are finally a LIFO stack. This means "last in first out". So the positions of our register variables are changing each time a new variable will be pushed or popped. This makes it somehow crazy, because it is not so easy to follow this mechanism in mind. Often you have to debug stepwise the single instructions to find a possible bug. But his is no problem with the new generation of HLL (high level language) compilers, like Microsoft ® VC++. You can easily debug inside inline Assembler code.

Now let us take a look to our inline Assembler code (needed some hours to write and debug):

inline void MoogFilterWithIA(float *buffer, float *paramsCutoff, float *paramsReso, long samples)
{
    // some static (remembered) vars
    static float x1 = 0.f; // saved per block / hold inside registers
    static float x2 = 0.f; // saved per block / hold inside registers
    static float x3 = 0.f; // saved per block / hold inside registers
    static float x4 = 0.f; // saved per block / hold inside registers
    static float y0 = 0.f; // saved per loop
    static float y1 = 0.f; // saved per loop
    static float y2 = 0.f; // saved per loop
    static float y3 = 0.f; // saved per loop

    // some coefficients
    float K; // register
    float P; // register
    //float F; // internal calculated
    //float S; // internal calculated
    //float R; // internal calculated
    float x0; // temporairly used

    // some calculation constants
    float c1 = 3.6f;
    float c2 = 1.6f;
    float c3 = 1.0f;
    float c4 = 0.5f;
    float c5 = 1.386249f;
    float c6 = 1.f / 6.f; // used with multiplication
    float c7 = 6.f;       // used with division

    __asm
    {
        // initialize general purpose registers
        mov eax, paramsCutoff
        mov ebx, paramsReso
        mov ecx, samples
        mov edx, buffer

        // preload FPU registers                               // FPU register usage: (st1..8)
        fld x1                                                                          //->1
        fld x2                                                                          //->2
        fld x3                                                                          //->3
        fld x4                                                                          //->4
        fld P                                                                           //->5
        fld K                                                                           //->6

    // loop begin
    repeat:

        // K = ((3.6f * F) - (1.6f * (F * F))) - 1.f;
        fld dword ptr [eax]                 ;st0 = [eax] = (*Cutoff) = (F)              //->7
        fmul st(0), st(0)                   ;st0 = F * F
        fmul c2                             ;st0 = (F * F) * 1.6f
        fld c1                              ;st0 = 3.6f                                 //->8
        fmul dword ptr [eax]                ;st0 = 3.6f * F
        fsubrp st(1), st(0)                 ;st0 = (3.6f * F) - ((F * F) * 1.6f)        //->7
        fld1                                ;st0 = 1.0                                  //->8
        fsubp st(1), st(0)                  ;st0 = ((3.6f * F) - ((F * F) * 1.6f) - 1.f //->7
        fstp st(1)                          ;st0 (K) -> st1                             //->6

        // P = (K + 1.f) * 0.5f;
        fld1                                ;st0 = 1.f                                  //->7
        fadd st(0), st(1)                   ;st0 = K + 1.f
        fmul c4                             ;st0 = (K + 1) * 0.5f
        fstp st(2)                          ;st0 (P) -> st2                             //->6

        // S = (float)exp( (1.f - P) * 1.386249f );
        fld1                                ;st0 = 1.f                                  //->7
        fsub st(0), st(2)                   ;st0 = 1.f - P
        fmul c5                             ;st0 = (1.f - P) * 1.386249f

        // exp() sub function ;st0 (S) = exp( (1.f - P) * 1.386249f )
        fldl2e                              ;st0 = (lbe)                                //->8
        fmulp st(1), st(0)                  ;st0 = value * (lbe)                        //->7
        fld st(0)                           ;duplicate st0                              //->8
        frndint                             ;st0 = frndint(st0)
        fxch st(1)                          ;st0 <-> st1
        fsub st(0), st(1)                   ;st0 -= st1
        f2xm1                               ;st0 = f2xm1 (st0)
        fadd c3                             ;st0 += 1.f
        fscale                              ;st0 (S) = fscale (st0, st1)

        // R = (*paramsReso) * S;
        fmul dword ptr [ebx]                ;st0 (R) = (S)* (*Reso)

        // x0 = (*buffer) - (R * x4);
        fmul x4                             ;st0 = R * x4
        fxch                                ;st0 <-> st1
        fstp st(0)                          ;pop                                        //->7
        fld dword ptr [edx]                 ;st0 = (*Buffer)                            //->8
        fsubrp st(1), st(0)                 ;st0 (x0) = (*Buffer) - (R * x4)            //->7
        fst x0

        // x1 = ((x0 * P) + (y0 * P)) - (K * x1);
        fmul st(0), st(2)                   ;st0 = x0 * P
        fld st(2)                           ;st0 = P                                    //->8
        fmul y0                             ;st0 = P * y0
        faddp st(1), st(0)                  ;st0 = (y0 * P) + (x0 * P)                  //->7
        fld st(1)                           ;st0 = K                                    //->8
        fmul st(0), st(7)                   ;st0 = K * x1
        fsubp st(1), st(0)                  ;st0 (x1) = ((y0 * P) + (x0 * P)) - (K * x1)//->7
        fst st(6)                           ;st0 (x1) -> st6
        // x2 = ((x1 * P) + (y1 * P)) - (K * x2);
        fmul st(0), st(2)                   ;st0 = x1 * P
        fld st(2)                           ;st0 = P                                    //->8
        fmul y1                             ;st0 = P * y1
        faddp st(1), st(0)                  ;st0 = (y1 * P) + (x1 * P)                  //->7
        fld st(1)                           ;st0 = K                                    //->8
        fmul st(0), st(6)                   ;st0 = K * x2
        fsubp st(1), st(0)                  ;st0 (x2) = ((y1 * P) + (x1 * P)) - (K * x2)//->7
        fst st(5)                           ;st0 (x2) -> st5
        // x3 = ((x2 * P) + (y2 * P)) - (K * x3);
        fmul st(0), st(2)                   ;st0 = x2 * P
        fld st(2)                           ;st0 = P                                    //->8
        fmul y2                             ;st0 = P * y2
        faddp st(1), st(0)                  ;st0 = (y2 * P) + (x2 * P)                  //->7
        fld st(1)                           ;st0 = K                                    //->8
        fmul st(0), st(5)                   ;st0 = K * x3
        fsubp st(1), st(0)                  ;st0 (x3) = ((y2 * P) + (x2 * P)) - (K * x3)//->7
        fst st(4)                           ;st0 (x3) -> st4
        // x4 = ((x3 * P) + (y3 * P)) - (K * x4);
        fmul st(0), st(2)                   ;st0 = x3 * P
        fld st(2)                           ;st0 = P                                    //->8
        fmul y3                             ;st0 = P * y3
        faddp st(1), st(0)                  ;st0 = (y3 * P) + (x3 * P)                  //->7
        fld st(1)                           ;st0 = K                                    //->8
        fmul st(0), st(4)                   ;st0 = K * x4
        fsubp st(1), st(0)                  ;st0 (x4) = ((y3 * P) + (x3 * P)) - (K * x4)//->7
        fst st(3)                           ;st0 (x4) -> st3

        // save old values
        // y0 = x0;
        fld x0                                                                          //->8
        fstp y0                                                                         //->7
        // y1 = x1;
        fld st(6)                                                                       //->8
        fstp y1                                                                         //->7
        // y2 = x2;
        fld st(5)                                                                       //->8
        fstp y2                                                                         //->7
        // y3 = x3;
        fld st(4)                                                                       //->8
        fstp y3                                                                         //->7

        fstp st(0)                          ;pop                                        //->6

        // x4 = x4 - (((x4 * x4) * x4) / 6.0f);
        fld st(2)                           ;st0 = x4                                   //->7
        fmul st(0), st(3)                   ;st0 = x4 * x4
        fmul st(0), st(3)                   ;st0 = x4 * x4 * x4
        //fdiv c7                           ;st0 = ((x4 * x4) * x4) / 6.f
        fmul c6                             ;st0 = (x4 * x4 * x4) * (1.f / 6.f)
        fld st(3)                           ;st0 = x4                                   //->8
        fsub st(0), st(1)                   ;st0 (x4) = x4 - (((x4 * x4) * x4) / 6.0f)
        fst st(4)                           ;st0 -> st4(x4)

        // *buffer = x4;
        fstp dword ptr [edx]                ;store the sample                           //->7
        //========
        fstp st(0)                          ;pop                                        //->6

        // buffer++
        add edx, 4                          ;inc Buffer pointer
        // paramsCutoff++;
        add eax, 4                          ;inc Cutoff pointer
        // paramsReso++;
        add ebx, 4                          ;inc Reso pointer

        // loop end
        dec ecx                             ;decrease count
        jnz repeat                          ;jump, if not 0

        // finalize
        ;store the vars for next loop
        fstp st(0)                          ;                                           //->5
        fstp st(0)                          ;                                           //->4
        fstp x4                             ;                                           //->3
        fstp x3                             ;                                           //->2
        fstp x2                             ;                                           //->1
        fstp x1                             ;                                           //->0
    }
}

Note: The "dword ptr" operators are merely 4byte size specifiers. We used this, because some Assemblers don't know the REAL32 keyword or macro. Because the FPU works internal always with 80bit precision, the processor must know, how big the size of the loaded/used floating point value is (for getting the right value interpretation on the given address).

At a very first impression one could think: "Oh my good, this actually looks much bigger and (therefore) it is surely slower!".

No, this is not slower as you will see; it is even clearly faster than our highly optimised machine generated assembly above. The reason for the obviously bigger size is, that we have written a conceptual "clear" code with many comments to demonstrate and describe our construction. It may not be the fastest Assembler code possible to envelope our filter algorithm, but for a start it is not that bad - we think.

This is the new disassembly:

1000:00401031 8b450c       mov eax, [ebp+0ch]
1000:00401034 8b5d10       mov ebx, [ebp+10h]
1000:00401037 8b4d14       mov ecx, [ebp+14h]
1000:0040103a 8b5508       mov edx, [ebp+08h]
1000:0040103d d905fc294200 fld [loc_004229fc]
1000:00401043 d905f8294200 fld [loc_004229f8]
1000:00401049 d905f4294200 fld [loc_004229f4]
1000:0040104f d905f0294200 fld [loc_004229f0]
1000:00401055 d945e0       fld [ebp-20h]
1000:00401058 d945e0       fld [ebp-20h]
1000:0040105b              ; XREFS First: 1000:00401135 Number : 1
1000:0040105b loc_0040105b:
1000:0040105b d900         fld [eax]
1000:0040105d dcc8         fmul st(0), st(0)
1000:0040105f d84de8       fmul [ebp-18h]
1000:00401062 d945ec       fld [ebp-14h]
1000:00401065 d808         fmul [eax]
1000:00401067 dee1         fsubrp st(1), st(0)
1000:00401069 d9e8         fld1
1000:0040106b dee9         fsubp st(1), st(0)
1000:0040106d ddd9         fstp st(1)
1000:0040106f d9e8         fld1
1000:00401071 d8c1         fadd st(0), st(1)
1000:00401073 d84df0       fmul [ebp-10h]
1000:00401076 ddda         fstp st(2)
1000:00401078 d9e8         fld1
1000:0040107a d8e2         fsub st(0), st(2)
1000:0040107c d84df4       fmul [ebp-0ch]
1000:0040107f d9ea         fldl2e
1000:00401081 dec9         fmulp st(1), st(0)
1000:00401083 d9c0         fld st(0)
1000:00401085 d9fc         frndint
1000:00401087 d9c9         fxch st(1)
1000:00401089 d8e1         fsub st(0), st(1)
1000:0040108b d9f0         f2xm1
1000:0040108d d845f8       fadd [ebp-08h]
1000:00401090 d9fd         fscale
1000:00401092 d80b         fmul [ebx]
1000:00401094 d80df0294200 fmul [loc_004229f0]
1000:0040109a d9c9         fxch st(1)
1000:0040109c ddd8         fstp st(0)
1000:0040109e d902         fld [edx]
1000:004010a0 dee1         fsubrp st(1), st(0)
1000:004010a2 d955e4       fst [ebp-1ch]
1000:004010a5 d8ca         fmul st(0), st(2)
1000:004010a7 d9c2         fld st(2)
1000:004010a9 d80dec294200 fmul [loc_004229ec]
1000:004010af dec1         faddp st(1), st(0)
1000:004010b1 d9c1         fld st(1)
1000:004010b3 d8cf         fmul st(0), st(7)
1000:004010b5 dee9         fsubp st(1), st(0)
1000:004010b7 ddd6         fst st(6)
1000:004010b9 d8ca         fmul st(0), st(2)
1000:004010bb d9c2         fld st(2)
1000:004010bd d80de8294200 fmul [loc_004229e8]
1000:004010c3 dec1         faddp st(1), st(0)
1000:004010c5 d9c1         fld st(1)
1000:004010c7 d8ce         fmul st(0), st(6)
1000:004010c9 dee9         fsubp st(1), st(0)
1000:004010cb ddd5         fst st(5)
1000:004010cd d8ca         fmul st(0), st(2)
1000:004010cf d9c2         fld st(2)
1000:004010d1 d80de4294200 fmul [loc_004229e4]
1000:004010d7 dec1         faddp st(1), st(0)
1000:004010d9 d9c1         fld st(1)
1000:004010db d8cd         fmul st(0), st(5)
1000:004010dd dee9         fsubp st(1), st(0)
1000:004010df ddd4         fst st(4)
1000:004010e1 d8ca         fmul st(0), st(2)
1000:004010e3 d9c2         fld st(2)
1000:004010e5 d80de0294200 fmul [loc_004229e0]
1000:004010eb dec1         faddp st(1), st(0)
1000:004010ed d9c1         fld st(1)
1000:004010ef d8cc         fmul st(0), st(4)
1000:004010f1 dee9         fsubp st(1), st(0)
1000:004010f3 ddd3         fst st(3)
1000:004010f5 d945e4       fld [ebp-1ch]
1000:004010f8 d91dec294200 fstp [loc_004229ec]
1000:004010fe d9c6         fld st(6)
1000:00401100 d91de8294200 fstp [loc_004229e8]
1000:00401106 d9c5         fld st(5)
1000:00401108 d91de4294200 fstp [loc_004229e4]
1000:0040110e d9c4         fld st(4)
1000:00401110 d91de0294200 fstp [loc_004229e0]
1000:00401116 ddd8         fstp st(0)
1000:00401118 d9c2         fld st(2)
1000:0040111a d8cb         fmul st(0), st(3)
1000:0040111c d8cb         fmul st(0), st(3)
1000:0040111e d84dfc       fmul [ebp-04h]
1000:00401121 d9c3         fld st(3)
1000:00401123 d8e1         fsub st(0), st(1)
1000:00401125 ddd4         fst st(4)
1000:00401127 d91a         fstp [edx]
1000:00401129 ddd8         fstp st(0)
1000:0040112b 83c204       add edx, 04h
1000:0040112e 83c004       add eax, 04h
1000:00401131 83c304       add ebx, 04h
1000:00401134 49           dec ecx
1000:00401135 0f8520ffffff jnz loc_0040105b
1000:0040113b ddd8         fstp st(0)
1000:0040113d ddd8         fstp st(0)
1000:0040113f d91df0294200 fstp [loc_004229f0]
1000:00401145 d91df4294200 fstp [loc_004229f4]
1000:0040114b d91df8294200 fstp [loc_004229f8]
1000:00401151 d91dfc294200 fstp [loc_004229fc]

This is exactly, what we have coded. We have all things implemented in that way we wanted. This is 100% our intended code. The access to processor extern variables is minimized. We use lesser instructions than the compiler generated code. Most action is inside the processor registers. The capacity of the 8 FPU registers is optimal (100%) used. Also we don't access external variable 32 times (like the first assembly) inside the loop, but only 10 times. And we have replaced the division instruction per loop with a faster multiplication.

So lets look at the performance graph:

[IMAGE]

As you may discover, this is actually about 10 to 20% faster than our hyper - optimised C++ program. This is not as bad for a start. Also you can see that the Assembler coding needs lesser time to initialise and finalize its action (see the rounded corners of the machine generated code). This is typical for inline Assembler coding, because we'll skip all the (useless) standard initialisations and finalizations, the compiler sometimes to the standard C/C++ functions adds.


Summary

Coding in Assembler is a hard job. And it is finally time consuming.
But you can do many things to speed up time critical processes. There are even other possibilities, if you use for instance SIMD instruction sets and its new special registers on today's modern microprocessors ...

We hope, we could give you some inspiration.


Perspectives

It is not easy to adapt the new SIMD instruction sets to our filter example.
The problem is: Each calculated value depends directly on the previously one. ( The Moog Filter is like all multi pole filters conceptually a fixed connection serial filter. ) So we cannot calculate 4 packed floating point values at same time without loosing all the advantages of the SIMD technology (which performs in parallel).

But a possible other way seems to be very interesting:
We actually can design a QUADFILTER, this means, a 4 x 4 pole filter in parallel connection to process 4 independent audio streams at the same time. With SSE/SSE2.
Even we could mix the outputs in real time together and design a really fascinating new morphing state variable filer for a new audio plugin.

And the best is: This Quad Filter would use (in its kernel) nearly the same amount of performance than our single FPU based Moog Filter! Not less, not more.

...to be continued...