!j development
Programming Article:
Micro surgical implementation of a 4 pole low pass
filter with Inline Assembler
by !jdev
©2003-2006 !j development division audio
This article is copyrighted. All rights reserved. Don't publish entirely
or parts without explicit permission.
Introduction
Well, this is the continuation of our article series about inline Assembly
for digital audio processing and other possible streaming multimedia applications
using single precision floating point routines.
In the last topic we had a look at the fantastic possibilities of speed
optimisation with the new instruction sets for the latest IA (Intel Architecture)
microprocessors. Now we will go somewhat deeper into the scope and examine,
how to implement an analogue filter simulation exemplary with (inline)
Assembler coding.
As a real model, we will take the famous "Moog Filter Cascade",
therefore a 4 pole low pass filter or with other words: the legendary
24 db resonant filter from the also legendary dinosaur synthesizers and
analogue filter hardware modules of the past.
Don't misunderstand us here, we will not invent an entirely new concept
of a digital filter implementation or so.
No. We simply take an already existing exemplary C++ function (available
all over the internet) and transform it into pure inline Assembler code.
Per hand, of course, doing it the hard way...
This article is not intended to deliver free stuff to the "communities".
It is intended to developers, who want to benefit from our experiences
to optimise their time critical processes for real time audio and other
multimedia applications.
So lets start and see, how our C/C++ function exemplary looks like:
inline void MoogFilter(float *buffer, float *paramsCutoff, float *paramsReso, long samples)
{
static float x1 = 0.f;
static float x2 = 0.f;
static float x3 = 0.f;
static float x4 = 0.f;
static float y0 = 0.f;
static float y1 = 0.f;
static float y2 = 0.f;
static float y3 = 0.f;
float F;
register float K;
register float P;
float S;
float R;
register float x0;
while (--samples);
{
F = (*paramsCutoff);
K = ((3.6f * F) - (1.6f * (F
* F))) - 1.f;
P = (K + 1.f) * 0.5f;
S = (float)exp( (1.f - P) *
1.386249f );
R = (*paramsReso) * S;
x0 = (*buffer) - (R * x4);
x1 = (x0 * P)
+ (y0 * P) - (K * x1);
x2 = (x1 * P) + (y1 * P) - (K * x2);
x3 = (x2 * P) + (y2 * P) - (K * x3);
x4 = (x3 * P) + (y3 * P) - (K * x4);
y0 = x0;
y1 = x1;
y2 = x2;
y3 = x3;
x4 = x4 - (((x4
* x4) * x4) / 6.0f);
*buffer++ = x4;
paramsCutoff++;
paramsReso++;
}
}
One note for "code rippers": The code in this article has
(intentionally) some disadvantages. If you use improper control parameter
values for resonance and/or cutoff or low sample rates, the code may drive
into "out of order"! Also there is no inbuilt denormalization
blocker, so it may produce massive performance problems on new generation
processors with very low internal floating point values.
This filter should NOT be used inside any serious digital audio applications
or plugins without the necessary modifications to fix that!
How you can see, there is really nothing spectacular. The opcode produces
an inplace processing of a sample stream buffer, filled with single precision
floating point values. We assume, that this buffer represents an audio
signal and the control data for cutoff and resonance parameters are also
in buffer based (per sample steamed) format, equal to the length of the
output buffer. We will not explain, how the real filter or the digital
clone works, nor will we teach any of the here used coding languages.
Whatever, you may notice those excessive usage of brackets inside the
function above. This is not necessary for C/C++, but will help us later
very much to manually design our Assembler code. You have to know the
hierarchy of operators very well, if you want to write Assembler code
successfully.
Unlike other examples on the internet, this filter is intended for (real)
real time processing, that means; the control parameters are absolutely
sample accurate, not only calculated ones per loop, but with each new
sample. You can use such a filter for instance to filter audio streams
controlled by fast envelopes or LFOs without any quantization in control
or audio signal. It is even predestined to support sample accurate filter
cross modulation (filter frequency modulation). So the parameters will
be recalculated on each new sample tick. This helps also to analyse the
speed differences of our opcodes with lesser loops.
Why buffer based?
As we have already seen in the last article, buffer based processing
is the fastest way possible. The reason is very simple: If you implement
such a relatively complex algorithm ( as the digital modeled analogue
filter above ) with single sample processing (one function call per tick),
the processor has heavily to jump around while executing the code for
each sample. The entire speed will be slowed down enormously. Also we
cannot use the great benefits, the processor, especially the FPU (floating
point unit) offers. This means: holding the variables in registers and
so doing jumps as less as possible while we process as much samples as
possible with one function call. This speeds up enormously!
If you begin to think in the manner of a microprocessor, you will see
very quickly, how much advantages the buffer based processing and the
additional use of hand coded inline Assembly has in contradiction to automatic
optimisation techniques or/and "one sample tick" based calculations.
Register variables
The FPU adds 8 additional 80bit registers ( st(0) to st(7) ) to the available
general purpose registers of the main processor. This so called "FPU
stack" is very often not optimal used, even if you compile with highly
optimising HLL compilers.
We will see later, what such a HLL compiler does with the function above.
This is really not, what you may expect. Mostly only two or three of the
FPU registers will be used. And the jumping around, such a code finally
causes, is indeed questionable...
The excessive usage of all available registers inside the CPU is very
important for fast execution. Why?
Each time, the processor has to access variables in memory, it will need
time. Up to today, ultra fast memory is very expensive. And so the memory
modules you can usually buy are relatively slow in comparison to the high
speed of today's CPUs. The result is the so called ( and by manufacturers
well hidden ) "wait states" of the CPU, that means: to wait
until the data from memory is finally transferred into the registers where
all the action performs. The same happens vice versa on storing the proceeded
data back to any memory locations outside the CPU registers.
Okay, you may say: There is something called "cache" inside
my processor, that will help to reduce those wait states. Right.
But those mysterious "cache" memory needs some time also. Even
this can actually use more time in some situations: The processor has
to look first into the cache, and if he can't find the desired memory
object, he will nevertheless be forced do a search in slower memory...
This needs actually more time, you can see.
The only logical consequence is finally, to hold as much of the variables
a function uses inside the registers of the CPU, especially the 8 FPU
registers with our filter opcode. So the access time to internal register
variables is approximately zero and we can speedup our opcodes to a new
dimension.
Unfortunately the most HLL compilers became extremely ignorant regarding
our instructions to use "register variables" if declared in
C/C++ or other HLLs source code. The compiler will actually do what "he"
want and completely ignore all our well thought "register" keyword
implementations.
Remember our function. We have declared at least 3 register variables:
register float K;
register float P;
register float x0;
Now try to find those variables inside the disassembly below...
They have simply ignored our high level coding instructions or removed
by the automatic optimizations of the compiler!
The only way to solve this problem and force the compiler to do exactly
what we want is finally ... what a surprise ... coding with (inline) Assembler.
The inline assembly will be executed 100% conform to our coding.
The Disassembly
Now, take a look at the disassembly the compiler produces, if we compile
our C++ function with highest possible (hyper) optimisation (all opt.
switches on /O2 /Og /Ob2 /Ot /G6 ...) on VC++ 7:
...
1000:00401000 loc_00401000:
1000:00401000 51
push ecx
1000:00401001 53
push ebx
1000:00401002 56
push esi
1000:00401003 57
push edi
1000:00401004 6660
pusha
1000:00401006 6661
popa
1000:00401008 8b742420 mov esi, [esp+20h]
1000:0040100c 4e
dec esi
1000:0040100d 0f8448010000 jz loc_0040115b
1000:00401013 8b54241c mov edx, [esp+1ch]
1000:00401017 8b4c2418 mov ecx, [esp+18h]
1000:0040101b 8b442414 mov eax, [esp+14h]
1000:0040101f 90
nop
1000:00401020
; XREFS First: 1000:00401155 Number : 1
1000:00401020 loc_00401020:
1000:00401020 d901
fld [ecx]
1000:00401022 83c004 add eax,
04h
1000:00401025 d905e4804000 fld [loc_004080e4]
1000:0040102b 83c104 add ecx, 04h
1000:0040102e d8c9
fmul st(0), st(1)
1000:00401030 83c204 add edx,
04h
1000:00401033 d9c1
fld st(1)
1000:00401035 d8ca
fmul st(0), st(2)
1000:00401037 d80de0804000 fmul [loc_004080e0]
1000:0040103d dee9
fsubp st(1), st(0)
1000:0040103f d825dc804000 fsub [loc_004080dc]
1000:00401045 d95c240c fstp [esp+0ch]
1000:00401049 ddd8
fstp st(0)
1000:0040104b d944240c fld [esp+0ch]
1000:0040104f d805dc804000 fadd [loc_004080dc]
1000:00401055 d80dd8804000 fmul [loc_004080d8]
1000:0040105b d95c2420 fstp [esp+20h]
1000:0040105f d905dc804000 fld [loc_004080dc]
1000:00401065 d8642420 fsub [esp+20h]
1000:00401069 d80dd4804000 fmul [loc_004080d4]
1000:0040106f d9ea
fldl2e
1000:00401071 dec9
fmulp st(1), st(0)
1000:00401073 d9c0
fld st(0)
1000:00401075 d9fc
frndint
1000:00401077 d9c9
fxch st(1)
1000:00401079 d8e1
fsub st(0), st(1)
1000:0040107b d9f0
f2xm1
1000:0040107d d9e8
fld1
1000:0040107f dec1
faddp st(1), st(0)
1000:00401081 d9fd
fscale
1000:00401083 ddd9
fstp st(1)
1000:00401085 d84afc fmul [edx-04h]
1000:00401088 d80dfca94600 fmul [loc_0046a9fc]
1000:0040108e d868fc fsubr [eax-04h]
1000:00401091 d905f8a94600 fld [loc_0046a9f8]
1000:00401097 d8c1
fadd st(0), st(1)
1000:00401099 d84c2420 fmul [esp+20h]
1000:0040109d d905f4a94600 fld [loc_0046a9f4]
1000:004010a3 d84c240c fmul [esp+0ch]
1000:004010a7 dee9
fsubp st(1), st(0)
1000:004010a9 d91df4a94600 fstp [loc_0046a9f4]
1000:004010af d905f0a94600 fld [loc_0046a9f0]
1000:004010b5 d805f4a94600 fadd [loc_0046a9f4]
1000:004010bb d84c2420 fmul [esp+20h]
1000:004010bf d905eca94600 fld [loc_0046a9ec]
1000:004010c5 d84c240c fmul [esp+0ch]
1000:004010c9 dee9
fsubp st(1), st(0)
1000:004010cb d91deca94600 fstp [loc_0046a9ec]
1000:004010d1 d905e8a94600 fld [loc_0046a9e8]
1000:004010d7 d805eca94600 fadd [loc_0046a9ec]
1000:004010dd d84c2420 fmul [esp+20h]
1000:004010e1 d905e4a94600 fld [loc_0046a9e4]
1000:004010e7 d84c240c fmul [esp+0ch]
1000:004010eb dee9
fsubp st(1), st(0)
1000:004010ed d91de4a94600 fstp [loc_0046a9e4]
1000:004010f3 8b3de4a94600 mov edi, [loc_0046a9e4]
1000:004010f9 d905e0a94600 fld [loc_0046a9e0]
1000:004010ff 893de0a94600 mov [loc_0046a9e0], edi
1000:00401105 d805e4a94600 fadd [loc_0046a9e4]
1000:0040110b 8b3df4a94600 mov edi, [loc_0046a9f4]
1000:00401111 893df0a94600 mov [loc_0046a9f0], edi
1000:00401117 8b3deca94600 mov edi, [loc_0046a9ec]
1000:0040111d d84c2420 fmul [esp+20h]
1000:00401121 893de8a94600 mov [loc_0046a9e8], edi
1000:00401127 d905fca94600 fld [loc_0046a9fc]
1000:0040112d d84c240c fmul [esp+0ch]
1000:00401131 dee9
fsubp st(1), st(0)
1000:00401133 d9c9
fxch st(1)
1000:00401135 d91df8a94600 fstp [loc_0046a9f8]
1000:0040113b d9c0
fld st(0)
1000:0040113d d8c9
fmul st(0), st(1)
1000:0040113f d8c9
fmul st(0), st(1)
1000:00401141 d80dd0804000 fmul [loc_004080d0]
1000:00401147 d8e9
fsubr st(0), st(1)
1000:00401149 d915fca94600 fst [loc_0046a9fc]
1000:0040114f ddd9
fstp st(1)
1000:00401151 d958fc fstp [eax-04h]
1000:00401154 4e
dec esi
1000:00401155 0f85c5feffff jnz loc_00401020
1000:0040115b
; XREFS First: 1000:0040100d Number : 1
1000:0040115b loc_0040115b:
1000:0040115b 6660
pusha
1000:0040115d 6661
popa
1000:0040115f 5f
pop edi
1000:00401160 5e
pop esi
1000:00401161 5b
pop ebx
1000:00401162 59
pop ecx
1000:00401163 c3
ret
...
Note: The pusha, popa sequences at the beginning and the end are
markers pasted by us to identify the function exactly inside this disassembly.
We used VC++ 7 to compile the filter example and Borg to disassemble the
executable file. Those markers are not used in the real test function.
Isn't that somehow ... strange ???
Not only, that the compiler has completely ignored our register instructions.
About 32 times the opcode has to get and put variables from and to some
external memory locations (loc_00whatever). There are also some strange
thingies more, which makes our filter slower than even possible. Please
take a look at the FPU register usage. This is not satisfactory, you can
see. Only st(0), st(1) and ones st(3) are effectively used.
So we think, we do it from the scratch and begin to transform our function
manually into Assembler code with our own logic.
The Coding
We want to do some things somehow different. So we take again a look
at the variables of our function.
static float x1
= 0.f;
static float x2 = 0.f;
static float x3 = 0.f;
static float x4 = 0.f;
static float y0 = 0.f;
static float y1 = 0.f;
static float y2 = 0.f;
static float y3 = 0.f;
float F;
register float K;
register float P;
float S;
float R;
float x0;
Note: The static variables must be memorized until the next loop
executes. We also could declare it global to the program or implement
it as class members. Some other variables are excessively used, so we
want to hold those inside the registers.
At first we will try to hold as much of the variables inside the registers
of the FPU. So we can expect an access time of nearly zero for those.
As said, there are 8 FPU registers we can actually use for this. The variables
x1, x2, x3, x4, K and P are predestined for holding inside the registers
while looping, because we use it excessively (multiple times) with the
calculations inside the loop. Especially The variables K and P are important,
because needed for each of the 4 "pole calculations" of our
filter opcode. This shows clearly, how important loop based processing
may be for speed increase.
Second, we will calculate most of the coefficients internally, without
using any additional variable storages. For instance: the variables F,
S and R are only helper variables in our C/C++ code, each access time
new created to pre-calculate some values. Those must not be saved to the
next loop and in this way we can pre-calculate it internally inside the
FPU registers on demand.
Third, we will replace all divisions inside the code, because the division
instruction needs many cycles to compute in general. Fortunately we can
replace many division instructions with multiplications by pre-calculated
coefficients. So we will i.e. not divide by 6.0f, but multiply a 1.0f
/ 6.0f constant (0.16666...f). This may produce slightly different values
for the result due the rounding error, but in our case this is insignificant.
A digital simulated analogue filter needs not to be a perfect mathematical
construct, because the real (analogue) filter it is an highly instable piece
of technique which causes finally the characteristic sound of a Moog filter.
But this is a completely different theme, not deeper discussed here ...
The Flow
This all sounds very simple, but in fact we need a well thought structure
of our function "flow". We want to prevent our code to do things
twice or inefficiently or to jump around and so on. Finally we want to
create a better one than the machine generated optimizations could ever
do.
A good way is, to analyse the machine generated code in detail and see
what's going on. We have already discovered some aspects to this. Now
we create stepwise an alternative to do things better. There is more than
one possibility to do things. We want to try to find only one of them.
Unfortunately we cannot use immediate values with the FPU instructions.
so we have to create some extra variables for the constants in our code,
but the compiler would do exactly the same. So it isn't that worse.
There is an other very difficult thing with FPU register usage: st(0)
to st(7) are finally a LIFO stack. This means "last in first out".
So the positions of our register variables are changing each time a new
variable will be pushed or popped. This makes it somehow crazy, because
it is not so easy to follow this mechanism in mind. Often you have to
debug stepwise the single instructions to find a possible bug. But his
is no problem with the new generation of HLL (high level language) compilers,
like Microsoft ® VC++. You can easily debug inside inline Assembler
code.
Now let us take a look to our inline Assembler code (needed some hours
to write and debug):
inline void MoogFilterWithIA(float *buffer, float *paramsCutoff, float *paramsReso, long samples)
{
// some static (remembered) vars
static float x1
= 0.f; // saved per block / hold inside registers
static float x2 = 0.f; // saved per block / hold inside registers
static float x3 = 0.f; // saved per block / hold inside registers
static float x4 = 0.f; // saved per block / hold inside registers
static float y0 = 0.f; // saved per loop
static float y1 = 0.f; // saved per loop
static float y2 = 0.f; // saved per loop
static float y3 = 0.f; // saved per loop
// some
coefficients
float K; //
register
float P; //
register
//float F; // internal calculated
//float S; // internal calculated
//float R; // internal calculated
float x0; //
temporairly used
// some
calculation constants
float c1 = 3.6f;
float c2 = 1.6f;
float c3 = 1.0f;
float c4 = 0.5f;
float c5 = 1.386249f;
float c6 = 1.f / 6.f; // used with multiplication
float c7 = 6.f; // used with division
__asm
{
// initialize
general purpose registers
mov eax, paramsCutoff
mov ebx, paramsReso
mov ecx, samples
mov edx, buffer
// preload FPU registers
// FPU register usage:
(st1..8)
fld x1
//->1
fld x2
//->2
fld x3
//->3
fld x4
//->4
fld P
//->5
fld K
//->6
// loop begin
repeat:
// K = ((3.6f * F) - (1.6f * (F * F))) - 1.f;
fld dword ptr [eax]
;st0 = [eax] = (*Cutoff) = (F)
//->7
fmul st(0), st(0)
;st0 = F * F
fmul c2
;st0 = (F
* F) * 1.6f
fld c1
;st0
= 3.6f
//->8
fmul dword ptr [eax]
;st0 = 3.6f * F
fsubrp st(1), st(0)
;st0 = (3.6f * F) - ((F * F) * 1.6f)
//->7
fld1
;st0 = 1.0
//->8
fsubp st(1), st(0)
;st0 = ((3.6f * F) - ((F * F) * 1.6f) - 1.f //->7
fstp st(1)
;st0 (K) -> st1
//->6
// P = (K + 1.f) * 0.5f;
fld1
;st0 = 1.f
//->7
fadd st(0), st(1)
;st0 = K + 1.f
fmul c4
;st0 = (K
+ 1) * 0.5f
fstp st(2)
;st0 (P) -> st2
//->6
// S = (float)exp( (1.f - P) * 1.386249f );
fld1
;st0 = 1.f
//->7
fsub st(0), st(2)
;st0 = 1.f - P
fmul c5
;st0 = (1.f
- P) * 1.386249f
// exp()
sub function ;st0 (S) = exp( (1.f - P) * 1.386249f )
fldl2e
;st0 = (lbe)
//->8
fmulp st(1), st(0)
;st0 = value * (lbe)
//->7
fld st(0)
;duplicate st0
//->8
frndint
;st0 = frndint(st0)
fxch st(1)
;st0 <-> st1
fsub st(0), st(1)
;st0 -= st1
f2xm1
;st0 = f2xm1 (st0)
fadd c3
;st0 += 1.f
fscale
;st0
(S) = fscale (st0, st1)
// R = (*paramsReso) * S;
fmul dword ptr [ebx]
;st0 (R) = (S)* (*Reso)
// x0
= (*buffer) - (R * x4);
fmul x4
;st0 = R * x4
fxch
;st0 <-> st1
fstp st(0)
;pop
//->7
fld dword ptr [edx]
;st0 = (*Buffer)
//->8
fsubrp st(1), st(0)
;st0 (x0) = (*Buffer) - (R * x4)
//->7
fst x0
// x1 = ((x0 * P) + (y0 * P)) - (K * x1);
fmul st(0), st(2)
;st0 = x0 * P
fld st(2)
;st0 = P
//->8
fmul y0
;st0 = P
* y0
faddp st(1), st(0)
;st0 = (y0 * P) + (x0 * P)
//->7
fld st(1)
;st0 = K
//->8
fmul st(0), st(7)
;st0 = K * x1
fsubp st(1), st(0)
;st0 (x1) = ((y0 * P) + (x0 * P)) - (K * x1)//->7
fst st(6)
;st0 (x1) ->
st6
// x2
= ((x1 * P) + (y1 * P)) - (K * x2);
fmul st(0), st(2)
;st0 = x1 * P
fld st(2)
;st0 = P
//->8
fmul y1
;st0 = P
* y1
faddp st(1), st(0)
;st0 = (y1 * P) + (x1 * P)
//->7
fld st(1)
;st0 = K
//->8
fmul st(0), st(6)
;st0 = K * x2
fsubp st(1), st(0)
;st0 (x2) = ((y1 * P) + (x1 * P)) - (K * x2)//->7
fst st(5)
;st0 (x2) ->
st5
// x3
= ((x2 * P) + (y2 * P)) - (K * x3);
fmul st(0), st(2)
;st0 = x2 * P
fld st(2)
;st0 = P
//->8
fmul y2
;st0 = P
* y2
faddp st(1), st(0)
;st0 = (y2 * P) + (x2 * P)
//->7
fld st(1)
;st0 = K
//->8
fmul st(0), st(5)
;st0 = K * x3
fsubp st(1), st(0)
;st0 (x3) = ((y2 * P) + (x2 * P)) - (K * x3)//->7
fst st(4)
;st0 (x3) ->
st4
// x4
= ((x3 * P) + (y3 * P)) - (K * x4);
fmul st(0), st(2)
;st0 = x3 * P
fld st(2)
;st0 = P
//->8
fmul y3
;st0 = P
* y3
faddp st(1), st(0)
;st0 = (y3 * P) + (x3 * P)
//->7
fld st(1)
;st0 = K
//->8
fmul st(0), st(4)
;st0 = K * x4
fsubp st(1), st(0)
;st0 (x4) = ((y3 * P) + (x3 * P)) - (K * x4)//->7
fst st(3)
;st0 (x4) ->
st3
// save old values
// y0 = x0;
fld x0
//->8
fstp y0
//->7
// y1
= x1;
fld st(6)
//->8
fstp y1
//->7
// y2
= x2;
fld st(5)
//->8
fstp y2
//->7
// y3
= x3;
fld st(4)
//->8
fstp y3
//->7
fstp st(0)
;pop
//->6
// x4 = x4 - (((x4 * x4) * x4) / 6.0f);
fld st(2)
;st0 = x4
//->7
fmul st(0), st(3)
;st0 = x4 * x4
fmul st(0), st(3)
;st0 = x4 * x4 * x4
//fdiv
c7
;st0 = ((x4 * x4)
* x4) / 6.f
fmul c6
;st0 = (x4
* x4 * x4) * (1.f / 6.f)
fld st(3)
;st0 = x4
//->8
fsub st(0), st(1)
;st0 (x4) = x4 - (((x4 * x4) * x4) / 6.0f)
fst st(4) ;st0 -> st4(x4)
//
*buffer = x4;
fstp dword ptr [edx]
;store the sample
//->7
//========
fstp st(0)
;pop
//->6
// buffer++
add edx, 4
;inc Buffer pointer
// paramsCutoff++;
add eax, 4
;inc Cutoff pointer
// paramsReso++;
add ebx, 4
;inc Reso pointer
// loop
end
dec ecx
;decrease count
jnz repeat
;jump, if not 0
//
finalize
;store the vars for
next loop
fstp st(0)
;
//->5
fstp st(0)
;
//->4
fstp x4
;
//->3
fstp x3
;
//->2
fstp x2
;
//->1
fstp x1
;
//->0
}
}
Note: The "dword ptr" operators are merely 4byte size specifiers.
We used this, because some Assemblers don't know the REAL32 keyword or
macro. Because the FPU works internal always with 80bit precision, the
processor must know, how big the size of the loaded/used floating point
value is (for getting the right value interpretation on the given address).
At a very first impression one could think: "Oh my good, this actually
looks much bigger and (therefore) it is surely slower!".
No, this is not slower as you will see; it is even clearly faster
than our highly optimised machine generated assembly above. The reason
for the obviously bigger size is, that we have written a conceptual "clear"
code with many comments to demonstrate and describe our construction.
It may not be the fastest Assembler code possible to envelope our filter
algorithm, but for a start it is not that bad - we think.
This is the new disassembly:
1000:00401031 8b450c
mov eax, [ebp+0ch]
1000:00401034 8b5d10 mov ebx, [ebp+10h]
1000:00401037 8b4d14 mov ecx, [ebp+14h]
1000:0040103a 8b5508 mov edx, [ebp+08h]
1000:0040103d d905fc294200 fld [loc_004229fc]
1000:00401043 d905f8294200 fld [loc_004229f8]
1000:00401049 d905f4294200 fld [loc_004229f4]
1000:0040104f d905f0294200 fld [loc_004229f0]
1000:00401055 d945e0 fld [ebp-20h]
1000:00401058 d945e0 fld [ebp-20h]
1000:0040105b
; XREFS First: 1000:00401135 Number : 1
1000:0040105b loc_0040105b:
1000:0040105b d900 fld
[eax]
1000:0040105d dcc8 fmul
st(0), st(0)
1000:0040105f d84de8 fmul [ebp-18h]
1000:00401062 d945ec fld [ebp-14h]
1000:00401065 d808 fmul
[eax]
1000:00401067 dee1 fsubrp
st(1), st(0)
1000:00401069 d9e8 fld1
1000:0040106b dee9 fsubp
st(1), st(0)
1000:0040106d ddd9 fstp
st(1)
1000:0040106f d9e8 fld1
1000:00401071 d8c1 fadd
st(0), st(1)
1000:00401073 d84df0 fmul [ebp-10h]
1000:00401076 ddda fstp
st(2)
1000:00401078 d9e8 fld1
1000:0040107a d8e2 fsub
st(0), st(2)
1000:0040107c d84df4 fmul [ebp-0ch]
1000:0040107f d9ea fldl2e
1000:00401081 dec9 fmulp
st(1), st(0)
1000:00401083 d9c0 fld
st(0)
1000:00401085 d9fc frndint
1000:00401087 d9c9 fxch
st(1)
1000:00401089 d8e1 fsub
st(0), st(1)
1000:0040108b d9f0 f2xm1
1000:0040108d d845f8 fadd [ebp-08h]
1000:00401090 d9fd fscale
1000:00401092 d80b fmul
[ebx]
1000:00401094 d80df0294200 fmul [loc_004229f0]
1000:0040109a d9c9 fxch
st(1)
1000:0040109c ddd8 fstp
st(0)
1000:0040109e d902 fld
[edx]
1000:004010a0 dee1 fsubrp
st(1), st(0)
1000:004010a2 d955e4 fst [ebp-1ch]
1000:004010a5 d8ca fmul
st(0), st(2)
1000:004010a7 d9c2 fld
st(2)
1000:004010a9 d80dec294200 fmul [loc_004229ec]
1000:004010af dec1 faddp
st(1), st(0)
1000:004010b1 d9c1 fld
st(1)
1000:004010b3 d8cf fmul
st(0), st(7)
1000:004010b5 dee9 fsubp
st(1), st(0)
1000:004010b7 ddd6 fst
st(6)
1000:004010b9 d8ca fmul
st(0), st(2)
1000:004010bb d9c2 fld
st(2)
1000:004010bd d80de8294200 fmul [loc_004229e8]
1000:004010c3 dec1 faddp
st(1), st(0)
1000:004010c5 d9c1 fld
st(1)
1000:004010c7 d8ce fmul
st(0), st(6)
1000:004010c9 dee9 fsubp
st(1), st(0)
1000:004010cb ddd5 fst
st(5)
1000:004010cd d8ca fmul
st(0), st(2)
1000:004010cf d9c2 fld
st(2)
1000:004010d1 d80de4294200 fmul [loc_004229e4]
1000:004010d7 dec1 faddp
st(1), st(0)
1000:004010d9 d9c1 fld
st(1)
1000:004010db d8cd fmul
st(0), st(5)
1000:004010dd dee9 fsubp
st(1), st(0)
1000:004010df ddd4 fst
st(4)
1000:004010e1 d8ca fmul
st(0), st(2)
1000:004010e3 d9c2 fld
st(2)
1000:004010e5 d80de0294200 fmul [loc_004229e0]
1000:004010eb dec1 faddp
st(1), st(0)
1000:004010ed d9c1 fld
st(1)
1000:004010ef d8cc fmul
st(0), st(4)
1000:004010f1 dee9 fsubp
st(1), st(0)
1000:004010f3 ddd3 fst
st(3)
1000:004010f5 d945e4 fld [ebp-1ch]
1000:004010f8 d91dec294200 fstp [loc_004229ec]
1000:004010fe d9c6 fld
st(6)
1000:00401100 d91de8294200 fstp [loc_004229e8]
1000:00401106 d9c5 fld
st(5)
1000:00401108 d91de4294200 fstp [loc_004229e4]
1000:0040110e d9c4 fld
st(4)
1000:00401110 d91de0294200 fstp [loc_004229e0]
1000:00401116 ddd8 fstp
st(0)
1000:00401118 d9c2 fld
st(2)
1000:0040111a d8cb fmul
st(0), st(3)
1000:0040111c d8cb fmul
st(0), st(3)
1000:0040111e d84dfc fmul [ebp-04h]
1000:00401121 d9c3 fld
st(3)
1000:00401123 d8e1 fsub
st(0), st(1)
1000:00401125 ddd4 fst
st(4)
1000:00401127 d91a fstp
[edx]
1000:00401129 ddd8 fstp
st(0)
1000:0040112b 83c204 add edx, 04h
1000:0040112e 83c004 add eax, 04h
1000:00401131 83c304 add ebx, 04h
1000:00401134 49
dec ecx
1000:00401135 0f8520ffffff jnz loc_0040105b
1000:0040113b ddd8 fstp
st(0)
1000:0040113d ddd8 fstp
st(0)
1000:0040113f d91df0294200 fstp [loc_004229f0]
1000:00401145 d91df4294200 fstp [loc_004229f4]
1000:0040114b d91df8294200 fstp [loc_004229f8]
1000:00401151 d91dfc294200 fstp [loc_004229fc]
This is exactly, what we have coded. We have all things implemented in
that way we wanted. This is 100% our intended code. The access to processor
extern variables is minimized. We use lesser instructions than the compiler
generated code. Most action is inside the processor registers. The capacity
of the 8 FPU registers is optimal (100%) used. Also we don't access external
variable 32 times (like the first assembly) inside the loop, but only
10 times. And we have replaced the division instruction per loop with
a faster multiplication.
So lets look at the performance graph:
[IMAGE]
As you may discover, this is actually about 10 to 20% faster than our
hyper - optimised C++ program. This is not as bad for a start. Also you
can see that the Assembler coding needs lesser time to initialise and
finalize its action (see the rounded corners of the machine generated
code). This is typical for inline Assembler coding, because we'll skip
all the (useless) standard initialisations and finalizations, the compiler
sometimes to the standard C/C++ functions adds.
Summary
Coding in Assembler is a hard job. And it is finally time consuming.
But you can do many things to speed up time critical processes. There
are even other possibilities, if you use for instance SIMD instruction
sets and its new special registers on today's modern microprocessors ...
We hope, we could give you some inspiration.
Perspectives
It is not easy to adapt the new SIMD instruction sets to our filter example.
The problem is: Each calculated value depends directly on the previously
one. ( The Moog Filter is like all multi pole filters conceptually a fixed
connection serial filter. ) So we cannot calculate 4 packed floating point
values at same time without loosing all the advantages of the SIMD technology
(which performs in parallel).
But a possible other way seems to be very interesting:
We actually can design a QUADFILTER, this means, a 4 x 4 pole filter in
parallel connection to process 4 independent audio streams at the same
time. With SSE/SSE2.
Even we could mix the outputs in real time together and design a really
fascinating new morphing state variable filer for a new audio plugin.
And the best is: This Quad Filter would use (in its kernel) nearly the
same amount of performance than our single FPU based Moog Filter! Not
less, not more.
...to be continued...
|