
			AMD K6 MPN SUBROUTINES



This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
K6-3.

The mmx subdirectory has routines using mmx instructions.  All K6s have MMX,
the separate directory is just so that configure can omit it if the
assembler doesn't support MMX.




STATUS

Times for the loops, with all code and data in L1 cache, are as follows.

                                 cycles/limb
	mpn_add_n/sub_n            3.25 normal, 2.75 in-place

	mpn_copyi/copyd            0.56 or 1.0  \
                                                |
	mpn_com_n                  1.0-1.2      | varying with
	mpn_and/andn/ior/xor_n     1.2-1.5      | data alignment
	mpn_iorn/xnor_n            1.5-2.0      |
	mpn_nand/nior_n            1.75-2.0     /

	mpn_l/rshift               1.75

	mpn_mul_1                  6.25
	mpn_add/submul_1           7.65-8.4     (varying with data values)

	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
	mpn_sqr_basecase           6.5  cycles/crossproduct (approx)
                                   or 12 cycles/triangleproduct (approx)

	mpn_divrem_1              20.0
	mpn_mod_1                 20.0

	mpn_divexact_by3          11.0


Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
instruction, code seems to run slower, and with just "mov" loads it doesn't
seem faster.  Results so far are inconsistent.  The K6 does a hardware
prefetch of the second cache line in a sector, so the penalty for not
prefetching in software is reduced.




NOTES

All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.

Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
chapter 6 table 12).

Write-allocate L1 data cache means prefetching of destinations is unnecessary.
Store queue is 7 entries of 64 bits each.

Floating point multiplications can be done in parallel with integer
multiplications, but there doesn't seem to be any way to make use of this.



OPTIMIZATIONS

Unrolled loops are used to reduce looping overhead.  The unrolling is
configurable up to 32 limbs/loop for most routines, up to 64 for some.

Sometimes computed jumps into the unrolling are used to handle sizes not a
multiple of the unrolling.  An attractive feature of this is that times
smoothly increase with operand size, but an indirect jump is about 6 cycles
and the setups about another 6, so it depends on how much the unrolled code
is faster than a simple loop as to whether a computed jump ought to be used.

Position independent code is implemented using a call to get eip for
computed jumps and a ret is always done, rather than an addl $4,%esp or a
popl, so the CPU return address branch prediction stack stays synchronised
with the actual stack in memory.  Such a call however still costs 4 to 7
cycles.

Addressing mode (%esi) causes an instruction to be vector decoded (see
Optimization Manual end of chapter 5, page 62 of rev C).  For this reason
(%esi) is avoided.  0(%esi) doesn't have this problem, and can be used as an
equivalent, or easier is just to use a different register, generally %ebx
has been used.

Branch prediction, in absence of any history, will guess forward jumps are
not taken and backward jumps are taken.  Where possible it's arranged that
the less likely or less important case is under a taken forward jump.



MMX

Putting emms_or_femms as late as possible in a routine seems to be fastest.
Perhaps an emms or femms stalls until all outstanding MMX instructions have
completed, so putting it later gives them a chance to complete on their own,
in parallel with other operations (like register popping).

The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
at the start of a routine, in case it's been preceded by x87 floating point
operations.  This isn't done because in gmp programs it's expected that x87
floating point won't be much used and that chances are an mpn routine won't
have been preceded by any x87 code.



CODING

Instructions in general code are shown paired if they can decode and execute
together, meaning two short decode instructions with the second not
depending on the first, only the first using the shifter, no more than one
load and no more than one store.

K6 does some out of order execution so this isn't essential, it just shows
what slots are available.  When decoding is the limiting factor things can
be scheduled that might not execute until later.



INSTRUCTIONS

adcl	  2 cycles of decode, maybe 2 cycles executing in the X pipe
jecxz	  2 cycles taken, 13 not taken (optimization manual says 7 not taken)
divl	  20 cycles back-to-back
mull	  2 decode, 3 execute (optimization manual decoding sample)
prefetch  2 cycles
rcll/rcrl implicit by one bit: 2 cycles
          immediate or %cl count: 11 + 2 per bit for dword
                                  13 + 4 per bit for byte
xchgl	  %eax,reg 1.5 cycles, back-to-back
          reg,reg 2 cycles, back-to-back
setCC	  2 cycles



REFERENCES

AMD-K6 Processor Code Optimization Application Note, AMD publication number
21924, revision C amendment 0, August 1999.  Available on-line,

	http://www.amd.com/K6/k6docs/pdf/21924.pdf

3DNow Technology Manual, AMD publication number 21928F/0-August 1999.  This
describes the femms and prefetch instructions, but nothing else from 3DNow
has been used.  Available on-line,

	http://www.amd.com/K6/k6docs/pdf/21928.pdf

AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note, AMD
publication number 21828, revision A amendment 0, August 1997.  This is an
older edition of 21924 above, applying to K6 rather than K6-2 and pretty much
the same except for no 3DNow.  Available on-line,

	http://www.amd.com/K6/k6docs/pdf/21828.pdf



----------------
Local variables:
mode: text
fill-column: 76
End:
