A Tour of NTL

----------------------------------------------------------------------------

Table of Contents

  1. Introduction
  2. Examples
  3. Summary of NTL's Main Modules
  4. NTL Implementation and Portability
  5. Some Performance Data
  6. Obtaining and Installing NTL
  7. Changes between NTL 1.0 and NTL 1.5

Introduction

NTL is a high-performance, portable C++ library providing data structures
and algorithms for manipulating signed, arbitrary length integers, and for
vectors, matrices, and polynomials over the integers and over finite fields.

NTL uses state-of-the-art algorithms. In particular, it's code for
polynomial arithmetic is one of the fastest available. Early versions of NTL
have been used to set "world records" for polynomial factorization and point
counting on elliptic curves.

NTL is written entirely in C++, and can be easily installed on just about
any Unix platform, including PCs, and 32- and 64-bit workstations. Despite
the fact that NTL is written in C++ and avoids assembly, NTL's performance
is generally much better than is typical of such portable libraries.

NTL is relatively easy to use, and it provides a good environment for easily
and quickly implementing new number-theoretic algorithms, without
sacrificing performance.

NTL is free software that is intended for research and educational purposes
only.

NTL is by no means a "complete" computer algebra package. Instead of having
NTL grow into a huge package, the aim is that NTL itself will serve as a
stable, portable platform for implementing other algorithms. It is hoped
that interested users with relevant expertise will implement algorithms on
top of NTL and make their software available to other NTL users. Links to
such software will be provided on the NTL page.

----------------------------------------------------------------------------

Examples

Perhaps the best way to introduce the basics of NTL is by way of example.
Please note, however, that there is much more functionality provided by NTL
than is illustrated in these examples.

Example 1

The first example makes use of the class ZZ, which represents signed,
arbitrary length integers. This program reads two numbers and prints their
product:

#include "ZZ.h"

main()
{
   ZZ a, b, prod;

   cin >> a;
   cin >> b;
   mul(prod, a, b);
   cout << prod << "\n";
}

Note that in NTL, one writes

   mul(prod, a, b);

instead of

   prod = a * b;

as many C++ programmers might expect. The reason is that the latter form
ultimately leads to a lot of unnecessary copying, and allocating and freeing
of memory, which might not be apparent to the programmer. In designing NTL,
efficiency won out over aesthetics in this case.

Example 2

The following routine sums up the numbers in a vector of ZZ's.

#include "vec_ZZ.h"

void sum(ZZ& s, const vector(ZZ)& v)
{
   ZZ acc;

   clear(acc);

   long n = v.length();
   long i;
   for (i = 0; i < n; i++)
      add(acc, acc, v[i]);

   s = acc;
}

The class vector(ZZ) is a dynamic-length array of ZZs; more generally, NTL
provides macros to create a template-like class vector(T) for any type T
that acts as a dynamic-length array. The reason that macros are used instead
of true templates is simple: at the present time, compiler support for
templates is not entirely satisfactory, and their use would make NTL much
more difficult to port. At some point in the future, a template-version of
NTL may be made available.

The routine sum declares a local variable acc of type ZZ, first clears it,
and then sums up the array elements.

One thing to notice is that the routine add works properly even if the
output (its first argument) aliases one of its inputs. Except for a few
exceptions that are well-marked in the documentation, NTL routines allow
their inputs to alias their outputs.

Also notice that by accumulating the sum in a local variable, instead of
directly in the argument s, the routine sum will work correctly even if s
aliases one of the elements of v. Any space required for acc is
automatically freed by acc's destructor when the routine returns.

Vectors in NTL are indexed from 0, but in many situations it is convenient
or more natural to index from 1. The class vector(T) allows for this; the
above example could be written as follows.

#include "vec_ZZ.h"

void sum(ZZ& s, const vector(ZZ)& v)
{
   ZZ acc;

   clear(acc);

   long n = v.length();
   long i;
   for (i = 1; i <= n; i++)
      add(acc, acc, v(i));

   s = acc;
}

Example 3

There is also basic support for matrices in NTL. In general, the class
matrix(T) is a special kind of vector(vector(T)), where each row is a vector
of the same length. Row i of matrix M can be accessed as M[i] (indexing from
0) or as M(i) (indexing from 1). Column j of row i can be accessed as
M[i][j] or M(i)(j); for notational convenience, the latter is equivalent to
M(i,j).

Here is a matrix multiplication routine, which in fact is already provided
by NTL.

#include "mat_ZZ.h"

void mul(matrix(ZZ)& X, const matrix(ZZ)& A, const matrix(ZZ)& B)
{
   long n = A.NumRows();
   long l = A.NumCols();
   long m = B.NumCols();

   if (l != B.NumRows())
      Error("matrix mul: dimension mismatch");

   X.SetDims(n, m);

   long i, j, k;
   ZZ acc, tmp;

   for (i = 1; i <= n; i++) {
      for (j = 1; j <= m; j++) {
         clear(acc);
         for(k = 1; k <= l; k++) {
            mul(tmp, A(i,k), B(k,j));
            add(acc, acc, tmp);
         }
         X(i,j) = acc;
      }
   }
}

Note that in case of a dimension mismatch, the routine calls the Error
function, which is a part of NTL and which simply prints the message and
aborts. That is generally how NTL deals with errors. Currently, NTL makes no
use of exceptions (for the same reason it does not use templates--see
above), but a future version may incorporate them.

Note that this routine will not work properly if X aliases A or B. The
actual matrix multiplication routine in NTL takes care of this.

Another thing to notice is that NTL generally avoids the type int,
preferring instead to use long. This seems to go against what most "style"
books preach, but nevertheless seems to make the most sense in today's
world. Although int was originally meant to represent the "natural" word
size, that seems to no longer be the case. Usually, int and long are the
same; however, when they are not, it seems that int is actually a hack for
backward compatibility to a past when word sizes were smaller. Thus, for
simplicity and safety, NTL uses long for all integer values.

Example 4

NTL provides extensive support for polynomial arithmetic. The class ZZX
represents univariate polynomials with integer coefficients. The following
program reads a polynomial, factors it, and prints the factorization.

#include "ZZXFactoring.h"

main()
{
   ZZX f;

   cin >> f;

   vector(pair(ZZX,long)) factors;
   ZZ c;

   factor(c, factors, f);

   cout << c << "\n";
   cout << factors << "\n";
}

When this program is compiled an run on input

   [2 10 14 6]

which represents the polynomial 2 + 10*X + 14*x^2 +6*X^3, the output is

   2
   [[[1 3] 1] [[1 1] 2]]

The first line of output is the content of the polynomial, which is 2 in
this case as each coefficient of the input polynomial is divisible by 2. The
second line is a vector of pairs, the first member of each pair is an
irreducible factor of the input, and the second is the exponent to which is
appears in the factorization. Thus, all of the above simply means that

2 + 10*X + 14*x^2 +6*X^3 = 2 * (1 + 3*X) * (1 + X)^2

Admittedly, I/O in NTL is not exactly user friendly, but then NTL has no
pretensions about being an interactive computer algebra system: it is a
library for programmers.

Example 5

Here is another example. The following program prints out the first 100
cyclotomic polynomials.

#include "ZZX.h"

main()
{
   vector(ZZX) phi;

   phi.SetLength(100);

   long i;
   for (i = 1; i <= 100; i++) {
      SetCoeff(phi(i), i);
      add(phi(i), phi(i), -1);

      long j;
      for (j = 1; j <= i-1; j++)
         if (i % j == 0)
            divide(phi(i), phi(i), phi(j));

      cout << phi(i) << "\n";
   }
}

Example 6

NTL also supports modular integer arithmetic. The class ZZ_p represents the
integers mod p. Despite the notation, p need not in general be prime, except
in situations where this is mathematically required. The classes
vector(ZZ_p), matrix(ZZ_p), and ZZ_pX represent vectors, matrices, and
polynomials mod p, and work much the same way as the corresponding classes
for ZZ.

Here is a program that reads a prime number p, and a polynomial f, and
factors it.

#include "ZZ_pXFactoring.h"

main()
{
   ZZ p;
   cin >> p;
   ZZ_pInit(p);

   ZZ_pX f;
   cin >> f;

   vector(pair(ZZ_pX,long)) factors;

   CanZass(factors, f);

   cout << factors << "\n";
}

As a program is running, NTL keeps track of a "current modulus" for the
class ZZ_p, which can be initialized or changed using ZZ_pInit. This must be
done before any variables are declared or computations are done that depend
on this modulus.

Please note that for efficiency reasons, NTL does not make any attempt to
ensure that variables declared under one modulus are not used under a
different one. If that happens, the behavior of a program in this case is
completely unpredictable.

Example 7

There is a mechanism for saving and restoring a modulus, which the following
example illustrates. This routine takes as input an integer polynomial and a
prime, and tests if the polynomial is irreducible modulo the prime.

#include "ZZX.h"
#include "ZZ_pXFactoring.h"

long IrredTestMod(const ZZX& f, const ZZ& p)
{
   ZZ_pBak bak;
   bak.save();

   ZZ_pInit(p);

   ZZ_pX f1;

   f1 << f;
   long res = DetIrredTest(f1);

   bak.restore();

   return res;
}

The current modulus is saved using the helper class ZZ_pBak and the
operation save(). The old modulus is then later restored using operation
restore(). In theory, the call to restore() could be left out, as the
destructor for bak should call it; however, the paranoid programmer should
call it explicitly.

The operator << is a conversion operator. NTL provides many conversion
operators between various types, all using the same syntax, and which
generally do the obvious thing. As a rule, NTL avoids the use of implicit
conversion operators, as these can easily lead to gross inefficiencies if
one is not extremely careful.

The routine DetIrredTest is one of several tests for irreducibility provided
by NTL.

Example 8

Suppose in the above example that p is known in advance to be a small,
single-precision prime. In this case, NTL provides a class zz_p, that acts
just like ZZ_p, along with corresponding classes vector(zz_p), matrix(zz_p),
and zz_pX. The interfaces to all of the routines are generally identical to
those for ZZ_p. However, the routines are much more efficient, in both time
and space.

For small primes, the routine in the previous example could be coded as
follows.

#include "ZZX.h"
#include "zz_pXFactoring.h"

long IrredTestMod(const ZZX& f, long p)
{
   zz_pBak bak;
   bak.save();

   zz_pInit(p);

   zz_pX f1;

   f1 << f;
   long res = DetIrredTest(f1);

   bak.restore();

   return res;
}

----------------------------------------------------------------------------

Summary of NTL's Main Modules

NTL consists of a number of software modules. Generally speaking, for each
module foo, there is a header file foo.h, and an implementation file foo.c.
There is also a documentation file foo.txt. This takes the form of a header
file, but stripped of implementation details and declarations of some of the
more esoteric routines and data structures, and it contains more complete
and usually clearer documentation. The following is a summary of the main
NTL modules. The corresponding ".txt" file can be obtained by clicking on
the module name.

 tools          some basic types and utility routines, including a timing
                function

 vector         macros for the template-like class vector(T), providing
                dynamic-size arrays

 matrix         macros for the template-like class matrix(T), providing
                dynamic-size 2-dimensional arrays

 ZZ             class ZZ: arbitrary length integers; includes routines for
                GCDs, Jacobi symbols, modular arithmetic, and primality
                testing; also includes small prime generation routines and
                in-line routines for single-precision modular arithmetic

 ZZ_p           class ZZ_p: integers mod p

 zz_p           class zz_p: integers mod p, where p is single-precision

 xdouble        class xdouble: double-precision floating point numbers with
                extended exponent range.

 RR             class RR: arbitrary-precision floating point numbers.

 ZZX            class ZZX: polynomials over ZZ; includes routines for GCDs,
                minimal and characteristic polynomials, norms and traces

 ZZXFactoring   routines for factoring univariate polynomials over ZZ

 ZZ_pX          class ZZ_pX: polynomials over ZZ_p; includes routines for
                modular polynomials arithmetic, modular composition,
                minimal and characteristic polynomials, and interpolation.

 ZZ_pXFactoring routines for factoring polynomials over ZZ_p; also includes
                routines for testing for and constructing irreducible
                polynomials

 zz_pX          class zz_pX: polynomials over zz_p; provides the same
                functionality as class ZZ_pX, but for single-precision p

 zz_pXFactoring routines for factoring polynomials over zz_p; provides the
                same functionality as class ZZ_pX, but for single-precision
                p

 mat_ZZ         class matrix(ZZ); includes basic matrix arithmetic
                operations, including determinant calculation, matrix
                inversion, and solving nonsingular systems of linear
                equations

 HNF            routines for computing the Hermite Normal Form of a lattice

 LLL            routines for performing lattice basis reduction, including
                an implementation of the Schnorr-Euchner LLL and Block
                Korkin Zolotarev reduction algorithm, as well as an
                integer-only reduction algorithm.

 mat_ZZ_p       class matrix(ZZ_p); includes basic matrix arithmetic
                operations, including determinant calculation, matrix
                inversion, solving nonsingular systems of linear equations,
                and Gaussian elimination

 mat_zz_p       class matrix(zz_p); includes basic matrix arithmetic
                operations, including determinant calculation, matrix
                inversion, solving nonsingular systems of linear equations,
                and Gaussian elimination

 mat_RR         class matrix(RR); includes basic matrix arithmetic
                operations, including determinant calculation, matrix
                inversion, and solving nonsingular systems of linear
                equations.

 mat_poly_ZZ    routine for computing the characteristic polynomial of a
                matrix(ZZ)

 mat_poly_ZZ_p  routine for computing the characteristic polynomial of a
                matrix(ZZ_p)

 mat_poly_zz_p  routine for computing the characteristic polynomial of a
                matrix(zz_p)

 vec_ZZ         class vector(ZZ)

 ZZVec          class ZZVec: fixed-length vectors of fixed-length ZZs; less
                flexible, but more efficient than vector(ZZ)

 vec_ZZ_p       class vector(ZZ_p)

 vec_zz_p       class vector(zz_p)

 vec_RR         class vector(RR)

----------------------------------------------------------------------------

NTL Implementation and Portability

NTL is designed to be portable, fast, and relatively easy to use and extend.

To make NTL portable, no assembly code is used. This is highly desirable, as
architectures are constantly changing and evolving, and maintaining assembly
code is quite costly. By avoiding assembly code, NTL should remain usable,
with virtually no maintenance, for many years. The main drawback of this
philosophy is that without assembly code, one cannot use machine
instructions to obtain double-word products, or perform double-word by
single-word division. There are a number of possible strategies for dealing
with this. NTL's basic strategy uses a combination of integer and
floating-point instruction sequences.

To carry out this strategy, NTL makes two requirements of its platform,
neither of which are guaranteed by the C++ language definition, but
nevertheless appear to be essentially universal:

  1. Integers are represented using 2's complement, and integer overflow is
     not trapped, but rather just wraps around.
  2. Double precision floating point conforms to the IEEE standard.

Actually, with some modification, NTL would not need the first requirement,
by exploiting language definitions dealing with unsigned arithmetic. Future
versions of NTL may incorporate this modification, if there is any need for
it.

Relying on floating point may seem prone to errors, but with the guarantees
provided by the IEEE standard, one can prove the correctness of the NTL code
that uses floating point. Actually, NTL is quite conservative, and
substantially weaker conditions are sufficient for correctness. In
particular, NTL works correctly with any rounding mode, and also with any
mix of double precision and extended double precision operations (which
arise, for example, with Intel x86 processors).

With this strategy, NTL represents arbitrary length integers using a 30-bit
radix on 32-bit machines, and a 50-bit radix on 64-bit machines.

This general strategy is used in A. K. Lenstra's LIP library for
arbitrary-length integer arithmetic. Indeed, NTL's integer arithmetic
evolved from LIP, but over time almost all of this code has been rewritten
to enhance performance.

Long integer multiplication is implemented using the classical algorithm,
crossing over to Karatsuba for very big numners. Polynomial multiplication
and division is carried out using a combination of the classical algorithm,
Karatsuba, the FFT using small primes, and the FFT using the
Schoenhagge-Strassen approach.

----------------------------------------------------------------------------

Some Performance Data

NTL is high-performance software, offering high-quality implementations of
the best algorithms. Here are some timing figures from the current version
of NTL. The figures were obtained using an IBM RS6000 Workstation, Model
43P-133, which has a 133 MHz PowerPC Model 604 processor. The operating
system is AIX and the compiler is lxC. The compiler options were -O2
-qarch=ppc -DAVOID_FLOAT -DTBL_REM.

The first problem considered is the factorization of univariate polynomials
modulo a prime p. As test polynomials, we take the family of polynomials
defined in [V. Shoup, J. Symb. Comp. 20:363-397, 1995]. For every n, we
define p to be the first prime greater than 2^{n-2}*PI, and the polynomial
is

\sum_{i=0}^n a_{n-i} X^i,

where a_0 = 1, and a_{i+1} = a_i^2 + 1. Here are some running times:

  n           64    128    256     512      1024

  hh:mm:ss    2     13     1:53    21:01    4:05:25

Also of interest is space usage. The n = 512 case used 4MB main memory, and
the n = 1024 case used 17 MB main memory.

Another test suite, this time using small primes, was used by Kaltofen and
Lobo (Proc. ISSAC '94). One of there polynomials is a degree 10001
polynomial, modulo the prime 127. This polynomial was factored with NTL in
just over 3 hours, using 17MB of memory.

The second problem considered is factoring univariate polynomials over the
integers. We use two test suites. In the first, we factor F_n(X)*F_{n+1}(X),
where

F_n(X) = \sum_{i=0}^n f_{n-i}*X^i,

and f_i is the i-th Fibonacci number (f_0 = 1, f_1 = 1, f_2 = 2, ...). Here
are some running times:

  n           100    200    300     400     500     1000

  hh:mm:ss    11     34     1:44    2:35    3:35    15:20

The space in the n=500 case was under 5MB, and in the n=1000 case, under
13MB.

The second test suite comes from Paul Zimmermann . The polynomial P1(X) has
degree 156, coefficients up to 424 digits, and 36 factors (12 of degree 2,
15 of degree 4, 9 of degree 8). The polynomial P2(X) has degree 196,
coefficients up to 419 digits and 12 factors (2 of degree 2, 4 of degree 12
and 6 of degree 24). The polynomial P3(X) has degree 336, coefficients up to
597 digits and 16 factors (4 of degree 12 and 12 of degree 24). The
polynomial P4(X) has degree 462, coefficients up to 756 digits, and two
factors of degree 66 and 396. More details on this test suite are available.

Our running times (hh:mm:ss) were as follows:

21, 23, 1:16, 1:37:10.

In all cases less than 5MB of main memory was used.

----------------------------------------------------------------------------

Obtaining and Installing NTL

To obtain the source code and documentation for NTL, download ntl-1.5.tar.Z,
placing it an empty directory, and then, working in this directory, execute
the following command:

zcat ntl-1.5.tar.Z | tar xvf -

There is a makefile, which you might need to edit. You need to specify a C++
compiler, and optionally, a compatible C compiler. A few source files are
written in pure C, and will compile under C or C++, but using a C compiler
sometimes yields better code. The default settings in the makefile use the
Gnu compilers g++ and gcc. There are a variety of compiler flags that you
can set to customize the compilation, affecting the quality of the compiled
code.

   * The -O2 flag, which invokes the optimizer, is essential for reasonable
     performance.
   * The -g option causes symbol tables to be generated that are used by a
     debugger; for the Gnu compilers, this is compatible with the optimizer,
     but for some other compilers it is not. If not, leave it out.
   * The -mv8 option should be used on Sparc stations that have an integer
     multiply instruction (Sparc-10 and later); otherwise, the compiler will
     not generate this instruction, causing a serious performance loss.
   * The -qarch=ppc option should be used on the RS6000/PowerPC when using
     the AIX compilers xlc/xlC to get access to the full instruction set.
   * The -+ should be added to the CPPFLAGS line in the makefile when using
     the AIX compiler xlC.
   * The -DAVOID_FLOAT option should be used on machines with fast integer
     multiplication, but slow floating point. On such machines, this will
     generally yield faster code. PowerPCs, for example, perform much better
     with this option.
   * The -DTBL_REM avoids some divisions in implementing multiplication in
     ZZ_pX. If you use the -DAVOID_FLOAT flag, then you should probably use
     this flag too.
   * The two flags -DAVOID_BRANCHING and -DFFT_PIPELINE should yield a
     performance gain on just about any RISC architecture, such as Sparc
     stations and Alphas. They don't seem to help on Pentiums, however. With
     the first flag, branches are replaced at several key points with
     equivalent code using shifts and masks. The second flag causes the FFT
     routine to use a software pipeline.
   * The -DRANGE_CHECK flag causes vector indices to be checked at run time,
     resulting in slower execution, of course.
   * The -DCPLUSPLUS_ONLY flag may be used of you are using only a C++
     compiler, and no C compiler. This flag eliminates unnecessary "C"
     linkage.
   * On 32-bit platforms platforms that support it, the -DSINGLE_MUL flag
     causes a 26-bit radix to be used, and avoids integer multiplications,
     using floating point multiplications instead. This yields better code
     on machines with slow integer multiplication, and fast floating point.
     On old Sparc stations, this yields a marked performance gain. However,
     the 26-bit radix can be awkward in some applications.

     The -DFAST_INT_MUL flag can be used in conjunction with the
     -DSINGLE_MUL flag. This flag causes integer multiplications to be used
     where possible.

     One technical point to be aware of is that should the library be
     compiled with (resp. without) the -DSINGLE_MUL flag, than any code
     calling the library must also be compiled with (resp. without) this
     flag. FFT primes.

After editing makefile, just execute make. The first thing that the makefile
does is to make mach_desc.h, which defines some machine characteristics such
as word size and machine precision. This is done by compiling and running a
C program that figures out these characteristics on its own, and prints some
diagnostics to the terminal.

After this, the makefile will compile all the source files, and then create
the library ntl.a.

Finally, the makefile compiles and runs a series of test programs. The
output generated should indicate if there are any problems.

Executing make clean will remove unnecessary object files.

Executing make clobber removes everything that was generated by a previous
installation.

When linking a program, you need to include ntl.a and -lm as libraries.

The compilation should run smoothly on just about any UNIX platform. The
only real trouble I've heard of is on the RS6000/PowerPC, where the GNU
compiler seems to have a code-generation bug. The AIX compilers xlc/xlC,
however, work fine.

----------------------------------------------------------------------------

Changes between NTL 1.0 and NTL 1.5

Programs written using NTL 1.0 should still work without change with NTL
1.5. In addition to some minor performance tuning and bug fixes, NTL 1.5
offers the following new features:

   * Implementation of Schnorr-Euchner algorithms for lattice basis
     reduction, including deep insertions and block Korkin Zolotarev
     reduction. These are significantly faster than the LLL algorithm in NTL
     1.0.
   * Implementation of arbitrary-precision floating point.
   * Implementation of double precision with extended exponent range, which
     is useful for lattice basis reduction when the coefficients are large.
   * Faster polynomial multiplication over the integers, incorporating the
     Schoenhagge-Strassen method.
   * Compilation flags that increase performance on machines with poor
     floating-point performance.

----------------------------------------------------------------------------
