
## Demonstration of parallelizing PDL with Many-core Engine for Perl (MCE)
## using Strassen's divide-and-conquer algorithm for matrix multiplication.
##
##    Requires MCE version 1.4 or later to run.
##    http://code.google.com/p/many-core-engine-perl/
##
## PDL is extremely powerful by itself. However, add MCE to it and be amazed.
##
## Usage:
##    perl matmult_pdl_b.pl   1024  ## Default size is 512:  $c = $a x $b
##    perl matmult_pdl_m.pl   1024  ## Default size is 512:  $c = $a * $b
##    perl matmult_perl_m.pl  1024  ## Default size is 512:  $c = $a * $b
##    perl strassen_pdl_m.pl  1024  ## Default size is 512:  divide-and-conquer
##    perl strassen_perl_m.pl 1024  ## Default size is 512:  divide-and-conquer
##

matmul_pdl_b.pl
      Baseline PDL matrix multiplication using PDL.

      my $a = sequence $size,$size;
      my $b = sequence $size,$size;
      my $c = $a x $b;

matmul_pdl_m.pl
      PDL matrix multiplication + MCE (8 workers).

      The baseline does $m1 x $m2 whereas this one does $m1 * $2 due to running
      in parallel. Even though not able to use the 'x' operator, this example
      performs quite well. It also has low memory utilization.

matmul_perl_m.pl
      Perl matrix multiplication + MCE (8 workers).

      This is a plain Perl implementation showing how one can parallelize the
      classic matrix multiplication without having to copy the matrices to each
      worker. Being 100% Perl, it's quite slow.

strassen_pdl_m.pl
      Divide-and-conquer implementation using Strassen's algorithm.

      PDL + MCE (7 workers) is a very powerful combination seen here.
      This example was created to see how MCE performs when applied with
      a recursive algorithm.

strassen_perl_m.pl
      100% Perl implementation + MCE (7 workers).

      The example was created to see how the Perl implementation performs.
      Execution requires lots of memory due to the nature of how Perl
      stores scalars into memory. Being 100% Perl, it's quite slow.

## Times below are reported in number of seconds.
##
## Benchmarked under Linux -- RHEL 6.3, Perl 5.10.1
## System is configured with both Turbo-Boost and Hyper-Threads enabled
## Hardware is an Intel(R) Xeon(R) CPU E5649 @ 2.53GHz x 2 (24 logical procs)
##
## My favorite is matmul_pdl_m.pl running with 24 workers. It has very low
## memory consumption. The strassen_pdl_m.pl is quite fast, but requires
## additional memory, especially strassen_perl_m.pl.
##
## Below, (n running) indicates a shell script running number of instances:
##
##    #!/bin/bash
##
##    perl matmul_pdl_m.pl 1024 &
##    perl matmul_pdl_m.pl 1024 &
##    perl matmul_pdl_m.pl 1024 &
##
##    wait
##

## -- Results for 1024x1024 ---------------------------------------------------
##
## matmul_pdl_b.pl    1024: compute:   2.705 secs   1 worker   ( 1 running)
## matmul_pdl_b.pl    1024: compute:  11.035 secs   1 worker   (24 running)
##
## matmul_pdl_m.pl    1024: compute:   3.809 secs   8 workers  ( 1 running)
## matmul_pdl_m.pl    1024: compute:   7.397 secs   8 workers  ( 3 running)
## matmul_pdl_m.pl    1024: compute:   2.486 secs  24 workers  ( 1 running)
##
## matmul_perl_m.pl   1024: compute:  33.833 secs   8 workers  ( 1 running)
## matmul_perl_m.pl   1024: compute:  69.830 secs   8 workers  ( 3 running)
## matmul_perl_m.pl   1024: compute:  23.995 secs  24 workers  ( 1 running)
##
## strassen_pdl_m.pl  1024: compute:   0.564 secs   7 workers  ( 1 running)
## strassen_perl_m.pl 1024: compute:  45.408 secs   7 workers  ( 1 running)
##
## Output
##    (0,0) 365967179776  (1023,1023) 563314846859776
##

## -- Results for 2048x2048 ---------------------------------------------------
##
## matmul_pdl_b.pl    2048: compute:  21.470 secs   1 worker   ( 1 running)
## matmul_pdl_b.pl    2048: compute:  96.217 secs   1 worker   (24 running)
##
## matmul_pdl_m.pl    2048: compute:  18.315 secs   8 workers  ( 1 running)
## matmul_pdl_m.pl    2048: compute:  35.954 secs   8 workers  ( 3 running)
## matmul_pdl_m.pl    2048: compute:  11.987 secs  24 workers  ( 1 running)
##
## matmul_perl_m.pl   2048: compute: 270.556 secs   8 workers  ( 1 running)
## matmul_perl_m.pl   2048: compute: 558.837 secs   8 workers  ( 3 running)
## matmul_perl_m.pl   2048: compute: 190.302 secs  24 workers  ( 1 running)
##
## strassen_pdl_m.pl  2048: compute:   2.734 secs   7 workers  ( 1 running)
## strassen_perl_m.pl 2048: compute: 322.932 secs   1 level  parallelization
## strassen_perl_m.pl 2048: compute: 200.440 secs   2 levels parallelization
##
## Output
##    (0,0) 5859767746560  (2047,2047) 1.80202496872953e+16  matmul examples
##    (0,0) 5859767746560  (2047,2047) 1.8020249687295e+16   strassen examples
##

## -- Results for 4096x4096 ---------------------------------------------------
##
## matmul_pdl_b.pl    4096: compute: 172.220 secs   1 worker   ( 1 running)
## matmul_pdl_m.pl    4096: compute:  68.104 secs  24 workers  ( 1 running)
## strassen_pdl_m.pl  4096: compute:  16.941 secs   7 workers  ( 1 running)
##
## Output
##    (0,0) 93790635294720  (4095,4095) 5.76554474219245e+17  matmul examples
##    (0,0) 93790635294720  (4095,4095) 5.76554474219244e+17  strassen example
##

## -- Results for 8192x8192 ---------------------------------------------------
##
##    Note: At this size, only matmul_pdl_m.pl can run due to it's low
##    memory consumption. Workers only have a local copy of the "b" matrix
##    in memory, not the "a" or "c" matrices. MCE's do method is useful in
##    fetching the next row as well as sending the result.
##
## matmul_pdl_b.pl    8192: compute: 1388.001 secs   1 worker   ( 1 running)
## matmul_pdl_m.pl    8192: compute:  462.206 secs  24 workers  ( 1 running)
##
## Output
##
##    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19
##    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19
##


This has been an exciting exercise for me. MCE enjoys parallelizing PDL :)

Running matmul_pdl_m.pl with max_workers => 24 is quite nice. One can set this
to 32 on servers with Intel's Xeon E5 processor (dual socket). If memory is
plentiful, the strassen_pdl_m.pl is quite fast. Please note that the Strassen
algorithm introduces rounding errors and may not be a good fit if wanting a
high level of accuracy.

MCE also likes big files. Look at the egrep.pl and wc.pl examples.
Those examples benefit from MCE's chunking engine.

Regards,
Mario

