
## Note: I'm most thankful for the code sample from David Mertens.
## https://groups.google.com/forum/#!topic/the-quantified-onion/2cSWXogt5Xs
##
## I'm quite new to PDL. David's example pointed me into the right direction.
##
## David has an interesting module called PDL::Parallel::threads to help
## share PDL data across Perl threads. MCE, on the other hand, has a powerful
## "do" method which is used to obtain data from the manager process as well
## as send results back as often as needed. Please note that MCE supports
## both forking and threading.
##

## Demonstration of parallelizing PDL with Many-core Engine for Perl (MCE)
## using Strassen's divide-and-conquer algorithm for matrix multiplication.
##
##    Requires MCE version 1.4 or later to run.
##    http://code.google.com/p/many-core-engine-perl/
##
## PDL is extremely powerful by itself. However, add MCE to it and be amazed.
##
## -- Usage -------------------------------------------------------------------
##
## perl matmult_pdl_b.pl   1024  ## Default size is 512:  $c = $a x $b
## perl matmult_pdl_m.pl   1024  ## Default size is 512:  $c = $a x $b
## perl matmult_pdl_n.pl   1024  ## Default size is 512:  $c = $a x $b
##
## perl matmult_perl_m.pl  1024  ## Default size is 512:  $c = $a * $b
##
## perl strassen_pdl_m.pl  1024  ## Default size is 512:  divide-and-conquer
## perl strassen_perl_m.pl 1024  ## Default size is 512:  divide-and-conquer
##

matmul_pdl_b.pl
      Baseline PDL matrix multiplication using PDL

      my $a = sequence $size,$size;
      my $b = sequence $size,$size;
      my $c = $a x $b;

matmul_pdl_m.pl
      PDL matrix multiplication + MCE (8 workers)
      Uses Storable qw(freeze thaw)

matmul_pdl_n.pl
      PDL matrix multiplication + MCE (8 workers)
      Same as matmul_pdl_m.pl but uses PDL::IO::FastRaw to write/read matrix b

matmul_perl_m.pl
      Perl matrix multiplication + MCE (8 workers)

      This is a plain Perl implementation showing how one can parallelize the
      classic matrix multiplication without having to copy the matrices to each
      worker. Being 100% Perl, it's quite slow.

strassen_pdl_m.pl
      Divide-and-conquer implementation using Strassen's algorithm

      PDL + MCE (7 workers) is a very powerful combination seen here.
      This example was created to see how MCE performs when applied with
      a recursive algorithm.

strassen_perl_m.pl
      100% Perl implementation + MCE (7 workers)

      The example was created to see how the Perl implementation performs.
      Execution requires lots of memory due to the nature of how Perl
      stores scalars into memory. Being 100% Perl, it's quite slow.

## -- Times below are reported in number of seconds ---------------------------
##
## Benchmarked under Linux -- RHEL 6.3, Perl 5.10.1
## System is configured with both Turbo-Boost and Hyper-Threads enabled
## Hardware is an Intel(R) Xeon(R) CPU E5649 @ 2.53GHz x 2 (24 logical procs)
##
## My favorite is matmult_pdl_n.pl running with 24 workers. It has very low
## memory consumption. The strassen_pdl_m.pl is quite fast, but requires
## additional memory, especially strassen_perl_m.pl.
##
## Below, (n running) indicates a shell script running number of instances:
##
##    #!/bin/bash
##
##    perl matmul_pdl_n.pl 1024 &
##    perl matmul_pdl_n.pl 1024 &
##    perl matmul_pdl_n.pl 1024 &
##
##    wait
##

## -- Results for 1024x1024 ---------------------------------------------------
##
## matmul_pdl_b.pl    1024: compute:   2.705 secs   1 worker   ( 1 running)
## matmul_pdl_b.pl    1024: compute:  11.035 secs   1 worker   (24 running)
##
## matmul_pdl_m.pl    1024: compute:   0.697 secs   8 workers  ( 1 running)
## matmul_pdl_m.pl    1024: compute:   1.625 secs   8 workers  ( 3 running)
## matmul_pdl_m.pl    1024: compute:   0.705 secs  24 workers  ( 1 running)
##
## matmul_pdl_n.pl    1024: compute:   0.500 secs   8 workers  ( 1 running)
## matmul_pdl_n.pl    1024: compute:   0.978 secs   8 workers  ( 3 running)
## matmul_pdl_n.pl    1024: compute:   0.368 secs  24 workers  ( 1 running)
##
## matmul_perl_m.pl   1024: compute:  33.833 secs   8 workers  ( 1 running)
## matmul_perl_m.pl   1024: compute:  69.830 secs   8 workers  ( 3 running)
## matmul_perl_m.pl   1024: compute:  23.995 secs  24 workers  ( 1 running)
##
## strassen_pdl_m.pl  1024: compute:   0.564 secs   7 workers  ( 1 running)
## strassen_perl_m.pl 1024: compute:  45.408 secs   7 workers  ( 1 running)
##
## Output
##    (0,0) 365967179776  (1023,1023) 563314846859776
##

## -- Results for 2048x2048 ---------------------------------------------------
##
## matmul_pdl_b.pl    2048: compute:  21.470 secs   1 worker   ( 1 running)
## matmul_pdl_b.pl    2048: compute:  96.217 secs   1 worker   (24 running)
##
## matmul_pdl_m.pl    2048: compute:   4.873 secs   8 workers  ( 1 running)
## matmul_pdl_m.pl    2048: compute:  12.610 secs   8 workers  ( 3 running)
## matmul_pdl_m.pl    2048: compute:   4.715 secs  24 workers  ( 1 running)
##
## matmul_pdl_n.pl    2048: compute:   3.198 secs   8 workers  ( 1 running)
## matmul_pdl_n.pl    2048: compute:   7.453 secs   8 workers  ( 3 running)
## matmul_pdl_n.pl    2048: compute:   2.515 secs  24 workers  ( 1 running)
##
## matmul_perl_m.pl   2048: compute: 270.556 secs   8 workers  ( 1 running)
## matmul_perl_m.pl   2048: compute: 558.837 secs   8 workers  ( 3 running)
## matmul_perl_m.pl   2048: compute: 190.302 secs  24 workers  ( 1 running)
##
## strassen_pdl_m.pl  2048: compute:   2.734 secs   7 workers  ( 1 running)
## strassen_perl_m.pl 2048: compute: 322.932 secs   1 level  parallelization
## strassen_perl_m.pl 2048: compute: 200.440 secs   2 levels parallelization
##
## Output
##    (0,0) 5859767746560  (2047,2047) 1.80202496872953e+16  matmul examples
##    (0,0) 5859767746560  (2047,2047) 1.8020249687295e+16   strassen examples
##

## -- Results for 4096x4096 ---------------------------------------------------
##
## matmul_pdl_b.pl    4096: compute: 172.220 secs   1 worker   ( 1 running)
## matmul_pdl_m.pl    4096: compute:  35.923 secs  24 workers  ( 1 running)
## matmul_pdl_n.pl    4096: compute:  23.580 secs  24 workers  ( 1 running)
## strassen_pdl_m.pl  4096: compute:  16.941 secs   7 workers  ( 1 running)
##
## Output
##    (0,0) 93790635294720  (4095,4095) 5.76554474219245e+17  matmul examples
##    (0,0) 93790635294720  (4095,4095) 5.76554474219244e+17  strassen example
##

## -- Results for 8192x8192 ---------------------------------------------------
##
##    Note: At this size, only matmult_pdl_m/n.pl can run due to it's low
##    memory consumption. Workers only have a local copy of the "b" matrix
##    in memory, not the "a" or "c" matrices. MCE's do method is useful in
##    fetching the next row as well as sending the result.
##
## matmul_pdl_b.pl    8192: compute: 1388.001 secs   1 worker   ( 1 running)
## matmul_pdl_n.pl    8192: compute:  447.485 secs  24 workers  ( 1 running)
##
## Output
##
##    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19
##    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19
##


This has been an exciting exercise for me. MCE enjoys parallelizing PDL :)

Running matmult_pdl_n.pl with max_workers => 24 is quite nice. One can set this
to 32 on servers with Intel's Xeon E5 processor (dual socket). If memory is
plentiful, the strassen_pdl_n.pl is quite fast. Please note that the Strassen
algorithm introduces rounding errors and may not be a good fit if wanting a
high level of accuracy.

MCE also likes big files. Look at the egrep.pl and wc.pl examples.
Those examples benefit from MCE's chunking engine.

Regards,
Mario

