#!/usr/bin/perl
   eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
       if $running_under_some_shell;

use strict;
use warnings;

use vars qw($VERSION);

$VERSION = '0.01';

=head1 NAME

psame - finds similarities between files or versions of files

=head1 SYNOPSIS

  psame [options] file1 file2
  psame [options] file
  psame [options] [-r version] file

The first usage compares the two files.
The second usage compare the given file with the latest version from
Subversion, CVS or RCS.
The third usage will compare against a given version from Subversion, CVS or
RCS.

By default, the output with be a side-by-side view of matching regions with a
few lines of context.

=head1 MOTIVATION

B<psame> allows the user to find lines in one piece of text (generally from a
file) that match some lines in a second piece of text.

=head1 USE CASES

=head2 Code comparision

The B<diff(1)> command is excellent for finding differences between files, but
sometimes similarity is more interesting.  A common case is when a chunk of
code is moved to another part of the same file.  In that case comparing the
old and new versions of the file with B<diff> will tell you that there has
been a deletion of text and an insertion.  B<psame>, on the other hand, will
tell you where moved code is in the new version.  In simple cases, the output
from B<diff> is clear enough but comparision with B<psame> can help in the
cases where there have been many edits.

=head1 DESCRIPTION

=head2 Options

=over 4

=item B<-b>

ignore changes in whitespace

=item B<-i>

ignore case

=item B<-B>

ignore blank lines

=item B<-s> <num>

ignore simple/short lines (ie. those with less than <num> chars)

=item B<-y>

side-by-side match view (default)

=item B<-V>

vertical match view

=item B<-n>

show non-matches instead of matches

=item B<-N>

show matches and non-matches

=item B<-x> <wid>

set terminal width in columns (normally guessed)

=item B<-r> <ver>

compare with <version> from SVN, CVS or RCS

=item B<-S> <num>

only show matches with score higher than <num> (see the SCORE section below)

=item B<-C> <num>

number of lines of context

=item B<-a>

apply (a)ll useful options - sets the following options:
B<-b> B<-i> B<-B> B<-s> 2 B<-N> B<-S> 3

=back

=head1 MATCHES

A "match" is some number of consective lines in one file (or file version)
that are similar to some number of consective lines in a second file (or
file version).  In the simplest case with no options specified, the lines
in each file must be identical.  As an example, consider these two pieces
of text (with added line numbers):

=head2 B<text_1>

 1. The parrot sketch -
 2.  'E's kicked the bucket, 'e's
 3.
 3.  shuffled off 'is mortal coil, run
 4.  down the curtain and joined the
 5.  bleedin' choir invisibile!
 6.  THIS IS AN EX-PARROT!


=head2 B<text_2>

 1. 'E's kicked the bucket, 'e's
 2. shuffled off 'is mortal coil, run
 3. down the curtain and joined the
 4. bleedin' choir invisibile!
 5. 
 6. This is an ex-parrot!

Using the default settings, B<psame> will report this:

 match 2..5==1..4
   The parrot sketch -                                                    
    'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's     
    shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
    down the curtain and joined the   =  down the curtain and joined the  
    bleedin' choir invisibile!        =  bleedin' choir invisibile!       
    THIS IS AN EX-PARROT!                                                 
                                         This is an ex-parrot!            

which indicates that there are four lines from B<text_1> (ie. lines 2 to 5)
that match four lines from text_2 (ie. 1 to 4).  Note that psame is, by
default, case sensitive so line 6 of B<text_1> doesn't match line 6 of
B<text_2> in this case.

Adding the B<-i> option will make B<psame> ignore case, hence find the last
line of each file to be equal:

 match 2..5==1..4
   The parrot sketch -                                                    
    'E's kicked the bucket, 'e's      =  'E's kicked the bucket, 'e's     
    shuffled off 'is mortal coil, run =  shuffled off 'is mortal coil, run
    down the curtain and joined the   =  down the curtain and joined the  
    bleedin' choir invisibile!        =  bleedin' choir invisibile!       
    THIS IS AN EX-PARROT!                                                 
                                         This is an ex-parrot!            
 match 6..6==6..6
    shuffled off 'is mortal coil, run    down the curtain and joined the  
    down the curtain and joined the      bleedin' choir invisibile!       
    bleedin' choir invisibile!                                            
    THIS IS AN EX-PARROT!             =  This is an ex-parrot!            

In this case B<psame> is reporting two distinct matches - one four lines long
and the other one line long.

=head1 NON-MATCHES

The B<-n> flag will report lines in each file that don't match any lines in
the other file.  For example, running B<psame -n> on the files above, with no
other options gives:

 non matches in text_1:
   1..1:
     The parrot sketch -
   6..6:
      THIS IS AN EX-PARROT!
 non matches in text_2:
   5..6:

      This is an ex-parrot!

In this case B<diff(1)> will tell us the same thing but is other situations we
only want to know about lines in file A that don't appear anywhere in file B.
An example might be when modifying the order of sections in a manuscript - we
would like to check that all sections are still present, even if in a
different place.

=head1 SCORE

The score of a match is currently the total number of lines this match covers
in both files.  The B<-S> option for filtering by score is useful for
filtering out small matches so that the larger changes can be seen.

=head1 BUGS

None known

=head1 LIMITATIONS

The code works well with small input files (up to 10,000 lines or so), but is
too slow and memory intensive for larger files.

=head1 TO DO

Output formatting should be done with Perl6::Form or some such and the output
needs to be more readable.  Suggestions are very welcome.

=head1 AUTHOR

Kim Rutherford <kmr+same@xenu.org.uk>

=cut

use Text::Same;
use Text::Same::TextUI qw( draw_match draw_non_match );

use Getopt::Std;

my %command_line_options = ();
getopts('aviBbVynNC:s:x:S:r:', \%command_line_options);

# set defaults
my %options = (side_by_side => 1);

$options{show_matches} = 1;

if (exists $command_line_options{a}) {
  $options{ignore_case} = 1;
  $options{ignore_blanks} = 1;
  $options{ignore_space} = 1;
  $options{ignore_simple} = 2;
  $options{side_by_side} = 1;
  $options{min_score} = 3;
  $options{show_matches} = 1;
  $options{show_non_matches} = 1;
  $options{context} = 3;
}

$options{ignore_case} ||= $command_line_options{i};
$options{ignore_blanks} ||= $command_line_options{B};
$options{ignore_space} ||= $command_line_options{b};
$options{ignore_simple} ||= $command_line_options{s};
$options{side_by_side} = 0 if defined $command_line_options{V};
$options{side_by_side} = 1 if defined $command_line_options{y};
$options{term_width} ||= $command_line_options{x};
$options{min_score} ||= $command_line_options{S};
$options{revision} = $command_line_options{r};
$options{context} ||= $command_line_options{C};

if (exists $command_line_options{n}) {
  $options{show_non_matches} = 1;
  $options{show_matches} = 0;
}

if (exists $command_line_options{N}) {
  $options{show_matches} = 1;
  $options{show_non_matches} = 1;
}

sub usage
{
  die <<"USAGE";
 usage: $0 [options] file1 file2
or
 usage: $0 [options] file
or
 usage: $0 [options] [-r version] file

options:
   -b           ignore changes in whitespace
   -i           ignore case
   -B           ignore blank lines
   -s <num>     ignore simple/short lines (ie. less than <num> chars>
   -y           side-by-side match view (default)
   -V           vertical match view
   -n           show non-matches instead of matches
   -N           show both matches and non-matches
   -x           terminal width in columns
   -r <version> compare with <version> from SVN, CVS or RCS
   -S <num>     only show matches with score higher than <num>
   -C <num>     the number of line of context to show around each match
   -a           apply (a)ll useful options - sets the following options:
                -b -i -B -s 2 -N -S 3

The first usage compares the two files.
The second usage compare the given file with the latest version from 
Subversion, CVS or RCS.
The third usage compares again a given version.

See the manual page for more details.
USAGE
}

if (@ARGV < 1 or @ARGV > 2) {
  usage;
}

if (@ARGV == 1) {
  my $revision = "";

  if (defined $options{revision}) {
    $revision = "-r $options{revision} ";
  }

  if (-d ".svn") {
    push @ARGV, "svn cat $revision$ARGV[0]|";
  } else {
    if (-d "CVS") {
      push @ARGV, "cvs up $revision-p $ARGV[0]|";
    } else {
      if (-e "$ARGV[0],v") {
        push @ARGV, "co $revision-p $ARGV[0]|";
      } else {
        usage;
      }
    }
  }
}

if (!defined $options{term_width}) {
  $options{term_width} = eval "require Term::Size; Term::Size::chars()";

  if ($@ or $options{term_width} == 0) {
    # pick a default
    $options{term_width} = 80;
  }
}

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];

my $matchmap = compare(\%options, $file1, $file2);

if ($options{show_matches}) {
  my @matches = $matchmap->matches;

  for my $match (@matches) {
    if (!defined $options{min_score} or $match->score >= $options{min_score}) {
      print draw_match(\%options, $match);
    }
  }
}

if ($options{show_non_matches}) {
  my @source1_non_matches = $matchmap->source1_non_matches;
  my @source2_non_matches = $matchmap->source2_non_matches;

  print "non matches in ", $matchmap->source1()->name, ":\n";
  for my $non_match (@source1_non_matches) {
    print draw_non_match(\%options, $matchmap->source1, $non_match);
  }
  print "non matches in ", $matchmap->source2()->name, ":\n";
  for my $non_match (@source2_non_matches) {
    print draw_non_match(\%options, $matchmap->source2, $non_match);
  }
}
