RestrictionDigest

DESCRIPTION

	RestrictionDigest can simulate the reference genome digestion and
generate comprehensive information of the simulation. It can simulate single-enzyme
digestion, double-enzyme digestion and size selection process. It can also analyze
multiple genomes at one run and generates concise comparison of enzyme(s) performance
across the genomes. 


INSTALLATION

To install this module, run the following commands:

	perl Makefile.PL
	make
	make test
	make install



METHODS

	Class 

	RestrictionDigest::SingleItem::Double

	Functions

	new
	add_ref
	add_enzyme_pair
	new_enzyme
	change_range
	change_lengths_distribution_parameters
	add_output_dir
	double_digest
	add_SNPs
	count_SNPs_at_fragments
	add_gff
	all_frags_coverage_ratio
	frags_in_range_covearge_ratio
	
	Script Example:

	use RestrictionDigest;
	my $double_digest=RestrictionDigest::SingleItem::Double->new();
	$double_digest->add_ref(-reference =>’ path to the reference file’);
	$double_digest->add_enzyme_pair(-front_enzyme =>’EcoRI’, -behind_enzyme =>’HinfI’);
	$double_digest->change_range(-start =>301,-end =>500);
	$double_digest->change_lengths_distribution_parameters(-front =>200,-behind =>800,-step =>25);
	$double_digest->add_output_dir(-output_dir=>’ path to the output directory’);
	$double_digest->double_digest();
	$double_digest->add_SNPs(-SNPs =>’ path to the SNPs file’);
	$double_digest->count_SNPs_at_fragments(-sequence_type =>’125SE’, -sequence_end =>’front_enzyme’);
	$double_digest->add_gff(-gff =>’ path to the GFF file’);
	$double_digest->frags_in_range_coverage_ratio();


	Class

	RestrictionDigest::SingleItem::Single

	Functions

	new
	add_ref
	add_single_enzyme
	new_enzyme
	change_range
	change_lengths_distribution_parameters
	add_output_dir
	single_digest
	add_SNPs
	count_SNPs_at_fragments
	add_gff
	all_frags_coverage_ratio
	frags_in_range_covearge_ratio

	Script Example:
	use RestrictionDigest;
	my $single_digest=RestrictionDigest::SingleItem::Single->new();
	$single_digest->add_ref(-reference=>’path to the reference file’);
	$single_digest->add_single_enzyme(-enzyme =>’EcoRI’);
	$single_digest->change_range(-start =>301,-end =>500);
	$single_digest->change_lengths_distribution_parameters(-front =>200,-behind =>800,-step =>25);
	$single_digest->add_output_dir(-output_dir=>’path to the output directory’);
	$single_digest->single_digest();
	$single_digest->add_SNPs(-SNPs =>’ path to the SNPs file’);
	$single_digest->count_SNPs_at_fragments(-sequence_type  =>’150SE’);
	$single_digest->add_gff(-gff =>’ path to the GFF file’);
	$single_digest->frags_in_range_coverage_ratio();




	Class 

	RestrictionDigest::MultipleItems::Double

	Functions

	new
	add_refs
	add_enzyme_pair
	new_enzyme
	change_lengths_distribution_parameters
	add_output_dir
	digests_and_compare
 
	Script Example:

	use RestrictionDigest;
	my $multiple_double_digest=RestrictionDigest::MultipleItems::Double->new();
	$multiple_double_digest->add_refs(-ref1 =>’path to reference 1 file’, -ref2 =>’path to reference 2 file’…);
	$multiple_double_digest->add_enzyme_pair(-front_enzyme=>’EcoRI’, -behind_enzyme => ‘HinfI’);
	$multiple_double_digest->change_lengths_distribution_parameters(-front => 200, -behind => 800, -step => 25);
	$multiple_double_digest->add_output_dir (-output_dir => ‘path to the output directory’);
	$multiple_double_digest->digests_and_compare();



	Class

	RestrictionDigest::MultipleItems::Single

	Functions

	new
	add_refs
	add_single_enzyme
	new_enzyme
	change_lengths_distribution_parameters
	add_output_dir
	digests_and_compare

	Script Example:

	use RestrictionDigest;
	my $multiple_double_digest=RestrictionDigest::MultipleItems::Single->new();
	$multiple_single_digest->add_refs(-ref1 =>’path to reference 1 file’, -ref2 =>’path to reference 2 file’…);
	$multiple_single_digest->add_enzyme_pair(-front_enzyme=>’EcoRI’, -behind_enzyme => ‘HinfI’);
	$multiple_single_digest->change_lengths_distribution_parameters(-front => 200, -behind => 800, -step => 25);
	$multiple_single_digest->add_output_dir (-output_dir => ‘path to the output directory’);
	$multiple_single_digest->digests_and_compare();


FORMATS OF THE INPUT REFERENCE FILE, GFF FILE, AND THE SNP COORDINATES FILE

1) RestrictionDigest is designed to process the reference file in the FASTA
format. It should be in the following format:

>ChromsomeName1
ATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCG
>ChromsomeName2
ATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCG



2) For the "GFF" file, it should be in the standard GFF format, here is an example:

<seqname>	<source>	<feature>	<start>	<end>	<score>	<strand>	<frame>	<attribute> 
      

##############################################################################################
scaffold39120	GLEAN	mRNA    36514	50111	.	+	.	ID=OYG_10003188_10008591;   
scaffold39120	GLEAN	Exon	36514	36534	.	+	.	Parent=OYG_10003188_10008591;   
scaffold39120	GLEAN	Exon	43276	43353	.	+	.	Parent=OYG_10003188_10008591;   
scaffold39120	GLEAN	Exon	43766	43868	.	+	.	Parent=OYG_10003188_10008591;   
scaffold39120	GLEAN	Exon	44710	44741	.	+	.	Parent=OYG_10003188_10008591;   
scaffold39120	GLEAN	Exon	49875	50111	.	+	.	Parent=OYG_10003188_10008591;   
scaffold838	GLEAN	mRNA	36580	65627	.	-	.	ID=OYG_10003281_10026064; 
###############################################################################################

The gff file should contain 9 columns and these columns must be seperated by '\t'. 


3) For the SNP coordinates file, it should contain 3 columns, here is an example:

<chromosome name>	<coordinate>	<SNP type>
scaffold18356		19		R
scaffold18356		20		Y
scaffold18356		55		G
scaffold18356		60		Y
scaffold18356		88		Y
scaffold18356		97		M
scaffold18356		114		M


EXPLANATIONS OF RESULT FILES

1. Files produced by RestrictionDigest::SingleItem::Double.


1) "position_FB_BF_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME" 
   "position_FF_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
   "position_BB_frags(_in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
The positions of digested fragments(all fragments or fragments in range) 
are saved in these file. There are 7 columns whose descriptions are as follows:
'>Name of fragments', 'strand','FRONT_ENZYME/behind', 'start position', 
'FRONT_ENZYME/behind', 'end position', 'length of fragment'.
    
    a) >Name of fragments.  The name of this fragment, the format consists 
    3 parts. The 1st part is '>'. The 2nd part is the name of scaffold in 
    the reference file User provides. The 3rd part is the serial num of this
    fragment     in all fragments produced in this scaffold.
    
    b) strand. The location of this fragment in the reference sequence, + or -.
    
    c) FRONT_ENZYME/behind. The corresponding enzyme of the overhang of front
    end of this fragments. If the strand is "+" strand, the enzyme is behind 
    enzyme, else, the enzyme is front enzyme.

    d) start position. The start position of this fragment on the specific
    scaffold. It is the start position of the     overhang, not the recognition
    base position. 

    e) FRONT_ENZYME/behind. Like the item 'c', but the opposite one to it.

    f) end postion. The end position of this fragment on the specific scaffold.
    It is also the end position of the     overhang, not the recognition base
    position.

    g) length of fragment. The base length of this fragment.


2) "seq__FB_BF_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
   "seq_FF_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
   "seq_BB_frags(in_range_)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
The seqences of digested fragments(all fragments or fragments in range)
and their names are saved in these files. There are many double-line. The
first line of the double_line is the name of the fragment in the 
">chromsomeName-[0-9]+" format. The second line of the double_line is 
the sequence of this fragment.

3) "coverage_ratio_every_chromosome_REFERENCE_by_ENZYME". 
The coverage ratios of every chromosome covered by three kinds of fragments
are saved in this file. 
chromosome_name	all_type_overall_rate	all_type_in_range_rate	FB_BF_overall_rate	FB_BF_in_range_rate	FF_overall_rate	FF_in_range_rate	BB_overall_rate	BB_in_range_rate


4) "digestion_summary_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME". 

	loci_number_of_FRONT_ENZYME	loci_number_of_BEHIND_ENZYME
	
	all_3types_fragments_number	all_3types_fragments_coverage	all_FB_BF_fragments_number	all_FB_BF_fragments_coverage	all_FF_fragments_number	all_FF_fragments_coverage

	3types_fragments_in_range_number	3types_fragments_in_range_coverage	FB_BF_fragments_in_range_number	FB_BF_fragments_in_range_coverage	FF_fragments_in_range_number	FF_fragments_in_range_coverage	BB_fragments_in_range_number	BB_fragments_in_range_coverage

	Length_range	three_types_number	three_types_ratio	FB_BF_number	FB_BF_ratio	FF_number	FF_ratio	BB_number	BB_ratio
 
	0bp-200bp
	201bp-250b
	251bp-300bp
	301bp-350bp
	351bp-400bp
	401bp-450bp
	451bp-500bp
	501bp-550bp
	551bp-600bp
	601bp-650bp
	651bp-700bp
	701bp-750bp
	751bp-800bp
	801bp-850bp
	851bp-900bp
	901bp-950bp
	951bp-1000bp
	1001bp-longer

	Expected_SNPs_at_all_fragments_via_SEQUNCE_TYPE:	
	Expected_SNPs_at_fragments_in_range_via_SEQUNCE_TYPE:

5) "genome_coverage_by_FF_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
   "genome_coverage_by_FB_BF_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
   "genome_coverage_by_BB_frags(_in_range)?_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME"
The coverage ratios of different genomic structures of every chromosome covered
by these three kinds of fragments(all fragments or fragments in range) are saved in these files.
If User apply the 'all_frags_coverage_ratio' or the 
'frags_in_range_coverage_ratio' method, this file will be produced. 
There are 13 columns in this file. They are: 'IntergenicLength',
'IntergenicMapLength','IntergenicMapRatio', 'GenesLength','GenesMapLength',
'GenesMapRatio', 'ExonLength', 'ExonMapLength', 'ExonMapRatio', 'IntronLength', 
'IntronMapLength','IntronMapRatio'. The first line is header. 
   
    a) GeneLength, ExonLength, IntronLength. The length of these different 
    parts, if some part does not exist, N/A is  given.
    b) GeneMapLength, ExonMapLength, IntronMapLength. The length of mapped
    part by all digested fragments or fragments in range of these parts. 
    If some part does not exists, 0 is given.
    c) GeneMapRatio, ExonMapRatio, IntronMapRatio. The ratio of mapped part
    by all digested fragments or fragments in range of these parts. If some
    part does not exits, 0 is given.


2. Files produced by RestrictionDigest::SingleItem::Single. 


1) "position_frags(_in_range_)?_REFERENCE_by_ENZYME". The positions of
digested fragments(all fragments or fragments in range) are saved in this
file. There are 7 columns whose descriptions are as follows:
'>Name of fragments', 'start position',  'end position', 'length of fragment'.
    
    a) >Name of fragments.  The name of this fragment, the format consists
    3 parts. The 1st part is '>'. The 2nd part is the name of scaffold in
    the reference file User provides. The 3rd part is the serial num of 
    this fragment in all fragments produced in this scaffold.
    

    b) start position. The start position of this fragment on the specific
    scaffold. It is the start position of the overhang, not the recognition 
    base position. 

   
    c) end postion. The end position of this fragment on the specific 
    scaffold. It is also the end position of the overhang, not the 
    recognition base position.

    d) length of fragment. The base length of this fragment.


2) "seq_frags(in_range_)?_REFERENCE_by_ENZYME". The seqences of digested
fragments(all fragments or fragments in range) and their names are saved 
in this file. There are many double-line. The first line of the double_line
is the name of the fragment in the ">chromsomeName-[0-9]+" format. The 
second line of the double_line is the sequence of this fragment.

3) "coverage_ratio_every_chromosome_REFERENCE_by_ENZYME". The length ratio 
of fragments in every scaffold are saved in this file. There are three 
columns in this file. The first column contains the names of scaffolds. 
And the second column contains the length ratio of all fragments in this
scaffold. And the third column contains the length ratio of fragments in 
range in this scaffold.

4) "digestion_summary_REFERENCE_by_FRONT_ENZYME_and_BEHIND_ENZYME". 

	loci_number_of_ENZYME 

	all_fragments_number	all_fragments_coverage	fragments_in_range_number	fragments_in_range_coverage
	
	Length_range	fragments_number	fragments_ratio

	0bp-200bp
	201bp-250b
	251bp-300bp
	301bp-350bp
	351bp-400bp
	401bp-450bp
	451bp-500bp
	501bp-550bp
	551bp-600bp
	601bp-650bp
	651bp-700bp
	701bp-750bp
	751bp-800bp
	801bp-850bp
	851bp-900bp
	901bp-950bp
	951bp-1000bp
	1001bp-longer

	Expected_SNPs_at_all_fragments_via_SEQUNCE_TYPE:	
	Expected_SNPs_at_fragments_in_range_via_SEQUNCE_TYPE:	



5) "genome_coverage_ratio(_in_range)?_REFERENCE_by_ENZYME". If User 
apply the 'all_frags_coverage_ratio' or the 'frags_in_range_coverage_ratio'
method, this file will be produced. 
There are 13 columns in this file. They are: 
'IntergenicLength','IntergenicMapLength','IntergenicMapRatio', 'GenesLength',
'GenesMapLength', 'GenesMapRatio', 'ExonLength', 'ExonMapLength', 'ExonMapRatio',
'IntronLength', 'IntronMapLength','IntronMapRatio'. The first line is header. 

    a)  
    b) GeneLength, ExonLength, IntronLength, UTRLength. The length of 
    these different parts, if some part does not     exist, N/A is given.
    c) GeneMapLength, ExonMapLength, IntronMapLength, UTRMapLength. The 
    length of mapped part by all digested fragments     or fragments in
    range of these parts. If some part does not exists, 0 is given.
    d) GeneMapRatio, ExonMapRatio, IntronMapRatio, UTRMapRatio. The ratio
    of mapped part by all digested fragments or fragments in range of 
    these parts. If some part does not exits, 0 is given.

3. Files produced by RestrictionDigest::MultipleItems::Double.

1) "digestion_summary_multiple_genomes_by_FRONT_ENZYME_and_BEHIND_ENZYME"

	Number_of_enzyme_loci	REFERENCE1	REFERENCE2	....
	Loci_number_of_FRONT_ENZYME
	Loci_number_of_BEHIND_ENZYME	
	Length_range	three_types_number_REFERENCE1	three_types_ratio_REFERENCE1	FB_BF_number_REFERENCE1	FB_BF_ratio_REFERENCE1	FF_number_REFERENCE1	FF_ratio_REFERENCE1	BB_number_REFERENCE1	BB_ratio_REFERENCE1	three_types_number_REFERENCE2	three_types_ratio_REFERENCE2	FB_BF_number_REFERENCE2	FB_BF_ratio_REFERENCE2	FF_number_REFERENCE2	FF_ratio_REFERENCE2	BB_number_REFERENCE2	BB_ratio_REFERENCE2
	0bp-200bp
	201bp-250b
	251bp-300bp
	301bp-350bp
	351bp-400bp
	401bp-450bp
	451bp-500bp
	501bp-550bp
	551bp-600bp
	601bp-650bp
	651bp-700bp
	701bp-750bp
	751bp-800bp
	801bp-850bp
	851bp-900bp
	901bp-950bp
	951bp-1000bp
	1001bp-longer
	

4. Files produced by RestrictionDigest::MultipleItems::Single.

1) "digestion_summary_multiple_genomes_by_ENZYME"
	Number_of_enzyme_loci	REFERENCE1	REFERENCE2	....
	Loci_number_of_ENZYME	
	Length_range	fragments_number_REFERENCE1	fragments_ratio_REFERENCE1	fragments_number_REFERENCE2	fragments_ratio_REFERENCE2
	0bp-200bp
	201bp-250b
	251bp-300bp
	301bp-350bp
	351bp-400bp
	401bp-450bp
	451bp-500bp
	501bp-550bp
	551bp-600bp
	601bp-650bp
	651bp-700bp
	701bp-750bp
	751bp-800bp
	801bp-850bp
	851bp-900bp
	901bp-950bp
	951bp-1000bp
	1001bp-longer




SUPPORT AND DOCUMENTATION

After installing, you can find documentation for this module with the
perldoc command.

    perldoc RestrictionDigest

You can also look for information at:

    RT, CPAN's request tracker (report bugs here)
        http://rt.cpan.org/NoAuth/Bugs.html?Dist=RestrictionDigest

    AnnoCPAN, Annotated CPAN documentation
        http://annocpan.org/dist/RestrictionDigest

    CPAN Ratings
        http://cpanratings.perl.org/d/RestrictionDigest

    Search CPAN
        http://search.cpan.org/dist/RestrictionDigest/


LICENSE AND COPYRIGHT

Copyright (C) 2015 Jinpeng Wang, Li Li, Haigang Qi, Xuedi Du, and Guofan Zhang

This program is free software; you can redistribute it and/or modify it
under the terms of the the Artistic License (2.0). You may obtain a
copy of the full license at:

L<http://www.perlfoundation.org/artistic_license_2_0>

Any use, modification, and distribution of the Standard or Modified
Versions is governed by this Artistic License. By using, modifying or
distributing the Package, you accept this license. Do not use, modify,
or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made
by someone other than you, you are nevertheless required to ensure that
your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service
mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge
patent license to make, have made, use, offer to sell, sell, import and
otherwise transfer the Package with respect to any patent claims
licensable by the Copyright Holder that are necessarily infringed by the
Package. If you institute patent litigation (including a cross-claim or
counterclaim) against any party alleging that the Package constitutes
direct or contributory patent infringement, then this Artistic License
to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER
AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY
YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR
CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR
CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

