


User Contributed Perl Documentation           CompBio::CompBio(3)



NNNNAAAAMMMMEEEE
     CompBio - Core library for some basic methods useful in
     computational biology/bioinformatics.

SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
     use CompBio;

     my $cbc = new->CompBio;

DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
     The CompBio module set is being developed as a new
     implementation of the code base originally developed at the
     BioMolecular Engineering Research Center (http://bmerc-
     www.bu.edu). CompBio.pm is intended to take a number of
     small, commonly used methods for supporting bioinformatics
     research, and make them into a single package. Many of the
     utils included with this distribution are just command line
     interfaces to the methods contained herein.

     The CompBio module set is _not_ intended to replace the
     bioperl project (http://www.bioperl.org/). Although I do
     welcome suggestions for improving or adding to the methods
     available in these modules, particularly I would love any
     help with things on the TO DO list, these modules are not
     intended to provide the depth that the bioperl suite can
     provide.

     CompBio has a limited API. It expects it's input to be in
     specific formats, as described in each methods description,
     and it's output is in a format that makes the most sense to
     me for that method. It does no error checking by and large,
     so incorrect input could cause bizzare behavior and/or a
     noisy death. If you want a flexible interface with lots of
     error checking and deep levels of verbosity, use the
     CompBio::Simple manpage - that's its job.

     Thanks!

     Other modules available (or that will be available) in the
     CompBio set are:

     The DB module will only be imediately useful if you import
     the databases as used by us here at the
     BMERC(ftp://mcclintock.bu.edu/BMERC/mysql/) or develop your
     own on the same basic design scheme. Otherwise I hope you
     find it useful as a source of design ideas for rolling your
     own. Please note however that I intend to expand the methods
     in CompBio/Simple.pm to allow seemless access to data
     through the DB module. Although at no time will including
     the DB module be required for Simple to work, I think that
     if you have the space for it, you will find having the data
     locally and adding this module (or adapting it to work with



2001-09-19          Last change: perl v5.6.0                    1






User Contributed Perl Documentation           CompBio::CompBio(3)



     databases already installed) will have an imense and
     imediate benifiacial impact. It did for us at least! :)

     The Profile module was designed to work with our PIMA-II
     sofware and the PIMA modules. The PIMA suite is available
     for license from Boston University for a nominal fee, and
     free for academic use. For examples and more info see
     http://bmerc-www.bu.edu/PIMA/. Unless you have or are
     interested in that package, this module will have no
     functional value.

MMMMeeeetttthhhhooooddddssss
     You may note that the majority of the methods here are for
     converting sequences from one format to another. Mainly this
     is for converting other formats to table format, which is
     used by most of these programs. This is not meant to be a
     comprhensive collection of format guessing and
     transformation methods. If you are looking for a converter
     for a format CompBio doesn't handle, I suggest you look into
     bioperls SeqIO package or the READSEQ program (java), found
     at http://iubio.bio.indiana.edu/soft/molbio/readseq/

     nnnneeeewwww

     Construct an object for invoking methods in CompBio.

     hhhheeeellllpppp

     Quits current application and uses perldoc to display the
     POD.

     cccchhhheeeecccckkkk____ttttyyyyppppeeee

     Checks a given sequence or set of sequences for it's type.
     Currently groks fasta(.fa), table(.tbl), raw genome(.raw),
     intelligenics[?](.ig) and coding dna sequence(.cdna) types.
     Each index of the referenced array should be an entire
     sequence record. * It is however nnnnooootttt recomended that you
     load up an entire raw genome into memory to do this test -
     see the perlfunc read entry elsewhere in this document *

     `$type = check_type(\@seqs,%parameters);'

     Possible return types are CDNA, TBL, FA, IG, RAW and
     UNKNOWN.

     Be warned, this is intended only as a quick check and only
     uses as many records as necisarry in the reference provided,
     stating with the first.  check_type assumes the rest of the
     records look the same and does not do any kind of deep QA on
     the set. If you are not sure, invoke check_type with a few
     random samples from your set, or use the CompBio::Simple



2001-09-19          Last change: perl v5.6.0                    2






User Contributed Perl Documentation           CompBio::CompBio(3)



     manpage, which does that by default.

     ttttbbbbllll____ttttoooo____ffffaaaa

     Converts a sequence record in table (tab delimited, usually
     .tbl file extension) format to fasta format.

     `$aref_faseqs = $cbc-'tbl_to_fa(\@seqdat,%params);>

     Each index in the @seqdat array must contain entire record
     (loci\tsequence) for single sequence. Return is an array
     reference, still one sequence per index.

     Extra data fields in the table format will be added to the
     annotation line (>) in the fasta output.

     ttttbbbbllll____ttttoooo____iiiigggg

     Converts a sequence in table (tab delimited) format to .ig
     format. Accepts sequences in a referenced array, one record
     per index.

     `$aref_igseqs = $cbc-'tbl_to_ig(\@tbl_seqs,%params);>

     Extra data fields in the table format will be placed on a
     single comment line (;) in the ig output.

     ffffaaaa____ttttoooo____ttttbbbbllll

     Accepts fasta format sequences in a referenced array, one
     complete sequence record per index. This method returns the
     _s_e_q_u_e_n_c_e(s) in table(.tbl) format contained in a referenced
     array.

     `$aref_faseqs = $cbc-'fa_to_tbl(\@fa_seq);>

     Extra data fields in the fasta format will be placed as
     extra tab delimited values after the sequence record.

     Options:

     CLEAN: Reduces signifier to the first strech of non white
     space characters found at least 4 characters long with no of
     the characters (| \ ? / ! *)

     iiiigggg____ttttoooo____ttttbbbbllll

     Accepts ig format sequences in a referenced array, one
     complete sequence record per index. This method returns the
     _s_e_q_u_e_n_c_e(s) in table(.tbl) format contained in a referenced
     array.




2001-09-19          Last change: perl v5.6.0                    3






User Contributed Perl Documentation           CompBio::CompBio(3)



     `$aref_igseqs = $cbc-'ig_to_tbl(\@fa_seq);>

     Extra comment lines in the ig format will be placed as extra
     tab delimited values after the sequence record.

     ddddnnnnaaaa____ttttoooo____aaaaaaaa

     Converts a dna sequence, containing no whitespace, submited
     as a scalar reference, to the amino acid residues as coded
     by the standard 'universal' genetic code.  Return is a
     reference to a scalar. dna sequence may contain standard
     special characters (i.e. R,S,B,N, ect.).

     `$aa = dna_to_aa(\$dna_seq,%params);'

     Options:

     C: Set to a true value to indicate dna should be converted
     to it's complement before translation.

     ALTCODE: A reference to a hash containing alternate coding
     keys where the value is the new aa to code for. Stop codons
     are represented by ".". Multiple allowable values for the
     final dna in the codon may be provided in the format
     AT[TCA].

     SEQFIX: Set to true to alter first position, making V or L
     an M, and removing stop in last position.

     ccccoooommmmpppplllleeeemmmmeeeennnntttt

     Converts dna to it's complementary strand. DNA sequence is
     submitted as scalar by reference. There is no return as
     sequence is modified. To maintain original sequence, send a
     reference to a copy.

     `complement(\$dna);'

     ssssiiiixxxx____ffffrrrraaaammmmeeee

     Converts a submitted dna sequence into all 6 frame
     translations to aa. Value submitted may be a reference to a
     scalar containing the dna or a filename.  Either way dna
     must be in RAW(.raw) format (Not whitespace within sequence,
     only one newline allowed at end of record). Output id's have
     start and stop positions encoded; if first value is larger,
     translated from anti-sense.

     `$result = six_frame($raw_file,$id,$seq_len,$out_file);'

     Options:




2001-09-19          Last change: perl v5.6.0                    4






User Contributed Perl Documentation           CompBio::CompBio(3)



     ALTCODE: A reference to a hash containing alternate coding
     keys where the value is the new aa to code for. Stop codons
     are represented by ".". Multiple allowable values for the
     final dna in the codon may be provided in the format
     AT[TCA].

     ID: Prefix for translated segment identifiers, default is
     SixFrame.

     SEQLEN: Minimum length of aa sequence to return, default is
     10.

     OUTPUT: A filename to pipe results to. This is recomended
     for large dna sequences such as large contigs or whole
     chromosomes, otherwise results are stored in memory until
     process complete. If the value of OUTFILE is STDOUT results
     will be sent directly to standard out.

     wwwwuuuu____bbbbllllaaaasssstttt

     A simple interface to the Washington University distribution
     of blast. Currently has only been tested with the 2.0a8MP
     release.

      Takes series of arguments qw($method $database $queryfile
     "parameter list","noise",$cpu_server). Method can be any allowed blast method. Database is the blastable fasta format
     database to be searched. queryfile is the file containg the fasta sequence you wish to use in the
     blast comparison. "parameter list" is a quote inclosed set of parameter arguments to hand blast.
     Optional final arguments are the noise or debig level to use (default is quiet), "quiet" tells blast
     to operate silently, anything else prints some debugging messages and allows any errors from blast
     to be printed. Final argument is for the 'cpu server' where blast is to execute and must follow
     a noise level argument.

             Only blastp and blastx  have default parameters at moment:
                     blastp  B=10  e=1e-3  filter=seg+xnu
                     blastx  B=25  e=1e-3  filter=seg+xnu
             If any parameters are supplied method will not use any default values!

     `$blast_out = blast('blastp',$fa_file,$queryfile,"-B=0
     -matrix=BLOSUM62","quiet"); '

     AAAAAAAA____HHHHAAAASSSSHHHH

     Creates a hash table with amino acid lookups by codon.
     Includes all cases where even an alternate na code (such as
     M for A or C) would return an unambiguous aa.  Also
     consistent with the complement method in this package, ie,
     lower cases in some contexts, for ease of use with
     six_frame.

      C<%aa_hash = aa_hash;>




2001-09-19          Last change: perl v5.6.0                    5






User Contributed Perl Documentation           CompBio::CompBio(3)



EEEEXXXXPPPPOOOORRRRTTTT
     None by default.

HHHHIIIISSSSTTTTOOOORRRRYYYY
     0.01    Original version; created by h2xs 1.20 with options

               -AXC -n CompBio


     0.20    Begin porting core function code for most initial
             methods from various sources at BMERC. Write
             protocode and placeholders for what I want in
             initial 'complete' version.

     0.44    Almost everything eneded up being rewritten
             practically from scratch.  Removed lingering locale
             assumptions and (hopefully) improving interface
             (added use of %params mainly).  Added OOP useability
             so this module should now work either way.

     0.45    Modifications to Simple primarilly.

     0.451   Fixed faulty statement in MANIFEST.SKIP that
             excluded Makefile.PL from distribution (thanks to
             Andreas Riechert for pointing this out to me almost
             imediately).

     0.46    Added some documentation, mostly on options.  Added
             the ALTCODE option to six_frame.  Finished modifying
             tbl_to* converters to make a pass at handeling extra
             data feilds from table format and changing *_to_tbl
             converters to keep extra data besides id in extra
             fields in table format.

TTTTOOOO DDDDOOOO - iiiinnnn nnnnoooo ppppaaaarrrrttttiiiiccccuuuullllaaaarrrr oooorrrrddddeeeerrrr
     I like the basic design so far - except where large sequence
     sets are to be munged.  There this becomes a seriouse memory
     hog. Some faster & more efficient way needs to be provided
     when dealing with files & large data sets.

     **Got it - needs work!** Get the full release of WashU Blast
     and make sure the blast stuff works with it as well.

     OK, rethink blast _m_e_t_h_o_d(s) completely. Old and kludged and
     needs to die.

     Add a method for handeling ncbi blast. CompBio::Simple
     should only have a blast method that will DTRT and use the
     appropriate core methods.

     Add DNA* and GCG format handelers




2001-09-19          Last change: perl v5.6.0                    6






User Contributed Perl Documentation           CompBio::CompBio(3)



     Add handler for the genbank report type format:  1   attgc
     gtgct 11  gtgtg   gacaa Which, though annoying, seems a
     certain candidate for recieving in cut and paste operations,
     particularly through the web.

     Better way to handle CPUSERVER. There must be some way to
     allow the user to define (presumably during ./config?) how
     to submit jobs for more intensive computational tools such
     as Blast and PIMA. Things like rsh, submitting to a batch
     queue, etc should all be definable at install somehow.

     write configure script to populate these globals as part of
     install process.

     modify tests to look for inclusion of Profile.pm or such and
     skip testing where apropriate if the PIMA suite is not
     installed.

     Find out if there is a way using OOP to allow a user to only
     need to include this module and create new objects for the
     submodules, or even better just DTRT. Can this be done just
     through inheritence? Should it be triggered by requested
     export (like use CompBio qw(Simple DB)? Or through an
     AUTOLOAD type interface, returning the correct object ($cbs
     = CompBio->Simple("new") or some such)? I think that would
     be far more desirable than _having_ to use a bunch of
     modules all in the CompBio namespace.

     _error (all packages) needs to correctly report line where
     error occured.  Can this be done through caller or do I need
     to pass manually?

CCCCOOOOPPPPYYYYRRRRIIIIGGGGHHHHTTTT
     Developed at the BioMolecular Engineering Research Center at
     Boston University under the NHLBI's Programs for Genomic
     Applications grant.

     Copyright Sean Quinlan, Trustees of Boston University
     2000-2001.

     All rights reserved. This program is free software; you can
     redistribute it and/or modify it under the same terms as
     Perl itself.

AAAAUUUUTTTTHHHHOOOORRRR
     Sean Quinlan, seanq@darwin.bu.edu

     Please email me with any changes you make or suggestions for
     changes/additions.  Latest version is available under
     ftp://mcclintock.bu.edu/BMERC/perl/. Thank you!





2001-09-19          Last change: perl v5.6.0                    7






User Contributed Perl Documentation           CompBio::CompBio(3)



     I would like to thank the staff at the BMERC for being my
     guinee pigs for the past few years, Jim Freeman who got me
     into this and left me with flint and tinder rather than the
     ember in a clay pot he started with, and my boss and
     (tor)mentor Dr. Temple F. Smith! I would never have gotten
     this far or learned this much without them.

SSSSEEEEEEEE AAAALLLLSSSSOOOO
     _p_e_r_l(1), _C_o_m_p_B_i_o::_S_i_m_p_l_e(1).














































2001-09-19          Last change: perl v5.6.0                    8



