
Bioperl FAQ
-----------
v. 1.0.1

This FAQ maintained by:
* Jason Stajich <jason@bioperl.org>
* Brian Osborne <brian_osborne@cognia.com>
* Heikki Lehvaslaiho <heikki@ebi.ac.uk>


---------------------------------------------------------------------------

Contents

---------------------------------------------------------------------------

0. About this FAQ

  Q0.1: What is this FAQ?
  Q0.2: How is it maintained?

1. Bioperl in general

  Q1.1: What is Bioperl?
  Q1.2: Where do I go to get the latest release?
  Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
	developer release?
  Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?	What's the deal?
  Q1.5: How do I figure out how to use a module?
  Q1.6: I'm interested in the bleeding edge version of the code, where can
	I get it?
  Q1.7: Who uses this toolkit?
  Q1.8: How should I cite Bioperl?
  Q1.9: What are the License terms for Bioperl?
  Q1.10: I want to help, where do I start?
  Q1.11: I've got an idea for a module how do I contribute it?

2. Sequences

  Q2.1: How do I parse a sequence file?
  Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?
  Q2.3: How can I get NT_ or NM_ accessions from NCBI (Reference
	Sequences)?
  Q2.4: How can I use SeqIO to parse sequence data from a string?

3. Report parsing

  Q3.1: I want to parse BLAST, how do I do this?
  Q3.2: What's wrong with Bio::Tools::Blast?
  Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?
  Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can I
	do this?
  Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
	0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I
	seeing these different numbers and how do I get the frame according
	to Blast?

4. Utilities

  Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
	sites in a protein? Calculate nucleotide melting temperature? Find
	repeats?
  Q4.2: How do I do motif searches with Bioperl? Can I do "find all
	sequences that are 75% identical" to a given motif?
  Q4.3: Can I query MEDLINE or other bibliographic repositories using
	Bioperl?


---------------------------------------------------------------------------

0. About this FAQ

---------------------------------------------------------------------------



  Q0.1: What is this FAQ?

	It is the list of Frequently Asked Questions about Bioperl.


  Q0.2: How is it maintained?

	This FAQ was generated using a Perl script and an XML file. All the
	files are in the Bioperl distribution directory doc/faq. So do not
	edit this file! Edit file faq.xml and run:

	% faq.pl -text faq.xml

	The XML structure was originally used by the Perl XML project.
	Their website seems to have vanished, though. The XML and modifying
	scripts were copied from Michael Rodriguez's web site
	http://www.xmltwig.com/xmltwig/XML-Twig-FAQ.html and modified to
	our needs.


---------------------------------------------------------------------------

1. Bioperl in general

---------------------------------------------------------------------------



  Q1.1: What is Bioperl?

	Bioperl is a tookit of perl modules useful in building
	bioinformatics solutions in perl.  It is built in an
	object-oriented manner so that many modules depend on each other to
	achieve a task. The collection of modules in the bioperl-live
	repository consist of the core of the functionality of bioperl. 
	Additionally auxiliary modules for creating graphical interfaces
	(bioperl-gui), persistent storage in RDMBS (bioperl-db), and CORBA
	bridges to the BioCORBA (http://www.biocorba.org) specification
	(bioperl-corba-server and bioperl-corba-client) are all available
	as CVS modules in our repository.


  Q1.2: Where do I go to get the latest release?

	You can always get our releases from ftp://bioperl.org/pub/DIST.
	Official releases will be noted on the website http://bioperl.org.


  Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean
	developer release?

	0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were
	stable releases on 0.7 branch.	This means they had a set of
	functionality that is maintained throughout (no experimental
	modules) and were guaranteed to have all tests and subsequent bug
	fix releases with the 0.7 designation would not have any API
	changes.

	The 0.9.X series was our first attempt at releasing so called
	developer releases.  These are snapshots of the actively developed
	code that at a minimum pass all our tests.

	But really, you should be using version 1.*!


  Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl?	What's the deal?

	Well, the perl.org guys granted us use of bio.perl.org. We prefer
	to be called Bioperl or BioPerl (unlike our Biopython friends). 
	We're part of the Open Bioinformatics Foundation (OBF) and so as
	part of the Bio{*} toolkits we prefer the Bioperl spelling.  But
	we're not really all that picky so no worries. 


  Q1.5: How do I figure out how to use a module?

	Read the embedded perl documentation (Plain Old Documentation -
	POD) that is part of every modules.  Do: 

	% perldoc MODULE

	(careful - spelling and case counts!).

	The bioperl tutorial - bptutorial.pl - provided in the root
	directory of the bioperl release will also provide a good
	introduction.  There are links to tutorials off the bioperl website
	that may provide some additional help.

	There are also many scripts in the examples/ and scripts/
	directories that could be useful - see bioperl.pod for a brief
	description of all of them.

	Additionally we have written many tests for our modules, you can
	see test data and example usage of the modules in these tests -
	look in the test dir (called 't').


  Q1.6: I'm interested in the bleeding edge version of the code, where can
	I get it?

	Go to http://cvs.bioperl.org and you'll see instructions on how to
	get the CVS code.

	Basically:

	% cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl
	login

	enter 'cvs' for the password

	
	  % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl co
	bioperl_all


  Q1.7: Who uses this toolkit?

	Lots of people.  Sanger Centre, EBI, many large and small academic
	laboratories, large and small pharmaceutical companies. All the
	developers on the bioperl list use the toolkit in some capacity on
	a regular basis.

	The Genquire annotation system
	(http://www.bioinformatics.org/Genquire/) and Ensembl
	(http://www.ensembl.org/) use bioperl as the basis for their
	implementation.


  Q1.8: How should I cite Bioperl?

	For now, cite it as "The Bioperl Project, http://www.bioperl.org".


  Q1.9: What are the License terms for Bioperl?

	Bioperl is licensed under the same terms as Perl itself which is
	the Perl Artistic License. You can see more information on that
	license at http://www.perl.com/pub/a/language/misc/Artistic.html
	and http://www.opensource.org/licenses/artistic-license.html.


  Q1.10: I want to help, where do I start?

	Bioperl is a pretty diverse collection of modules which has grown
	from the direct needs of the developers participating in the
	project.  So if you don't have a need for a specific module in the
	toolkit it becomes hard to just describe ways it needs to be
	expanded or adapted.  One area, however is the development of stand
	alone scripts which use bioperl components for common tasks.  Some
	starting points for script: find out what people in your
	institution do routinely that a shortcut can be developed for. 
	Identify modules in bioperl that need easy intefaces and write that
	wrapper - you'll learn how to use the module inside and out. We
	always need people to help fix bugs - read the jitterbug bug
	tracking system (webpage linked from bioperl website sidebar  under
	"Bugs").


  Q1.11: I've got an idea for a module how do I contribute it?

	We suggest the following.  Post your idea to the bioperl list. If
	it is a really new idea consider taking us through your thought
	process.  We'll help you tease out the necessary information such
	as what methods you'll want and how it can interact with other
	bioperl modules.  If it is a port of something you've already
	worked on, give us a summary of the current methods.  Make sure
	there is an interface to the module, not just an implementation
	(see the biodesign.pod for more info) and make sure there will be a
	set of tests that will be in the t/ directory to insure that your
	module is tested.


---------------------------------------------------------------------------

2. Sequences

---------------------------------------------------------------------------



  Q2.1: How do I parse a sequence file?

	Use the Bio::SeqIO system.  This will create Bio::Seq objects for
	you.  See the tutorial bptutorial.pl for more information or the
	documentation for Bio::SeqIO (e.g. 'perldoc SeqIO.pm').


  Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not?

	NCBI changed the web CGI script that provided this access.  You
	must be using bioperl <= 0.7.2.  The developer release 0.9.3
	contains this fix as does the 1.0 release.


  Q2.3: How can I get NT_ or NM_ accessions from NCBI (Reference
	Sequences)?

	Use Bio::DB::RefSeq not Bio::DB::GenBank when you are retrieving
	the NM_ accessions. This is still an area of active development
	because the data providers have not provided the best interface for
	us to query.  EBI has provided a mirror with their dbfetch system
	which is accessible through the Bio::DB::RefSeq object however,
	there are cases where NT_ accessions will not be retrievable.


  Q2.4: How can I use SeqIO to parse sequence data from a string?

	
	  use IO::String;
	  use Bio::SeqIO;
	  my $stringfh = new IO::String($string);
	
	  my $seqio = new Bio::SeqIO(-fh => $stringfh, -format => 'fasta');
	
	  while( my $seq = $seqio->next_seq ) { # process each seq
	  }


---------------------------------------------------------------------------

3. Report parsing

---------------------------------------------------------------------------



  Q3.1: I want to parse BLAST, how do I do this?

	Well you might notice that there are a lot of choices.	Sorry about
	that.  We've been evolving towards a single solution.

	Currently the best way to parse a report is to use the SearchIO
	system.  This supports blast and fasta report parsing.	The
	bptutorial provides an example of how to use this system as well as
	the documentation in the Bio::SearchIO system.


  Q3.2: What's wrong with Bio::Tools::Blast?

	Nothing is really wrong with it, it has just been outgrown by a
	more generic approach to reports.  This generic approach allows us
	to just write pluggable modules for fasta and Blast parsing while
	using the same framework.  This is completely analogous to the
	Bio::SeqIO system of parsing sequence files.  However, the objects
	produced are of the Bio::Search rather than Bio::Seq variety.


  Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this?

	It is as simple as parsing text BLAST results - you simply need to
	specify the format as "fasta" or "blastxml" and the parser will
	load the appropriate module for you.  You can use the exact logic
	and code for all of these formats as we have generalized the
	modules for sequence database searching.


  Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can I
	do this?

	See the Bio::Factory::EMBOSS to see how to use the 'water' and
	'needle' alignment programs that are part of the EMBOSS suite.

	Additionally you can use the pSW module that is part of the
	bioperl-ext package (distributed separated at
	ftp://bioperl.org/pub/DIST). However note this only does protein
	alignments and is no longer a supported module.  Instead the EMBOSS
	implementation is the the best path ahead unless someone else wants
	to provide an Inline::C implementation.


  Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing
	0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I
	seeing these different numbers and how do I get the frame according
	to Blast?

	These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a
	frame of 2 with the strand being set to -1 (for more on GFF see
	http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml).

	Frames are relative to the hit or query sequence so you need to
	query it based on sequence you are interested in:

	
	  $hsp->hit->strand();
	  $hsp->hit->frame();

	or

	
	  $hsp->query->strand();
	  $hsp->query->frame();

	So the value according to a blast report of -3 can be constructed
	as

	
	  my $blastvalue = ($hsp->query->frame + 1) * $hsp->query->strand;


---------------------------------------------------------------------------

4. Utilities

---------------------------------------------------------------------------



  Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic
	sites in a protein? Calculate nucleotide melting temperature? Find
	repeats?

	In fact, none of these functions are built into Bioperl but they
	are all available in the EMBOSS package (http://www.emboss.org/),
	as well as many others. The Bioperl developers created a simple
	interface to EMBOSS such that any and all EMBOSS programs can be
	run from within Bioperl. See Bio::Factory::EMBOSS for more
	information.

	If you can't find the functionality you want in Bioperl then make
	sure to look for it in EMBOSS, these packages integrate quite
	gracefully with Bioperl. Of course, you will have to install EMBOSS
	to get this access.

	In addition, Bioperl after version 1.0.1 contains the Pise/Bioperl
	modules. The Pise package
	(http://www-alt.pasteur.fr/~letondal/Pise) was designed to provide
	a uniform interface to bioinformatics applications, and currently
	provides wrappers to greater than 250 such applications! Included
	amongst these wrapped apps are HMMER, Phylip, BLAST, GENSCAN, even
	the EMBOSS suite. Use of the Pise/Bioperl modules does not require
	installation of the Pise package.


  Q4.2: How do I do motif searches with Bioperl? Can I do "find all
	sequences that are 75% identical" to a given motif?

	There are a number of approaches inside and outside of Bioperl.
	Within Bioperl take a look at Bio::Tools::SeqPattern, but it's also
	conceivable that the combination of Bioperl and Perl's regular
	expressions could do the trick. You might also consider the CPAN
	module String::Approx (this module addresses the percent match
	query). Or, take a look at the TFBS package, at
	http://forkhead.cgb.ki.se/TFBS (Transcription Factor Binding Site).
	This Bioperl-compliant package specializes in pattern searching of
	nucleotide sequence using matrices. Finally, you could use EMBOSS,
	as discussed in the previous question (or you could use Pise to run
	EMBOSS applications). The relevant programs would be fuzzpro or
	fuzznuc.


  Q4.3: Can I query MEDLINE or other bibliographic repositories using
	Bioperl?

	Yes! The solution lies in Bio::Biblio*, a set of modules that
	provide access to MEDLINE and OpenBQS-compliant servers using SOAP.
	See Bio/Biblio.pm or examples/biblio.pl for details and example
	code.

---------------------------------------------------------------------------
Copyright (c)2002 Open Bioinformatics Foundation. You may distribute this
FAQ under the same terms as perl itself.

