\name{corna.test.fun}
\alias{corna.test.fun}

\title{Statistical test the associations of miRNA with a sample of genes}
\description{
Supposed there are a population of genes and a sample of it, for example all genes being tested in a microarray experiment 
as the population and the significantly differential expressed genes as the sample. The associations between these genes 
and miRNAs can be obtained from miRBase. For every miRNA that associates with at least one sample gene, this function counts 
the total numbers of genes in the sample and in the population that associate with it. Then hypergeometric test is applied 
to infer whether or not this miRNA is more likely to associate with sample genes. By default, the hypergeometric test is 
for "over-representing" so a small p-value of a miRNA indicates that it is strongly associates with the sample genes. For 
comparisons, Fisher's exact test and chi-square test are also provided.
}
\usage{
corna.test.fun(x, y, z, 
               hypergeometric=TRUE, hyper.lower.tail=FALSE,
               fisher=FALSE, fisher.alternative="two.sided",
               chi.sqare=FALSE, 
               p.adjust.method="none", label=FALSE, sort="hypergeometric",
               min.pop=-1, min.sam=-1, desc=NULL)
}

\arguments{
  \item{x}{ A character vector of sample genes. }
  \item{y}{ A character vector of population genes. }
  \item{z}{ A data frame of links between genes and miRNAs. }
  \item{hypergeometric}{ Logical, if TRUE (default), Hypergeometric test is applied; if FALSE, not applied. }
  \item{hyper.lower.tail}{ Logical for Hypergeometric test: if FALSE (default), test over-representing; if TRUE, test under-representing. }
  \item{fisher}{ Logical, if TRUE, Fisher's exact test is applied; if FALSE (default), not applied. }
  \item{fisher.alternative}{ Alternative hypothesis of Fisher's exact test: "two.sided" (default), "greater" or "less". }
  \item{chi.square}{ Logical, if TRUE, Chi-square test is applied; if FALSE (default), not applied. }
  \item{p.adjust.method}{ Method for adjusting the p-value, default is "none" and can also be "BH", "BY" etc.}
  \item{label}{ Logical, if TRUE, p-values will be labled by ** if less than 0.01 and by * if less than 0.05; if FALSE (default), not labeled}
  \item{sort}{ Sort the result by p-values of one test, default is Hypergeometric test. }
  \item{min.pop}{ Minimal number of miRNA's targets in population in the results. Default is -1 means outputing everything.}
  \item{min.sam}{ Minimal number of miRNA's targets in sample in the results. Default is -1 means outputing everything.}
  \item{desc}{ A data frame of the descriptions of the testing miRNA. }
}
\details{
The test is for miRNA while the testing data are counts of genes which can be illustrated by a traditional contingency table 
with the following 4 numbers.

1. number of genes in sample that associate with this miRNA;
2. number of genes in sample that do not associate with this miRNA;
3. number of genes remaining in population that associate with this miRNA;
4. number of genes remaining in population that do not associate with this miRNA;

The null hypothesis is there is no significant association between the testing miRNA and the sample genes, the proportion 
of genes associate with this miRNA in sample is roughly the same as that proportion in population. The default alternative 
hypothesis is there is significant positive association which means the sample genes are more likely to associate with this 
miRNA than the remaining genes in population. Although negative association can be tested by set the "lower.tail" to TRUE, 
careful interpretation is necessary. Only the miRNAs associates with at least one sample gene will be tested, so the ones 
only associate with the remaining genes in the population are all "under-representing" in the sample but they are not tested.

Fisher's exact test and chi-square test are provided as options and can be turned on or off by setting options "fisher" and 
"chi.square" to TRUE or FALSE respectively. One-tailed Fisher's exact test for over-representing is identical to hypergeometric 
test and is suitable for small sample size. When the expected values of the above 4 numbers are all greater than 10, 
chi-square test can be used. Not like hypergeometric test and Fisher's exact test, the alternative hypothesis of chi-square 
test is there is significant association either positive or negative between the testing miRNA and sample genes. 

When performing tests that draw from the Hypergeometric distribution, defining the universe (or population) from
which the sample has been drawn is key.  In the case of CORNA, we are dealing with gene lists, and so the question is:
from which population of genes have we drawn our sample gene list?

The obvious answer is "all the genes in the genome in question".  However, this may not be the case - if you have not
assayed all genes in the genome, then those genes you didn't assay could never have made it into your sample, and therefore
they should not be present in the population either.  For example, if you perform a microarray experiment and your microarray
only represents a subset of the genes in the genome, then only that subset should be used as the population.

The population may be further refined.  For example, with CORNA, we are looking at regulation of gene lists by microRNAs, therefore
the population of genes could be further reduced to only those genes that you have assayed that also have a predicted
miRNA relationship.  If you have assayed a gene that has no miRNA relationship, then that gene can never
be counted against a particular miRNA, and serves only to increase the size of the universe.

In general, the effect of a larger universe is that the resultant p-values appear more significant.

The population gene list to be used in CORNA is handled by the 'y' parameter of the corna.test.fun function.  This is highly 
configurable and you may input any list you wish, and there is therefore a responsibility on the user to select the correct 
population.   
}

\value{
A data frame, row names indicate the miRNA IDs and has at least the following 4 columns.

total: total number of genes in population that associate with this miRNA.
expectation: expected number of genes in sample that associate with this miRNA according to the sample size under null hypothesis.
observation: observed number of genes in sample that associate with this miRNA.
hypergeometric: p-values of the hypergeometric test.

Extra two columns of p-values, "fisher" and "chi.square", will be added if the Fisher's exact test and chi-square test are also selected.
The data frame is sorted by p-value of the hypergeometric test.

A column of description will be added if "desc" is supplied.
}

\author{
Xikun Wu and Michael Watson
}

\seealso{ phyper, fisher.test, chisq.test }
\examples{
# links between transcripts and miRNA: miRBase mouse data
tran2mir.df <- miRBase2df.fun(url="ftp://ftp.sanger.ac.uk/pub/mirbase/targets/v5/arch.v5.txt.mus_musculus.zip")

# population: all transcripts
pop.vec <- unique(tran2mir.df[, "tran"])

# sample: randomly selected 1%
sam.vec <- sample(pop.vec, length(pop.vec)/100)

# test
corna.test.df <- corna.test.fun(sam.vec, pop.vec, tran2mir.df)
}

\keyword{ manip }
