RNA-seq / Differential expression analysis using edgeR
Description
Differential expression analysis using the exact test of the edgeR Bioconductor package ("classic edgeR").
Please note that this tool is suitable only for two group comparisons.
For multifactor experiments you can use the tool "Differential expression using edgeR for multivariate experiments",
which uses generalized linear models -based statistical methods ("glm edgeR").
Parameters
- Column describing groups [group]
- Filter out genes which don't have counts in at least this many samples (1-10000) [2]
- P-value cutoff (0-1) [0.05]
- Multiple testing correction (none, Bonferroni, Holm, Hochberg, BH, BY) [BH]
- Dispersion method (common, tagwise) [tagwise]
- Dispersion value used if no replicates are available (0-1) [0.1]
- Apply TMM normalization (yes, no) [yes]
- Plot width (200-3200 [600]
- Plot height (200-3200) [600]
Details
This tool takes as input a table of raw counts from the different samples. The count file has to be associated with a phenodata file describing the experimental groups.
These files are best created by the tool "Utilities / Define NGS experiment", which combines count files for different samples to one table, and creates a phenodata file for it.
You should set the filtering parameter to the number of samples in your smallest experimental group.
Filtering will cause those genes which are not expressed in either experimental group or are expressed in very low levels (less than 5 counts) to be ignored in statistical testing.
These genes have little chance of showing significant evidence for differential expression, and removing them reduces the severity of multiple testing adjustment of p-values.
Trimmed mean of M-values (TMM) normalization is used to calculate normalization factors in order to reduce RNA composition effect,
which can arise for example when a small number of genes are very highly expressed in one experiment condition but not in the other.
Dispersion is estimated using the quantile-adjusted conditional maximum likelyhood method (qCML). Common dispersion assumes that all genes have the same dispersion,
while the tagwise approach calculates gene-wise dispersions. It uses an empirical Bayes strategy to "shrink" the gene-wise dispersions toward a common trend
(obtained by estimating a smooth function prior to the shrinkage using the estimateTrendedDisp function).
You should always have at least two biological replicates for each experiment condition.
If this is not possible, you can still run the analysis by guessing the dispersion value with the 'Dispersion value' parameter.
The default value for this parameter is 0.1, which is somewhere in-between what is usually observed for technical replicates (0.01) and human data (0.4).
Once negative binomial models are fitted and dispersion estimates are obtained,
edgeR proceeds with testing for differential expression using the exact test, which is based on the qCML methods.
Output
The analysis output consists of the following files:
- de-list-edger.tsv: Result table from statistical testing, including fold change estimates and p-values.
- de-list-edger.bed: If you data contained genomic coordinates, the result table is also given as a BED file for genome browser use. The score column contains log fold change values.
- ma-plot-edger.pdf: MA plot where significantly differentially expressed features are highlighted.
- mds-plot-edger.pdf: Multidimensional scaling plot to visualize sample similarities.
- dispersion-edger.pdf: Biological coefficient of variation plot.
- p-value-plot-edger.pdf: Raw and adjusted p-value distribution plot.
- edger-log.txt: Log file if no significantly different expression was found.
References
This tool uses the edgeR package for statistical analysis. Please read the following article for more detailed information:
MD Robinson, DJ McCarthy, and GK Smyth. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26 (1):139-40, Jan 2010.
.