Contributed by Yujin Hoshida <d35116@h.u-tokyo.ac.jp>
-----------------------------------------------------------


I send my perl scripts that handle DNA microarray data. All scripts get data
from comma-delimited text file. They consist of 3 groups as below.
I am sorry for delay. It took time for preparation owing to my little
daughter's heavy crying at night (she is 3 months old).

(1) discriminative gene selection.

       t_test.pl     :     permutation t-test (Radmacher, NCI report)
       u_test.pl     :    Mann-Whiteney U-test
       info.pl        :    Info-score (TNoM) (Ben-Dor, JCB)
       ds.pl          :    discrimination score (Golub, Science)
       cat.pl         :    categolization (eg 3 groups: Cy3/Cy5 >=2, 0.5<
Cy3/Cy5 <2, Cy3/Cy5 <=0.5)
                           (Tsunoda, Cancer Res)

These scripts select genes that discriminate 2 groups (4 samples in each
group, the minimal number that I think) based on 10,000 random permutation
of the sample labels. Threshold level is set to P=.001 (ie superior to top
or bottom 10 permutations). Difference among these script are only gene
selecting algorithms.

They need re-writing according to sample number of objective microarray
data.

(2) leave-one-out cross validation of (1).

                                        gene selection
in-silico genotyping
        loocv_t.pl           :              t-test
compound covariate (Radmacher)
        loocv_u.pl           :              U-test
simple rank
        loocv_t_vote.pl     :               t-test
weighted vote (Golub)
        loocv_u_vote.pl     :              U-test
weighted vote
        loocv_info.pl        :        Info-score (TNoM)
weighted vote
        loocv_ds.pl          :      discrimination score
weighted vote
        loocv_cat.pl         :          categolization
weighted vote

 These scripts evaluate (1). One sample is removed and discriminative genes
are selected using remaining samples using each algorithm. Removed sample is
genotyped using selected gene set and judged whether the genotyping is
correct or not. This process is repeated for all samples and the number of
misclassification is counted. Furthermore, sample labels are randomly
permutated 1,000 times and its significance (P=.05) is evaluated.

Infoscore and TNoM are not calculated in these scripts (calculated
beforehand manually).
The problem of these scripts is huge calculation time.
Dr.Jason, if I use other compiling-type language instead of Perl
(interpriter-type language), is this problem solved?

(3) relevance network of gene expression (Butte, PNAS)

        entropy.pl          :       select genes with sufficient entropy for
calculation of correlation coefficient.
        relevance.pl       :       calculate Pearson correlation coefficient
among genes
                                     , and select genes with higher
correlation coefficient than threshold value.
        relrand.pl          :       calculate threshold value of Pearson
correlation coefficient.

Now I am developing a JAVA application that visualize the relevance network
based on data sheet derived from relevance.pl script.
relevnce.pl also takes huge calculation time (probably in the step of
sorting correlation coefficients: eg, from DNA array with 5,000 genes,
12,497,500 correlation coefficients are calculated). I think that some
improvement is needed (eg using bubble sort).

My coding is not elegant. Please tell me the point that needs revision.
In addition, I apologize for my poor English explanation.

King regards,

Yujin

