# Directory containing some scripts to post-process and filter discoSnpRad results


## Filtering scripts  

   1. **script** `filter_by_cluster_size_and_rank.py`:
       * removes variants belonging to a cluster (locus) whose size (nb of variants) is outside the given size range (options `-m` and `-M`)
       * removes variants with rank lower than a given threshold given by option `-r`
       * Usage :  
       `python filter_by_cluster_size_and_rank.py -i vcf_file [-o new_vcf_file -m 0 -M 150 -r 0.4]`

   3. **script** `filter_vcf_by_indiv_cov_max_missing_and_maf.py`:
       * replaces individual genotypes that have DP less than the value given by option `-c` by missing genotype `./.`
       * removes variants (vcf lines) that have a fraction of missing genotypes greater than the value given by option `-m` 
       * removes variants (vcf lines) that have a minor allele frequency smaller than the value given by option `-f` 
       * outputs only SNP variants if option `-s`. 
       * Usage :    
       `python filter_vcf_by_indiv_cov_max_missing_and_maf.py -i vcf_file -o new_vcf_file [-c min_cov -m max_missing -f maf -s] `

  3. **script** `filter_paralogs.py`:
        * identifies variants (vcf lines) that have a fraction of heterozygous genotypes greater than `x` (not counting missing genotypes)
        * removes variants (vcf lines) that belong to a cluster having a fraction of such variants greater than `y`
        * Example : `x=0.1` and `y= 0.5` and if we consider a cluster to represent a locus. This filter removes loci that have more than 50% of the SNPs that have each more than 10% of heterozygous genotypes.
        * Usage :     
        `python filter_paralogs.py -i vcf_file -o new_vcf_file [-x 0.1 -y 0.5]`


## Scripts for STRUCTURE analyses :

   4. **script** `1SNP_per_cluster.py`
        * selects one SNP per cluster (the one with less missing genotypes)
        * Usage :   
        `python  1SNP_per_cluster.py -i vcf_file -o new_vcf_file`

   5. **script** `vcf2structure.sh`    
        * changes the vcf format to a Structure format (input of the software Structure)
        * Usage:    
          `sh vcf2structure.sh file.vcf`
          (creates a file `file.str`)


## Mapping to a reference genome, and keeping the cluster information :

When a reference genome is available, even if variants have been called in a reference-free manner, it could be useful to get the positions of variants on this reference genome.    
To get such information, two programs are necessary in the case of a discoSnpRAD result:    

* `VCF_creator` (in dir `[DISCO_DIR]/scripts/`).     

   Note : it uses the mapper bwa, which must to be in the PATH env variable.   
   
* **script**  `add_cluster_info_to_mapped_vcf.py` in this current directory, to append the clustering information (and some minimal filtering on cluster size) in the vcf output by VCF_creator.   

Here is the full pipeline to map the result of discoSnpRAD with name prefix `myDiscoSnpRADResult` to a reference genome `myReferenceGenome.fa`:   
```
# Running VCF_creator with a given reference genome ()
sh [DISCO_DIR]/scripts/run_VCF_creator.sh  -G dm6_masked.fa -p myDiscoSnpRADResult_raw_filtered.fa -e -o temp.vcf

# Adding clustering information (and minimal filtering on cluster size)
python add_cluster_info_to_mapped_vcf.py -m temp.vcf -u myDiscoSnpRADResult_clustered.vcf -o myDiscoSnpRADResult_mapped.vcf
# final vcf is myDiscoSnpRADResult_mapped.vcf
```

Additionnally, in a validation context, if one wants to compare variant positions between two such vcf files, the following command will output recall and precision metrics:
```
python [DISCO_DIR]/scripts/validation_scripts/compare_vcf_disco_pos_allele_only.py truth.vcf myDiscoSnpRADResult_mapped.vcf
```


<!---
== Clustering full process ==
-----------------------------

`ls ${discofile}.fa > ${discofile}.fof  `

`short_read_connector.sh -b ${discofile}.fa -q ${discofile}.fof -s 0 -k ${k} -a
1 -l -p ${discofile}.txt`

`./quick_hierarchical_clustering ${discofile}.txt > ${discofile}.cluster`

`python3 clusters_and_fasta_to_fasta.py ${discofile}.fa ${discofile}.cluster >
${discofile}_with_clusters.fa`

`run_VCF_creator.sh -p ${discofile}_with_clusters.fa
-o ${discofile}_with_clusters.vcf`

**NB**: Then the `${discofile}_with_clusters.vcf` can be simply sorted to put
together variants from each cluster.

**Sources**:

`short_read_connector.sh` from <https://github.com/GATB/short_read_connector>

`run_VCF_creator.sh` from <https://github.com/GATB/DiscoSnp> (scripts directory)




== Old scripts ==
*  `filter_missgeno.py"  : filtering out variants with too many missing genotypes 
* From a .fa file generated by discoSnp-Rad, remove variants with too many missing genotypes 
* Usage: 
`filter_missgeno.py file max_missing` (with max_missing in percent)
-->
