
Universal Protein Resource (UniProt)
====================================


The Universal Protein Resource (UniProt), a collaboration between the European
Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and
the Protein Information Resource (PIR), is comprised of three databases, each
optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensively curated protein information, including
function, classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed up
sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive
repository of all protein sequences, consisting only of unique identifiers and
sequences.


This directory contains files of amino acid altering variants imported from
Ensembl Variation databases. Mapped sequence variants are supplied per species
in a tab delimited text file. Variants that are manually annotated in
UniProtKB/Swiss-Prot, for Homo sapiens only, are available in the
humsavar.txt document.


This directory contains the following files:

humsavar.txt:
Index of manually curated Human polymorphisms and disease mutations from
UniProtKB/Swiss-Prot.

aedes_aegypti_variation.txt.gz
The UniProtKB Aedes aegypti reference proteome strain is LVPib12; this is the
same strain used by the Ensembl Genome. Ensembl Genomes variation data is
derived from two sets of variation data both imported via VectorBase.

bos_taurus_variation.txt.gz
The UniProtKB Bos taurus (Cow) reference proteome breed is Hereford; this is the
Variants are sourced from dbSNP, Online Mendelian Inheritance in Animals (OMIA).
The Animal Quantitative Trait Loci (QTL) database Animal QTLdb) and Database of
Genomic (Animal QTLdb) and Database of Genomic Variants Archive (DGVa).

brachypodium_distachyon_variation.txt.gz
The UniProtKB Brachypodium distachyon (Purple false brome) reference proteome
strain is cv. Bd21; this is the same strain used by the Genome Reference
Consortium for their primary assembly. Ensembl Genomes variation data comes
from variations have been identified by the alignment of transcriptome
assemblies from three slender false brome (Brachypodium sylvaticum) populations.

canis_familiaris_variation.txt.gz
The UniProtKB Canis lupus (Dog) reference proteome breed is Boxer; this is the
same breed used by Ensembl. Variants are sourced from dbSNP, Online Mendelian
Inheritance in Animals (OMIA) and Database of Genomic Variants Archive (DGVa).

danio_rerio_variation.txt.gz
The UniProtKB Danio rerio (Zebrafish) reference proteome strain is Tuebingen;
this is the same strain used by the Genome Reference Consortium for their
primary assembly. Ensembl variation source variants from multiple strains and
map the variants to the primary assembly; therefore the zebrafish variants
defined in this file may have been discovered in another strain of zebrafish.

equus_caballus_variation.txt.gz
The UniProtKB Equus caballus (Horse) reference proteome breed is Thoroughbred;
this is the same breed used by Ensembl. Variants are sourced from dbSNP, Online
Mendelian Inheritance in Animals (OMIA), the Animal Quantitative Trait Loci
(QTL) database (Animal QTLdb) and Database of Genomic Variants Archive (DGVa).

fusarium_oxysporum_variation.txt.gz
The UniProtKB Fusarium oxysporum reference proteome strain is 4287 / CBS 123668
FGSC 9935 / NRRL 34936; this is the same strain used by the Ensembl Genome.
Ensembl Genomes variation data is derived from comparing 27 different strains of
this species.

gallus_gallus_variation.txt.gz
The UniProtKB Gallus gallus (Chicken) reference proteome breed is Red Jungle
fowl, inbred line UCD001; this is the same breed used by Ensembl. Variants are
sourced from dbSNP, Online Mendelian Inheritance in Animals (OMIA), the Animal
Quantitative Trait Loci (QTL) database (Animal QTLdb) and Database of Genomic
Variants Archive (DGVa).

homo_sapiens_variation.txt.gz:
The variants listed are the Ensembl Variation databases' set of 1000 Genomes
project (https://www.1000genomes.org/) and Catalogue of Somatic Mutations In
Cancer (COSMIC) v71, imported directly from COSMIC and via Ensembl Variation,
protein altering variants (SO:0001583). COSMIC v71 variants are the last freely
available somatic variants from COSMIC before their licence change; therefore 
the accuracy of the information provided for a COSMIC variant should be verified
with COSMIC.

The variants listed are the Ensembl Variation databases' set of 1000 Genomes
project (https://www.1000genomes.org/), The Exome Aggregation Consortium (ExAC),
the National Cancer Institute public Cancer Genome Atlas (NCI-TCGA) variants,
Exome Sequencing Project (ESP) and the Catalogue of Somatic Mutations In Cancer
(COSMIC) v71, imported directly from COSMIC and via Ensembl Variation, protein
altering variants (SO:0001583). COSMIC v71 variants are the last freely
available somatic variants from COSMIC before their licence change; therefore
the accuracy of the information provided for a COSMIC variant should be
verified with COSMIC.

hordeum_vulgare_variation.txt.gz
The UniProtKB Hordeum vulgare reference proteome strain is cv. Morex; this is
the same strain used by the Ensembl Genome Ensembl Genomes variation data is
derived from WGS survey sequencing of four cultivars, Barke, Bowman, Igri,
Haruna Nijo and a wild barley (H. spontaneum), SNPs discovered from RNA-Seq
performed on the embryo tissues of 9 spring barley varieties (Barke, Betzes,
Bowman, Derkado, Intro, Optic, Quench, Sergeant and Tocada) and Morex, from
population sequencing of 90 Morex x Barke individuals, and from population
sequencing of 84 Oregon Wolfe barley individuals and SNPs from the Illumina
iSelect 9k barley SNP chip.

ixodes_scapularis_variation.txt.gz
The UniProtKB Ixodes scapularis reference proteome strain is Wikel; this is the
strain used by the Ensembl Genome. Ensembl Genomes variation data is same
derived from ten populations of Ixodes scapularis, imported from VectorBase.

macaca_mulatta_variation.txt.gz
The UniProtKB Macaca mulatta (Macaque) reference proteome strain is 17573; this
is the same strain used by Ensembl. Variants are sourced from dbSNP, Online
Mendelian Inheritance in Animals (OMIA) and Database of Genomic Variants Archive
(DGVa).

meleagris_gallopavo_variation.txt.gz
For Meleagris gallopavo (Turkey) variants are sourced from dbSNP and Online
Mendelian Inheritance in Animals (OMIA).

monodelphis_domestica_variation.txt.gz
The UniProtKB Monodelphis domestica (Opossum) variants are sourced from dbSNP.

mus_musculus_variation.txt.gz
The UniProtKB Mus musculus (mouse) reference proteome strain is C57BL/6J; this
is the same strain used by the Genome Reference Consortium for their primary
assembly. Ensembl variation source variants from multiple strains and map the
variants to the primary assembly; therefore the mouse variants defined in this
file may have been discovered in another strain of mouse.

nomascus_leucogenys_variation.txt.gz
The UniProtKB Nomascus leucogenys (Gibbon) variants are sourced from Ensembl.

ornithorhynchus_anatinus_variation.txt.gz
The UniProtKB Ornithorhynchus anatinus (Platypus) reference proteome is from an
individual female called Glennie; this is the same breed used by Ensembl. 
Variants are sourced from dbSNP and Ensembl.

oryza_glaberrima_variation.txt.gz
The UniProtKB Oryza glaberrima (African rice) reference proteome strain is IRGC
96717; this is the same strain used by the Genome Reference Consortium for their
primary assembly. Ensembl Genomes variation data comes from two (unpublised)
sources: 20 diverse accessions of Oryza glaberrima and 19 accessions of its wild
progenitor, Oryza barthii, collected from geographically distributed regions of
Africa.

oryza_indica_variation.txt.gz
The UniProtKB Oryza sativa (indica) reference proteome strain is cv. 93-11; this
is the same strain used by the Genome Reference Consortium for their primary
assembly. Ensembl Genomes variation data comes from two NCBI dbSNP sources: SNPs
called from the comparison of Oryza sativa Indica and Oryza sativa Japonica and
SNPs resulting from OMAP project alignments between O. glaberrima, O. punctata,
O. nivara, and O. rufipogon agains O. sativa Japoinca mapped to O. sativa
indica.

oryza_sativa_variation.txt.gz
The UniProtKB Oryza sativa Japonica reference proteome strain is cv. Nipponbare;
this is the same strain used by the Genome Reference Consortium for their
primary assembly. Ensembl Genomes variation data comes from a collection of SNPs
produced by the BGI based on comparison of the Japonica and Indica genome, SNPs
derived from the OMAP project, a SNP variation study involving 1311 SNPs across
395 accessions and OryzaSNP, and a large scale SNP variation study involving 
~160K SNPs in 20 diversity rice accessions.

ovis_aries_variation.txt.gz
The UniProtKB Ovis aries (Sheep) variants are sourced from from dbSNP, Online
Mendelian Inheritance in Animals (OMIA), and the Animal Quantitative Trait Loci
(QTL) database (Animal QTLdb).

phytophthora_infestans_variation.txt.gz
The UniProtKB Phytophthora infestans reference proteome strain is T30-4; this
is the same strain used by the Ensembl Genome. Ensembl Genomes variation data
derives from resequecing for 3 different strains PIC99189 (ERP000341), 90128
(ERP000343) and T30-4 (ERP000344).

plasmodium_falciparum_variation.txt.gz
The UniProtKB Plasmodium_falciparum reference proteome strain is Isolate 3D7;
this is the same strain used by the Ensembl Genome. Ensembl Genomes variation
data is a direct import from dbSNP.

pongo_abelii_variation.txt.gz
The UniProtKB Pongo abelii (Orangutan) variants are sourced from dbSNP.

solanum_lycopersicum_variation.txt.gz
The UniProtKB Solanum lycopersicum reference proteome strain is cv. Heinz 1706;
this is the same strain used by the Ensembl Genome. Ensembl Genomes variation
data comprises of genetic variation from sequencing of a selection of 84 tomato
accessions and related wild species representative for the Lycopersicon,
Arcanum, Eriopersicon and Neolycopersicon groups. The variation data has been
submitted to the ENA with accession ERP004618.

sorghum_bicolor_variation.txt.gz
The UniProtKB Sorghum bicolor reference proteome strain is cv. BTx623; this is
the same strain used by the Ensembl Genome Ensembl Genomes variation data is
derived from two studies: Morris et al 2013. Proc. Natl. Acad. Sci. U.S.A. 
110:453-458 and Mace et al. 2013. Nat Commun. 4:2320.

sus_scrofa_variation.txt.gz
The UniProtKB Sus scrofa (Pig) variants are sourced from dbSNP, the Animal
Quantitative Trait Loci (QTL) database (Animal QTLdb), Database of Genomic
Variants Archive (DGVa) and the Pig SNP Consortium.

taeniopygia_guttata_variation.txt.gz
The UniProtKB Taeniopygia guttata (Zebra finch) variants are sourced from dbSNP.

triticum_aestivum_variation.txt.gz
The UniProtKB Triticum aestivum reference proteome strain is cv. Chinese Spring
this is the same strain used by the Ensembl Genome Ensembl Genomes variation
data is derived from SNP markers provided by CerealsDB, from the University of
Bristol.

vitis_vinifera_variation.txt.gz
The UniProtKB Vitis vinifera reference proteome strain is cv. Pinot noir 
PN40024; this is the same strain used by the Ensembl Genome. Ensembl Genomes
variation data derives from a collection of grape cultivars and wild Vitis
species from the USDA germplasm collection.


--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution 4.0 International
(CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all
copyrightable parts of our databases.

(c) 2002-2024 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.
