
Universal Protein Resource (UniProt)
====================================


The Universal Protein Resource (UniProt), a collaboration between the European
Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and
the Protein Information Resource (PIR), is comprised of three databases, each
optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensively curated protein information, including
function, classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed up
sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive
repository of all protein sequences, consisting only of unique identifiers and
sequences.

UniProt Reference Clusters (UniRef)
=================================================

The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90
and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc
records, in order to obtain complete coverage of sequence space at several resolutions
(100%, >90% and >50%) while hiding redundant sequences (but not their descriptions)
from view.

UniRef90
=========

UniRef90 clusters are generated from the UniRef100 seed sequences with a 90% sequence
identity threshold using the MMseqs2 algorithm. The seed sequences are the longest 
members of the UniRef100 cluster. However, the longest sequence is not always the 
most informative. There is often more biologically relevant information and annotation
(name, function, cross-references) available on other cluster members. All the proteins
in each cluster are ranked to facilitate the selection of a biologically relevant
representative for the cluster. The proteins are ranked as follows: 
1. quality of annotation: order of preference is a member from UniProtKB/Swiss-Prot
   then UniProtKB/TrEMBL and last is UniParc
2. annotation score: prefer entries that have higher UniProtKB Annotation Score
3. organism: prefer entries from Reference proteomes and Model Organisms
4. sequence length: longest sequence is preferred. 
As new proteins are added to UniProtKB and UniParc, UniRef cluster memberships and/or
identifiers might change.

UniRef90 cluster titles and identifiers are derived from the representative UniRef100
entry. The UniRef90 identifier is generated by replacing "UniRef100_"  prefix of
the representative with "UniRef90_".

Ftp access 
==========

Currently, UniRef90 is available from UniProt FTP site:

        ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90

The UniRef90 files and their descriptions are as follows:

File Name       File Description
-------------   -----------------------------------------------------------
uniref90.fasta  This file contains all UniRef90 entries in FASTA format. 
                The definition line in the FASTA format includes cluster
                specific information such as cluster name, number of members and
                and common taxonomy and also the ID of the representative protein.
                The format is as follows:
                >UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMember
                where:
                - UniqueIdentifier is the primary accession number of the UniRef cluster.
                - ClusterName is the name of the UniRef cluster.
                - Members is the number of UniRef cluster members.
                - Taxon is the scientific name of the lowest common taxon shared
                  by all UniRef cluster members.
                - RepresentativeMember is the entry name of the representative member
                  of the UniRef cluster.
                For example:
                >UniRef90_P99999 Cytochrome c n=14 Tax=Catarrhini RepID=CYC_HUMAN

uniref90.xml    This file contains all UniRef90 entries in XML format. Each entry is
                identified by the UniRef identifier, and contains:
                - cross-reference to representative UniProtKB or UniParc entry and its 
                  sequence
                - cluster member that served as the seed sequence is flagged 
                - cross-references to member UniProtKB and/or UniParc entries
		- cross-references to UniRef50 and UniRef100 entries
                - member count
                - common taxon

Document type definition for uniref90.xml   
------------------------------------------
<?xml version="1.0" encoding="ASCII"?>
<!DOCTYPE UniRef90 [
<!ELEMENT UniRef90 (entry+)>
<!ATTLIST UniRef90 
                    xmlns CDATA #FIXED "http://uniprot.org/uniref"
                    xmlns:xsi CDATA #IMPLIED
                    xsi:schemaLocation CDATA #IMPLIED
                    releaseDate    CDATA #IMPLIED
                    version        CDATA #IMPLIED
>
<!-- entry: UniRef90 entry -->
<!ELEMENT entry (name,property*,representativeMember,member*)> 
<!ATTLIST entry  id             ID    #REQUIRED
                 updated        CDATA #IMPLIED 
>

<!-- name: UniRef90 cluster name derived from representative --> 
<!-- UniRef100 entry  -->
<!ELEMENT name  (#PCDATA)>


<!-- representativeMember: information for representative -->
<!-- UniRef100 entry  -->
<!ELEMENT representativeMember (dbReference,sequence)>

<!-- memberList: members of UniRef90 cluster other than representative --> 
<!ELEMENT member (dbReference)>

<!-- dbReference: cross-reference to member UniRef100 entries  -->
<!-- of the UniRef90 cluster --> 
<!ELEMENT dbReference (property*)>
<!ATTLIST dbReference
    type CDATA #REQUIRED 
    id 	 CDATA #REQUIRED 
> 

<!-- property: properties of cross-references -->
<!ELEMENT property EMPTY>
<!ATTLIST property
    type CDATA #REQUIRED
    value CDATA #REQUIRED
>

<!ELEMENT sequence (#PCDATA ) >
<!ATTLIST sequence
    length CDATA #IMPLIED
    checksum CDATA #IMPLIED
>

]>

--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution (CC BY 4.0) License
(https://creativecommons.org/licenses/by/4.0/) to all copyrightable parts of
our databases.

(c) 2002-2024 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.

