
Universal Protein Resource (UniProt)
====================================


The Universal Protein Resource (UniProt), a collaboration between the European
Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics, and
the Protein Information Resource (PIR), is comprised of three databases, each
optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the
central access point for extensively curated protein information, including
function, classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed up
sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive
repository of all protein sequences, consisting only of unique identifiers and
sequences.


UniParc
=======

The UniProt Archive (UniParc) is a non-redundant protein sequence archive,
containing all new and revised protein sequences from all publicly available 
sources (http://www.uniprot.org/help/uniparc) to ensure that complete sequence 
coverage is available at a single site. To avoid redundancy, all sequences 
100% identical over the entire length are merged, regardless of the source organism. 
New and updated sequences are cross-referenced to the source database accession 
number, and provided with a sequence version that increments upon changes to the 
underlying sequence. The basic information stored within each UniParc entry is 
the identifier, the sequence, cyclic redundancy check number, source database(s) 
with accession and version numbers, and a time stamp. In addition, each source
database accession number is tagged with its status in that database, indicating
if the sequence still exists or has been deleted in the source database and
cross-references to NCBI GI and TaxId if appropriate.


This directory contains the following:

fasta/active/  Directory representing UniParc sequences with at least one active
               cross-reference to a source database, in gzip-compressed FASTA format.
               The data has been split into smaller files for more robust downloads.
               All files from this directory need to be downloaded and combined.


xml/all/       Directory containing all UniParc sequences, including those that have been
               deleted from the source database in XML format, split into smaller files.
               All files from this directory need to be downloaded and combined.
               The XML files include:
               - cross-references to the source databases
               - status of the sequence in each source database
                 (e.g. if the sequence still exists, the status will be "active")
               - source database accession numbers and version
               - cross-references to NCBI GI and TaxID if appropriate

uniparc.xsd    Schema definition for the UniParc XML format. Related files can be found
               in xml/all directory

This file and directories are updated with each UniProt release. File names might be unchanged,
but data is likely to be different or start at different offsets. We therefore strongly 
recommend to download the full set.

Please note: From UniProt release 2023_02 onwards we are no longer providing the full 
uniparc_active.fasta.gz and uniparc_all.xml.gz files. The size of these files had grown over 
the years to more than 100 and 200 Gigabytes, respectively, which made them difficult to 
download. We now therefore split these files into sets of smaller files.

FTP geographical sites
Switzerland	https://ftp.expasy.org/databases/uniprot/current_release/uniparc
United Kingdom	https://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/uniparc
USA		https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc

Example wget command line to get all fasta files using Switzerland's FTP:
wget -r -nd --no-parent -A 'uniparc_active_p*.fasta.gz' https://ftp.expasy.org/databases/uniprot/current_release/uniparc/fasta/active -e robots=off


--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution 4.0 International
(CC BY 4.0) License (https://creativecommons.org/licenses/by/4.0/) to all
copyrightable parts of our databases.

(c) 2002-2024 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.
