7.5.1. Main Header
-
Ident:
"XETBLOB"(7 ASCII bytes)¶ -
Version: 8-bit unsigned, MUST be
1¶ -
Xorb hash: 32-byte Merkle hash from Section 6.2¶
| Internet-Draft | XET | December 2025 |
| Denis | Expires 18 June 2026 | [Page] |
This document specifies XET, a content-addressable storage (CAS) protocol designed for efficient storage and transfer of large files with chunk-level deduplication.¶
XET uses content-defined chunking to split files into variable-sized chunks, aggregates chunks into containers called xorbs, and enables deduplication across files and repositories through cryptographic hashing.¶
This note is to be removed before publishing as an RFC.¶
Source for this draft and an issue tracker can be found at https://github.com/jedisct1/draft-denis-xet.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 18 June 2026.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.¶
Large-scale data storage and transfer systems face fundamental challenges in efficiency: storing multiple versions of similar files wastes storage space, and transferring unchanged data wastes bandwidth. Traditional approaches such as file-level deduplication miss opportunities to share common content between different files, while fixed-size chunking fails to handle insertions and deletions gracefully.¶
XET addresses these challenges through a content-addressable storage protocol that operates at the chunk level. By using content-defined chunking with a rolling hash algorithm, XET creates stable chunk boundaries that remain consistent even when files are modified.¶
This enables efficient deduplication not only within a single file across versions, but also across entirely different files that happen to share common content.¶
The protocol is designed around several key principles:¶
Determinism: Given the same input data, any conforming implementation MUST produce identical chunks, hashes, and serialized formats, ensuring interoperability.¶
Content Addressing: All objects (chunks, xorbs, files) are identified by cryptographic hashes of their content, enabling integrity verification and natural deduplication.¶
Efficient Transfer: The reconstruction-based download model allows clients to fetch only the data they need, supporting range queries and parallel downloads.¶
Algorithm Agility: The chunking and hashing algorithms are encapsulated in algorithm suites, enabling future evolution while maintaining compatibility within a deployment.¶
Provider Agnostic: While originally developed for machine learning model and dataset storage, XET is a generic protocol applicable to any large file storage scenario.¶
This specification provides the complete details necessary for implementing interoperable XET clients and servers.
It defines the XET-GEARHASH-BLAKE3 algorithm suite as the default, using Gearhash for content-defined chunking and BLAKE3 for cryptographic hashing.¶
XET is particularly well-suited for scenarios involving:¶
Machine Learning: Model checkpoints often share common layers and parameters across versions, enabling significant storage savings through deduplication.¶
Dataset Management: Large datasets with incremental updates benefit from chunk-level deduplication, where only changed portions need to be transferred.¶
Version Control: Similar to Git LFS but with content-aware chunking that enables sharing across different files, not just versions of the same file.¶
Content Distribution: The reconstruction-based model enables efficient range queries and partial downloads of large files.¶
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Throughout this document, the following terms apply:¶
| Term | Definition |
|---|---|
| Algorithm Suite | A specification of the cryptographic hash function and content-defined chunking algorithm used by an XET deployment. All participants in an XET system MUST use the same algorithm suite for interoperability. |
| Chunk | A variable-sized unit of data derived from a file using content-defined chunking. Chunks are the fundamental unit of deduplication in XET. |
| Chunk Hash | A 32-byte cryptographic hash that uniquely identifies a chunk based on its content. |
| Xorb | A container object that aggregates multiple compressed chunks for efficient storage and transfer. The name derives from “XET orb.” |
| Xorb Hash | A 32-byte cryptographic hash computed from the chunk hashes within a xorb using a Merkle tree construction. |
| File Hash | A 32-byte cryptographic hash that uniquely identifies a file based on its chunk composition. |
| Shard | A binary metadata structure that describes file reconstructions and xorb contents, used for registering uploads and enabling deduplication. |
| Term | A reference to a contiguous range of chunks within a specific xorb, used to describe how to reconstruct a file. |
| File Reconstruction | An ordered list of terms that describes how to reassemble a file from chunks stored in xorbs. |
| Content-Defined Chunking (CDC) | An algorithm that determines chunk boundaries based on file content rather than fixed offsets, enabling stable boundaries across file modifications. |
| Content-Addressable Storage (CAS) | A storage system where objects are addressed by cryptographic hashes of their content rather than by location or name. |
| Global Deduplication | The process of identifying chunks that already exist in the storage system to avoid redundant uploads. |
All multi-byte integers in binary formats (xorb headers, shard structures) use little-endian byte order unless otherwise specified.¶
Hash values are 32 bytes (256 bits). When serialized, they are stored as raw bytes. When displayed as strings, they use a specific byte-swapped hexadecimal format (see Section 6.5).¶
Range specifications use different conventions depending on context:¶
| Context | End Semantics | Example |
|---|---|---|
HTTP Range header |
Inclusive |
bytes=0-999 means bytes 0 through 999 |
url_range in fetch_info
|
Inclusive |
{"start": 0, "end": 999} means bytes 0 through 999 |
| Chunk index ranges | Exclusive |
{"start": 0, "end": 4} means chunks 0, 1, 2, 3 |
| Shard chunk ranges | Exclusive |
chunk_index_end is exclusive |
XET operates as a client-server protocol. Clients perform content-defined chunking locally, query for deduplication opportunities, form xorbs from new chunks, and upload both xorbs and shards to the server. The CAS server provides APIs for reconstruction queries, global deduplication, and persistent storage.¶
The upload process transforms files into content-addressed storage:¶
Chunking: Split files into variable-sized chunks using content-defined chunking (see Section 5).¶
Deduplication: Query for existing chunks to avoid redundant uploads (see Section 10).¶
Xorb Formation: Group new chunks into xorbs, applying compression (see Section 7).¶
Xorb Upload: Upload serialized xorbs to the CAS server.¶
Shard Formation: Create shard metadata describing file reconstructions.¶
Shard Upload: Upload the shard to register files in the system.¶
The download process reconstructs files from stored chunks:¶
Reconstruction Query: Request reconstruction information for a file hash.¶
Term Processing: Parse the ordered list of terms describing the file.¶
Data Fetching: Download required xorb ranges using provided URLs.¶
Chunk Extraction: Deserialize and decompress chunks from xorb data.¶
File Assembly: Concatenate chunks in term order to reconstruct the file.¶
XET is designed as a generic framework where the specific chunking algorithm and cryptographic hash function are parameters defined by an algorithm suite. This enables future algorithm agility while maintaining full backward compatibility within a deployment.¶
An algorithm suite specifies:¶
Content-Defined Chunking Algorithm: The rolling hash function and boundary detection logic used to split files into chunks.¶
Cryptographic Hash Function: The hash algorithm used for all content addressing (chunk hashes, xorb hashes, file hashes, verification hashes).¶
Keying Material: Domain separation keys for the hash function.¶
Algorithm Parameters: Chunk size bounds, mask values, lookup tables, and other constants.¶
Any conforming algorithm suite MUST satisfy:¶
Determinism: Identical inputs MUST produce identical outputs across all implementations.¶
Collision Resistance: The hash function MUST provide at least 128 bits of collision resistance.¶
Preimage Resistance: The hash function MUST provide at least 128 bits of preimage resistance.¶
Keyed Mode: The hash function MUST support keyed operation for domain separation.¶
The algorithm suite used by an XET deployment is determined out-of-band, typically by the CAS server configuration. All clients interacting with a given server MUST use the same suite. Binary formats (xorbs, shards) do not contain suite identifiers; the suite is determined implicitly by the deployment context.¶
This specification defines one algorithm suite:¶
XET-GEARHASH-BLAKE3: Uses Gearhash for content-defined chunking and BLAKE3 for all cryptographic hashing. This is the default and currently only defined suite.¶
Future specifications MAY define additional suites with different algorithms.¶
Content-defined chunking (CDC) splits files into variable-sized chunks based on content rather than fixed offsets. This produces deterministic chunk boundaries that remain stable across file modifications, enabling efficient deduplication.¶
This section describes the chunking algorithm for the XET-GEARHASH-BLAKE3 suite.
Other algorithm suites MAY define different chunking algorithms with different parameters.¶
The XET-GEARHASH-BLAKE3 suite uses a Gearhash-based rolling hash algorithm [GEARHASH].
Gearhash maintains a 64-bit state that is updated with each input byte using a lookup table, providing fast and deterministic boundary detection.¶
The following constants define the chunking behavior for the XET-GEARHASH-BLAKE3 suite:¶
TARGET_CHUNK_SIZE = 65536 # 64 KiB (2^16 bytes) MIN_CHUNK_SIZE = 8192 # 8 KiB (TARGET / 8) MAX_CHUNK_SIZE = 131072 # 128 KiB (TARGET * 2) MASK = 0xFFFF000000000000 # 16 one-bits¶
The Gearhash algorithm uses a lookup table of 256 64-bit constants.
Implementations of the XET-GEARHASH-BLAKE3 suite MUST use the table defined in [GEARHASH] (see Appendix A for the complete lookup table).¶
The algorithm maintains a 64-bit rolling hash value and processes input bytes sequentially:¶
function chunk_file(data):
h = 0 # 64-bit rolling hash
start_offset = 0 # Start of current chunk
chunks = []
for i from 0 to length(data):
b = data[i]
h = ((h << 1) + TABLE[b]) & 0xFFFFFFFFFFFFFFFF # 64-bit wrap
chunk_size = i - start_offset + 1
# Skip boundary checks until minimum size reached
if chunk_size < MIN_CHUNK_SIZE:
continue
# Force boundary at maximum size
if chunk_size >= MAX_CHUNK_SIZE:
chunks.append(data[start_offset : i + 1])
start_offset = i + 1
h = 0
continue
# Check for natural boundary
if (h & MASK) == 0:
chunks.append(data[start_offset : i + 1])
start_offset = i + 1
h = 0
# Emit final chunk if any data remains
if start_offset < length(data):
chunks.append(data[start_offset : length(data)])
return chunks
¶
The following rules govern chunk boundary placement:¶
Boundaries MUST NOT be placed before MIN_CHUNK_SIZE bytes have been processed in the current chunk.¶
Boundaries MUST be forced when MAX_CHUNK_SIZE bytes have been processed, regardless of hash value.¶
Between minimum and maximum sizes, boundaries are placed when (h & MASK) == 0.¶
The final chunk MAY be smaller than MIN_CHUNK_SIZE if it represents the end of the file.¶
Files smaller than MIN_CHUNK_SIZE produce a single chunk.¶
Implementations MUST produce identical chunk boundaries for identical input data.
For the XET-GEARHASH-BLAKE3 suite, this requires:¶
Using the exact lookup table values from Appendix A¶
Using 64-bit wrapping arithmetic for hash updates¶
Processing bytes in sequential order¶
Applying boundary rules consistently¶
Other algorithm suites MUST specify their own determinism requirements.¶
Implementations MAY skip hash computation for the first MIN_CHUNK_SIZE - 64 - 1 bytes of each chunk, as boundary tests are not performed in this region.¶
This optimization does not affect output correctness because the Gearhash window is 64 bytes, ensuring the hash state is fully populated by the time boundary tests begin.¶
XET uses cryptographic hashing for content addressing, integrity verification, and deduplication. The specific hash function is determined by the algorithm suite. All hashes are 32 bytes (256 bits) in length.¶
This section describes the hashing methods for the XET-GEARHASH-BLAKE3 suite, which uses BLAKE3 keyed hashing [BLAKE3] for all cryptographic hash computations.
Different key values provide domain separation between hash types.¶
Chunk hashes uniquely identify individual chunks based on their content. The algorithm suite determines how chunk hashes are computed.¶
For the XET-GEARHASH-BLAKE3 suite, chunk hashes use BLAKE3 keyed hash with DATA_KEY as the key:¶
DATA_KEY = {
0x66, 0x97, 0xf5, 0x77, 0x5b, 0x95, 0x50, 0xde,
0x31, 0x35, 0xcb, 0xac, 0xa5, 0x97, 0x18, 0x1c,
0x9d, 0xe4, 0x21, 0x10, 0x9b, 0xeb, 0x2b, 0x58,
0xb4, 0xd0, 0xb0, 0x4b, 0x93, 0xad, 0xf2, 0x29
}
¶
function compute_chunk_hash(chunk_data):
return blake3_keyed_hash(DATA_KEY, chunk_data)
¶
Xorb hashes identify xorbs based on their constituent chunks. The hash is computed using a Merkle tree construction where leaf nodes are chunk hashes. The Merkle tree construction is defined separately from the hash function.¶
Internal node hashes combine child hashes with their sizes. The hash function is determined by the algorithm suite.¶
For the XET-GEARHASH-BLAKE3 suite, internal node hashes use BLAKE3 keyed hash with INTERNAL_NODE_KEY as the key:¶
INTERNAL_NODE_KEY = {
0x01, 0x7e, 0xc5, 0xc7, 0xa5, 0x47, 0x29, 0x96,
0xfd, 0x94, 0x66, 0x66, 0xb4, 0x8a, 0x02, 0xe6,
0x5d, 0xdd, 0x53, 0x6f, 0x37, 0xc7, 0x6d, 0xd2,
0xf8, 0x63, 0x52, 0xe6, 0x4a, 0x53, 0x71, 0x3f
}
¶
The input to the hash function is a string formed by concatenating lines for each child:¶
{hash_hex} : {size}\n
¶
Where:¶
{hash_hex} is the 64-character lowercase hexadecimal representation of the child hash as defined in Section 6.5¶
{size} is the decimal representation of the child’s byte size¶
Lines are separated by newline characters (\n)¶
function compute_internal_hash(children):
buffer = ""
for (hash, size) in children:
buffer += hash_to_string(hash) + " : " + str(size) + "\n"
return blake3_keyed_hash(INTERNAL_NODE_KEY, buffer.encode("utf-8"))
¶
XET uses an aggregated hash tree construction with variable fan-out, not a traditional binary Merkle tree. This algorithm iteratively collapses a list of (hash, size) pairs until a single root hash remains.¶
MEAN_BRANCHING_FACTOR = 4 MIN_CHILDREN = 2 MAX_CHILDREN = 2 * MEAN_BRANCHING_FACTOR + 1 # 9¶
The tree structure is determined by the hash values themselves. A cut point occurs when:¶
At least 3 children have been accumulated AND the current hash modulo MEAN_BRANCHING_FACTOR equals zero, OR¶
The maximum number of children (9) has been reached, OR¶
The end of the input list is reached¶
Note: When the input has 2 or fewer hashes, all are merged together. This ensures each internal node has at least 2 children.¶
function next_merge_cut(hashes):
# hashes is a list of (hash, size) pairs
# Returns the number of entries to merge (cut point)
if length(hashes) <= 2:
return length(hashes)
end = min(MAX_CHILDREN, length(hashes))
# Check indices [2, end) using 0-based indexing
# Minimum merge is 3 children when input has more than 2 hashes
for i from 2 to end:
h = hashes[i].hash
# Interpret last 8 bytes of hash as little-endian 64-bit unsigned int
hash_value = bytes_to_u64_le(h[24:32])
if hash_value % MEAN_BRANCHING_FACTOR == 0:
return i + 1 # Cut after element i (include i+1 elements)
return end
¶
function merged_hash_of_sequence(hash_pairs):
# hash_pairs is a list of (hash, size) pairs
buffer = ""
total_size = 0
for (h, s) in hash_pairs:
buffer += hash_to_string(h) + " : " + str(s) + "\n"
total_size += s
new_hash = blake3_keyed_hash(INTERNAL_NODE_KEY, buffer.encode("utf-8"))
return (new_hash, total_size)
¶
This produces lines like:¶
cfc5d07f6f03c29bbf424132963fe08d19a37d5757aaf520bf08119f05cd56d6 : 100¶
Each line contains:¶
function compute_merkle_root(entries):
# entries is a list of (hash, size) pairs
if length(entries) == 0:
return zero_hash() # 32 zero bytes
hv = copy(entries)
while length(hv) > 1:
write_idx = 0
read_idx = 0
while read_idx < length(hv):
# Find the next cut point
next_cut = read_idx + next_merge_cut(hv[read_idx:])
# Merge this slice into one parent node
hv[write_idx] = merged_hash_of_sequence(hv[read_idx:next_cut])
write_idx += 1
read_idx = next_cut
hv = hv[0:write_idx]
return hv[0].hash
¶
The xorb hash is the root of a Merkle tree built from chunk hashes:¶
function compute_xorb_hash(chunk_hashes, chunk_sizes):
# Build leaf entries
entries = []
for i from 0 to length(chunk_hashes):
entries.append((chunk_hashes[i], chunk_sizes[i]))
# Compute root using the aggregated hash tree algorithm
return compute_merkle_root(entries)
¶
File hashes identify files based on their complete chunk composition. The computation is similar to xorb hashes, but with an additional final keyed hash step for domain separation.¶
For the XET-GEARHASH-BLAKE3 suite, file hashes use an all-zero key (ZERO_KEY) for the final hash:¶
ZERO_KEY = {
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
}
¶
function compute_file_hash(chunk_hashes, chunk_sizes):
# Build (hash, size) pairs for Merkle tree
entries = zip(chunk_hashes, chunk_sizes)
merkle_root = compute_merkle_root(entries)
return blake3_keyed_hash(ZERO_KEY, merkle_root)
¶
For empty files (zero bytes), there are no chunks, so compute_merkle_root([]) returns 32 zero bytes.
The file hash is therefore blake3_keyed_hash(ZERO_KEY, zero_hash()), where zero_hash() is 32 zero bytes.¶
Term verification hashes are used in shards to prove that the uploader possesses the actual file data, not just metadata. The hash function is determined by the algorithm suite.¶
For the XET-GEARHASH-BLAKE3 suite, verification hashes use BLAKE3 keyed hash with VERIFICATION_KEY as the key:¶
VERIFICATION_KEY = {
0x7f, 0x18, 0x57, 0xd6, 0xce, 0x56, 0xed, 0x66,
0x12, 0x7f, 0xf9, 0x13, 0xe7, 0xa5, 0xc3, 0xf3,
0xa4, 0xcd, 0x26, 0xd5, 0xb5, 0xdb, 0x49, 0xe6,
0x41, 0x24, 0x98, 0x7f, 0x28, 0xfb, 0x94, 0xc3
}
¶
The input is the raw concatenation of chunk hashes (not hex-encoded) for the term’s chunk range:¶
function compute_verification_hash(chunk_hashes, start_index, end_index):
buffer = bytes()
for i from start_index to end_index: # end_index is exclusive
buffer += chunk_hashes[i] # 32 bytes each
return blake3_keyed_hash(VERIFICATION_KEY, buffer)
¶
When representing hashes as strings (e.g., in API paths), a specific byte reordering is applied before hexadecimal encoding.¶
The 32-byte hash is interpreted as four little-endian 64-bit unsigned values, and each value is printed as 16 hexadecimal digits:¶
Divide the 32-byte hash into four 8-byte segments¶
Interpret each segment as a little-endian 64-bit unsigned value¶
Format each value as a zero-padded 16-character lowercase hexadecimal string¶
Concatenate the four strings (64 characters total)¶
function hash_to_string(hash):
out = ""
for segment in 0..4: # segments 0,1,2,3
offset = segment * 8
value = little_endian_to_u64(hash[offset : offset + 8])
out += format("{:016x}", value) # always 16 hex digits
return out
function string_to_hash(hex_string):
hash = []
for segment in 0..4:
start = segment * 16
value = parse_u64_from_hex(hex_string[start : start + 16])
hash.extend(u64_to_little_endian_bytes(value))
return hash
¶
Original hash bytes (indices 0-31): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] Reordered bytes: [7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 23, 22, 21, 20, 19, 18, 17, 16, 31, 30, 29, 28, 27, 26, 25, 24] String representation: 07060504030201000f0e0d0c0b0a090817161514131211101f1e1d1c1b1a1918¶
A xorb is a container that aggregates multiple compressed chunks for efficient storage and transfer. Xorbs are identified by their xorb hash (see Section 6.2).¶
MAX_XORB_SIZE = 67108864 # 64 MiB maximum serialized size MAX_XORB_CHUNKS = 8192 # Maximum chunks per xorb¶
Implementations MUST NOT exceed either limit. When collecting chunks:¶
Serialized xorbs have a footer so readers can locate metadata by seeking from the end:¶
+-------------------------------------------------------------+ | Chunk Data Region (variable) | | [chunk header + compressed bytes repeated per chunk] | +-------------------------------------------------------------+ | CasObjectInfo Footer (variable) | +-------------------------------------------------------------+ | Info Length (32-bit unsigned LE, footer length only) | +-------------------------------------------------------------+¶
The final 4-byte little-endian integer stores the length of the CasObjectInfo block immediately preceding it (the length does not include the 4-byte length field itself).¶
The chunk data region consists of consecutive chunk entries, each containing an 8-byte header followed by the compressed chunk data.¶
Each chunk header is 8 bytes with the following layout:¶
| Offset | Size | Field |
|---|---|---|
| 0 | 1 | Version (must be 0) |
| 1 | 3 | Compressed Size (little-endian, bytes) |
| 4 | 1 | Compression Type |
| 5 | 3 | Uncompressed Size (little-endian, bytes) |
The version field MUST be 0 for this specification.
Implementations MUST reject chunks with unknown version values.¶
Both size fields use 3-byte little-endian encoding, supporting values up to 16,777,215 bytes. Given the maximum chunk size of 128 KiB, this provides ample range.¶
Implementations MUST validate size fields before allocating buffers or invoking decompression:¶
uncompressed_size MUST be greater than zero and MUST NOT exceed MAX_CHUNK_SIZE (128 KiB). Chunks that declare larger sizes MUST be rejected and the containing xorb considered invalid.¶
compressed_size MUST be greater than zero and MUST NOT exceed the lesser of MAX_CHUNK_SIZE and the remaining bytes in the serialized xorb payload. Oversize or truncated compressed payloads MUST cause the xorb to be rejected.¶
| Value | Name | Description |
|---|---|---|
| 0 |
None
|
No compression; data stored as-is |
| 1 |
LZ4
|
LZ4 Frame format compression |
| 2 |
ByteGrouping4LZ4
|
Byte grouping preprocessing followed by LZ4
|
None (Type 0)
Data is stored without modification. Used when compression would increase size or for already-compressed data.¶
LZ4 (Type 1)
LZ4 Frame format compression [LZ4] (not LZ4 block format).
Each compressed chunk is a complete LZ4 frame.
This is the default compression scheme for most data.¶
ByteGrouping4LZ4 (Type 2)
A two-stage compression optimized for structured data (e.g., floating-point arrays):¶
Byte Grouping Phase: Reorganize bytes by position within 4-byte groups¶
LZ4 Compression: Apply LZ4 to the reorganized data¶
Byte grouping transformation:¶
Original: [A0 A1 A2 A3 | B0 B1 B2 B3 | C0 C1 C2 C3 | ...] Grouped: [A0 B0 C0 ... | A1 B1 C1 ... | A2 B2 C2 ... | A3 B3 C3 ...]¶
function byte_group_4(data):
n = length(data)
groups = [[], [], [], []]
for i from 0 to n:
groups[i % 4].append(data[i])
return concatenate(groups[0], groups[1], groups[2], groups[3])
function byte_ungroup_4(grouped_data, original_length):
n = original_length
base_size = n / 4
remainder = n % 4
# Calculate group sizes
sizes = [base_size + (1 if i < remainder else 0) for i in range(4)]
# Extract groups
groups = []
offset = 0
for size in sizes:
groups.append(grouped_data[offset : offset + size])
offset += size
# Interleave back to original order
data = []
for i from 0 to n:
group_idx = i % 4
pos_in_group = i / 4
data.append(groups[group_idx][pos_in_group])
return data
¶
When the data length is not a multiple of 4, the remainder bytes are distributed to the first groups. For example, with 10 bytes the group sizes are 3, 3, 2, 2 (first two groups get the extra bytes).¶
Implementations MAY use any strategy to select compression schemes.
If compression increases size, implementations SHOULD use compression type 0 (None).¶
ByteGrouping4LZ4 (Type 2) is typically beneficial for structured numerical data such as float32 or float16 tensors, where bytes at the same position within 4-byte groups tend to be similar.¶
A file reconstruction is an ordered list of terms that describes how to reassemble a file from chunks stored in xorbs.¶
Each term specifies:¶
Terms MUST be processed in order.¶
For each term, extract chunks at indices [start, end) from the specified xorb.¶
Decompress chunks according to their compression headers.¶
Concatenate decompressed chunk data in order.¶
For range queries, apply offset_into_first_range to skip initial bytes.¶
Validate that the total reconstructed size matches expectations.¶
When downloading a byte range rather than the complete file:¶
XET supports chunk-level deduplication at multiple levels to minimize storage and transfer overhead.¶
Within a single upload session, implementations SHOULD track chunk hashes to avoid processing identical chunks multiple times.¶
Implementations MAY cache shard metadata locally to enable deduplication against recently uploaded content without network queries.¶
The global deduplication API enables discovering existing chunks across the entire storage system.¶
Not all chunks are eligible for global deduplication queries. A chunk is eligible if:¶
For eligible chunks, query the global deduplication API (see Section 11.4).¶
On a match, the API returns a shard containing CAS info for xorbs containing the chunk.¶
Chunk hashes in the response are protected with a keyed hash; match by computing keyed hashes of local chunk hashes.¶
Record matched xorb references for use in file reconstruction terms.¶
The keyed hash protection ensures that clients can only identify chunks they already possess:¶
Aggressive deduplication can fragment files across many xorbs, harming read performance. Implementations SHOULD:¶
The CAS (Content Addressable Storage) API provides HTTP endpoints for upload and download operations.¶
All API requests require authentication via Bearer token in the Authorization header:¶
Authorization: Bearer <access_token>¶
Tokens have associated scopes:¶
read: Required for reconstruction and global deduplication queries¶
write: Required for xorb and shard uploads (includes read permissions)¶
Token acquisition is provider-specific and outside the scope of this specification.¶
Request headers:¶
Authorization: Bearer token (required)¶
Content-Type: application/octet-stream for binary uploads¶
Range: Byte range for partial requests (optional)¶
Response headers:¶
Content-Type: application/json or application/octet-stream¶
Retrieves reconstruction information for downloading a file.¶
GET /v1/reconstructions/{file_id}
¶
Path Parameters:¶
file_id: File hash as hex string (see Section 6.5)¶
Optional Headers:¶
Range: bytes={start}-{end}: Request specific byte range (end inclusive)¶
Response (200 OK):¶
{
"offset_into_first_range": 0,
"terms": [
{
"hash": "<xorb_hash_hex>",
"unpacked_length": 263873,
"range": {
"start": 0,
"end": 4
}
}
],
"fetch_info": {
"<xorb_hash_hex>": [
{
"range": {
"start": 0,
"end": 4
},
"url": "https://...",
"url_range": {
"start": 0,
"end": 131071
}
}
]
}
}
¶
Response Fields:¶
offset_into_first_range: Bytes to skip in first term (for range queries)¶
terms: Ordered list of reconstruction terms¶
fetch_info: Map from xorb hash to fetch information¶
Fetch Info Fields:¶
range: Chunk index range this entry covers¶
url: Pre-signed URL for downloading xorb data¶
url_range: Byte range within the xorb for HTTP Range header (end inclusive).
The start offset is always aligned to a chunk header boundary, so clients can parse chunk headers sequentially from the start of the fetched data.¶
Error Responses:¶
Checks if a chunk exists in the system for deduplication.¶
GET /v1/chunks/default-merkledb/{chunk_hash}
¶
Path Parameters:¶
chunk_hash: Chunk hash as hex string (see Section 6.5)¶
Response (200 OK): Shard format binary (see Section 9)¶
Response (404 Not Found): Chunk not tracked by global deduplication¶
Uploads a serialized xorb to storage.¶
POST /v1/xorbs/default/{xorb_hash}
¶
Path Parameters:¶
xorb_hash: Xorb hash as hex string (see Section 6.5)¶
Request Body: Serialized xorb (binary, see Section 7)¶
Response (200 OK):¶
{
"was_inserted": true
}
¶
The was_inserted field is false if the xorb already existed; this is not an error.¶
Error Responses:¶
This section describes the complete procedure for uploading files.¶
Split each file into chunks using the algorithm in Section 5.¶
For each chunk:¶
Compute the chunk hash (see Section 6.1)¶
Record the chunk data, hash, and size¶
For each chunk, attempt deduplication in order:¶
Local Session: Check if chunk hash was seen earlier in this session¶
Cached Metadata: Check local shard cache for chunk hash¶
Global API: For eligible chunks, query the global deduplication API¶
Record deduplication results:¶
Group new (non-deduplicated) chunks into xorbs:¶
Collect chunks maintaining their order within files¶
Form xorbs targeting ~64 MiB total size¶
Compute compression for each chunk¶
Compute xorb hash for each xorb (see Section 6.2)¶
For each new xorb:¶
All xorbs MUST be uploaded before proceeding to shard upload.¶
Build the shard structure:¶
For each file, construct file reconstruction terms¶
Compute verification hashes for each term (see Section 6.4)¶
Compute file hash (see Section 6.3)¶
Compute SHA-256 of raw file contents¶
Build CAS info blocks for new xorbs¶
The following ordering constraints apply:¶
All xorbs referenced by a shard MUST be uploaded before the shard¶
Chunk computation for a file must complete before xorb formation¶
Xorb hash computation must complete before shard formation¶
Within these constraints, operations MAY be parallelized:¶
This section describes the complete procedure for downloading files.¶
Request reconstruction information:¶
GET /v1/reconstructions/{file_id}
Authorization: Bearer <token>
¶
For range queries, include the Range header:¶
Range: bytes=0-1048575¶
Extract from the response:¶
For each term:¶
Look up fetch_info by xorb hash¶
Find fetch_info entry covering the term’s chunk range¶
Make HTTP GET request to the URL with Range header¶
Download the xorb byte range¶
Multiple terms may share fetch_info entries; implementations SHOULD avoid redundant downloads.¶
For each downloaded xorb range:¶
See Section 14 for comprehensive caching guidance. Key recommendations:¶
XET’s content-addressable design enables effective caching at multiple levels. This section provides guidance for implementers on caching strategies and considerations.¶
Objects in XET are identified by cryptographic hashes of their content. This content-addressable design provides a fundamental property: content at a given hash never changes. A xorb with hash H will always contain the same bytes, and a chunk with hash C will always decompress to the same data.¶
This immutability enables aggressive caching:¶
Cached xorb data never becomes stale¶
Cached chunk data can be reused indefinitely¶
Cache invalidation is never required for content objects¶
The only time-sensitive elements are authentication tokens and pre-signed URLs, which are discussed separately below.¶
Implementations SHOULD cache decompressed chunk data to avoid redundant decompression and network requests. The chunk hash provides a natural cache key.¶
Chunk caches SHOULD use the chunk hash (32 bytes or its string representation) as the cache key. Since hashes uniquely identify content, there is no risk of cache collisions or stale data.¶
Implementations MAY cache at different granularities:¶
Individual chunks: Fine-grained, maximizes deduplication benefit¶
Chunk ranges: Coarser-grained, reduces metadata overhead¶
Complete xorbs: Simplest, but may cache unused chunks¶
For most workloads, caching individual chunks by hash provides the best balance of storage efficiency and hit rate.¶
Since all cached content remains valid indefinitely, eviction is based purely on resource constraints:¶
LRU (Least Recently Used): Effective for workloads with temporal locality¶
LFU (Least Frequently Used): Effective for workloads with stable hot sets¶
Size-aware LRU: Prioritizes keeping smaller chunks that are cheaper to re-fetch¶
Implementations SHOULD track cache size and implement eviction when storage limits are reached.¶
Raw xorb data (compressed chunks with headers) MAY be cached by clients or intermediaries.¶
Caching raw xorb byte ranges avoids repeated downloads but requires decompression on each use. This uses local storage to reduce bandwidth consumption. Implementations SHOULD prefer caching decompressed chunks unless bandwidth is severely constrained.¶
The reconstruction API returns pre-signed URLs for downloading xorb data. These URLs have short expiration times (typically minutes to hours) and MUST NOT be cached beyond their validity period.¶
Implementations MUST:¶
Use URLs promptly after receiving them¶
Re-query the reconstruction API if URLs have expired¶
Never persist URLs to disk for later sessions¶
Reconstruction responses SHOULD be treated as ephemeral and re-fetched when needed rather than cached.¶
CAS servers SHOULD return appropriate HTTP caching headers for xorb downloads:¶
For xorb content (immutable):¶
Cache-Control: public, immutable, max-age=<url_ttl_seconds> ETag: "<xorb_hash>"¶
max-age MUST be set to a value no greater than the remaining validity window of the pre-signed URL used to serve the object (e.g., a URL that expires in 900 seconds MUST NOT be served with max-age larger than 900).¶
Servers SHOULD also emit an Expires header aligned to the URL expiry time.¶
Shared caches MUST NOT serve the response after either header indicates expiry, even if the content is immutable.¶
The immutable directive still applies within that bounded window, allowing caches to skip revalidation until the signature expires.¶
For reconstruction API responses (ephemeral):¶
Cache-Control: private, no-store¶
Reconstruction responses contain pre-signed URLs that expire and MUST NOT be cached by intermediaries.¶
For global deduplication responses:¶
Cache-Control: private, max-age=3600 Vary: Authorization¶
Deduplication responses are user-specific and may be cached briefly by the client.¶
Clients SHOULD respect Cache-Control headers from servers.
When downloading xorb data, clients MAY cache responses locally even if no caching headers are present, since content-addressed data is inherently immutable.¶
XET deployments typically serve xorb data through CDNs. The content-addressable design is well-suited for CDN caching:¶
Hash-based URLs enable cache key stability¶
Immutable content eliminates cache invalidation complexity¶
Range requests enable partial caching of large xorbs¶
Effective cache key design determines whether multiple users can share cached xorb data. Since xorb content is immutable and identified by hash, the ideal cache key includes only the xorb hash and byte range—maximizing cache reuse. However, access control requirements constrain this choice.¶
Two URL authorization strategies are applicable to XET deployments:¶
Edge-Authenticated URLs. The URL path contains the xorb hash with no signature parameters. Authorization is enforced at the CDN edge via signed cookies or tokens validated on every request. The cache key is derived from the xorb hash and byte range only, excluding any authorization tokens. This allows all authorized users to share the same cache entries. This pattern requires CDNs capable of per-request authorization; generic shared caches without edge auth MUST NOT be used.¶
Query-Signed URLs. The URL includes signature parameters in the query string (similar to pre-signed cloud storage URLs). Cache keys MUST include all signature-bearing query parameters. Each unique signature produces a separate cache entry, resulting in lower hit rates. This approach works with any CDN but sacrifices cache efficiency for simplicity.¶
For both strategies:¶
Cache keys SHOULD include the byte range when Range headers are present¶
Cache keys SHOULD NOT include Authorization headers, since different users have different tokens but request identical content¶
For deployments with access-controlled content (e.g., gated models requiring user agreement), see Section 15.4 for additional CDN considerations.¶
CDNs SHOULD cache partial responses (206 Partial Content) by byte range.
When a subsequent request covers a cached range, the CDN can serve from cache without contacting the origin.¶
Some CDNs support range coalescing, where multiple partial caches are combined to serve larger requests. This is particularly effective for XET where different users may request different chunk ranges from the same xorb.¶
Corporate proxies and other intermediaries MAY cache XET traffic.¶
Pre-signed URLs include authentication in the URL itself, allowing unauthenticated intermediaries to cache responses.
However, reconstruction API requests include Bearer tokens in the Authorization header and SHOULD NOT be cached by intermediaries (the private directive prevents this).¶
XET provides content integrity through cryptographic hashing:¶
Chunk hashes verify individual chunk integrity¶
Xorb hashes verify complete xorb contents¶
File hashes verify complete file reconstruction¶
Implementations SHOULD verify hashes when possible, particularly for downloaded content.¶
The keyed hash protection in global deduplication prevents enumeration attacks:¶
XET deployments may support access-controlled or “gated” content, where users must be authorized (e.g., by accepting terms of service or requesting access) before downloading certain files. This has several implications for XET implementations.¶
Access control in XET is typically enforced at the repository or file level, not at the xorb or chunk level.
The reconstruction API MUST verify that the requesting user has access to the file before returning pre-signed URLs.
Unauthorized requests MUST return 401 Unauthorized or 403 Forbidden.¶
Since the same xorb may be referenced by both public and access-controlled files, CDN caching requires careful design:¶
Edge-Authenticated Deployments. When using edge authentication (cookies or tokens validated per-request), the CDN enforces access control on every request. Xorbs referenced only by access-controlled files remain protected even when cached. This is the recommended approach for deployments with gated content.¶
Query-Signed URL Deployments. When using query-signed URLs, each authorized user receives unique signatures. Cache efficiency is reduced, but access control is enforced by signature validity. Deployments MAY choose to exclude xorbs from access-controlled repositories from CDN caching entirely.¶
The same chunk may exist in both access-controlled and public repositories. XET’s content-addressable design allows storage deduplication across access boundaries:¶
When a user uploads to a public repository, chunks matching access-controlled content may be deduplicated¶
The user does not gain access to the access-controlled repository; they simply avoid re-uploading data they already possess¶
The keyed hash protection in global deduplication (Section 10.3) ensures users can only match chunks they possess¶
This is a storage optimization, not an access control bypass. Implementations MUST still enforce repository-level access control for all download operations.¶
Deployments with access-controlled content SHOULD consider:¶
This document does not require any IANA actions.¶
The XET-GEARHASH-BLAKE3 content-defined chunking algorithm requires a lookup table of 256 64-bit constants.
Implementations of this suite MUST use the exact values below for determinism.¶
TABLE = [
0xb088d3a9e840f559, 0x5652c7f739ed20d6, 0x45b28969898972ab, 0x6b0a89d5b68ec777,
0x368f573e8b7a31b7, 0x1dc636dce936d94b, 0x207a4c4e5554d5b6, 0xa474b34628239acb,
0x3b06a83e1ca3b912, 0x90e78d6c2f02baf7, 0xe1c92df7150d9a8a, 0x8e95053a1086d3ad,
0x5a2ef4f1b83a0722, 0xa50fac949f807fae, 0x0e7303eb80d8d681, 0x99b07edc1570ad0f,
0x689d2fb555fd3076, 0x00005082119ea468, 0xc4b08306a88fcc28, 0x3eb0678af6374afd,
0xf19f87ab86ad7436, 0xf2129fbfbe6bc736, 0x481149575c98a4ed, 0x0000010695477bc5,
0x1fba37801a9ceacc, 0x3bf06fd663a49b6d, 0x99687e9782e3874b, 0x79a10673aa50d8e3,
0xe4accf9e6211f420, 0x2520e71f87579071, 0x2bd5d3fd781a8a9b, 0x00de4dcddd11c873,
0xeaa9311c5a87392f, 0xdb748eb617bc40ff, 0xaf579a8df620bf6f, 0x86a6e5da1b09c2b1,
0xcc2fc30ac322a12e, 0x355e2afec1f74267, 0x2d99c8f4c021a47b, 0xbade4b4a9404cfc3,
0xf7b518721d707d69, 0x3286b6587bf32c20, 0x0000b68886af270c, 0xa115d6e4db8a9079,
0x484f7e9c97b2e199, 0xccca7bb75713e301, 0xbf2584a62bb0f160, 0xade7e813625dbcc8,
0x000070940d87955a, 0x8ae69108139e626f, 0xbd776ad72fde38a2, 0xfb6b001fc2fcc0cf,
0xc7a474b8e67bc427, 0xbaf6f11610eb5d58, 0x09cb1f5b6de770d1, 0xb0b219e6977d4c47,
0x00ccbc386ea7ad4a, 0xcc849d0adf973f01, 0x73a3ef7d016af770, 0xc807d2d386bdbdfe,
0x7f2ac9966c791730, 0xd037a86bc6c504da, 0xf3f17c661eaa609d, 0xaca626b04daae687,
0x755a99374f4a5b07, 0x90837ee65b2caede, 0x6ee8ad93fd560785, 0x0000d9e11053edd8,
0x9e063bb2d21cdbd7, 0x07ab77f12a01d2b2, 0xec550255e6641b44, 0x78fb94a8449c14c6,
0xc7510e1bc6c0f5f5, 0x0000320b36e4cae3, 0x827c33262c8b1a2d, 0x14675f0b48ea4144,
0x267bd3a6498deceb, 0xf1916ff982f5035e, 0x86221b7ff434fb88, 0x9dbecee7386f49d8,
0xea58f8cac80f8f4a, 0x008d198692fc64d8, 0x6d38704fbabf9a36, 0xe032cb07d1e7be4c,
0x228d21f6ad450890, 0x635cb1bfc02589a5, 0x4620a1739ca2ce71, 0xa7e7dfe3aae5fb58,
0x0c10ca932b3c0deb, 0x2727fee884afed7b, 0xa2df1c6df9e2ab1f, 0x4dcdd1ac0774f523,
0x000070ffad33e24e, 0xa2ace87bc5977816, 0x9892275ab4286049, 0xc2861181ddf18959,
0xbb9972a042483e19, 0xef70cd3766513078, 0x00000513abfc9864, 0xc058b61858c94083,
0x09e850859725e0de, 0x9197fb3bf83e7d94, 0x7e1e626d12b64bce, 0x520c54507f7b57d1,
0xbee1797174e22416, 0x6fd9ac3222e95587, 0x0023957c9adfbf3e, 0xa01c7d7e234bbe15,
0xaba2c758b8a38cbb, 0x0d1fa0ceec3e2b30, 0x0bb6a58b7e60b991, 0x4333dd5b9fa26635,
0xc2fd3b7d4001c1a3, 0xfb41802454731127, 0x65a56185a50d18cb, 0xf67a02bd8784b54f,
0x696f11dd67e65063, 0x00002022fca814ab, 0x8cd6be912db9d852, 0x695189b6e9ae8a57,
0xee9453b50ada0c28, 0xd8fc5ea91a78845e, 0xab86bf191a4aa767, 0x0000c6b5c86415e5,
0x267310178e08a22e, 0xed2d101b078bca25, 0x3b41ed84b226a8fb, 0x13e622120f28dc06,
0xa315f5ebfb706d26, 0x8816c34e3301bace, 0xe9395b9cbb71fdae, 0x002ce9202e721648,
0x4283db1d2bb3c91c, 0xd77d461ad2b1a6a5, 0xe2ec17e46eeb866b, 0xb8e0be4039fbc47c,
0xdea160c4d5299d04, 0x7eec86c8d28c3634, 0x2119ad129f98a399, 0xa6ccf46b61a283ef,
0x2c52cedef658c617, 0x2db4871169acdd83, 0x0000f0d6f39ecbe9, 0x3dd5d8c98d2f9489,
0x8a1872a22b01f584, 0xf282a4c40e7b3cf2, 0x8020ec2ccb1ba196, 0x6693b6e09e59e313,
0x0000ce19cc7c83eb, 0x20cb5735f6479c3b, 0x762ebf3759d75a5b, 0x207bfe823d693975,
0xd77dc112339cd9d5, 0x9ba7834284627d03, 0x217dc513e95f51e9, 0xb27b1a29fc5e7816,
0x00d5cd9831bb662d, 0x71e39b806d75734c, 0x7e572af006fb1a23, 0xa2734f2f6ae91f85,
0xbf82c6b5022cddf2, 0x5c3beac60761a0de, 0xcdc893bb47416998, 0x6d1085615c187e01,
0x77f8ae30ac277c5d, 0x917c6b81122a2c91, 0x5b75b699add16967, 0x0000cf6ae79a069b,
0xf3c40afa60de1104, 0x2063127aa59167c3, 0x621de62269d1894d, 0xd188ac1de62b4726,
0x107036e2154b673c, 0x0000b85f28553a1d, 0xf2ef4e4c18236f3d, 0xd9d6de6611b9f602,
0xa1fc7955fb47911c, 0xeb85fd032f298dbd, 0xbe27502fb3befae1, 0xe3034251c4cd661e,
0x441364d354071836, 0x0082b36c75f2983e, 0xb145910316fa66f0, 0x021c069c9847caf7,
0x2910dfc75a4b5221, 0x735b353e1c57a8b5, 0xce44312ce98ed96c, 0xbc942e4506bdfa65,
0xf05086a71257941b, 0xfec3b215d351cead, 0x00ae1055e0144202, 0xf54b40846f42e454,
0x00007fd9c8bcbcc8, 0xbfbd9ef317de9bfe, 0xa804302ff2854e12, 0x39ce4957a5e5d8d4,
0xffb9e2a45637ba84, 0x55b9ad1d9ea0818b, 0x00008acbf319178a, 0x48e2bfc8d0fbfb38,
0x8be39841e848b5e8, 0x0e2712160696a08b, 0xd51096e84b44242a, 0x1101ba176792e13a,
0xc22e770f4531689d, 0x1689eff272bbc56c, 0x00a92a197f5650ec, 0xbc765990bda1784e,
0xc61441e392fcb8ae, 0x07e13a2ced31e4a0, 0x92cbe984234e9d4d, 0x8f4ff572bb7d8ac5,
0x0b9670c00b963bd0, 0x62955a581a03eb01, 0x645f83e5ea000254, 0x41fce516cd88f299,
0xbbda9748da7a98cf, 0x0000aab2fe4845fa, 0x19761b069bf56555, 0x8b8f5e8343b6ad56,
0x3e5d1cfd144821d9, 0xec5c1e2ca2b0cd8f, 0xfaf7e0fea7fbb57f, 0x000000d3ba12961b,
0xda3f90178401b18e, 0x70ff906de33a5feb, 0x0527d5a7c06970e7, 0x22d8e773607c13e9,
0xc9ab70df643c3bac, 0xeda4c6dc8abe12e3, 0xecef1f410033e78a, 0x0024c2b274ac72cb,
0x06740d954fa900b4, 0x1d7a299b323d6304, 0xb3c37cb298cbead5, 0xc986e3c76178739b,
0x9fabea364b46f58a, 0x6da214c5af85cc56, 0x17a43ed8b7a38f84, 0x6eccec511d9adbeb,
0xf9cab30913335afb, 0x4a5e60c5f415eed2, 0x00006967503672b4, 0x9da51d121454bb87,
0x84321e13b9bbc816, 0xfb3d6fb6ab2fdd8d, 0x60305eed8e160a8d, 0xcbbf4b14e9946ce8,
0x00004f63381b10c3, 0x07d5b7816fcc4e10, 0xe5a536726a6a8155, 0x57afb23447a07fdd,
0x18f346f7abc9d394, 0x636dc655d61ad33d, 0xcc8bab4939f7f3f6, 0x63c7a906c1dd187b
]
¶
The following test vectors are for the XET-GEARHASH-BLAKE3 algorithm suite.¶
Input (ASCII): Hello World! Input (hex): 48656c6c6f20576f726c6421 Hash (raw hex, bytes 0-31): a29cfb08e608d4d8726dd8659a90b9134b3240d5d8e42d5fcb28e2a6e763a3e8 Hash (XET string representation): d8d408e608fb9ca213b9909a65d86d725f2de4d8d540324be8a363e7a6e228cb¶
The XET hash string format interprets the 32-byte hash as four little-endian 64-bit unsigned values and prints each as 16 hexadecimal digits.¶
Hash bytes [0..31]: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f Expected XET string: 07060504030201000f0e0d0c0b0a090817161514131211101f1e1d1c1b1a1918¶
The conversion formula:¶
function hash_to_string(h):
# h is a 32-byte array
result = ""
for i from 0 to 4:
start = i * 8
end = (i + 1) * 8
u64_bytes = slice(h, start, end)
u64_val = little_endian_to_u64(u64_bytes)
result += format("{:016x}", u64_val)
return result
¶
Child 1: hash (XET string): c28f58387a60d4aa200c311cda7c7f77f686614864f5869eadebf765d0a14a69 size: 100 Child 2: hash (XET string): 6e4e3263e073ce2c0e78cc770c361e2778db3b054b98ab65e277fc084fa70f22 size: 200 Buffer being hashed (ASCII, with literal \n newlines): c28f58387a60d4aa200c311cda7c7f77f686614864f5869eadebf765d0a14a69 : 100\n 6e4e3263e073ce2c0e78cc770c361e2778db3b054b98ab65e277fc084fa70f22 : 200\n Result (XET string): be64c7003ccd3cf4357364750e04c9592b3c36705dee76a71590c011766b6c14¶
Input: Two chunk hashes from test vector B.3, concatenated as raw bytes (not XET string format).¶
Chunk hash 1 (raw hex): aad4607a38588fc2777f7cda1c310c209e86f564486186f6694aa1d065f7ebad Chunk hash 2 (raw hex): 2cce73e063324e6e271e360c77cc780e65ab984b053bdb78220fa74f08fc77e2 Concatenated input (64 bytes, raw hex): aad4607a38588fc2777f7cda1c310c209e86f564486186f6694aa1d065f7ebad 2cce73e063324e6e271e360c77cc780e65ab984b053bdb78220fa74f08fc77e2 Verification hash (XET string): eb06a8ad81d588ac05d1d9a079232d9c1e7d0b07232fa58091caa7bf333a2768¶
Complete reference files including sample chunks, xorbs, and shards are available at: https://huggingface.co/datasets/xet-team/xet-spec-reference-files¶
The XET protocol was invented by Hailey Johnson and Yucheng Low at Hugging Face. This specification is based on the reference implementation and documentation developed by the Hugging Face team.¶