| Internet-Draft | NTCF | June 2026 |
| Sunnetci | Expires 15 December 2026 | [Page] |
This document specifies NTCF (Network and Telemetry Compression Format), a self-describing, columnar, append-friendly binary container for cybersecurity and network telemetry such as flow records, honeypot events, and web access logs. Unlike general-purpose byte compressors, NTCF models the semantics of telemetry -- IP addresses, autonomous system numbers, ports, country codes, event types, and timestamps -- as typed columns and applies semantic encodings (dictionary, delta, delta-of-delta, run-length, frame-of-reference bit packing, and variable-length integers) before a conventional entropy compression stage.¶
NTCF embeds per-column zone-map statistics and Bloom filters so that point lookups and analytical predicates can be evaluated by reading only the columns and segments that can possibly match, without decompressing the entire file. This document defines the on-disk octet layout (format version 1), the encoding catalogue, the reading and crash-recovery algorithms, a resource-limit model, security considerations, and an IANA media-type registration.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 15 December 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.¶
Security and network telemetry is high in volume, highly repetitive, and almost always interrogated along a small number of dimensions: source and destination IP address, autonomous system number (ASN), country, port, event type, and time. Operators retain large archives and incur two costs: storage of the data at rest, and the time spent decompressing whole archives to answer a single question during an incident.¶
General-purpose compressors (gzip, zstd, lz4, xz) reduce the storage cost but produce an opaque blob: answering "which records involved 203.0.113.5?" requires full decompression, and the format offers no analytics. General-purpose columnar analytics formats such as Apache Parquet [PARQUET] make data queryable but are not specialised for telemetry semantics (for example, IP, ASN, and CIDR types and IP-range pruning) nor for crash-safe streaming append from an edge sensor.¶
NTCF aims to be, simultaneously:¶
Cryptographic authentication and encryption of files, distributed query, joins across files, and a network storage service are out of scope for format version 1. See Section 15.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Data type conventions used throughout this document:¶
Terminology:¶
NTCF compresses in two cooperating layers.¶
The semantic layer operates on typed columns and removes structural redundancy that a byte compressor cannot perceive: near-monotonic timestamps (delta-of-delta), low-cardinality enumerations (dictionary), repeated values (run-length), small-range integers (frame-of-reference bit packing), and general integers (variable-length plus ZigZag). An encoder SHOULD trial candidate encodings per column chunk and keep the smallest, and MUST always include a baseline (Plain or Raw) so the chosen encoding is never larger than the baseline.¶
The entropy layer is a conventional byte compressor -- zstd, lz4, or none -- applied per column chunk to the semantically encoded octets.¶
The two layers are complementary: the semantic layer maps a repeated IP column to small dictionary ordinals; the entropy layer removes the residual byte-level redundancy.¶
A file is a fixed header, a sequence of segments (each an opaque concatenation of column chunks and optional index blobs), and a footer. The footer is read first for fast open and for predicate pruning. Crash recovery relies on checkpoint footers and a backward scan (Section 11), and NOT on per-segment framing.¶
+----------------------------------------------------------+ | Header (fixed 36 octets, Section 5) | +----------------------------------------------------------+ | Segment 0 | Segment 1 | ... | Segment N-1 (Section 6) | | each segment = chunk | [index] | chunk | [index] | ... | +----------------------------------------------------------+ | [zero or more intermediate checkpoint footers] | +----------------------------------------------------------+ | Footer body (Section 10) | | footerLen u32 | CRC32C u32 | trailer magic "NTCF" | +----------------------------------------------------------+¶
Column and segment octet locations are recorded as absolute file offsets in the footer. Intermediate checkpoint footers, if present, are dead octets that a conforming reader skips: the authoritative footer is the final one, whose offsets account for any preceding checkpoint footers.¶
The header is exactly 36 octets:¶
Field | Type | Description
----------+-----------+------------------------------------------------
magic | bytes[4] | 0x4E 0x54 0x43 0x46 ("NTCF")
version | u16 | format version; this document specifies 1
flags | u16 | reserved; senders set 0, readers ignore unknown
created | u64 | file creation time, Unix nanoseconds
writerID | bytes[16] | opaque producer identifier; MAY be zero
crc32c | u32 | CRC-32C (Castagnoli) over the preceding 32 octets
¶
A reader MUST validate magic, MUST reject a version it does not support (Section 17), and MUST verify crc32c before relying on any other octet.¶
A segment is the concatenation of one column chunk per schema column, in schema column order, with each indexed column's chunk OPTIONALLY followed by its index blob (Section 9). A segment has no magic and no self-contained header; its extent and the location of every chunk and index within it are given by the footer's segment directory (Section 10). All chunks in a segment encode the same number of rows.¶
A column chunk is self-validating: it carries its own checksum and the lengths needed to bound decompression.¶
Field | Type | Description -----------------+-------------+---------------------------------------- kind | u8 | 0 = integer domain, 1 = byte domain encodingID | u8 | semantic encoding (Section 7) compressionID | u8 | entropy codec (Section 8) flags | u8 | bit0 = presence bitmap follows; rest 0 rows | uvarint | number of rows uncompressedLen | uvarint | octet length of encoded data (pre-entropy) storedLen | uvarint | octet length of stored (post-entropy) data bitmapLen | uvarint | only if flags bit0; equals ceil(rows/8) bitmap | bytes[*] | only if flags bit0 (Section 6.3) checksum | u64 | XXH64 over 'stored' stored | bytes[*] | entropy-compressed semantic octets¶
To decode a chunk a reader MUST: (1) verify checksum over stored; (2) enforce the decompression limits of Section 14 against uncompressedLen and the ratio uncompressedLen divided by storedLen; (3) entropy-decode stored to exactly uncompressedLen octets; (4) semantic-decode rows values; and (5) if a presence bitmap is present, apply it to mark null rows.¶
When a column contains null (absent) values, the chunk stores a presence bitmap of ceil(rows/8) octets. Bit i (least significant bit first within each octet, that is, octet i divided by 8, bit i modulo 8) is set when row i is present (non-null). The encoded value stream contains one value per row; the value at a null row is a placeholder (zero for integers, empty for byte values) and MUST be ignored by readers when the presence bit is clear.¶
Storing a placeholder per null row, rather than only present values, is a deliberate simplification of format version 1; placeholders compress well under run-length and dictionary encodings. A future version MAY define a present-values-only encoding.¶
Columns are mapped to one of two physical domains. The integer domain carries values as u64; the byte domain carries variable-length octet strings. The mapping from logical type to domain is given in Section 13. All integer encodings are exact over the full u64 range because their arithmetic is performed modulo 2^64 identically on encode and decode.¶
encodingID values are stable and assigned as follows:¶
ID | Name | Dom | Description
----+--------------+-----+-------------------------------------------------
0 | Plain | int | u64 little-endian per value (baseline)
1 | Varint | int | uvarint per value
2 | Delta | int | uvarint first value, then varint of each diff
3 | DeltaOfDelta | int | uvarint first value; varint first delta; then
| | | varint of each change in delta
4 | RLE | int | repeated (uvarint value, uvarint run-length)
5 | Bitpack | int | uvarint min; u8 width; (value-min) bit-packed
| | | at width bits (Section 7.1)
6 | DictInt | int | dictionary, integer keys (Section 7.2)
64 | Raw | byte| repeated (uvarint length, bytes) per value
65 | DictBytes | byte| dictionary, octet-string keys (Section 7.2)
66 | RLEBytes | byte| repeated (uvarint len, bytes, uvarint run-length)
¶
A reader MUST reject a chunk whose encodingID is unknown for its kind. The number of values produced MUST equal rows.¶
Bit packing serialises a sequence of unsigned integers using exactly width bits each, least significant bit first, with no per-value octet alignment. width is in the range 0 to 64. A width of 0 encodes a sequence of zeros and occupies no octets. The total size is ceil((count times width) divided by 8) octets. The Bitpack frame-of-reference encoding subtracts a per-chunk minimum before packing; the dictionary encodings pack ordinal indices at width equal to the number of bits needed to represent dictLen minus 1.¶
A dictionary chunk has the following layout:¶
Each ordinal MUST be strictly less than dictLen.¶
The entropy layer is applied to the semantically encoded octets of a chunk. A writer SHOULD select none when entropy compression would not reduce size. compressionID values:¶
ID | Name | Description ----+------+---------------------------------------------------------------- 0 | none | stored octets are the semantic octets verbatim 1 | zstd | a single zstd frame (RFC 8478) over the semantic octets 2 | lz4 | a one-octet selector (0=raw, 1=LZ4 block) then the payload¶
For compressionID 2, selector 0 means the remaining octets are the uncompressed semantic octets (used when the data is incompressible); selector 1 means the remaining octets form an LZ4 block (not an LZ4 frame) that decompresses to exactly uncompressedLen octets. For all codecs, a reader MUST verify that decompression yields exactly uncompressedLen octets and MUST treat any deviation as corruption.¶
The zstd frame format used by compressionID 1 is specified in [RFC8478]; the LZ4 block format used by compressionID 2 is described in [LZ4].¶
For each column marked indexed, a writer MAY emit an index blob immediately after that column's chunk within the segment; its location is recorded in the footer (indexOffset, indexLength). An indexLength of 0 means no index.¶
Field | Type | Description ----------+------+------------------------------------------------------- flags | u8 | bit0 = Bloom filter present; bit1 = inverted present bloom | ... | present if bit0 (Section 9.1) inverted | ... | present if bit1 (Section 9.2)¶
Field | Type | Description -----------+-----------------+-------------------------------------------- k | u8 | number of hash probes wordCount | uvarint | number of 64-bit words words | u64 x wordCount | bit array¶
The bit count m equals wordCount times 64. Bit b is located in word (b divided by 64) at bit position (b modulo 64). A value's membership uses double hashing of its XXH64 digest h (integers are hashed as their 8-octet little-endian form): with h1 equal to h and h2 equal to (h right-shifted by 33) bitwise-OR (h left-shifted by 31), and with h2 replaced by the constant 0x9E3779B97F4A7C15 if it would otherwise be 0, probe i for 0 le i less than k addresses bit (h1 + i times h2) modulo m. A writer SHOULD size the filter to the column's distinct cardinality at a target false-positive rate (the reference uses one percent). A clear probe is definitive non-membership; a fully set result is probabilistic.¶
Field | Type | Description ---------+---------+---------------------------------------------------- kind | u8 | 0 = integer keys, 1 = byte keys count | uvarint | number of distinct keys entries | ... | 'count' entries, in ascending sorted key order¶
Each entry is a key followed by a posting list. The key is a uvarint for integer keys, or a uvarint length plus that many octets for byte keys. The posting list is a uvarint bitmapLen followed by bitmapLen octets containing a Roaring Bitmap [ROARING], per the Roaring Bitmap serialization specification, of the zero-based row positions within the segment that hold that key. Inverted indexes are OPTIONAL; when absent, equality is resolved by zone-map and Bloom pruning followed by a scan of the decoded column.¶
A writer that appends records over time (streaming ingestion) SHOULD write a checkpoint footer at intervals. A checkpoint footer is a complete footer (Section 10) written at the current end of file; it is appended, never overwritten. Because earlier footers are never modified, the most recently completed footer is always intact even if the process terminates while writing a subsequent segment.¶
On read, if the trailing footer is missing or fails validation, a reader MAY recover by scanning backward from the end of file for an occurrence of the trailer magic and attempting to parse a footer ending there; the first (latest) candidate whose footerLen and CRC validate is the recovered footer. Records written after the last checkpoint but before termination are not recoverable; this is the correct durability boundary. Dead intermediate footers MAY be reclaimed by an out-of-band compaction step.¶
A count of all rows with no predicate is answered from totalRows with no body read.¶
ID | Type | Dom | Normalisation
----+-----------+-----+------------------------------------------------------
0 | timestamp | int | Unix nanoseconds since the epoch
1 | ip | byte| canonical 16-octet form; IPv4 stored as IPv4-mapped
| | | IPv6 so one column holds both families and the
| | | lexicographic order is total
2 | uint | int | unsigned 64-bit
3 | port | int | transport port 0..65535
4 | enum | byte| low-cardinality octet string (country, protocol,
| | | HTTP method, event type, ...)
5 | string | byte| arbitrary octet string
6 | bool | int | 0 or 1
¶
The 16-octet IP normalisation gives correct zone-map ordering within each address family. Implementations storing both families in one column SHOULD note that minimum and maximum bounds spanning families are looser; this does not affect correctness, only pruning effectiveness.¶
Because every length and offset in a file is attacker-controlled, a reader MUST gate every allocation derived from a file-supplied count by a finite ceiling before allocating, and MUST bound decompression. The reference implementation enforces, and this document RECOMMENDS, at least the following ceilings:¶
Quantity | Ceiling -------------------------------------------+-------------- columns per schema | 4096 rows per segment | 16777216 segments per file | 1048576 dictionary entries per chunk | 16777216 stored (post-entropy) octets per chunk | 1 GiB uncompressed octets per chunk | 4 GiB decompression expansion ratio | 256:1 footer body | 256 MiB a single byte-domain value | 16 MiB¶
A reader MUST reject any "count times width" computation that would overflow, and MUST reject any offset or length that falls outside the file.¶
NTCF files frequently originate from untrusted parties: partner feeds, tenant sensors, and attacker probes. Conforming readers MUST treat all input as hostile.¶
No panics or unbounded work: for any input octets, a decoder MUST return a value or an error; it MUST NOT crash, allocate without bound, read outside the file, or loop indefinitely. The reference implementation enforces this property with fuzz testing across the header and footer parser, every encoding decoder, the entropy layer, the index parser, and the query parser.¶
Decompression bombs: a reader MUST enforce both an absolute uncompressed ceiling and an expansion-ratio cap (Section 14) before and during entropy decoding, and MUST verify that the decompressed length equals the declared length.¶
Integrity, not authenticity: the CRC-32C checksums on the header and footer and the XXH64 checksums on chunks detect accidental corruption only; they are NOT message authentication codes. An adversary with write access can forge a valid-looking file. Format version 1 provides no confidentiality and no authenticity. Consumers requiring those properties MUST layer an authenticated or encrypted transport or storage mechanism beneath NTCF. An authenticated container is a candidate for a future version.¶
Resource exhaustion: the limits of Section 14 bound memory and CPU per file; operators ingesting many files SHOULD additionally bound concurrency.¶
This document requests registration of the following media type, per [RFC6838], and a file extension.¶
If a registry of NTCF encodingID, compressionID, or logical type values is desired, this document suggests an "NTCF Encodings" registry seeded with the assignments in Sections 7, 8, and 13, under a "Specification Required" policy.¶
The header version field gates on-disk compatibility. A reader MUST refuse a file whose version it does not implement. Within a supported version, a reader MUST reject unknown encodingID, compressionID, and logical type values rather than guess. Additive changes that do not alter the octet layout of existing structures (for example, a new encoding identifier) MAY be made within a version only if existing readers can still reject the new identifier safely; otherwise the version MUST be incremented.¶
A minimal file containing one segment of three rows over the columns timestamp (timestamp), srcip (ip, indexed), and country (enum, indexed, nullable) is laid out as follows.¶
[Header: "NTCF", version=1, flags=0, created, writerID, crc32c] (36 octets)
[Segment 0]
[chunk: timestamp] kind=0 enc=DeltaOfDelta comp=zstd ... checksum, stored
[chunk: srcip] kind=1 enc=DictBytes comp=zstd ... checksum, stored
[index: srcip] flags=bit0 (bloom): k, wordCount, words
[chunk: country] kind=1 enc=DictBytes comp=none flags=bit0 (nulls)
bitmap, checksum, stored
[index: country] flags=bit0: bloom
[Footer body]
schema{id, "demo", v1, 3 columns ...}
sourceType="demo", totalRows=3, minTS, maxTS, segCount=1
segment0{offset=36, length, rows=3, minTS, maxTS, colCount=3,
col0{chunkOffset,chunkLength,0,0, flags=0,nonNull=3, minInt,maxInt}
col1{chunkOffset,chunkLength,indexOffset,indexLength, min/max bytes}
col2{chunkOffset,chunkLength,indexOffset,indexLength,
flags=1,nonNull=2, min/max bytes}}
[footerLen u32][crc32c u32]["NTCF"]
¶
A complete, Apache-2.0-licensed reference implementation in Go accompanies this specification. It includes round-trip and fuzz tests for every encoding, the chunk and footer framing, the index blobs, and the query parser, together with a benchmark harness.¶
Measured compression ratios on synthetic but realistically skewed telemetry (flow, honeypot, and web access) exceed those of gzip, zstd, lz4, and xz on the same inputs while preserving in-place search; these results are reproducible from the implementation and are illustrative rather than a conformance requirement. Production deployments SHOULD validate ratios on their own representative data.¶