Independent Submission                                       A. Sünnetci
Internet-Draft                                              NTCF Project
Intended status: Informational                              13 June 2026
Expires: 15 December 2026


           The NTCF Network and Telemetry Compression Format
                     draft-sunnetci-ntcf-format-00

Abstract

   This document specifies NTCF (Network and Telemetry Compression
   Format), a self-describing, columnar, append-friendly binary
   container for cybersecurity and network telemetry such as flow
   records, honeypot events, and web access logs.  Unlike general-
   purpose byte compressors, NTCF models the semantics of telemetry --
   IP addresses, autonomous system numbers, ports, country codes, event
   types, and timestamps -- as typed columns and applies semantic
   encodings (dictionary, delta, delta-of-delta, run-length, frame-of-
   reference bit packing, and variable-length integers) before a
   conventional entropy compression stage.

   NTCF embeds per-column zone-map statistics and Bloom filters so that
   point lookups and analytical predicates can be evaluated by reading
   only the columns and segments that can possibly match, without
   decompressing the entire file.  This document defines the on-disk
   octet layout (format version 1), the encoding catalogue, the reading
   and crash-recovery algorithms, a resource-limit model, security
   considerations, and an IANA media-type registration.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 15 December 2026.


Sunnetci                Expires 15 December 2026                [Page 1]

Internet-Draft                    NTCF                         June 2026


Copyright Notice

   Copyright (c) 2026 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Problem . . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.  Goals . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.3.  Non-Goals (This Version)  . . . . . . . . . . . . . . . .   3
   2.  Conventions and Terminology . . . . . . . . . . . . . . . . .   4
   3.  Design Overview . . . . . . . . . . . . . . . . . . . . . . .   5
   4.  File Structure  . . . . . . . . . . . . . . . . . . . . . . .   5
   5.  Header  . . . . . . . . . . . . . . . . . . . . . . . . . . .   6
   6.  Segments and Column Chunks  . . . . . . . . . . . . . . . . .   6
     6.1.  Segments  . . . . . . . . . . . . . . . . . . . . . . . .   6
     6.2.  Column Chunk  . . . . . . . . . . . . . . . . . . . . . .   6
     6.3.  Presence Bitmap (Nullability) . . . . . . . . . . . . . .   7
   7.  Semantic Encodings  . . . . . . . . . . . . . . . . . . . . .   7
     7.1.  Bit Packing . . . . . . . . . . . . . . . . . . . . . . .   8
     7.2.  Dictionary Encoding . . . . . . . . . . . . . . . . . . .   8
   8.  Entropy Compression . . . . . . . . . . . . . . . . . . . . .   8
   9.  Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
     9.1.  Bloom Filter  . . . . . . . . . . . . . . . . . . . . . .   9
     9.2.  Inverted Index  . . . . . . . . . . . . . . . . . . . . .   9
   10. Footer  . . . . . . . . . . . . . . . . . . . . . . . . . . .  10
     10.1.  Schema Descriptor  . . . . . . . . . . . . . . . . . . .  10
     10.2.  Footer Body  . . . . . . . . . . . . . . . . . . . . . .  10
     10.3.  Footer Trailer . . . . . . . . . . . . . . . . . . . . .  11
   11. Durability and Crash Recovery . . . . . . . . . . . . . . . .  11
   12. Reading Algorithm (Informative) . . . . . . . . . . . . . . .  12
   13. Logical Type System . . . . . . . . . . . . . . . . . . . . .  12
   14. Resource Limits . . . . . . . . . . . . . . . . . . . . . . .  13
   15. Security Considerations . . . . . . . . . . . . . . . . . . .  13
   16. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
   17. Versioning and Interoperability . . . . . . . . . . . . . . .  15
   18. Normative References  . . . . . . . . . . . . . . . . . . . .  15
   19. Informative References  . . . . . . . . . . . . . . . . . . .  15
   Appendix A.  Worked Example (Informative) . . . . . . . . . . . .  16
   Appendix B.  Reference Implementation and Results
           (Informative) . . . . . . . . . . . . . . . . . . . . . .  16


Sunnetci                Expires 15 December 2026                [Page 2]

Internet-Draft                    NTCF                         June 2026


   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  16

1.  Introduction

1.1.  Problem

   Security and network telemetry is high in volume, highly repetitive,
   and almost always interrogated along a small number of dimensions:
   source and destination IP address, autonomous system number (ASN),
   country, port, event type, and time.  Operators retain large archives
   and incur two costs: storage of the data at rest, and the time spent
   decompressing whole archives to answer a single question during an
   incident.

   General-purpose compressors (gzip, zstd, lz4, xz) reduce the storage
   cost but produce an opaque blob: answering "which records involved
   203.0.113.5?" requires full decompression, and the format offers no
   analytics.  General-purpose columnar analytics formats such as Apache
   Parquet [PARQUET] make data queryable but are not specialised for
   telemetry semantics (for example, IP, ASN, and CIDR types and IP-
   range pruning) nor for crash-safe streaming append from an edge
   sensor.

1.2.  Goals

   NTCF aims to be, simultaneously:

   1.  Compact -- competitive with or better than the best general-
       purpose compressors on representative telemetry, by encoding
       meaning rather than bytes.

   2.  Searchable in place -- equality search and a useful subset of
       analytical queries answerable without decompressing the whole
       file, using embedded zone maps, Bloom filters, and optional
       inverted indexes.

   3.  Streaming- and crash-safe -- appendable from a long-running
       sensor such that a process crash leaves a readable file
       containing all committed records.

   4.  Self-describing and versioned -- a file carries its own schema
       and a format version that gates compatibility.

1.3.  Non-Goals (This Version)

   Cryptographic authentication and encryption of files, distributed
   query, joins across files, and a network storage service are out of
   scope for format version 1.  See Section 15.


Sunnetci                Expires 15 December 2026                [Page 3]

Internet-Draft                    NTCF                         June 2026


2.  Conventions and Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   Data type conventions used throughout this document:

   *  All multi-octet integers are little-endian unless stated
      otherwise.

   *  u8, u16, u32, and u64 denote unsigned integers of that width in
      bits.

   *  uvarint denotes an unsigned LEB128 variable-length integer: seven
      bits of payload per octet, with the most significant bit of each
      octet set on every octet except the last.

   *  varint denotes a signed integer encoded as ZigZag(value) followed
      by uvarint, where ZigZag maps a two's-complement 64-bit value v to
      (v left-shifted by 1) XOR (v arithmetic-right-shifted by 63).

   *  The notation ceil(x) denotes the least integer not less than x.

   Terminology:

   Column  a named, typed sequence of one value per row.

   Column chunk  the encoded, optionally compressed, self-validating
      serialization of one column within one segment.

   Segment  a row group; a contiguous run of rows whose columns are
      stored as adjacent column chunks.  A segment is located by
      absolute offset from the footer; it is not independently framed
      (see Section 6).

   Footer  the trailing metadata block: schema descriptor, segment and
      column directory, zone-map statistics, and file-level totals.

   Zone map  the per-column, per-segment minimum and maximum of present
      values, used to prune segments.

   Checkpoint footer  a footer written mid-stream by an appending writer
      to bound data loss on crash (see Section 11).


Sunnetci                Expires 15 December 2026                [Page 4]

Internet-Draft                    NTCF                         June 2026


3.  Design Overview

   NTCF compresses in two cooperating layers.

   The semantic layer operates on typed columns and removes structural
   redundancy that a byte compressor cannot perceive: near-monotonic
   timestamps (delta-of-delta), low-cardinality enumerations
   (dictionary), repeated values (run-length), small-range integers
   (frame-of-reference bit packing), and general integers (variable-
   length plus ZigZag).  An encoder SHOULD trial candidate encodings per
   column chunk and keep the smallest, and MUST always include a
   baseline (Plain or Raw) so the chosen encoding is never larger than
   the baseline.

   The entropy layer is a conventional byte compressor -- zstd, lz4, or
   none -- applied per column chunk to the semantically encoded octets.

   The two layers are complementary: the semantic layer maps a repeated
   IP column to small dictionary ordinals; the entropy layer removes the
   residual byte-level redundancy.

   A file is a fixed header, a sequence of segments (each an opaque
   concatenation of column chunks and optional index blobs), and a
   footer.  The footer is read first for fast open and for predicate
   pruning.  Crash recovery relies on checkpoint footers and a backward
   scan (Section 11), and NOT on per-segment framing.

4.  File Structure

   +----------------------------------------------------------+
   | Header (fixed 36 octets, Section 5)                      |
   +----------------------------------------------------------+
   | Segment 0 | Segment 1 | ... | Segment N-1  (Section 6)   |
   |   each segment = chunk | [index] | chunk | [index] | ... |
   +----------------------------------------------------------+
   | [zero or more intermediate checkpoint footers]           |
   +----------------------------------------------------------+
   | Footer body (Section 10)                                 |
   | footerLen u32 | CRC32C u32 | trailer magic "NTCF"        |
   +----------------------------------------------------------+

   Column and segment octet locations are recorded as absolute file
   offsets in the footer.  Intermediate checkpoint footers, if present,
   are dead octets that a conforming reader skips: the authoritative
   footer is the final one, whose offsets account for any preceding
   checkpoint footers.


Sunnetci                Expires 15 December 2026                [Page 5]

Internet-Draft                    NTCF                         June 2026


5.  Header

   The header is exactly 36 octets:

 Field    | Type      | Description
----------+-----------+------------------------------------------------
 magic    | bytes[4]  | 0x4E 0x54 0x43 0x46  ("NTCF")
 version  | u16       | format version; this document specifies 1
 flags    | u16       | reserved; senders set 0, readers ignore unknown
 created  | u64       | file creation time, Unix nanoseconds
 writerID | bytes[16] | opaque producer identifier; MAY be zero
 crc32c   | u32       | CRC-32C (Castagnoli) over the preceding 32 octets

   A reader MUST validate magic, MUST reject a version it does not
   support (Section 17), and MUST verify crc32c before relying on any
   other octet.

6.  Segments and Column Chunks

6.1.  Segments

   A segment is the concatenation of one column chunk per schema column,
   in schema column order, with each indexed column's chunk OPTIONALLY
   followed by its index blob (Section 9).  A segment has no magic and
   no self-contained header; its extent and the location of every chunk
   and index within it are given by the footer's segment directory
   (Section 10).  All chunks in a segment encode the same number of
   rows.

6.2.  Column Chunk

   A column chunk is self-validating: it carries its own checksum and
   the lengths needed to bound decompression.

 Field           | Type        | Description
-----------------+-------------+----------------------------------------
 kind            | u8          | 0 = integer domain, 1 = byte domain
 encodingID      | u8          | semantic encoding (Section 7)
 compressionID   | u8          | entropy codec (Section 8)
 flags           | u8          | bit0 = presence bitmap follows; rest 0
 rows            | uvarint     | number of rows
 uncompressedLen | uvarint     | octet length of encoded data (pre-entropy)
 storedLen       | uvarint     | octet length of stored (post-entropy) data
 bitmapLen       | uvarint     | only if flags bit0; equals ceil(rows/8)
 bitmap          | bytes[*]    | only if flags bit0 (Section 6.3)
 checksum        | u64         | XXH64 over 'stored'
 stored          | bytes[*]    | entropy-compressed semantic octets


Sunnetci                Expires 15 December 2026                [Page 6]

Internet-Draft                    NTCF                         June 2026


   To decode a chunk a reader MUST: (1) verify checksum over stored; (2)
   enforce the decompression limits of Section 14 against
   uncompressedLen and the ratio uncompressedLen divided by storedLen;
   (3) entropy-decode stored to exactly uncompressedLen octets; (4)
   semantic-decode rows values; and (5) if a presence bitmap is present,
   apply it to mark null rows.

6.3.  Presence Bitmap (Nullability)

   When a column contains null (absent) values, the chunk stores a
   presence bitmap of ceil(rows/8) octets.  Bit i (least significant bit
   first within each octet, that is, octet i divided by 8, bit i modulo
   8) is set when row i is present (non-null).  The encoded value stream
   contains one value per row; the value at a null row is a placeholder
   (zero for integers, empty for byte values) and MUST be ignored by
   readers when the presence bit is clear.

   Storing a placeholder per null row, rather than only present values,
   is a deliberate simplification of format version 1; placeholders
   compress well under run-length and dictionary encodings.  A future
   version MAY define a present-values-only encoding.

7.  Semantic Encodings

   Columns are mapped to one of two physical domains.  The integer
   domain carries values as u64; the byte domain carries variable-length
   octet strings.  The mapping from logical type to domain is given in
   Section 13.  All integer encodings are exact over the full u64 range
   because their arithmetic is performed modulo 2^64 identically on
   encode and decode.

   encodingID values are stable and assigned as follows:

 ID | Name         | Dom | Description
----+--------------+-----+-------------------------------------------------
  0 | Plain        | int | u64 little-endian per value (baseline)
  1 | Varint       | int | uvarint per value
  2 | Delta        | int | uvarint first value, then varint of each diff
  3 | DeltaOfDelta | int | uvarint first value; varint first delta; then
    |              |     | varint of each change in delta
  4 | RLE          | int | repeated (uvarint value, uvarint run-length)
  5 | Bitpack      | int | uvarint min; u8 width; (value-min) bit-packed
    |              |     | at width bits (Section 7.1)
  6 | DictInt      | int | dictionary, integer keys (Section 7.2)
 64 | Raw          | byte| repeated (uvarint length, bytes) per value
 65 | DictBytes    | byte| dictionary, octet-string keys (Section 7.2)
 66 | RLEBytes     | byte| repeated (uvarint len, bytes, uvarint run-length)


Sunnetci                Expires 15 December 2026                [Page 7]

Internet-Draft                    NTCF                         June 2026


   A reader MUST reject a chunk whose encodingID is unknown for its
   kind.  The number of values produced MUST equal rows.

7.1.  Bit Packing

   Bit packing serialises a sequence of unsigned integers using exactly
   width bits each, least significant bit first, with no per-value octet
   alignment. width is in the range 0 to 64.  A width of 0 encodes a
   sequence of zeros and occupies no octets.  The total size is
   ceil((count times width) divided by 8) octets.  The Bitpack frame-of-
   reference encoding subtracts a per-chunk minimum before packing; the
   dictionary encodings pack ordinal indices at width equal to the
   number of bits needed to represent dictLen minus 1.

7.2.  Dictionary Encoding

   A dictionary chunk has the following layout:

   *  dictLen (uvarint): the number of distinct values.

   *  The value table, in ascending sorted order.  For DictInt: the
      first value as uvarint, then each subsequent value as the uvarint
      non-negative gap from its predecessor.  For DictBytes: per entry,
      a uvarint length followed by that many octets.

   *  width (u8): bits per ordinal, equal to the number of bits needed
      to represent dictLen minus 1.

   *  The per-row ordinals, bit-packed at width bits (Section 7.1).

   Each ordinal MUST be strictly less than dictLen.

8.  Entropy Compression

   The entropy layer is applied to the semantically encoded octets of a
   chunk.  A writer SHOULD select none when entropy compression would
   not reduce size. compressionID values:

 ID | Name | Description
----+------+----------------------------------------------------------------
  0 | none | stored octets are the semantic octets verbatim
  1 | zstd | a single zstd frame (RFC 8478) over the semantic octets
  2 | lz4  | a one-octet selector (0=raw, 1=LZ4 block) then the payload


Sunnetci                Expires 15 December 2026                [Page 8]

Internet-Draft                    NTCF                         June 2026


   For compressionID 2, selector 0 means the remaining octets are the
   uncompressed semantic octets (used when the data is incompressible);
   selector 1 means the remaining octets form an LZ4 block (not an LZ4
   frame) that decompresses to exactly uncompressedLen octets.  For all
   codecs, a reader MUST verify that decompression yields exactly
   uncompressedLen octets and MUST treat any deviation as corruption.

   The zstd frame format used by compressionID 1 is specified in
   [RFC8478]; the LZ4 block format used by compressionID 2 is described
   in [LZ4].

9.  Indexes

   For each column marked indexed, a writer MAY emit an index blob
   immediately after that column's chunk within the segment; its
   location is recorded in the footer (indexOffset, indexLength).  An
   indexLength of 0 means no index.

 Field    | Type | Description
----------+------+-------------------------------------------------------
 flags    | u8   | bit0 = Bloom filter present; bit1 = inverted present
 bloom    | ...  | present if bit0 (Section 9.1)
 inverted | ...  | present if bit1 (Section 9.2)

9.1.  Bloom Filter

 Field     | Type            | Description
-----------+-----------------+--------------------------------------------
 k         | u8              | number of hash probes
 wordCount | uvarint         | number of 64-bit words
 words     | u64 x wordCount | bit array

   The bit count m equals wordCount times 64.  Bit b is located in word
   (b divided by 64) at bit position (b modulo 64).  A value's
   membership uses double hashing of its XXH64 digest h (integers are
   hashed as their 8-octet little-endian form): with h1 equal to h and
   h2 equal to (h right-shifted by 33) bitwise-OR (h left-shifted by
   31), and with h2 replaced by the constant 0x9E3779B97F4A7C15 if it
   would otherwise be 0, probe i for 0 le i less than k addresses bit
   (h1 + i times h2) modulo m.  A writer SHOULD size the filter to the
   column's distinct cardinality at a target false-positive rate (the
   reference uses one percent).  A clear probe is definitive non-
   membership; a fully set result is probabilistic.

9.2.  Inverted Index


Sunnetci                Expires 15 December 2026                [Page 9]

Internet-Draft                    NTCF                         June 2026


 Field   | Type    | Description
---------+---------+----------------------------------------------------
 kind    | u8      | 0 = integer keys, 1 = byte keys
 count   | uvarint | number of distinct keys
 entries | ...     | 'count' entries, in ascending sorted key order

   Each entry is a key followed by a posting list.  The key is a uvarint
   for integer keys, or a uvarint length plus that many octets for byte
   keys.  The posting list is a uvarint bitmapLen followed by bitmapLen
   octets containing a Roaring Bitmap [ROARING], per the Roaring Bitmap
   serialization specification, of the zero-based row positions within
   the segment that hold that key.  Inverted indexes are OPTIONAL; when
   absent, equality is resolved by zone-map and Bloom pruning followed
   by a scan of the decoded column.

10.  Footer

10.1.  Schema Descriptor

 Field    | Type     | Description
----------+----------+----------------------------------------------------
 schemaID | u32      | schema identifier
 nameLen  | uvarint  |
 name     | bytes[*] | schema name (UTF-8)
 version  | u16      | schema version
 colCount | uvarint  | number of columns (at most 4096)
 columns  | ...      | 'colCount' column descriptors

   Each column descriptor is: nameLen (uvarint), name (octets), type
   (u8, see Section 13), and flags (u8: bit0 = indexed, bit1 =
   nullable).

10.2.  Footer Body

 Field      | Type     | Description
------------+----------+--------------------------------------------------
 schema     | (desc)   | Section 10.1
 sourceLen  | uvarint  |
 sourceType | bytes[*] | originating source identifier (e.g. "honeypot")
 totalRows  | u64      | total rows in the file
 minTS      | u64      | minimum timestamp (Unix ns) over the file; 0 if none
 maxTS      | u64      | maximum timestamp
 segCount   | uvarint  | number of segments (at most 1048576)
 segments   | ...      | 'segCount' segment directory entries


Sunnetci                Expires 15 December 2026               [Page 10]

Internet-Draft                    NTCF                         June 2026


   Each segment directory entry is: offset (u64), length (u64), rows
   (uvarint), minTS (u64), maxTS (u64), colCount (uvarint, which MUST
   equal the schema column count), then one column directory entry per
   column:

 Field       | Type    | Description
-------------+---------+-----------------------------------------------------
 chunkOffset | u64     | absolute file offset of the column chunk
 chunkLength | u64     |
 indexOffset | u64     | absolute offset of the index blob, or 0
 indexLength | u64     | 0 if no index
 flags       | u8      | bit0 = column has nulls in this segment
 nonNull     | uvarint | count of non-null values
 zone-map    | ...     | integer columns: minInt (u64), maxInt (u64).
             |         | byte columns: minLen (uvarint) + min octets,
             |         | maxLen (uvarint) + max octets. The domain is
             |         | determined from the schema column type.

10.3.  Footer Trailer

   The footer body is immediately followed by footerLen (u32, the octet
   length of the footer body), crc32c (u32, CRC-32C over the footer
   body), and the trailer magic (bytes[4] equal to "NTCF").

   To open a file, a reader reads the final 4 octets and verifies the
   trailer magic, reads footerLen and crc32c from the 8 octets preceding
   it, slices the footer body of footerLen octets immediately before,
   verifies the CRC, and parses the body.  A reader MUST enforce that
   footerLen is at most 256 mebibytes (Section 14) and MUST verify that
   the body lies wholly between the header and the trailer.

11.  Durability and Crash Recovery

   A writer that appends records over time (streaming ingestion) SHOULD
   write a checkpoint footer at intervals.  A checkpoint footer is a
   complete footer (Section 10) written at the current end of file; it
   is appended, never overwritten.  Because earlier footers are never
   modified, the most recently completed footer is always intact even if
   the process terminates while writing a subsequent segment.

   On read, if the trailing footer is missing or fails validation, a
   reader MAY recover by scanning backward from the end of file for an
   occurrence of the trailer magic and attempting to parse a footer
   ending there; the first (latest) candidate whose footerLen and CRC
   validate is the recovered footer.  Records written after the last
   checkpoint but before termination are not recoverable; this is the
   correct durability boundary.  Dead intermediate footers MAY be
   reclaimed by an out-of-band compaction step.


Sunnetci                Expires 15 December 2026               [Page 11]

Internet-Draft                    NTCF                         June 2026


12.  Reading Algorithm (Informative)

   1.  Validate the header (Section 5).

   2.  Locate and validate the footer (Section 10.3); on failure,
       optionally recover (Section 11).

   3.  Parse the schema and the segment and column directory.

   4.  For a predicate of the form "column OP value": normalise value
       into the column domain (Section 13); for each segment consult the
       column's zone map and, if value cannot lie in the relevant bound,
       skip the segment without reading its body; for equality on an
       indexed column consult the Bloom filter and skip on a clear
       result; if an inverted index is present take its posting list,
       otherwise decode the single column chunk and scan it.

   5.  Combine per-segment row sets across predicates (intersection for
       AND, union for OR), then aggregate or project.

   A count of all rows with no predicate is answered from totalRows with
   no body read.

13.  Logical Type System

 ID | Type      | Dom | Normalisation
----+-----------+-----+------------------------------------------------------
  0 | timestamp | int | Unix nanoseconds since the epoch
  1 | ip        | byte| canonical 16-octet form; IPv4 stored as IPv4-mapped
    |           |     | IPv6 so one column holds both families and the
    |           |     | lexicographic order is total
  2 | uint      | int | unsigned 64-bit
  3 | port      | int | transport port 0..65535
  4 | enum      | byte| low-cardinality octet string (country, protocol,
    |           |     | HTTP method, event type, ...)
  5 | string    | byte| arbitrary octet string
  6 | bool      | int | 0 or 1

   The 16-octet IP normalisation gives correct zone-map ordering within
   each address family.  Implementations storing both families in one
   column SHOULD note that minimum and maximum bounds spanning families
   are looser; this does not affect correctness, only pruning
   effectiveness.


Sunnetci                Expires 15 December 2026               [Page 12]

Internet-Draft                    NTCF                         June 2026


14.  Resource Limits

   Because every length and offset in a file is attacker-controlled, a
   reader MUST gate every allocation derived from a file-supplied count
   by a finite ceiling before allocating, and MUST bound decompression.
   The reference implementation enforces, and this document RECOMMENDS,
   at least the following ceilings:

    Quantity                                  | Ceiling
   -------------------------------------------+--------------
    columns per schema                        | 4096
    rows per segment                          | 16777216
    segments per file                         | 1048576
    dictionary entries per chunk              | 16777216
    stored (post-entropy) octets per chunk    | 1 GiB
    uncompressed octets per chunk             | 4 GiB
    decompression expansion ratio             | 256:1
    footer body                               | 256 MiB
    a single byte-domain value                | 16 MiB

   A reader MUST reject any "count times width" computation that would
   overflow, and MUST reject any offset or length that falls outside the
   file.

15.  Security Considerations

   NTCF files frequently originate from untrusted parties: partner
   feeds, tenant sensors, and attacker probes.  Conforming readers MUST
   treat all input as hostile.

   No panics or unbounded work: for any input octets, a decoder MUST
   return a value or an error; it MUST NOT crash, allocate without
   bound, read outside the file, or loop indefinitely.  The reference
   implementation enforces this property with fuzz testing across the
   header and footer parser, every encoding decoder, the entropy layer,
   the index parser, and the query parser.

   Decompression bombs: a reader MUST enforce both an absolute
   uncompressed ceiling and an expansion-ratio cap (Section 14) before
   and during entropy decoding, and MUST verify that the decompressed
   length equals the declared length.


Sunnetci                Expires 15 December 2026               [Page 13]

Internet-Draft                    NTCF                         June 2026


   Integrity, not authenticity: the CRC-32C checksums on the header and
   footer and the XXH64 checksums on chunks detect accidental corruption
   only; they are NOT message authentication codes.  An adversary with
   write access can forge a valid-looking file.  Format version 1
   provides no confidentiality and no authenticity.  Consumers requiring
   those properties MUST layer an authenticated or encrypted transport
   or storage mechanism beneath NTCF.  An authenticated container is a
   candidate for a future version.

   Resource exhaustion: the limits of Section 14 bound memory and CPU
   per file; operators ingesting many files SHOULD additionally bound
   concurrency.

16.  IANA Considerations

   This document requests registration of the following media type, per
   [RFC6838], and a file extension.

   Type name:  application

   Subtype name:  vnd.ntcf

   Required parameters:  none

   Optional parameters:  none

   Encoding considerations:  binary

   Magic number(s):  the four octets 0x4E 0x54 0x43 0x46 ("NTCF") at
      offset 0, and the same four octets as the final four octets of a
      complete file

   File extension(s):  .ntcf

   Security considerations:  see Section 15 of this document

   Interoperability considerations:  the format is versioned
      (Section 17)

   Published specification:  this document

   Intended usage:  COMMON

   Change controller:  The NTCF Authors


Sunnetci                Expires 15 December 2026               [Page 14]

Internet-Draft                    NTCF                         June 2026


   If a registry of NTCF encodingID, compressionID, or logical type
   values is desired, this document suggests an "NTCF Encodings"
   registry seeded with the assignments in Sections 7, 8, and 13, under
   a "Specification Required" policy.

17.  Versioning and Interoperability

   The header version field gates on-disk compatibility.  A reader MUST
   refuse a file whose version it does not implement.  Within a
   supported version, a reader MUST reject unknown encodingID,
   compressionID, and logical type values rather than guess.  Additive
   changes that do not alter the octet layout of existing structures
   (for example, a new encoding identifier) MAY be made within a version
   only if existing readers can still reject the new identifier safely;
   otherwise the version MUST be incremented.

18.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC6838]  Freed, N., Klensin, J., and T. Hansen, "Media Type
              Specifications and Registration Procedures", BCP 13,
              RFC 6838, January 2013,
              <https://www.rfc-editor.org/info/rfc6838>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, May 2017,
              <https://www.rfc-editor.org/info/rfc8174>.

   [RFC8478]  Collet, Y. and M. Kucherawy, Ed., "Zstandard Compression
              and the application/zstd Media Type", RFC 8478, October
              2018, <https://www.rfc-editor.org/info/rfc8478>.

19.  Informative References

   [LZ4]      Collet, Y., "LZ4 Block Format Description", 2011,
              <https://github.com/lz4/lz4/blob/dev/doc/
              lz4_Block_format.md>.

   [PARQUET]  Apache Software Foundation, "Apache Parquet File Format",
              2024, <https://parquet.apache.org/docs/file-format/>.

   [ROARING]  Chambi, S., Lemire, D., Kaser, O., and R. Godin, "Better
              bitmap performance with Roaring bitmaps", Software:
              Practice and Experience 46(5), 2016,
              <https://doi.org/10.1002/spe.2325>.


Sunnetci                Expires 15 December 2026               [Page 15]

Internet-Draft                    NTCF                         June 2026


Appendix A.  Worked Example (Informative)

   A minimal file containing one segment of three rows over the columns
   timestamp (timestamp), srcip (ip, indexed), and country (enum,
   indexed, nullable) is laid out as follows.

[Header: "NTCF", version=1, flags=0, created, writerID, crc32c]  (36 octets)
[Segment 0]
  [chunk: timestamp] kind=0 enc=DeltaOfDelta comp=zstd ... checksum, stored
  [chunk: srcip]     kind=1 enc=DictBytes    comp=zstd ... checksum, stored
  [index: srcip]     flags=bit0 (bloom): k, wordCount, words
  [chunk: country]   kind=1 enc=DictBytes    comp=none flags=bit0 (nulls)
                     bitmap, checksum, stored
  [index: country]   flags=bit0: bloom
[Footer body]
  schema{id, "demo", v1, 3 columns ...}
  sourceType="demo", totalRows=3, minTS, maxTS, segCount=1
  segment0{offset=36, length, rows=3, minTS, maxTS, colCount=3,
           col0{chunkOffset,chunkLength,0,0, flags=0,nonNull=3, minInt,maxInt}
           col1{chunkOffset,chunkLength,indexOffset,indexLength, min/max bytes}
           col2{chunkOffset,chunkLength,indexOffset,indexLength,
                flags=1,nonNull=2, min/max bytes}}
[footerLen u32][crc32c u32]["NTCF"]

Appendix B.  Reference Implementation and Results (Informative)

   A complete, Apache-2.0-licensed reference implementation in Go
   accompanies this specification.  It includes round-trip and fuzz
   tests for every encoding, the chunk and footer framing, the index
   blobs, and the query parser, together with a benchmark harness.

   Measured compression ratios on synthetic but realistically skewed
   telemetry (flow, honeypot, and web access) exceed those of gzip,
   zstd, lz4, and xz on the same inputs while preserving in-place
   search; these results are reproducible from the implementation and
   are illustrative rather than a conformance requirement.  Production
   deployments SHOULD validate ratios on their own representative data.

Author's Address

   Alptekin Sünnetci
   NTCF Project
   Email: alptekin@sunnetci.net
   URI:   https://github.com/ntcf/ntcf


Sunnetci                Expires 15 December 2026               [Page 16]