mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,38 @@
|
||||
# KDI File Format and Writer
|
||||
|
||||
The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
|
||||
|
||||
## Core Format (`.kdi`)
|
||||
|
||||
- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
|
||||
- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
|
||||
- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
|
||||
- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
|
||||
|
||||
## Writer (`KdiWriter`)
|
||||
|
||||
- **Strict ordering**: K-mers must be written in *strictly increasing order*.
|
||||
- Efficient buffering via `bufio.Writer` (64 KB buffer).
|
||||
- Internally tracks:
|
||||
- Current k-mer count,
|
||||
- Previous value (for delta computation),
|
||||
- Bytes written in data section.
|
||||
- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
|
||||
|
||||
## Companion Index (`.kdx`)
|
||||
|
||||
- Written automatically on `Close()` if indexing entries exist.
|
||||
- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
|
||||
- Enables efficient random access without full file scan.
|
||||
|
||||
## Usage Pattern
|
||||
|
||||
```go
|
||||
w, _ := obikmer.NewKdiWriter("data.kdi")
|
||||
for _, kmer := range sortedKMers {
|
||||
w.Write(kmer)
|
||||
}
|
||||
w.Close() // finalizes header, writes .kdx index
|
||||
```
|
||||
|
||||
The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
|
||||
Reference in New Issue
Block a user