⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+24
View File
@@ -0,0 +1,24 @@
# `.skm` File Format and `SkmWriter` Functionality
The Go package `obikmer` provides a binary writer for `.skm` (super-kmer) files, optimized for compact storage of DNA sequences.
- **Purpose**: Efficiently serialize *super-kmers* (long k-mers) into a binary format.
- **Format per super-kmer**:
- `len: uint16 LE` — length of the sequence in bases (little-endian, 2 bytes).
- `data: ⌈len/4⌉ bytes` — nucleotide sequence encoded as **2 bits per base**, packed tightly.
- **Encoding scheme**:
- `A → 00`, `C → 01`, `G → 10`, `T → 11`.
- Padding: trailing bits in the final byte are zeroed if `len % 4 ≠ 0`.
- **Implementation details**:
- Uses buffered I/O (`bufio.Writer` with 64 KiB buffer) for performance.
- `NewSkmWriter(path)` opens/creates the file and returns a writer instance.
- `Write(sk SuperKmer)` encodes sequence length, then packs bases using a lookup (`__single_base_code__[seq[pos]&31]`).
- `Close()` flushes buffers and closes the file handle.
- **Use case**: Ideal for high-throughput genomic preprocessing (e.g., indexing, sketching), where space and I/O speed matter.
- **Assumptions**: `SuperKmer` type exposes a `.Sequence []byte`; bases are ASCII (`A,C,G,T,a,c,g,t`) — `&31` normalizes to lowercase index.
- **Efficiency**: 4× compression vs. ASCII (1 byte/base → ~0.25 bytes/base), minimal overhead.