Files
obitools4/autodoc/docmd/pkg/obikmer/skm_writer.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

25 lines
1.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `.skm` File Format and `SkmWriter` Functionality
The Go package `obikmer` provides a binary writer for `.skm` (super-kmer) files, optimized for compact storage of DNA sequences.
- **Purpose**: Efficiently serialize *super-kmers* (long k-mers) into a binary format.
- **Format per super-kmer**:
- `len: uint16 LE` — length of the sequence in bases (little-endian, 2 bytes).
- `data: ⌈len/4⌉ bytes` — nucleotide sequence encoded as **2 bits per base**, packed tightly.
- **Encoding scheme**:
- `A → 00`, `C → 01`, `G → 10`, `T → 11`.
- Padding: trailing bits in the final byte are zeroed if `len % 4 ≠ 0`.
- **Implementation details**:
- Uses buffered I/O (`bufio.Writer` with 64 KiB buffer) for performance.
- `NewSkmWriter(path)` opens/creates the file and returns a writer instance.
- `Write(sk SuperKmer)` encodes sequence length, then packs bases using a lookup (`__single_base_code__[seq[pos]&31]`).
- `Close()` flushes buffers and closes the file handle.
- **Use case**: Ideal for high-throughput genomic preprocessing (e.g., indexing, sketching), where space and I/O speed matter.
- **Assumptions**: `SuperKmer` type exposes a `.Sequence []byte`; bases are ASCII (`A,C,G,T,a,c,g,t`) — `&31` normalizes to lowercase index.
- **Efficiency**: 4× compression vs. ASCII (1 byte/base → ~0.25 bytes/base), minimal overhead.