mirror of
https://github.com/metabarcoding/obitools4.git
synced 2026-04-30 03:50:39 +00:00
⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
This commit is contained in:
@@ -0,0 +1,36 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
1. **`Count4Mer(seq, buffer, counts)`**
|
||||
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
|
||||
- Encodes each 4-mer into an integer (0–255) using `Encode4mer`.
|
||||
- Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
|
||||
- Reuses or allocates the `counts` buffer as needed.
|
||||
|
||||
2. **`Common4Mer(count1, count2)`**
|
||||
Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
|
||||
Used to measure shared content between sequences.
|
||||
|
||||
3. **`Sum4Mer(count)`**
|
||||
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
|
||||
|
||||
## Distance & Similarity Bounds
|
||||
|
||||
4. **`LCS4MerBounds(count1, count2)`**
|
||||
Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
|
||||
- **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
|
||||
- **Upper bound**: `min(total1, total2) + 3 − ceil((min_total – common)/4)`
|
||||
Leverages the fact that overlapping k-mers constrain possible alignments.
|
||||
|
||||
5. **`Error4MerBounds(count1, count2)`**
|
||||
Estimates bounds for *alignment errors* (e.g., mismatches + indels):
|
||||
- **Upper bound**: `max_total − common_kmers + 2 * floor((common_kmers + 5)/8)`
|
||||
- **Lower bound**: `ceil(upper_bound / 4)`
|
||||
Provides fast, approximate error estimates without full alignment.
|
||||
|
||||
## Use Case
|
||||
|
||||
Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
|
||||
@@ -0,0 +1,44 @@
|
||||
# Semantic Description of the `obikmer` Package
|
||||
|
||||
This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
|
||||
- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
|
||||
- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
|
||||
|
||||
### Graph Operations
|
||||
|
||||
- **Node Queries**:
|
||||
- `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.
|
||||
- `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
|
||||
- **Path Exploration**:
|
||||
- `MaxPath()`: Greedily traces the highest-weight path from a head.
|
||||
- `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
|
||||
- `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
|
||||
|
||||
### Consensus & Filtering
|
||||
|
||||
- **Consensus Generation**:
|
||||
- `BestConsensus()` returns a sequence from the greedy max-weight path.
|
||||
- `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
|
||||
- **Weight Statistics**:
|
||||
- `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.
|
||||
- `FilterMinWeight(min)` removes low-count nodes.
|
||||
- **Decoding**:
|
||||
- `DecodeNode()` converts a k-mer index to its DNA string.
|
||||
- `DecodePath()` reconstructs the full consensus from a path.
|
||||
|
||||
### I/O & Diagnostics
|
||||
|
||||
- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
|
||||
- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
|
||||
- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
|
||||
|
||||
### Dependencies & Design
|
||||
|
||||
- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Go’s stdlib.
|
||||
- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
|
||||
|
||||
Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
|
||||
@@ -0,0 +1,35 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
1. **Nucleotide Encoding**
|
||||
- `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:
|
||||
`0→A`, `1→C`, `2→G`, `3→T/U`.
|
||||
Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).
|
||||
Uses a lookup table for O(1) performance.
|
||||
|
||||
2. **4-mer Encoding**
|
||||
- `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.
|
||||
Each k-mer is encoded as an unsigned byte (0–255), where each nucleotide contributes 2 bits.
|
||||
Supports optional buffer reuse for memory efficiency.
|
||||
|
||||
3. **4-mer Indexing**
|
||||
- `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0–255) to all its occurrence positions in the sequence.
|
||||
Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
|
||||
|
||||
4. **Fast Sequence Comparison**
|
||||
- `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.
|
||||
- Uses precomputed 4-mer index of a reference sequence and encodes the query.
|
||||
- Counts co-occurring 4-mers across all possible shifts (`refpos − queryPos`).
|
||||
- Computes raw and relative scores (normalized by alignment length).
|
||||
- Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-aware**: Supports buffer reuse to minimize allocations.
|
||||
- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).
|
||||
- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.
|
||||
|
||||
Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
|
||||
@@ -0,0 +1,39 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
|
||||
|
||||
## Core Encoding & Decoding
|
||||
|
||||
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
|
||||
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
|
||||
|
||||
## Iterators (Memory-Efficient Streaming)
|
||||
|
||||
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
|
||||
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for **odd k ≤ 31**.
|
||||
|
||||
## Error Handling & Markers
|
||||
|
||||
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
|
||||
|
||||
## Reverse Complement & Circular Normalization
|
||||
|
||||
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
|
||||
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
|
||||
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
|
||||
|
||||
## Counting & Math Utilities
|
||||
|
||||
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreau’s necklace formula**, with Euler's totient function and divisor enumeration.
|
||||
|
||||
## Performance & Safety
|
||||
|
||||
- All functions avoid heap allocations where possible (reusing buffers).
|
||||
- Panics on invalid `k` or length mismatches for correctness.
|
||||
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
|
||||
- Error-aware k-mer filtering in sequencing pipelines
|
||||
- Low-complexity region detection via circular entropy normalization
|
||||
@@ -0,0 +1,36 @@
|
||||
# Obikmer: Efficient K-mer Encoding and Manipulation in Go
|
||||
|
||||
This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
### K-mer Encoding (`EncodeKmers`, `IterKmers`)
|
||||
Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
|
||||
|
||||
### Reverse Complement (`ReverseComplement`)
|
||||
Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
|
||||
|
||||
### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
|
||||
Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
|
||||
|
||||
### Super *k*-mers Extraction (`ExtractSuperKmers`)
|
||||
Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
|
||||
|
||||
### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
|
||||
Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
|
||||
- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
|
||||
- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
|
||||
- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Genome assembly &DBG construction
|
||||
- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)
|
||||
- Error-aware k-mer counting & filtering
|
||||
- Strand-unbiased sequence comparison
|
||||
|
||||
All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
|
||||
@@ -0,0 +1,31 @@
|
||||
# Semantic Description of `obikmer` Entropy Functions
|
||||
|
||||
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`KmerEntropy(kmer, k, levelMax)`**:
|
||||
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
|
||||
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
|
||||
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
|
||||
- Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
|
||||
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
|
||||
|
||||
- **`KmerEntropyFilter`**:
|
||||
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
|
||||
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
|
||||
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
|
||||
- **Not goroutine-safe** — each thread must instantiate its own filter.
|
||||
|
||||
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
|
||||
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
|
||||
|
||||
- **`Accept(kmer)` / `Entropy(kmer)`**:
|
||||
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
|
||||
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).
|
||||
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
|
||||
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
|
||||
@@ -0,0 +1,37 @@
|
||||
# K-Way Merge for Sorted k-mer Streams
|
||||
|
||||
This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
|
||||
|
||||
## Core Components
|
||||
|
||||
- **`mergeItem`**: Stores a value and its source reader index for heap operations.
|
||||
- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
|
||||
- **`KWayMerge`**: Main struct managing the heap and input readers.
|
||||
|
||||
## Key Functionality
|
||||
|
||||
- **Initialization (`NewKWayMerge`)**:
|
||||
- Takes a slice of `*KdiReader`, each expected to yield sorted values.
|
||||
- Preloads the heap with one value from each reader.
|
||||
|
||||
- **Streaming Output (`Next`)**:
|
||||
- Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
|
||||
- Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
|
||||
|
||||
- **Cleanup (`Close`)**:
|
||||
- Closes all underlying `KdiReader`s and returns the first encountered error.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
|
||||
- Efficient deduplication with multiplicity tracking.
|
||||
- Scalable union/intersection operations on large *k*-mer sets.
|
||||
|
||||
## Complexity
|
||||
|
||||
| Operation | Time |
|
||||
|-----------|------------|
|
||||
| `Next()` | *O(log k)* (heap ops per unique value) |
|
||||
| Init | *O(k)* |
|
||||
|
||||
Where `k` = number of input readers.
|
||||
@@ -0,0 +1,27 @@
|
||||
# K-Way Merge Functionality in `obikmer`
|
||||
|
||||
This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
|
||||
- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
|
||||
- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
|
||||
- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
|
||||
- **Robust Test Coverage**: Includes unit tests for:
|
||||
- Basic merging with overlapping and non-overlapping values.
|
||||
- Single-stream input (degenerate case).
|
||||
- Empty streams handling.
|
||||
- All identical k-mers across inputs.
|
||||
|
||||
## API Highlights
|
||||
|
||||
- `NewKdiReader(path)` — opens a `.kdi` file for reading.
|
||||
- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
|
||||
- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
|
||||
- `.Next()` → `(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
|
||||
- `.Close()` — cleans up resources.
|
||||
|
||||
## Use Case
|
||||
|
||||
Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each sample’s k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
|
||||
@@ -0,0 +1,27 @@
|
||||
# KDI Reader: Streaming Delta-Varint Decoding for k-mers
|
||||
|
||||
The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
|
||||
- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
|
||||
- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
|
||||
- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
|
||||
- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
|
||||
|
||||
## Key API
|
||||
|
||||
- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
|
||||
- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
|
||||
- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
|
||||
- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
|
||||
- `Count()` / `Remaining()` → total and unread k-mers.
|
||||
- `Close()` → releases file handle.
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- Uses 64 KB buffer for efficient I/O.
|
||||
- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
|
||||
- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
|
||||
- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
|
||||
@@ -0,0 +1,34 @@
|
||||
# KDI File Format and API
|
||||
|
||||
The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
|
||||
- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
|
||||
- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
|
||||
|
||||
## File Structure
|
||||
|
||||
A `.kdi` file contains:
|
||||
1. **Magic header** (4 bytes): Identifies the format.
|
||||
2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
|
||||
3. **First value** (8 bytes, `uint64`): Initial k-mer.
|
||||
4. **Delta-encoded tail**: `(n−1)` deltas, each encoded as a `varint`.
|
||||
|
||||
## API
|
||||
|
||||
- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
|
||||
- **`Writer.Count()`**: Returns the number of written items before closing.
|
||||
- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
|
||||
- **`Reader.Count()`**: Returns total stored count.
|
||||
|
||||
## Tests Validate
|
||||
|
||||
1. Basic round-trip with diverse values (including large `uint64`s).
|
||||
2. Empty and single-k-mer files.
|
||||
3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
|
||||
4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
|
||||
5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
|
||||
|
||||
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
|
||||
@@ -0,0 +1,38 @@
|
||||
# KDI File Format and Writer
|
||||
|
||||
The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
|
||||
|
||||
## Core Format (`.kdi`)
|
||||
|
||||
- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
|
||||
- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
|
||||
- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
|
||||
- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
|
||||
|
||||
## Writer (`KdiWriter`)
|
||||
|
||||
- **Strict ordering**: K-mers must be written in *strictly increasing order*.
|
||||
- Efficient buffering via `bufio.Writer` (64 KB buffer).
|
||||
- Internally tracks:
|
||||
- Current k-mer count,
|
||||
- Previous value (for delta computation),
|
||||
- Bytes written in data section.
|
||||
- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
|
||||
|
||||
## Companion Index (`.kdx`)
|
||||
|
||||
- Written automatically on `Close()` if indexing entries exist.
|
||||
- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
|
||||
- Enables efficient random access without full file scan.
|
||||
|
||||
## Usage Pattern
|
||||
|
||||
```go
|
||||
w, _ := obikmer.NewKdiWriter("data.kdi")
|
||||
for _, kmer := range sortedKMers {
|
||||
w.Write(kmer)
|
||||
}
|
||||
w.Close() // finalizes header, writes .kdx index
|
||||
```
|
||||
|
||||
The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
|
||||
@@ -0,0 +1,29 @@
|
||||
# KDX Index Format and Functionality
|
||||
|
||||
The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **Magic bytes**: `KDX\x01` validates file integrity.
|
||||
- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
|
||||
- **Entry structure**: Each entry stores:
|
||||
- `kmer`: the k-mer value at that index position.
|
||||
- `offset`: absolute byte offset in the corresponding `.kdi` file.
|
||||
|
||||
## Key Operations
|
||||
|
||||
- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
|
||||
- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
|
||||
- Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
|
||||
- Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
|
||||
- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
|
||||
- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
|
||||
|
||||
## Performance
|
||||
|
||||
- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
|
||||
- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
|
||||
@@ -0,0 +1,14 @@
|
||||
# Semantic Description of `obikmer` Package
|
||||
|
||||
The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
|
||||
|
||||
- `QueryEntry` represents a single canonical k‑mer with its origin: sequence index and 1-based position.
|
||||
- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
|
||||
- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each k‑mer to a partition via modulo hashing, and sorts buckets by k‑mer value.
|
||||
- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
|
||||
- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
|
||||
- Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
|
||||
- Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
|
||||
- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and k‑mer stream in lockstep.
|
||||
|
||||
The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
|
||||
@@ -0,0 +1,49 @@
|
||||
# `obikmer` K-mer Set Group Builder — Functional Overview
|
||||
|
||||
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
|
||||
|
||||
## Core Features
|
||||
|
||||
- **K-mer & Minimizer Configuration**:
|
||||
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
|
||||
|
||||
- **Functional Options for Filtering**:
|
||||
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
|
||||
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
|
||||
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
|
||||
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
|
||||
|
||||
- **Concurrent & Pipeline-Aware Processing**:
|
||||
Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
|
||||
|
||||
- **Partitioned I/O & Thread Safety**:
|
||||
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Build Phase**:
|
||||
- Input sequences → super-kmers extracted via minimizer-based partitioning.
|
||||
- Super-kmers written to `.build/set_*/part_*.skm`.
|
||||
|
||||
2. **Finalization (`Close()`)**:
|
||||
- `.skm` files loaded → canonical k-mers extracted.
|
||||
- K-mers sorted, counted (frequency spectrum), and filtered per config.
|
||||
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
|
||||
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
|
||||
|
||||
3. **Append Mode**:
|
||||
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
|
||||
|
||||
## Output Artifacts
|
||||
|
||||
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
|
||||
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
|
||||
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
|
||||
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
|
||||
|
||||
## Design Highlights
|
||||
|
||||
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
|
||||
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
|
||||
- **Robust error handling**: Early termination on first failure; cleanup of partial state.
|
||||
|
||||
@@ -0,0 +1,44 @@
|
||||
# K-mer Set Group Builder — Semantic Description
|
||||
|
||||
This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
|
||||
|
||||
### Core Functionalities
|
||||
|
||||
- **K-mer Set Group Construction**:
|
||||
`NewKmerSetGroupBuilder` creates a builder configured with:
|
||||
- *k* (k-mer length),
|
||||
- *m* (minimal unique substring for partitioning),
|
||||
- number of sets (`nSets`),
|
||||
- and optional parameters like `WithMinFrequency`.
|
||||
|
||||
- **Sequence Ingestion**:
|
||||
Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
|
||||
- Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
|
||||
- Deduplicated and optionally filtered by occurrence frequency.
|
||||
|
||||
- **Persistence & Round-Trip**:
|
||||
`builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).
|
||||
`OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
|
||||
|
||||
- **Metadata & Attributes**:
|
||||
Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
|
||||
|
||||
- **Efficient Iteration**:
|
||||
The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
|
||||
|
||||
- **Frequency Filtering**:
|
||||
`WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
|
||||
|
||||
- **Multi-set Support**:
|
||||
Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
|
||||
|
||||
### Testing Coverage
|
||||
|
||||
Comprehensive unit tests validate:
|
||||
- Basic construction & correctness,
|
||||
- Multi-sequence ingestion and deduplication,
|
||||
- Frequency-based inclusion/exclusion logic,
|
||||
- Cross-set isolation (`nSets > 1`),
|
||||
- Metadata round-trip integrity.
|
||||
|
||||
This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
|
||||
@@ -0,0 +1,44 @@
|
||||
# `obikmer` Package: Disk-Based K-mer Set Group Management
|
||||
|
||||
The `obikmer` package provides a streaming, disk-backed implementation for managing collections of *k*-mer sets (called **K-mer Set Groups**), optimized for large-scale metagenomic or genomic analyses.
|
||||
|
||||
### Core Concepts
|
||||
- A **KmerSetGroup** stores *N* disjoint sets of sorted *k*-mers, partitioned into *P* files per set.
|
||||
- Each group is defined by immutable parameters: `k` (*mer size), `m* (minimizer size), and *P* partitions.
|
||||
- Data is stored on disk as `.kdi` files (sorted k-mers) with optional sparse indices (`.kdx`) for fast lookup.
|
||||
- Metadata is serialized in TOML format (`metadata.toml`), supporting both group-level and per-set attributes.
|
||||
|
||||
### Key Functionalities
|
||||
|
||||
#### 1. **Lifecycle Management**
|
||||
- `OpenKmerSetGroup(directory)` loads an existing index in read-only mode.
|
||||
- `NewFilteredKmerSetGroup(...)` constructs a new group (e.g., after filtering).
|
||||
- `SaveMetadata()` persists metadata changes to disk.
|
||||
|
||||
#### 2. **Accessors & Metadata**
|
||||
- Basic properties: `K()`, `M()`, `Partitions()`, `Size()` (i.e., *N*), and group ID.
|
||||
- Attribute API: get/set/delete user-defined metadata (group-level or per-set).
|
||||
- Supports type coercion (`GetIntAttribute`, `GetStringAttribute`).
|
||||
|
||||
#### 3. **Membership & Iteration**
|
||||
- `Contains(setIndex, kmer)` checks presence using indexed binary search + linear scan across all partitions (parallelized).
|
||||
- `Iterator(setIndex)` yields sorted *k*-mers via k-way merge of partition readers.
|
||||
|
||||
#### 4. **Similarity & Distance Metrics**
|
||||
- `JaccardDistanceMatrix()` and `JaccardSimilarityMatrix()`: compute pairwise metrics in a streaming fashion.
|
||||
- Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
|
||||
|
||||
#### 5. **Set Management**
|
||||
- `CopySetsByIDTo(ids, destDir)` copies selected sets (with metadata) to another group.
|
||||
- Supports compatibility checks and optional overwriting (`force`).
|
||||
- `RemoveSetByID(id)` deletes a set, renumbers remaining sets for contiguous indices.
|
||||
- Glob pattern matching: `MatchSetIDs(patterns)` resolves IDs like `"sample_*"`.
|
||||
|
||||
#### 6. **Compatibility & Utility**
|
||||
- `IsCompatibleWith(other)` verifies same `(k, m, partitions)`.
|
||||
- Helper methods: `PartitionPath`, `Spectrum(...)`, and spectrum file I/O.
|
||||
|
||||
### Design Highlights
|
||||
- **Streaming**: Operations avoid loading full datasets into memory.
|
||||
- **Immutability after creation** ensures consistency; modifications require explicit save operations.
|
||||
- Thread-safe for concurrent partition processing (via `sync.Mutex`/`WaitGroup`).
|
||||
@@ -0,0 +1,26 @@
|
||||
# Semantic Description of `obikmer` Set Operations
|
||||
|
||||
This Go package implements scalable set operations over collections of *k*-mers stored in disk-backed, sorted structures (`.kdi` files). A `KmerSetGroup` represents a group of *N* disjoint sets (e.g., per-sample or per-partition), each containing sorted unique *k*-mers.
|
||||
|
||||
## Core Set Operations
|
||||
|
||||
- **`Union()`**: Computes the union across all *N* sets — a k-mer appears in output if present in ≥1 input set.
|
||||
- **`Intersect()`**: Computes the intersection — a k-mer appears only if present in *all* sets.
|
||||
- **`Difference()`**: Computes `set₀ \ (set₁ ∪ … ∪ setₙ₋₁)` — keeps k-mers unique to the first set.
|
||||
- **`QuorumAtLeast(q)`**: Returns k-mers present in ≥ *q* sets.
|
||||
- **`QuorumExactly(q)`**: Returns k-mers present in exactly *q* sets.
|
||||
- **`QuorumAtMost(q)`**: Returns k-mers present in ≤ *q* sets.
|
||||
|
||||
## Pairwise Group Operations
|
||||
|
||||
- **`UnionWith(other)` / `IntersectWith(other)`**: Performs *per-set* binary operations between two compatible groups (same k, m, partitions, size). Result has *N* sets: `setᵢ = this.setᵢ ⊕ other.setᵢ`, where ⊕ is union or intersection.
|
||||
|
||||
## Implementation Highlights
|
||||
|
||||
- **Partitioned & Parallelized**: Each operation processes partitions in parallel using `runtime.NumCPU()` workers.
|
||||
- **Streaming K-way Merge**: Uses efficient sorted-stream merging (via `KWayMerge`) to avoid loading full sets into memory.
|
||||
- **Quorum Filtering**: Counts occurrences per k-mer across partitions by merging sorted streams and tallying hits.
|
||||
- **Compatibility Check**: Ensures groups share metadata (k, m, partitions) before pairwise operations.
|
||||
- **Disk Output**: All results materialized as new `KmerSetGroup` in a directory, with per-partition `.kdi` files and metadata.
|
||||
|
||||
All operations preserve sorted order and support large-scale genomic datasets via streaming, partitioning, and minimal memory footprint.
|
||||
@@ -0,0 +1,28 @@
|
||||
# Semantic Description of `obikmer` Package Functionalities
|
||||
|
||||
The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
|
||||
|
||||
- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
|
||||
- `Union`: Merges all *k*-mers from selected sets.
|
||||
- `Intersect`: Retains only *k*-mers present in all input sets.
|
||||
- `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
|
||||
- `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
|
||||
|
||||
- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.
|
||||
|
||||
- **Similarity & Distance Metrics**:
|
||||
- `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
|
||||
- `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
|
||||
- Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
|
||||
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
|
||||
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
|
||||
|
||||
This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
|
||||
@@ -0,0 +1,37 @@
|
||||
# Semantic Description of `KmerMap` Functionality
|
||||
|
||||
The provided Go package implements a **k-mer indexing and matching system** for biological sequences (`BioSequence`). It supports both standard and *sparse* k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).
|
||||
|
||||
### Core Data Structures
|
||||
- `KmerMap[T]`: A generic hash map associating *normalized* k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
|
||||
- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.
|
||||
|
||||
### Key Features
|
||||
1. **K-mer Normalization**
|
||||
- Handles both forward and reverse-complement k-mers.
|
||||
- Selects the lexicographically smaller representation (canonical form).
|
||||
- Supports *sparse* k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.
|
||||
|
||||
2. **Efficient Indexing (`Push`)**
|
||||
- Builds an index of all canonical k-mers from a set of sequences.
|
||||
- Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).
|
||||
|
||||
3. **Querying (`Query`)**
|
||||
- Given a query sequence, returns all sequences in the index sharing k-mers with it.
|
||||
- Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
|
||||
|
||||
4. **Result Utilities (`KmerMatch`)**
|
||||
- `FilterMinCount`: Remove low-count matches.
|
||||
- `Max()`, `Sequences()`: Retrieve best match or all matched sequences.
|
||||
|
||||
5. **Construction (`NewKmerMap`)**
|
||||
- Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
|
||||
- Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
|
||||
- Integrates progress bar during indexing.
|
||||
|
||||
### Use Cases
|
||||
- Read clustering (e.g., OTU/ASV picking).
|
||||
- Error correction via k-mer abundance.
|
||||
- Sequence similarity search or contamination screening.
|
||||
|
||||
The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Minimizer Size Utilities in `obikmer`
|
||||
|
||||
This Go package provides helper functions to compute and validate the **minimizer size** `m` in k-mer-based genomic algorithms (e.g., minimizer schemes for sequence comparison or indexing).
|
||||
|
||||
## Core Functions
|
||||
|
||||
- **`DefaultMinimizerSize(k)`**
|
||||
Returns a *recommended* minimizer size: `ceil(k / 2.5)`, clamped to `[1, k−1]`.
|
||||
→ Ensures `m` is reasonably large for uniqueness while keeping window size (`k − m + 1`) manageable.
|
||||
|
||||
- **`MinMinimizerSize(nworkers)`**
|
||||
Computes the *minimum* `m` such that there are ≥ `nworkers` distinct minimizers:
|
||||
solves `4^m ≥ n_workers`, i.e., `ceil(log₄(nworkers))`.
|
||||
→ Guarantees enough diversity for parallelization (e.g., hashing-based distribution across workers).
|
||||
|
||||
- **`ValidateMinimizerSize(m, k, nworkers)`**
|
||||
Enforces constraints on `m`:
|
||||
- Lower bound: ≥ `MinMinimizerSize(nworkers)` (warns & adjusts if violated)
|
||||
- Hard bounds: `1 ≤ m < k`
|
||||
→ Prevents invalid or inefficient parameter choices.
|
||||
|
||||
## Semantic Purpose
|
||||
|
||||
These functions ensure that minimizer-based workflows are:
|
||||
- **Theoretically sound** (sufficient entropy for parallelism),
|
||||
- **Practically viable** (avoiding degenerate cases like `m = 0` or `m ≥ k`),
|
||||
- **User-friendly** (providing sensible defaults + clear warnings on adjustment).
|
||||
@@ -0,0 +1,24 @@
|
||||
# SKM File Reader for Super-Kmers
|
||||
|
||||
This Go package provides a binary file reader (`SkmReader`) for `.skm` files, which store *super-kmers* — compact representations of DNA sequences using 2-bit encoding.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **Binary Format Parsing**: Reads structured data from `.skm` files, where each record contains:
|
||||
- A 2-byte little-endian integer specifying the sequence length.
|
||||
- Packed nucleotide data, where every byte encodes up to four bases (2 bits per base).
|
||||
|
||||
- **Decoding Logic**: Converts packed 2-bit codes (`00`, `01`, `10`, `11`) to nucleotide characters using the mapping:
|
||||
`{ 'a', 'c', 'g', 't' }`.
|
||||
|
||||
- **Memory-Efficient Reading**: Uses buffered I/O (64 KiB buffer) for fast sequential access.
|
||||
|
||||
- **Streaming Interface**: `Next()` returns the next super-kmer as a struct with:
|
||||
- `Sequence`: decoded nucleotide byte slice.
|
||||
- `Start`, `End`: positional metadata (currently fixed to full length).
|
||||
|
||||
- **Resource Management**: Provides a clean `.Close()` method for file handle cleanup.
|
||||
|
||||
## Use Case
|
||||
|
||||
Designed for high-performance processing of large genomic datasets (e.g., in k-mer analysis or sequence indexing), where storage size and read speed are critical.
|
||||
@@ -0,0 +1,23 @@
|
||||
# SKM File Format Specification
|
||||
|
||||
This Go package implements a binary format for storing *super-kmers*—compact representations of DNA sequences used in bioinformatics. The tests validate reading/writing, padding behavior, and file size correctness.
|
||||
|
||||
## Core Functionalities
|
||||
|
||||
- **SuperKmer Structure**: Each super-kmer stores a DNA sequence (as bytes), likely padded to 4-base boundaries for efficient storage.
|
||||
- **SkmWriter**: Serializes super-kmers into a binary file. Each entry writes:
|
||||
- A 2-byte little-endian length (number of bases),
|
||||
- Then `ceil(length/4)` bytes encoding nucleotides in 2 bits each (A=0, C=1, G=2, T=3).
|
||||
- **SkmReader**: Parses the binary format back into memory. Returns `(SuperKmer, bool)` via `Next()`, with EOF signaled by `ok = false`.
|
||||
- **Case Handling**: Writes preserve original case; reads normalize to lowercase (via `| 0x20` in tests), ensuring robust comparison.
|
||||
|
||||
## Test Coverage
|
||||
|
||||
- **Round-trip integrity**: Verifies exact sequence recovery after write/read.
|
||||
- **Empty file handling**: Confirms reader returns `ok = false` immediately on empty files.
|
||||
- **Variable-length padding**: Validates correct encoding/decoding for sequences of length 1–5.
|
||||
- **Size validation**: Confirms file size = `2 + ceil(L/4)` bytes for a sequence of length *L*.
|
||||
|
||||
## Use Case
|
||||
|
||||
Efficient, lossless storage and retrieval of super-kmers for downstream genomic analysis (e.g., assembly or alignment acceleration).
|
||||
@@ -0,0 +1,24 @@
|
||||
# `.skm` File Format and `SkmWriter` Functionality
|
||||
|
||||
The Go package `obikmer` provides a binary writer for `.skm` (super-kmer) files, optimized for compact storage of DNA sequences.
|
||||
|
||||
- **Purpose**: Efficiently serialize *super-kmers* (long k-mers) into a binary format.
|
||||
- **Format per super-kmer**:
|
||||
- `len: uint16 LE` — length of the sequence in bases (little-endian, 2 bytes).
|
||||
- `data: ⌈len/4⌉ bytes` — nucleotide sequence encoded as **2 bits per base**, packed tightly.
|
||||
|
||||
- **Encoding scheme**:
|
||||
- `A → 00`, `C → 01`, `G → 10`, `T → 11`.
|
||||
- Padding: trailing bits in the final byte are zeroed if `len % 4 ≠ 0`.
|
||||
|
||||
- **Implementation details**:
|
||||
- Uses buffered I/O (`bufio.Writer` with 64 KiB buffer) for performance.
|
||||
- `NewSkmWriter(path)` opens/creates the file and returns a writer instance.
|
||||
- `Write(sk SuperKmer)` encodes sequence length, then packs bases using a lookup (`__single_base_code__[seq[pos]&31]`).
|
||||
- `Close()` flushes buffers and closes the file handle.
|
||||
|
||||
- **Use case**: Ideal for high-throughput genomic preprocessing (e.g., indexing, sketching), where space and I/O speed matter.
|
||||
|
||||
- **Assumptions**: `SuperKmer` type exposes a `.Sequence []byte`; bases are ASCII (`A,C,G,T,a,c,g,t`) — `&31` normalizes to lowercase index.
|
||||
|
||||
- **Efficiency**: 4× compression vs. ASCII (1 byte/base → ~0.25 bytes/base), minimal overhead.
|
||||
@@ -0,0 +1,35 @@
|
||||
# K-mer Spectrum Analysis Package (`obikmer`)
|
||||
|
||||
This Go package provides tools for analyzing k-mer frequency distributions in biological sequences.
|
||||
|
||||
## Core Data Structures
|
||||
|
||||
- **`SpectrumEntry`**: Represents a bin in the k-mer frequency spectrum:
|
||||
`Frequency`: how often a k-mer was observed; `Count`: number of distinct k-mers with that frequency.
|
||||
|
||||
- **`KmerSpectrum`**: A sorted list of non-zero `SpectrumEntry`s (ascending by frequency), enabling efficient statistics and serialization.
|
||||
|
||||
## Key Functionalities
|
||||
|
||||
### Spectrum Management
|
||||
- `MapToSpectrum()` / `ToMap()`: Convert between map and structured spectrum representations.
|
||||
- `MergeSpectraMaps()` / `MergeTopN()`: Combine spectral or top-k data from multiple sources.
|
||||
- `MaxFrequency()` returns the highest observed k-mer count.
|
||||
|
||||
### I/O & Persistence
|
||||
- Binary format (`KSP\x01` magic header) with varint encoding for compact storage:
|
||||
- `WriteSpectrum()` / `ReadSpectrum()`: Save/load full spectra to disk.
|
||||
- CSV export:
|
||||
- `WriteTopKmersCSV()`: Outputs top-k k-mers with their sequences (decoded from uint64) and frequencies.
|
||||
|
||||
### Top-N K-mer Tracking
|
||||
- Uses a **min-heap** to efficiently maintain the *N most frequent* k-mers in streaming scenarios:
|
||||
- `NewTopNKmers(n)`: Initialize collector.
|
||||
- `Add(kmer, freq)`: Insert/update while respecting capacity *n*.
|
||||
- `Results()`: Return top-kmers sorted descending by frequency.
|
||||
|
||||
## Design Highlights
|
||||
- Memory-efficient: Uses `uint64` for k-mers (suitable up to *k* ≤ 32).
|
||||
- Streaming-friendly: Top-N collector supports incremental updates.
|
||||
- Thread-safety note: External synchronization required for concurrent access.
|
||||
|
||||
@@ -0,0 +1,48 @@
|
||||
# SuperKmer and Minimizer-Based Sliding Window Analysis
|
||||
|
||||
This Go package provides functionality for extracting *super k-mers* from DNA sequences using a minimizer-based sliding window approach.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
- **K-mers**: Substrings of length `k` from a DNA sequence.
|
||||
- **Minimizer**: The lexicographically smallest canonical *m*-mer (substring of length `m`) among all `(k − m + 1)` overlapping *m*-mers in a given k-mer.
|
||||
- **Super K-mer**: A maximal contiguous subsequence where *every* consecutive k-mer shares the **same minimizer**.
|
||||
|
||||
## Data Structures
|
||||
|
||||
### `SuperKmer`
|
||||
Represents a maximal region with uniform minimizer:
|
||||
- `Minimizer`: Canonical 64-bit hash of the shared m-mer.
|
||||
- `Start`, `End`: Slice-style bounds (0-indexed, exclusive end).
|
||||
- `Sequence`: Raw byte slice of the DNA subsequence.
|
||||
|
||||
### `dequeItem`
|
||||
Used internally to maintain a monotone deque:
|
||||
- `position`: Index of the m-mer in the sequence.
|
||||
- `canonical`: Canonical hash value (e.g., lexicographically smallest of forward/reverse-complement).
|
||||
|
||||
## Main Function
|
||||
|
||||
### `ExtractSuperKmers(seq, k, m, buffer)`
|
||||
- Extracts all maximal super k-mers from `seq`.
|
||||
- Parameters validated:
|
||||
- `1 ≤ m < k`,
|
||||
- `2 ≤ k ≤ 31`,
|
||||
- sequence length ≥ `k`.
|
||||
- Uses an efficient **O(n)** time algorithm via internal iteration.
|
||||
- Supports optional preallocation (`buffer`) to reduce memory allocations.
|
||||
|
||||
## Algorithm Highlights
|
||||
|
||||
- Maintains a sliding window of size `k − m + 1` over *m*-mers.
|
||||
- Tracks the current minimizer using a monotone deque for O(1) updates per step.
|
||||
- Detects *minimizer transitions* to delimit super k-mer boundaries.
|
||||
|
||||
## Complexity
|
||||
|
||||
| Aspect | Bound |
|
||||
|---------------|-------------------------------|
|
||||
| Time | **O(n)** (linear in sequence length) |
|
||||
| Space | **O(k − m + 1)** for deque + output size |
|
||||
|
||||
Useful in genome compression, read clustering, and minimizer-based alignment acceleration.
|
||||
@@ -0,0 +1,32 @@
|
||||
# Super K-mers Extraction Module (`obikmer`)
|
||||
|
||||
This Go package provides efficient tools for extracting **super k-mers** from DNA sequences using *minimizer-based sliding windows*. Super k-mers are maximal contiguous subsequences sharing the same minimal canonical minimizer in a window of size `k`.
|
||||
|
||||
## Core Functionality
|
||||
|
||||
- **`IterSuperKmers(seq, k, m)`**
|
||||
Returns an iterator over `SuperKmer` structs. Each struct contains:
|
||||
- `Start`, `End`: genomic positions of the super k-mer in the original sequence
|
||||
- `Minimizer`: canonical minimizer value (uint64) for that segment
|
||||
- `Sequence`: the actual DNA subsequence
|
||||
|
||||
- **`SuperKmer.ToBioSequence(...)`**
|
||||
Converts a raw `SuperKmer` into an enriched `obiseq.BioSequence`, embedding metadata:
|
||||
- ID: `{parentID}_superkmer_{start}_{end}`
|
||||
- Attributes: minimizer sequence (`minimizer_seq`), value, `k`, `m`, positions, and parent ID
|
||||
|
||||
- **`SuperKmerWorker(k, m)`**
|
||||
A `SeqWorker` adapter for pipeline integration (e.g., with `obiiter`). Processes a full BioSequence and returns all extracted super k-mers as a slice of `BioSequence`s.
|
||||
|
||||
## Algorithm Highlights
|
||||
|
||||
- Uses **canonical minimizers** (forward/reverse-complement minimum) to ensure strand-invariance
|
||||
- Maintains a monotonic deque for efficient *sliding-window minimizer* tracking (O(n) time complexity)
|
||||
- Supports DNA bases `A/C/G/T/U` case-insensitively via bitmasking (`seq[i] & 31`)
|
||||
- Enforces parameter constraints: `1 ≤ m < k ≤ 31`, sequence length ≥ `k`
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Read partitioning in metagenomics (e.g., for error correction or clustering)
|
||||
- Efficient k-mer space segmentation without storing all individual kmers
|
||||
- Integration into modular bioinformatics pipelines via `SeqWorker` interface
|
||||
@@ -0,0 +1,39 @@
|
||||
# Semantic Description of `obikmer` Package Functionalities
|
||||
|
||||
The `obikmer` package provides tools for **super k-mer extraction and minimizer-based sequence analysis** in bioinformatics.
|
||||
|
||||
## Core Concepts
|
||||
|
||||
A **super k-mer** is a maximal contiguous subsequence of DNA where *all* embedded *k*-mers share the **same minimizer**—a compact representative (typically lexicographically minimal) of *m*-mers, considering both forward and reverse-complement strands.
|
||||
|
||||
## Key Functions & Features
|
||||
|
||||
- **`IterSuperKmers(seq, k, m)`**:
|
||||
An iterator over all super *k*-mers in input sequence `seq`, parameterized by:
|
||||
- `k`: length of embedded *k*-mers,
|
||||
- `m`: size of minimizer window (`m ≤ k`).
|
||||
Yields structured objects with:
|
||||
- `Sequence`: the super *k*-mer substring,
|
||||
- `Start`/`End`: genomic coordinates (0-based half-open),
|
||||
- `Minimizer`: canonical hash of the shared minimizer.
|
||||
|
||||
- **`ExtractSuperKmers(...)`**:
|
||||
Synchronous counterpart returning a slice of all super *k*-mers.
|
||||
|
||||
## Verified Properties (via Tests)
|
||||
|
||||
1. **Boundary correctness**: Extracted subsequences match `seq[start:end]`.
|
||||
2. **Consistency between iterator and slice versions**: Both APIs produce identical results.
|
||||
3. **Bijection property**:
|
||||
- Each unique super *k*-mer sequence maps to exactly one minimizer.
|
||||
- All embedded *k*-mers within a super *k-mer* share the same minimizer.
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
- Minimizers are computed canonically (min of forward and reverse-complement encodings).
|
||||
- Uses base encoding via `__single_base_code__` (assumed helper mapping A/C/G/T → 0/1/2/3).
|
||||
- Tests cover simple, homopolymer-rich, and complex genomic patterns.
|
||||
|
||||
## Design Rationale
|
||||
|
||||
Super *k*-mers enable efficient compression, indexing (e.g., in minimizer spaces), and alignment-free comparisons—crucial for scalable genomic analysis.
|
||||
@@ -0,0 +1,33 @@
|
||||
# Variable-Length Integer Encoding/Decoding Utility
|
||||
|
||||
This Go package (`obikmer`) provides efficient serialization of `uint64` integers using **protobuf-style variable-length encoding (varint)**.
|
||||
|
||||
## Core Features
|
||||
|
||||
- ✅ `EncodeVarint(io.Writer, uint64) (n int, err error)`
|
||||
Writes a `uint64` as a compact varint to any `io.Writer`. Uses **7 bits per byte**, with the MSB as a continuation flag. Max 10 bytes for `uint64`.
|
||||
|
||||
- ✅ `DecodeVarint(io.Reader) (val uint64, err error)`
|
||||
Reads and decodes a varint from any `io.Reader`. Handles multi-byte sequences safely; returns error on malformed input or overflow (>70 bits).
|
||||
|
||||
- ✅ `VarintLen(uint64) int`
|
||||
Computes the exact byte length required to encode a value *without* performing I/O — useful for buffer preallocation or size estimation.
|
||||
|
||||
## Encoding Scheme
|
||||
|
||||
- Each byte holds 7 bits of data; bit 8 (MSB) = `1` if more bytes follow, else `0`.
|
||||
- Example:
|
||||
- `0x7F` → `1 byte`: `0111_1111`
|
||||
- `0x80` → `2 bytes`: `1000_0000 0000_0001`
|
||||
|
||||
## Use Cases
|
||||
|
||||
- Network protocols & binary file formats requiring compact integer representation
|
||||
- Serialization frameworks (e.g., custom protobuf-like codecs)
|
||||
- Embedded systems or bandwidth-constrained environments where space efficiency matters
|
||||
|
||||
## Design Notes
|
||||
|
||||
- No external dependencies; uses only `io` from the standard library.
|
||||
- Thread-safe *per call* (no shared state), but `io.Reader`/`Writer` concurrency must be handled externally.
|
||||
- Compatible with standard protobuf varint format (e.g., interoperable with `encoding/binary` or gRPC).
|
||||
@@ -0,0 +1,37 @@
|
||||
# Varint Encoding and Decoding Module (`obikmer`)
|
||||
|
||||
This Go package implements **variable-length integer encoding/decoding**, commonly used in binary protocols (e.g., Protocol Buffers, SQLite) to efficiently store small integers using fewer bytes.
|
||||
|
||||
## Core Features
|
||||
|
||||
- **`EncodeVarint(w io.Writer, v uint64) (n int, err error)`**
|
||||
Encodes a `uint64` value into the minimal number of bytes (1–10) using **LEB128-style varint**, writing the result to a writer. Returns bytes written and any I/O error.
|
||||
|
||||
- **`DecodeVarint(r io.Reader) (uint64, error)`**
|
||||
Reads and decodes a varint from an `io.Reader`, reconstructing the original `uint64`. Fails on malformed or incomplete data.
|
||||
|
||||
- **`VarintLen(v uint64) int`**
|
||||
Computes the exact number of bytes required to encode `v`, without performing I/O.
|
||||
|
||||
## Test Coverage
|
||||
|
||||
- **Round-trip correctness**: All test values (including edge cases like `0`, powers of two, and max `uint64`) encode → decode back identically.
|
||||
- **Length validation**: Encoded length matches `VarintLen` predictions exactly (e.g., 127 → 1 byte; 16384 → 3 bytes).
|
||||
- **Sequence handling**: Multiple varints can be concatenated and decoded in order, preserving data integrity.
|
||||
|
||||
## Efficiency & Design
|
||||
|
||||
- Uses **7-bit groups per byte**, with the MSB as a continuation flag (`1` = more bytes follow).
|
||||
- Minimal memory footprint — no allocations beyond buffer I/O.
|
||||
- Designed for streaming use (e.g., network or file serialization).
|
||||
|
||||
## Edge Cases Verified
|
||||
|
||||
| Value | Encoded Length |
|
||||
|----------------|---------------|
|
||||
| `0` | 1 byte |
|
||||
| `2⁷−1 = 127` | 1 byte |
|
||||
| `2⁷ = 128` | 2 bytes |
|
||||
| `2¹⁴−1 = 16383`| 2 bytes |
|
||||
| `^uint64(0)` | **10 bytes** |
|
||||
|
||||
Reference in New Issue
Block a user