⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
This commit is contained in:
Eric Coissac
2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
+36
View File
@@ -0,0 +1,36 @@
# Semantic Description of `obikmer` Package
This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
## Core Functionalities
1. **`Count4Mer(seq, buffer, counts)`**
Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.
- Encodes each 4-mer into an integer (0255) using `Encode4mer`.
- Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.
- Reuses or allocates the `counts` buffer as needed.
2. **`Common4Mer(count1, count2)`**
Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.
Used to measure shared content between sequences.
3. **`Sum4Mer(count)`**
Returns the total number of 4-mers in a profile (i.e., sum over all entries).
## Distance & Similarity Bounds
4. **`LCS4MerBounds(count1, count2)`**
Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:
- **Lower bound**: `common_kmers + (3 if common > 0 else 0)`
- **Upper bound**: `min(total1, total2) + 3 ceil((min_total common)/4)`
Leverages the fact that overlapping k-mers constrain possible alignments.
5. **`Error4MerBounds(count1, count2)`**
Estimates bounds for *alignment errors* (e.g., mismatches + indels):
- **Upper bound**: `max_total common_kmers + 2 * floor((common_kmers + 5)/8)`
- **Lower bound**: `ceil(upper_bound / 4)`
Provides fast, approximate error estimates without full alignment.
## Use Case
Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
+44
View File
@@ -0,0 +1,44 @@
# Semantic Description of the `obikmer` Package
This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
### Core Functionalities
- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
### Graph Operations
- **Node Queries**:
- `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.
- `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
- **Path Exploration**:
- `MaxPath()`: Greedily traces the highest-weight path from a head.
- `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).
- `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
### Consensus & Filtering
- **Consensus Generation**:
- `BestConsensus()` returns a sequence from the greedy max-weight path.
- `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
- **Weight Statistics**:
- `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.
- `FilterMinWeight(min)` removes low-count nodes.
- **Decoding**:
- `DecodeNode()` converts a k-mer index to its DNA string.
- `DecodePath()` reconstructs the full consensus from a path.
### I/O & Diagnostics
- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
### Dependencies & Design
- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Gos stdlib.
- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
@@ -0,0 +1,35 @@
# Semantic Description of `obikmer` Package
The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
## Core Functionalities
1. **Nucleotide Encoding**
- `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:
`0→A`, `1→C`, `2→G`, `3→T/U`.
Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).
Uses a lookup table for O(1) performance.
2. **4-mer Encoding**
- `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.
Each k-mer is encoded as an unsigned byte (0255), where each nucleotide contributes 2 bits.
Supports optional buffer reuse for memory efficiency.
3. **4-mer Indexing**
- `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0255) to all its occurrence positions in the sequence.
Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
4. **Fast Sequence Comparison**
- `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.
- Uses precomputed 4-mer index of a reference sequence and encodes the query.
- Counts co-occurring 4-mers across all possible shifts (`refpos queryPos`).
- Computes raw and relative scores (normalized by alignment length).
- Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
## Design Highlights
- **Memory-aware**: Supports buffer reuse to minimize allocations.
- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).
- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.
Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
+39
View File
@@ -0,0 +1,39 @@
# Semantic Description of `obikmer` Package
The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
## Core Encoding & Decoding
- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
## Iterators (Memory-Efficient Streaming)
- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 03). Only valid for **odd k ≤ 31**.
## Error Handling & Markers
- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
## Reverse Complement & Circular Normalization
- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
## Counting & Math Utilities
- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreaus necklace formula**, with Euler's totient function and divisor enumeration.
## Performance & Safety
- All functions avoid heap allocations where possible (reusing buffers).
- Panics on invalid `k` or length mismatches for correctness.
- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
## Use Cases
- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
- Error-aware k-mer filtering in sequencing pipelines
- Low-complexity region detection via circular entropy normalization
@@ -0,0 +1,36 @@
# Obikmer: Efficient K-mer Encoding and Manipulation in Go
This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
## Core Functionalities
### K-mer Encoding (`EncodeKmers`, `IterKmers`)
Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
### Reverse Complement (`ReverseComplement`)
Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
### Super *k*-mers Extraction (`ExtractSuperKmers`)
Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
## Key Features
- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
## Use Cases
- Genome assembly &DBG construction
- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)
- Error-aware k-mer counting & filtering
- Strand-unbiased sequence comparison
All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
+31
View File
@@ -0,0 +1,31 @@
# Semantic Description of `obikmer` Entropy Functions
The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
## Core Functionality
- **`KmerEntropy(kmer, k, levelMax)`**:
Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.
- Decodes the encoded *k*-mer (2 bits/base) into a DNA string.
- For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.
- Normalized entropy = `(log(N) Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.
- Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
- **`KmerEntropyFilter`**:
A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:
- Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.
- Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).
- **Not goroutine-safe** — each thread must instantiate its own filter.
- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:
Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.
- **`Accept(kmer)` / `Entropy(kmer)`**:
- `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).
- `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
## Design Highlights
- **Circular canonical normalization** ensures symmetry (e.g., `AT``TA`).
- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.
- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
+37
View File
@@ -0,0 +1,37 @@
# K-Way Merge for Sorted k-mer Streams
This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
## Core Components
- **`mergeItem`**: Stores a value and its source reader index for heap operations.
- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
- **`KWayMerge`**: Main struct managing the heap and input readers.
## Key Functionality
- **Initialization (`NewKWayMerge`)**:
- Takes a slice of `*KdiReader`, each expected to yield sorted values.
- Preloads the heap with one value from each reader.
- **Streaming Output (`Next`)**:
- Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
- Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
- **Cleanup (`Close`)**:
- Closes all underlying `KdiReader`s and returns the first encountered error.
## Use Case
Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
- Efficient deduplication with multiplicity tracking.
- Scalable union/intersection operations on large *k*-mer sets.
## Complexity
| Operation | Time |
|-----------|------------|
| `Next()` | *O(log k)* (heap ops per unique value) |
| Init | *O(k)* |
Where `k` = number of input readers.
@@ -0,0 +1,27 @@
# K-Way Merge Functionality in `obikmer`
This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
## Key Features
- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
- **Robust Test Coverage**: Includes unit tests for:
- Basic merging with overlapping and non-overlapping values.
- Single-stream input (degenerate case).
- Empty streams handling.
- All identical k-mers across inputs.
## API Highlights
- `NewKdiReader(path)` — opens a `.kdi` file for reading.
- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
- `.Next()``(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
- `.Close()` — cleans up resources.
## Use Case
Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each samples k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
+27
View File
@@ -0,0 +1,27 @@
# KDI Reader: Streaming Delta-Varint Decoding for k-mers
The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
## Core Features
- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
## Key API
- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
- `Count()` / `Remaining()` → total and unread k-mers.
- `Close()` → releases file handle.
## Design Highlights
- Uses 64KB buffer for efficient I/O.
- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
+34
View File
@@ -0,0 +1,34 @@
# KDI File Format and API
The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
## Core Features
- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
## File Structure
A `.kdi` file contains:
1. **Magic header** (4 bytes): Identifies the format.
2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
3. **First value** (8 bytes, `uint64`): Initial k-mer.
4. **Delta-encoded tail**: `(n1)` deltas, each encoded as a `varint`.
## API
- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
- **`Writer.Count()`**: Returns the number of written items before closing.
- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
- **`Reader.Count()`**: Returns total stored count.
## Tests Validate
1. Basic round-trip with diverse values (including large `uint64`s).
2. Empty and single-k-mer files.
3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
+38
View File
@@ -0,0 +1,38 @@
# KDI File Format and Writer
The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
## Core Format (`.kdi`)
- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
## Writer (`KdiWriter`)
- **Strict ordering**: K-mers must be written in *strictly increasing order*.
- Efficient buffering via `bufio.Writer` (64 KB buffer).
- Internally tracks:
- Current k-mer count,
- Previous value (for delta computation),
- Bytes written in data section.
- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
## Companion Index (`.kdx`)
- Written automatically on `Close()` if indexing entries exist.
- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
- Enables efficient random access without full file scan.
## Usage Pattern
```go
w, _ := obikmer.NewKdiWriter("data.kdi")
for _, kmer := range sortedKMers {
w.Write(kmer)
}
w.Close() // finalizes header, writes .kdx index
```
The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
+29
View File
@@ -0,0 +1,29 @@
# KDX Index Format and Functionality
The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
## Core Concepts
- **Magic bytes**: `KDX\x01` validates file integrity.
- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
- **Entry structure**: Each entry stores:
- `kmer`: the k-mer value at that index position.
- `offset`: absolute byte offset in the corresponding `.kdi` file.
## Key Operations
- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
- Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
- Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
## Performance
- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
## Design Philosophy
Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obikmer` Package
The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
- `QueryEntry` represents a single canonical kmer with its origin: sequence index and 1-based position.
- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each kmer to a partition via modulo hashing, and sorts buckets by kmer value.
- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
- Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
- Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and kmer stream in lockstep.
The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
@@ -0,0 +1,49 @@
# `obikmer` K-mer Set Group Builder — Functional Overview
The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
## Core Features
- **K-mer & Minimizer Configuration**:
Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
- **Functional Options for Filtering**:
- `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).
- `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.
- `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).
- `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
- **Concurrent & Pipeline-Aware Processing**:
Uses a two-stage pipeline: *I/O-bound readers* (24 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
- **Partitioned I/O & Thread Safety**:
Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
## Workflow
1. **Build Phase**:
- Input sequences → super-kmers extracted via minimizer-based partitioning.
- Super-kmers written to `.build/set_*/part_*.skm`.
2. **Finalization (`Close()`)**:
- `.skm` files loaded → canonical k-mers extracted.
- K-mers sorted, counted (frequency spectrum), and filtered per config.
- Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.
- Metadata (`metadata.toml`) generated; `.build/` cleaned.
3. **Append Mode**:
`AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
## Output Artifacts
- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.
- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).
- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.
- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
## Design Highlights
- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.
- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.
- **Robust error handling**: Early termination on first failure; cleanup of partial state.
@@ -0,0 +1,44 @@
# K-mer Set Group Builder — Semantic Description
This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
### Core Functionalities
- **K-mer Set Group Construction**:
`NewKmerSetGroupBuilder` creates a builder configured with:
- *k* (k-mer length),
- *m* (minimal unique substring for partitioning),
- number of sets (`nSets`),
- and optional parameters like `WithMinFrequency`.
- **Sequence Ingestion**:
Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
- Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
- Deduplicated and optionally filtered by occurrence frequency.
- **Persistence & Round-Trip**:
`builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).
`OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
- **Metadata & Attributes**:
Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
- **Efficient Iteration**:
The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
- **Frequency Filtering**:
`WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
- **Multi-set Support**:
Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
### Testing Coverage
Comprehensive unit tests validate:
- Basic construction & correctness,
- Multi-sequence ingestion and deduplication,
- Frequency-based inclusion/exclusion logic,
- Cross-set isolation (`nSets > 1`),
- Metadata round-trip integrity.
This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
@@ -0,0 +1,44 @@
# `obikmer` Package: Disk-Based K-mer Set Group Management
The `obikmer` package provides a streaming, disk-backed implementation for managing collections of *k*-mer sets (called **K-mer Set Groups**), optimized for large-scale metagenomic or genomic analyses.
### Core Concepts
- A **KmerSetGroup** stores *N* disjoint sets of sorted *k*-mers, partitioned into *P* files per set.
- Each group is defined by immutable parameters: `k` (*mer size), `m* (minimizer size), and *P* partitions.
- Data is stored on disk as `.kdi` files (sorted k-mers) with optional sparse indices (`.kdx`) for fast lookup.
- Metadata is serialized in TOML format (`metadata.toml`), supporting both group-level and per-set attributes.
### Key Functionalities
#### 1. **Lifecycle Management**
- `OpenKmerSetGroup(directory)` loads an existing index in read-only mode.
- `NewFilteredKmerSetGroup(...)` constructs a new group (e.g., after filtering).
- `SaveMetadata()` persists metadata changes to disk.
#### 2. **Accessors & Metadata**
- Basic properties: `K()`, `M()`, `Partitions()`, `Size()` (i.e., *N*), and group ID.
- Attribute API: get/set/delete user-defined metadata (group-level or per-set).
- Supports type coercion (`GetIntAttribute`, `GetStringAttribute`).
#### 3. **Membership & Iteration**
- `Contains(setIndex, kmer)` checks presence using indexed binary search + linear scan across all partitions (parallelized).
- `Iterator(setIndex)` yields sorted *k*-mers via k-way merge of partition readers.
#### 4. **Similarity & Distance Metrics**
- `JaccardDistanceMatrix()` and `JaccardSimilarityMatrix()`: compute pairwise metrics in a streaming fashion.
- Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
#### 5. **Set Management**
- `CopySetsByIDTo(ids, destDir)` copies selected sets (with metadata) to another group.
- Supports compatibility checks and optional overwriting (`force`).
- `RemoveSetByID(id)` deletes a set, renumbers remaining sets for contiguous indices.
- Glob pattern matching: `MatchSetIDs(patterns)` resolves IDs like `"sample_*"`.
#### 6. **Compatibility & Utility**
- `IsCompatibleWith(other)` verifies same `(k, m, partitions)`.
- Helper methods: `PartitionPath`, `Spectrum(...)`, and spectrum file I/O.
### Design Highlights
- **Streaming**: Operations avoid loading full datasets into memory.
- **Immutability after creation** ensures consistency; modifications require explicit save operations.
- Thread-safe for concurrent partition processing (via `sync.Mutex`/`WaitGroup`).
@@ -0,0 +1,26 @@
# Semantic Description of `obikmer` Set Operations
This Go package implements scalable set operations over collections of *k*-mers stored in disk-backed, sorted structures (`.kdi` files). A `KmerSetGroup` represents a group of *N* disjoint sets (e.g., per-sample or per-partition), each containing sorted unique *k*-mers.
## Core Set Operations
- **`Union()`**: Computes the union across all *N* sets — a k-mer appears in output if present in ≥1 input set.
- **`Intersect()`**: Computes the intersection — a k-mer appears only if present in *all* sets.
- **`Difference()`**: Computes `set₀ \ (set₁ setₙ₋₁)` — keeps k-mers unique to the first set.
- **`QuorumAtLeast(q)`**: Returns k-mers present in ≥ *q* sets.
- **`QuorumExactly(q)`**: Returns k-mers present in exactly *q* sets.
- **`QuorumAtMost(q)`**: Returns k-mers present in ≤ *q* sets.
## Pairwise Group Operations
- **`UnionWith(other)` / `IntersectWith(other)`**: Performs *per-set* binary operations between two compatible groups (same k, m, partitions, size). Result has *N* sets: `setᵢ = this.setᵢ ⊕ other.setᵢ`, where ⊕ is union or intersection.
## Implementation Highlights
- **Partitioned & Parallelized**: Each operation processes partitions in parallel using `runtime.NumCPU()` workers.
- **Streaming K-way Merge**: Uses efficient sorted-stream merging (via `KWayMerge`) to avoid loading full sets into memory.
- **Quorum Filtering**: Counts occurrences per k-mer across partitions by merging sorted streams and tallying hits.
- **Compatibility Check**: Ensures groups share metadata (k, m, partitions) before pairwise operations.
- **Disk Output**: All results materialized as new `KmerSetGroup` in a directory, with per-partition `.kdi` files and metadata.
All operations preserve sorted order and support large-scale genomic datasets via streaming, partitioning, and minimal memory footprint.
@@ -0,0 +1,28 @@
# Semantic Description of `obikmer` Package Functionalities
The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
## Core Features
- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
- `Union`: Merges all *k*-mers from selected sets.
- `Intersect`: Retains only *k*-mers present in all input sets.
- `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
- `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A B| = |A| + |B| |A ∩ B|`), validated via unit tests.
- **Similarity & Distance Metrics**:
- `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 similarity) between all sets.
- `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A B|`).
- Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
## Design Principles
- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
+37
View File
@@ -0,0 +1,37 @@
# Semantic Description of `KmerMap` Functionality
The provided Go package implements a **k-mer indexing and matching system** for biological sequences (`BioSequence`). It supports both standard and *sparse* k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).
### Core Data Structures
- `KmerMap[T]`: A generic hash map associating *normalized* k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.
### Key Features
1. **K-mer Normalization**
- Handles both forward and reverse-complement k-mers.
- Selects the lexicographically smaller representation (canonical form).
- Supports *sparse* k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.
2. **Efficient Indexing (`Push`)**
- Builds an index of all canonical k-mers from a set of sequences.
- Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).
3. **Querying (`Query`)**
- Given a query sequence, returns all sequences in the index sharing k-mers with it.
- Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
4. **Result Utilities (`KmerMatch`)**
- `FilterMinCount`: Remove low-count matches.
- `Max()`, `Sequences()`: Retrieve best match or all matched sequences.
5. **Construction (`NewKmerMap`)**
- Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
- Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
- Integrates progress bar during indexing.
### Use Cases
- Read clustering (e.g., OTU/ASV picking).
- Error correction via k-mer abundance.
- Sequence similarity search or contamination screening.
The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.
@@ -0,0 +1,27 @@
# Minimizer Size Utilities in `obikmer`
This Go package provides helper functions to compute and validate the **minimizer size** `m` in k-mer-based genomic algorithms (e.g., minimizer schemes for sequence comparison or indexing).
## Core Functions
- **`DefaultMinimizerSize(k)`**
Returns a *recommended* minimizer size: `ceil(k / 2.5)`, clamped to `[1, k1]`.
→ Ensures `m` is reasonably large for uniqueness while keeping window size (`k m + 1`) manageable.
- **`MinMinimizerSize(nworkers)`**
Computes the *minimum* `m` such that there are ≥ `nworkers` distinct minimizers:
solves `4^m ≥ n_workers`, i.e., `ceil(log₄(nworkers))`.
→ Guarantees enough diversity for parallelization (e.g., hashing-based distribution across workers).
- **`ValidateMinimizerSize(m, k, nworkers)`**
Enforces constraints on `m`:
- Lower bound: ≥ `MinMinimizerSize(nworkers)` (warns & adjusts if violated)
- Hard bounds: `1 ≤ m < k`
→ Prevents invalid or inefficient parameter choices.
## Semantic Purpose
These functions ensure that minimizer-based workflows are:
- **Theoretically sound** (sufficient entropy for parallelism),
- **Practically viable** (avoiding degenerate cases like `m = 0` or `m ≥ k`),
- **User-friendly** (providing sensible defaults + clear warnings on adjustment).
+24
View File
@@ -0,0 +1,24 @@
# SKM File Reader for Super-Kmers
This Go package provides a binary file reader (`SkmReader`) for `.skm` files, which store *super-kmers* — compact representations of DNA sequences using 2-bit encoding.
## Core Functionality
- **Binary Format Parsing**: Reads structured data from `.skm` files, where each record contains:
- A 2-byte little-endian integer specifying the sequence length.
- Packed nucleotide data, where every byte encodes up to four bases (2 bits per base).
- **Decoding Logic**: Converts packed 2-bit codes (`00`, `01`, `10`, `11`) to nucleotide characters using the mapping:
`{ 'a', 'c', 'g', 't' }`.
- **Memory-Efficient Reading**: Uses buffered I/O (64 KiB buffer) for fast sequential access.
- **Streaming Interface**: `Next()` returns the next super-kmer as a struct with:
- `Sequence`: decoded nucleotide byte slice.
- `Start`, `End`: positional metadata (currently fixed to full length).
- **Resource Management**: Provides a clean `.Close()` method for file handle cleanup.
## Use Case
Designed for high-performance processing of large genomic datasets (e.g., in k-mer analysis or sequence indexing), where storage size and read speed are critical.
+23
View File
@@ -0,0 +1,23 @@
# SKM File Format Specification
This Go package implements a binary format for storing *super-kmers*—compact representations of DNA sequences used in bioinformatics. The tests validate reading/writing, padding behavior, and file size correctness.
## Core Functionalities
- **SuperKmer Structure**: Each super-kmer stores a DNA sequence (as bytes), likely padded to 4-base boundaries for efficient storage.
- **SkmWriter**: Serializes super-kmers into a binary file. Each entry writes:
- A 2-byte little-endian length (number of bases),
- Then `ceil(length/4)` bytes encoding nucleotides in 2 bits each (A=0, C=1, G=2, T=3).
- **SkmReader**: Parses the binary format back into memory. Returns `(SuperKmer, bool)` via `Next()`, with EOF signaled by `ok = false`.
- **Case Handling**: Writes preserve original case; reads normalize to lowercase (via `| 0x20` in tests), ensuring robust comparison.
## Test Coverage
- **Round-trip integrity**: Verifies exact sequence recovery after write/read.
- **Empty file handling**: Confirms reader returns `ok = false` immediately on empty files.
- **Variable-length padding**: Validates correct encoding/decoding for sequences of length 15.
- **Size validation**: Confirms file size = `2 + ceil(L/4)` bytes for a sequence of length *L*.
## Use Case
Efficient, lossless storage and retrieval of super-kmers for downstream genomic analysis (e.g., assembly or alignment acceleration).
+24
View File
@@ -0,0 +1,24 @@
# `.skm` File Format and `SkmWriter` Functionality
The Go package `obikmer` provides a binary writer for `.skm` (super-kmer) files, optimized for compact storage of DNA sequences.
- **Purpose**: Efficiently serialize *super-kmers* (long k-mers) into a binary format.
- **Format per super-kmer**:
- `len: uint16 LE` — length of the sequence in bases (little-endian, 2 bytes).
- `data: ⌈len/4⌉ bytes` — nucleotide sequence encoded as **2 bits per base**, packed tightly.
- **Encoding scheme**:
- `A → 00`, `C → 01`, `G → 10`, `T → 11`.
- Padding: trailing bits in the final byte are zeroed if `len % 4 ≠ 0`.
- **Implementation details**:
- Uses buffered I/O (`bufio.Writer` with 64 KiB buffer) for performance.
- `NewSkmWriter(path)` opens/creates the file and returns a writer instance.
- `Write(sk SuperKmer)` encodes sequence length, then packs bases using a lookup (`__single_base_code__[seq[pos]&31]`).
- `Close()` flushes buffers and closes the file handle.
- **Use case**: Ideal for high-throughput genomic preprocessing (e.g., indexing, sketching), where space and I/O speed matter.
- **Assumptions**: `SuperKmer` type exposes a `.Sequence []byte`; bases are ASCII (`A,C,G,T,a,c,g,t`) — `&31` normalizes to lowercase index.
- **Efficiency**: 4× compression vs. ASCII (1 byte/base → ~0.25 bytes/base), minimal overhead.
+35
View File
@@ -0,0 +1,35 @@
# K-mer Spectrum Analysis Package (`obikmer`)
This Go package provides tools for analyzing k-mer frequency distributions in biological sequences.
## Core Data Structures
- **`SpectrumEntry`**: Represents a bin in the k-mer frequency spectrum:
`Frequency`: how often a k-mer was observed; `Count`: number of distinct k-mers with that frequency.
- **`KmerSpectrum`**: A sorted list of non-zero `SpectrumEntry`s (ascending by frequency), enabling efficient statistics and serialization.
## Key Functionalities
### Spectrum Management
- `MapToSpectrum()` / `ToMap()`: Convert between map and structured spectrum representations.
- `MergeSpectraMaps()` / `MergeTopN()`: Combine spectral or top-k data from multiple sources.
- `MaxFrequency()` returns the highest observed k-mer count.
### I/O & Persistence
- Binary format (`KSP\x01` magic header) with varint encoding for compact storage:
- `WriteSpectrum()` / `ReadSpectrum()`: Save/load full spectra to disk.
- CSV export:
- `WriteTopKmersCSV()`: Outputs top-k k-mers with their sequences (decoded from uint64) and frequencies.
### Top-N K-mer Tracking
- Uses a **min-heap** to efficiently maintain the *N most frequent* k-mers in streaming scenarios:
- `NewTopNKmers(n)`: Initialize collector.
- `Add(kmer, freq)`: Insert/update while respecting capacity *n*.
- `Results()`: Return top-kmers sorted descending by frequency.
## Design Highlights
- Memory-efficient: Uses `uint64` for k-mers (suitable up to *k* ≤ 32).
- Streaming-friendly: Top-N collector supports incremental updates.
- Thread-safety note: External synchronization required for concurrent access.
+48
View File
@@ -0,0 +1,48 @@
# SuperKmer and Minimizer-Based Sliding Window Analysis
This Go package provides functionality for extracting *super k-mers* from DNA sequences using a minimizer-based sliding window approach.
## Core Concepts
- **K-mers**: Substrings of length `k` from a DNA sequence.
- **Minimizer**: The lexicographically smallest canonical *m*-mer (substring of length `m`) among all `(k m + 1)` overlapping *m*-mers in a given k-mer.
- **Super K-mer**: A maximal contiguous subsequence where *every* consecutive k-mer shares the **same minimizer**.
## Data Structures
### `SuperKmer`
Represents a maximal region with uniform minimizer:
- `Minimizer`: Canonical 64-bit hash of the shared m-mer.
- `Start`, `End`: Slice-style bounds (0-indexed, exclusive end).
- `Sequence`: Raw byte slice of the DNA subsequence.
### `dequeItem`
Used internally to maintain a monotone deque:
- `position`: Index of the m-mer in the sequence.
- `canonical`: Canonical hash value (e.g., lexicographically smallest of forward/reverse-complement).
## Main Function
### `ExtractSuperKmers(seq, k, m, buffer)`
- Extracts all maximal super k-mers from `seq`.
- Parameters validated:
- `1 ≤ m < k`,
- `2 ≤ k ≤ 31`,
- sequence length ≥ `k`.
- Uses an efficient **O(n)** time algorithm via internal iteration.
- Supports optional preallocation (`buffer`) to reduce memory allocations.
## Algorithm Highlights
- Maintains a sliding window of size `k m + 1` over *m*-mers.
- Tracks the current minimizer using a monotone deque for O(1) updates per step.
- Detects *minimizer transitions* to delimit super k-mer boundaries.
## Complexity
| Aspect | Bound |
|---------------|-------------------------------|
| Time | **O(n)** (linear in sequence length) |
| Space | **O(k m + 1)** for deque + output size |
Useful in genome compression, read clustering, and minimizer-based alignment acceleration.
@@ -0,0 +1,32 @@
# Super K-mers Extraction Module (`obikmer`)
This Go package provides efficient tools for extracting **super k-mers** from DNA sequences using *minimizer-based sliding windows*. Super k-mers are maximal contiguous subsequences sharing the same minimal canonical minimizer in a window of size `k`.
## Core Functionality
- **`IterSuperKmers(seq, k, m)`**
Returns an iterator over `SuperKmer` structs. Each struct contains:
- `Start`, `End`: genomic positions of the super k-mer in the original sequence
- `Minimizer`: canonical minimizer value (uint64) for that segment
- `Sequence`: the actual DNA subsequence
- **`SuperKmer.ToBioSequence(...)`**
Converts a raw `SuperKmer` into an enriched `obiseq.BioSequence`, embedding metadata:
- ID: `{parentID}_superkmer_{start}_{end}`
- Attributes: minimizer sequence (`minimizer_seq`), value, `k`, `m`, positions, and parent ID
- **`SuperKmerWorker(k, m)`**
A `SeqWorker` adapter for pipeline integration (e.g., with `obiiter`). Processes a full BioSequence and returns all extracted super k-mers as a slice of `BioSequence`s.
## Algorithm Highlights
- Uses **canonical minimizers** (forward/reverse-complement minimum) to ensure strand-invariance
- Maintains a monotonic deque for efficient *sliding-window minimizer* tracking (O(n) time complexity)
- Supports DNA bases `A/C/G/T/U` case-insensitively via bitmasking (`seq[i] & 31`)
- Enforces parameter constraints: `1 ≤ m < k ≤ 31`, sequence length ≥ `k`
## Use Cases
- Read partitioning in metagenomics (e.g., for error correction or clustering)
- Efficient k-mer space segmentation without storing all individual kmers
- Integration into modular bioinformatics pipelines via `SeqWorker` interface
@@ -0,0 +1,39 @@
# Semantic Description of `obikmer` Package Functionalities
The `obikmer` package provides tools for **super k-mer extraction and minimizer-based sequence analysis** in bioinformatics.
## Core Concepts
A **super k-mer** is a maximal contiguous subsequence of DNA where *all* embedded *k*-mers share the **same minimizer**—a compact representative (typically lexicographically minimal) of *m*-mers, considering both forward and reverse-complement strands.
## Key Functions & Features
- **`IterSuperKmers(seq, k, m)`**:
An iterator over all super *k*-mers in input sequence `seq`, parameterized by:
- `k`: length of embedded *k*-mers,
- `m`: size of minimizer window (`m ≤ k`).
Yields structured objects with:
- `Sequence`: the super *k*-mer substring,
- `Start`/`End`: genomic coordinates (0-based half-open),
- `Minimizer`: canonical hash of the shared minimizer.
- **`ExtractSuperKmers(...)`**:
Synchronous counterpart returning a slice of all super *k*-mers.
## Verified Properties (via Tests)
1. **Boundary correctness**: Extracted subsequences match `seq[start:end]`.
2. **Consistency between iterator and slice versions**: Both APIs produce identical results.
3. **Bijection property**:
- Each unique super *k*-mer sequence maps to exactly one minimizer.
- All embedded *k*-mers within a super *k-mer* share the same minimizer.
## Implementation Notes
- Minimizers are computed canonically (min of forward and reverse-complement encodings).
- Uses base encoding via `__single_base_code__` (assumed helper mapping A/C/G/T → 0/1/2/3).
- Tests cover simple, homopolymer-rich, and complex genomic patterns.
## Design Rationale
Super *k*-mers enable efficient compression, indexing (e.g., in minimizer spaces), and alignment-free comparisons—crucial for scalable genomic analysis.
+33
View File
@@ -0,0 +1,33 @@
# Variable-Length Integer Encoding/Decoding Utility
This Go package (`obikmer`) provides efficient serialization of `uint64` integers using **protobuf-style variable-length encoding (varint)**.
## Core Features
-`EncodeVarint(io.Writer, uint64) (n int, err error)`
Writes a `uint64` as a compact varint to any `io.Writer`. Uses **7 bits per byte**, with the MSB as a continuation flag. Max 10 bytes for `uint64`.
-`DecodeVarint(io.Reader) (val uint64, err error)`
Reads and decodes a varint from any `io.Reader`. Handles multi-byte sequences safely; returns error on malformed input or overflow (>70 bits).
-`VarintLen(uint64) int`
Computes the exact byte length required to encode a value *without* performing I/O — useful for buffer preallocation or size estimation.
## Encoding Scheme
- Each byte holds 7 bits of data; bit8 (MSB) = `1` if more bytes follow, else `0`.
- Example:
- `0x7F``1 byte`:`0111_1111`
- `0x80``2 bytes`:`1000_0000 0000_0001`
## Use Cases
- Network protocols & binary file formats requiring compact integer representation
- Serialization frameworks (e.g., custom protobuf-like codecs)
- Embedded systems or bandwidth-constrained environments where space efficiency matters
## Design Notes
- No external dependencies; uses only `io` from the standard library.
- Thread-safe *per call* (no shared state), but `io.Reader`/`Writer` concurrency must be handled externally.
- Compatible with standard protobuf varint format (e.g., interoperable with `encoding/binary` or gRPC).
+37
View File
@@ -0,0 +1,37 @@
# Varint Encoding and Decoding Module (`obikmer`)
This Go package implements **variable-length integer encoding/decoding**, commonly used in binary protocols (e.g., Protocol Buffers, SQLite) to efficiently store small integers using fewer bytes.
## Core Features
- **`EncodeVarint(w io.Writer, v uint64) (n int, err error)`**
Encodes a `uint64` value into the minimal number of bytes (110) using **LEB128-style varint**, writing the result to a writer. Returns bytes written and any I/O error.
- **`DecodeVarint(r io.Reader) (uint64, error)`**
Reads and decodes a varint from an `io.Reader`, reconstructing the original `uint64`. Fails on malformed or incomplete data.
- **`VarintLen(v uint64) int`**
Computes the exact number of bytes required to encode `v`, without performing I/O.
## Test Coverage
- **Round-trip correctness**: All test values (including edge cases like `0`, powers of two, and max `uint64`) encode → decode back identically.
- **Length validation**: Encoded length matches `VarintLen` predictions exactly (e.g., 127 → 1 byte; 16384 → 3 bytes).
- **Sequence handling**: Multiple varints can be concatenated and decoded in order, preserving data integrity.
## Efficiency & Design
- Uses **7-bit groups per byte**, with the MSB as a continuation flag (`1` = more bytes follow).
- Minimal memory footprint — no allocations beyond buffer I/O.
- Designed for streaming use (e.g., network or file serialization).
## Edge Cases Verified
| Value | Encoded Length |
|----------------|---------------|
| `0` | 1 byte |
| `2⁷−1 = 127` | 1 byte |
| `2⁷ = 128` | 2 bytes |
| `2¹⁴−1 = 16383`| 2 bytes |
| `^uint64(0)` | **10 bytes** |