⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,36 @@
+# Semantic Description of `obikmer` Package
+
+This Go package provides utilities for **k-mer (specifically 4-mer) counting and comparison** of biological sequences.
+
+## Core Functionalities
+
+1. **`Count4Mer(seq, buffer, counts)`**  
+   Counts occurrences of all possible 16-mer (4-nucleotide) subsequences in a `BioSequence`.  
+   - Encodes each 4-mer into an integer (0–255) using `Encode4mer`.  
+   - Populates a fixed-size `[256]uint16` table (`Table4mer`) with counts.  
+   - Reuses or allocates the `counts` buffer as needed.
+
+2. **`Common4Mer(count1, count2)`**  
+   Computes the *intersection* of two 4-mer frequency profiles: sum over all k-mers of `min(count1[k], count2[k])`.  
+   Used to measure shared content between sequences.
+
+3. **`Sum4Mer(count)`**  
+   Returns the total number of 4-mers in a profile (i.e., sum over all entries).
+
+## Distance & Similarity Bounds
+
+4. **`LCS4MerBounds(count1, count2)`**  
+   Estimates bounds for the *Longest Common Subsequence* (LCS) length between two sequences based on 4-mer profiles:  
+   - **Lower bound**: `common_kmers + (3 if common > 0 else 0)`  
+   - **Upper bound**: `min(total1, total2) + 3 − ceil((min_total – common)/4)`  
+   Leverages the fact that overlapping k-mers constrain possible alignments.
+
+5. **`Error4MerBounds(count1, count2)`**  
+   Estimates bounds for *alignment errors* (e.g., mismatches + indels):  
+   - **Upper bound**: `max_total − common_kmers + 2 * floor((common_kmers + 5)/8)`  
+   - **Lower bound**: `ceil(upper_bound / 4)`  
+   Provides fast, approximate error estimates without full alignment.
+
+## Use Case
+
+Designed for **high-performance comparison of NGS reads** (e.g., in metabarcoding), where exact alignment is too costly, and k-mer-based heuristics enable scalable similarity estimation.
@@ -0,0 +1,44 @@
+# Semantic Description of the `obikmer` Package
+
+This Go package implements a **De Bruijn graph** for efficient k-mer manipulation and sequence assembly, primarily used in bioinformatics (e.g., metagenomic read error correction or consensus building).
+
+### Core Functionalities
+
+- **K-mer Encoding**: K-mers are encoded as `uint64` using 2 bits per nucleotide (A=0, C=1, G=2, T=3), supporting IUPAC ambiguity codes via the `iupac` map.
+- **Reverse Complement Handling**: The `revcompnuc` table enables nucleotide-wise reverse complementation.
+- **Graph Construction**: The `DeBruijnGraph` struct maintains a map from k-mer hashes to integer weights (e.g., observed counts), with helper masks for bit manipulation (`kmermask`, `prevc/g/t`).
+
+### Graph Operations
+
+- **Node Queries**:  
+  - `Previouses()` / `Nexts()`: Return predecessor/successor k-mers in the graph.  
+  - `MaxNext()` / `MaxHead()`: Find neighbors or heads (sources) with maximum weight.
+- **Path Exploration**:  
+  - `MaxPath()`: Greedily traces the highest-weight path from a head.  
+  - `LongestPath()`: Explores all heads to find the path with maximum cumulative weight (optionally bounded in length).  
+  - `HaviestPath()`: Uses Dijkstra-like priority queue to find the *heaviest* (sum-weight) path, with cycle detection via DFS (`HasCycle()`).
+
+### Consensus & Filtering
+
+- **Consensus Generation**:  
+  - `BestConsensus()` returns a sequence from the greedy max-weight path.  
+  - `LongestConsensus(id, min_cov)` trims low-coverage ends using a coverage threshold (mode-based).
+- **Weight Statistics**:  
+  - `MaxWeight()`, `WeightMean()`, `WeightMode()` provide distribution summaries.  
+  - `FilterMinWeight(min)` removes low-count nodes.
+- **Decoding**:  
+  - `DecodeNode()` converts a k-mer index to its DNA string.  
+  - `DecodePath()` reconstructs the full consensus from a path.
+
+### I/O & Diagnostics
+
+- **GML Export**: `WriteGml()` outputs a directed graph in Graph Modelling Language (for visualization), with edge thickness and labels reflecting weights.
+- **Hamming Distance**: `HammingDistance()` computes edit distance between two encoded k-mers using bit operations.
+- **Sequence Insertion**: `Push()` adds a biosequence (with count weight) to the graph, expanding all IUPAC variants recursively.
+
+### Dependencies & Design
+
+- Leverages `obiseq` for sequence representation and `logrus`/`slices`/`heap` from Go’s stdlib.
+- Designed for scalability and speed, using bit-level operations to minimize memory footprint.
+
+Overall: a robust k-mer graph engine for *de novo* assembly, error correction, and consensus recovery in high-throughput sequencing data.
@@ -0,0 +1,35 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package provides efficient k-mer encoding and comparison utilities for biological sequences, optimized for DNA analysis.
+
+## Core Functionalities
+
+1. **Nucleotide Encoding**  
+   - `EncodeNucleotide(b byte)`: Maps DNA bases (A, C, G, T/U) to 2-bit values:  
+     `0→A`, `1→C`, `2→G`, `3→T/U`.  
+     Ambiguous or non-standard characters (e.g., N, R, Y) default to `A` (`0`).  
+     Uses a lookup table for O(1) performance.
+
+2. **4-mer Encoding**  
+   - `Encode4mer(seq, buffer)`: Converts a biological sequence into overlapping 4-mers.  
+     Each k-mer is encoded as an unsigned byte (0–255), where each nucleotide contributes 2 bits.  
+     Supports optional buffer reuse for memory efficiency.
+
+3. **4-mer Indexing**  
+   - `Index4mer(seq, index, buffer)`: Builds an inverted index mapping each 4-mer code (0–255) to all its occurrence positions in the sequence.  
+     Returns `[][]int`, where rows correspond to k-mer codes and columns list positions.
+
+4. **Fast Sequence Comparison**  
+   - `FastShiftFourMer(...)`: Compares two sequences using a FASTA-like shift-scoring algorithm.  
+     - Uses precomputed 4-mer index of a reference sequence and encodes the query.  
+     - Counts co-occurring 4-mers across all possible shifts (`refpos − queryPos`).  
+     - Computes raw and relative scores (normalized by alignment length).  
+     - Returns optimal shift, count of matching 4-mers, and maximum score (raw or relative).
+
+## Design Highlights
+
+- **Memory-aware**: Supports buffer reuse to minimize allocations.  
+- **Robustness**: Non-canonical bases handled gracefully (defaulting to A).  
+- **Performance-oriented**: O(n) encoding and indexing; efficient hash-based shift counting.  
+
+Intended for rapid alignment-free sequence comparison in metabarcoding or metagenomic workflows.
@@ -0,0 +1,39 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package provides high-performance, zero-allocation utilities for **k-mer manipulation** in DNA sequences (A/C/G/T/U), targeting bioinformatics applications like genome indexing, assembly, and error correction.
+
+## Core Encoding & Decoding
+
+- **`EncodeKmer`, `DecodeKmer`**: Convert between DNA sequences and compact 62-bit uint64 representations (2 bits/base), preserving top 2 bits for optional error markers.
+- **`EncodeCanonicalKmer`, `CanonicalKmer`**: Encode or normalize k-mers to their *biological canonical form* — the lexicographically smaller of a k-mer and its reverse complement.
+
+## Iterators (Memory-Efficient Streaming)
+
+- **`IterKmers`, `IterCanonicalKmers`**: Stream all overlapping k-mers from a sequence without allocating intermediate slices — ideal for large-scale processing (e.g., inserting into Roaring Bitmaps).
+- **`IterCanonicalKmersWithErrors`**: Same as above, but detects ambiguous bases (N/R/Y/W/S/K/M/B/D/H/V) and encodes their count in the top 2 bits (error code: 0–3). Only valid for **odd k ≤ 31**.
+
+## Error Handling & Markers
+
+- `SetKmerError`, `GetKmerError`, and `ClearKmerError` manipulate the top 2 bits of a uint64 to store error metadata (e.g., ambiguous base count), enabling downstream filtering or correction.
+
+## Reverse Complement & Circular Normalization
+
+- **`ReverseComplement`, `CanonicalKmer`**: Compute biological reverse complement and canonical form.
+- **`NormalizeCircular`, `EncodeCircularCanonicalKmer`**: Compute *circular canonical form* — the lexicographically smallest rotation (used for low-complexity masking).
+- Distinction: `CanonicalKmer` uses **reverse complement**, while `NormalizeCircular` uses **rotation**.
+
+## Counting & Math Utilities
+
+- **`CanonicalCircularKmerCount`, `necklaceCount`, etc.**: Compute exact counts of unique circular k-mer equivalence classes using **Moreau’s necklace formula**, with Euler's totient function and divisor enumeration.
+
+## Performance & Safety
+
+- All functions avoid heap allocations where possible (reusing buffers).
+- Panics on invalid `k` or length mismatches for correctness.
+- Supports case-insensitive input (A/a, T/t…), and ambiguous bases via `__single_base_code_err__`.
+
+## Use Cases
+
+- K-mer counting in assemblers (e.g., with Bloom filters or bitmaps)
+- Error-aware k-mer filtering in sequencing pipelines
+- Low-complexity region detection via circular entropy normalization
@@ -0,0 +1,36 @@
+# Obikmer: Efficient K-mer Encoding and Manipulation in Go
+
+This package provides high-performance utilities for DNA sequence analysis using *k*-mers—contiguous substrings of length `k`. It supports encoding, canonicalization (forward/reverse-complement normalization), minimizer-based super-*k*-mer extraction, and error tagging—all optimized for 64-bit integer arithmetic.
+
+## Core Functionalities
+
+### K-mer Encoding (`EncodeKmers`, `IterKmers`)
+Encodes DNA sequences (A/C/G/T/U, case-insensitive) into `uint64` using 2 bits per nucleotide (A=00, C=01, G=10, T/U=11). Supports sliding-window extraction and streaming via an iterator. Handles sequences up to 31-mers (62 bits), with validation for invalid `k` values.
+
+### Reverse Complement (`ReverseComplement`)
+Computes the reverse complement of a *k*-mer in constant time using bit manipulation. Preserves error metadata (see below) and satisfies involution: `RC(RC(x)) = x`.
+
+### Canonical K-mers (`CanonicalKmer`, `EncodeCanonicalKmers`)
+Returns the lexicographically smaller of a *k*-mer and its reverse complement—enabling strand-agnostic analysis. Supports both single-kmer normalization (`CanonicalKmer`) and full-sequence canonical encoding.
+
+### Super *k*-mers Extraction (`ExtractSuperKmers`)
+Groups overlapping *k*-mers sharing the same minimizer (minimal *m*-mer in sliding window) into contiguous regions ("super *k*-mers"). Output includes start/end positions and minimizer values, all canonicalized.
+
+### Error Marking (`SetKmerError`, `GetKmerError`, etc.)
+Uses the top 2 bits of a `uint64` to tag error states (e.g., sequencing errors), leaving 62 bits for sequence data. Error operations preserve the underlying *k*-mer and work seamlessly with canonicalization/RC.
+
+## Key Features
+
+- **Memory Efficiency**: Reusable buffers via optional `*[]uint64` or `*[]SuperKmer` parameters.
+- **Edge Case Handling**: Gracefully handles empty sequences, `k > len(seq)`, invalid parameters (`m ≥ k`), and max-length constraints.
+- **Performance**: Optimized for speed—benchmarks included for all major functions (e.g., `BenchmarkEncodeKmers`, `BenchmarkExtractSuperKmers`).
+- **Comprehensive Testing**: Covers basic cases, boundary conditions (e.g., 31-mers), symmetry properties (canonical/RC invariance), and error resilience.
+
+## Use Cases
+
+- Genome assembly &DBG construction  
+- Minimizer-based sketching (e.g., *Mash*, *Sourmash*)  
+- Error-aware k-mer counting & filtering  
+- Strand-unbiased sequence comparison  
+
+All functions operate on `[]byte` DNA sequences and return canonicalized, efficient representations suitable for hashing or indexing.
@@ -0,0 +1,31 @@
+# Semantic Description of `obikmer` Entropy Functions
+
+The `obikmer` package provides high-performance tools to compute **Shannon entropy** for DNA *k*-mers, with a focus on detecting low-complexity sequences via sub-word repetition analysis.
+
+## Core Functionality
+
+- **`KmerEntropy(kmer, k, levelMax)`**:  
+  Computes the *minimum normalized Shannon entropy* across all sub-word sizes from `1` to `levelMax`.  
+  - Decodes the encoded *k*-mer (2 bits/base) into a DNA string.  
+  - For each word size `ws`, extracts all overlapping substrings, normalizes them to their **circular canonical form**, and counts frequencies.  
+  - Normalized entropy = `(log(N) − Σ(nᵢ log nᵢ)/N) / emax`, where `emax` is the theoretical max entropy given sequence length and alphabet constraints.  
+  - Returns min entropy across `ws ∈ [1, levelMax]`. Values near **0** indicate repeats (e.g., `AAAAA…`); values near **1** suggest high complexity.
+
+- **`KmerEntropyFilter`**:  
+  A reusable, precomputed filter for batch processing millions of *k*-mers efficiently:  
+  - Pre-builds normalization tables (for circular canonical forms), entropy lookup values (`emax`, `logNwords`), and frequency tables.  
+  - Avoids repeated allocations — critical for performance in pipelines (e.g., read filtering).  
+  - **Not goroutine-safe** — each thread must instantiate its own filter.
+
+- **`NewKmerEntropyFilter(k, levelMax, threshold)`**:  
+  Initializes a filter with precomputed tables and sets the entropy rejection `threshold`.  
+
+- **`Accept(kmer)` / `Entropy(kmer)`**:  
+  - `Accept()` returns `true` if entropy > threshold (i.e., *k*-mer is complex enough to pass).  
+  - `Entropy()` computes entropy using precomputed tables — ~10× faster than standalone calls.
+
+## Design Highlights
+
+- **Circular canonical normalization** ensures symmetry (e.g., `AT` ≡ `TA`).  
+- **Sub-word-level entropy** captures local repetitiveness better than global *k*-mer uniqueness.  
+- Optimized for **speed and memory reuse**, suitable for large-scale genomic data filtering.
@@ -0,0 +1,37 @@
+# K-Way Merge for Sorted k-mer Streams
+
+This Go package implements a **k-way merge** over multiple sorted streams of *k*-mer values (`uint64`). It leverages a **min-heap** to efficiently produce the globally sorted sequence while aggregating duplicate counts across input streams.
+
+## Core Components
+
+- **`mergeItem`**: Stores a value and its source reader index for heap operations.
+- **`mergeHeap`** & `heap.Interface`: Implements a min-heap for efficient retrieval of smallest values.
+- **`KWayMerge`**: Main struct managing the heap and input readers.
+
+## Key Functionality
+
+- **Initialization (`NewKWayMerge`)**:
+  - Takes a slice of `*KdiReader`, each expected to yield sorted values.
+  - Preloads the heap with one value from each reader.
+
+- **Streaming Output (`Next`)**:
+  - Returns the next smallest *k*-mer, its frequency across readers (i.e., how many input streams contained it), and a success flag.
+  - Handles duplicates: pops *all* items equal to the current minimum before advancing readers.
+
+- **Cleanup (`Close`)**:
+  - Closes all underlying `KdiReader`s and returns the first encountered error.
+
+## Use Case
+
+Ideal for merging sorted *k*-mer databases (e.g., from multiple files or processes), enabling:
+- Efficient deduplication with multiplicity tracking.
+- Scalable union/intersection operations on large *k*-mer sets.
+
+## Complexity
+
+| Operation | Time       |
+|-----------|------------|
+| `Next()`  | *O(log k)* (heap ops per unique value) |
+| Init      | *O(k)*     |
+
+Where `k` = number of input readers.
@@ -0,0 +1,27 @@
+# K-Way Merge Functionality in `obikmer`
+
+This Go package provides utilities for merging sorted k-mer streams stored in `.kdi` files. Its core component is the `KWayMerge`, which performs a k-way merge of multiple sorted input streams, aggregating duplicate k-mers by counting their occurrences.
+
+## Key Features
+
+- **Sorted K-Mer Input**: Reads k-mers from `.kdi` files via `KdiReader`, assuming each file contains *sorted* 64-bit unsigned integers (`uint64`).
+- **K-Way Merge**: Merges multiple sorted streams into a single globally sorted stream using an efficient priority queue (min-heap) internally.
+- **Count Aggregation**: When identical k-mers appear across multiple streams, the merge counts how many times each unique k-mer occurs.
+- **Memory-Efficient Streaming**: Processes data incrementally, avoiding full loading of all streams into memory.
+- **Robust Test Coverage**: Includes unit tests for:
+  - Basic merging with overlapping and non-overlapping values.
+  - Single-stream input (degenerate case).
+  - Empty streams handling.
+  - All identical k-mers across inputs.
+
+## API Highlights
+
+- `NewKdiReader(path)` — opens a `.kdi` file for reading.
+- `writeKdi(...)` (test helper) — writes sorted k-mers to a `.kdi` file.
+- `NewKWayMerge([]*KdiReader)` — constructs the merger from multiple readers.
+- `.Next()` → `(kmer uint64, count int, ok bool)` — yields next merged k-mer and its frequency; `ok=false` signals end-of-stream.
+- `.Close()` — cleans up resources.
+
+## Use Case
+
+Ideal for aggregating k-mer counts across multiple sequencing samples (e.g., in bioinformatics), where each sample’s k-mers are pre-sorted and persisted, enabling scalable distributed counting without full in-memory deduplication.
@@ -0,0 +1,27 @@
+# KDI Reader: Streaming Delta-Varint Decoding for k-mers
+
+The `obikmer` package provides a high-performance, streaming reader for `.kdi` files—binary containers storing *sorted* k-mers (typically DNA substrings encoded as 64-bit integers). It supports both sequential and indexed access.
+
+## Core Features
+
+- **Streaming decoding**: K-mers are read incrementally using delta-varint compression to minimize I/O and memory footprint.
+- **Delta encoding**: After the first absolute `uint64`, subsequent values are stored as *deltas* (difference from previous), encoded via custom `DecodeVarint`.
+- **Magic & format validation**: A 4-byte magic header ensures file integrity; Little Endian `uint64` stores total count.
+- **Sparse indexing**: When paired with a `.kdx` index, `SeekTo(target)` enables fast forward-only jumps to positions ≥ target k-mer.
+- **Graceful fallback**: If `.kdx` is missing or invalid, the reader automatically degrades to sequential mode.
+
+## Key API
+
+- `NewKdiReader(path)` → opens `.kdi` for streaming (no index).
+- `NewKdiIndexedReader(path)` → opens with optional `.kdx` for random access.
+- `Next()` → returns `(nextKmer, true)` or `(0, false)` when exhausted.
+- `SeekTo(target uint64) error` → jumps to first k-mer ≥ target using index (no backward seek).
+- `Count()` / `Remaining()` → total and unread k-mers.
+- `Close()` → releases file handle.
+
+## Design Highlights
+
+- Uses 64 KB buffer for efficient I/O.
+- Index entries record `(kmer, byteOffset)` at fixed strides (e.g., every 1024 k-mers).
+- `SeekTo` is idempotent and safe: no-op if target ≤ current position or index unavailable.
+- Designed for large-scale genomic k-mer catalogs (e.g., from minimizers or de Bruijn graphs).
@@ -0,0 +1,34 @@
+# KDI File Format and API
+
+The `obikmer` package implements a compact, sorted k-mer storage format (`.kdi`) with delta compression for efficient disk persistence and retrieval.
+
+## Core Features
+
+- **Sorted k-mer serialization**: K-mers (as `uint64`) are written in ascending order.
+- **Delta encoding**: Consecutive differences (deltas) between k-mers are stored using variable-length integers (`varint`), drastically reducing size for dense sequences.
+- **Round-trip integrity**: Full write/read cycles preserve exact k-mer values and counts.
+
+## File Structure
+
+A `.kdi` file contains:
+1. **Magic header** (4 bytes): Identifies the format.
+2. **Count field** (8 bytes, `uint64`): Number of stored k-mers.
+3. **First value** (8 bytes, `uint64`): Initial k-mer.
+4. **Delta-encoded tail**: `(n−1)` deltas, each encoded as a `varint`.
+
+## API
+
+- **`NewKdiWriter(path string)`**: Creates a writer; `Write(v uint64)` appends k-mers.
+- **`Writer.Count()`**: Returns the number of written items before closing.
+- **`NewKdiReader(path string)`**: Opens a reader; `Next() (uint64, bool)` yields k-mers in order.
+- **`Reader.Count()`**: Returns total stored count.
+
+## Tests Validate
+
+1. Basic round-trip with diverse values (including large `uint64`s).
+2. Empty and single-k-mer files.
+3. Exact file size for minimal cases (e.g., 20 bytes for one k-mer).
+4. Delta compression efficiency on dense sequences (e.g., 10k even numbers → ~9,999 extra bytes).
+5. Real-world usage: extracting canonical k-mers from DNA sequences, sorting/deduplicating, and persisting them.
+
+The format is optimized for memory-mapped access or streaming traversal in bioinformatics pipelines.
@@ -0,0 +1,38 @@
+# KDI File Format and Writer
+
+The `obikmer` package implements a compact, sorted sequence storage format for 64-bit k-mers using delta encoding and sparse indexing.
+
+## Core Format (`.kdi`)
+
+- **Magic header**: `KDI\x01` (`4 bytes`) identifies the file type.
+- **Count field**: `uint64 LE`, total number of k-mers (patched at close).
+- **First value**: `uint64 LE`, the initial k-mer stored as an absolute integer.
+- **Deltas**: Subsequent values encoded via *delta-varint* (difference from previous k-mer), enabling high compression for sorted sequences.
+
+## Writer (`KdiWriter`)
+
+- **Strict ordering**: K-mers must be written in *strictly increasing order*.
+- Efficient buffering via `bufio.Writer` (64 KB buffer).
+- Internally tracks:
+  - Current k-mer count,
+  - Previous value (for delta computation),
+  - Bytes written in data section.
+- **Sparse indexing**: Every `defaultKdxStride` k-mers, an entry is recorded in memory for random access.
+
+## Companion Index (`.kdx`)
+
+- Written automatically on `Close()` if indexing entries exist.
+- Stores `(kmer, file_offset)` pairs for fast seek-to-position lookups (e.g., binary search on k-mer range).
+- Enables efficient random access without full file scan.
+
+## Usage Pattern
+
+```go
+w, _ := obikmer.NewKdiWriter("data.kdi")
+for _, kmer := range sortedKMers {
+    w.Write(kmer)
+}
+w.Close()  // finalizes header, writes .kdx index
+```
+
+The format is optimized for memory-efficient storage and fast retrieval of sorted uint64 k-mers in genomic or sequence analysis pipelines.
@@ -0,0 +1,29 @@
+# KDX Index Format and Functionality
+
+The `obikmer` package provides a sparse indexing mechanism for `.kdi` files (likely storing sorted k-mers with delta encoding). The **`.kdx` file** serves as a fast lookup table to accelerate k-mer searches.
+
+## Core Concepts
+
+- **Magic bytes**: `KDX\x01` validates file integrity.
+- **Stride-based sparsity**: One index entry every *N* k-mers (default: 4096), balancing memory vs. search speed.
+- **Entry structure**: Each entry stores:
+  - `kmer`: the k-mer value at that index position.
+  - `offset`: absolute byte offset in the corresponding `.kdi` file.
+
+## Key Operations
+
+- **Loading**: `LoadKdxIndex()` reads and validates a `.kdx` file; returns `(nil, nil)` if missing (graceful degradation).
+- **Searching**: `FindOffset(target uint64)` performs binary search over index entries to find the *best jump point*:
+  - Returns `offset`, `skipCount` (k-mer count already passed), and a boolean success flag.
+  - Enables efficient seeking: after `offset`, only up to *stride* k-mers need linear scanning.
+- **Writing**: `WriteKdxIndex()` serializes an in-memory index to disk (for building indexes).
+- **Helper**: `KdxPathForKdi()` derives the `.kdx` path from a given `.kdi` file.
+
+## Performance
+
+- Search complexity: **O(log M)** for the binary search (where *M* = #index entries), plus ≤ stride linear steps.
+- Memory footprint: Linear in index size (16 bytes per entry), highly scalable for large k-mer sets.
+
+## Design Philosophy
+
+Minimalist, binary-safe format with explicit endianness (little-endian), no external dependencies beyond `encoding/binary`, and robust error handling.
@@ -0,0 +1,14 @@
+# Semantic Description of `obikmer` Package
+
+The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
+
+- `QueryEntry` represents a single canonical k‑mer with its origin: sequence index and 1-based position.
+- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
+- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each k‑mer to a partition via modulo hashing, and sorts buckets by k‑mer value.
+- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
+- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
+  - Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
+  - Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
+- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and k‑mer stream in lockstep.
+
+The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.
@@ -0,0 +1,49 @@
+# `obikmer` K-mer Set Group Builder — Functional Overview
+
+The `KmerSetGroupBuilder` enables scalable construction of k-mer indexes from biological sequences, supporting both new and incremental (append) workflows. It operates in two phases: **collection** of super-kmers into partitioned temporary files (`.skm`), and **finalization**, where partitions are processed in parallel into final k-mer indexes (`.kdi`).
+
+## Core Features
+
+- **K-mer & Minimizer Configuration**:  
+  Supports `k ∈ [2,31]`; auto-computes optimal minimizer size (`m ≈ k/2.5`) and partition count (up to `4^m`, capped at 4096).
+
+- **Functional Options for Filtering**:  
+  - `WithMinFrequency(n)`: Keep only k-mers with frequency ≥ *n* (enables deduplication).  
+  - `WithMaxFrequency(n)`: Discard k-mers with frequency > *n*.  
+  - `WithEntropyFilter(threshold, levelMax)`: Remove low-complexity k-mers (entropy ≤ threshold).  
+  - `WithSaveFreqKmers(n)`: Save top-*n* most frequent k-mers per set to `top_kmers.csv`.
+
+- **Concurrent & Pipeline-Aware Processing**:  
+  Uses a two-stage pipeline: *I/O-bound readers* (2–4 goroutines) feed k-mers to *CPU-bound workers*, one per core, maximizing throughput.
+
+- **Partitioned I/O & Thread Safety**:  
+  Super-kmers are written to per-partition `.skm` files using mutex-protected writers, enabling safe concurrent `AddSequence()` calls.
+
+## Workflow
+
+1. **Build Phase**:  
+   - Input sequences → super-kmers extracted via minimizer-based partitioning.  
+   - Super-kmers written to `.build/set_*/part_*.skm`.
+
+2. **Finalization (`Close()`)**:  
+   - `.skm` files loaded → canonical k-mers extracted.  
+   - K-mers sorted, counted (frequency spectrum), and filtered per config.  
+   - Final `.kdi` files written; `spectrum.bin`, and optionally `top_kmers.csv`.  
+   - Metadata (`metadata.toml`) generated; `.build/` cleaned.
+
+3. **Append Mode**:  
+   `AppendKmerSetGroupBuilder()` extends an existing group, inheriting its parameters and appending new sets.
+
+## Output Artifacts
+
+- `.kdi`: Sorted, deduplicated (and optionally filtered) k-mers.  
+- `spectrum.bin`: Per-set frequency spectrum (`count → #k-mers`).  
+- `top_kmers.csv` (optional): Top *N* k-mers per set with counts.  
+- `metadata.toml`: Global and per-set metadata (k, m, partitions, counts).
+
+## Design Highlights
+
+- **Memory-efficient**: Streams large `.skm` files; reuses slices to minimize GC pressure.  
+- **Scalable**: Parallel finalization scales with CPU cores and I/O bandwidth.  
+- **Robust error handling**: Early termination on first failure; cleanup of partial state.
+
@@ -0,0 +1,44 @@
+# K-mer Set Group Builder — Semantic Description
+
+This Go module (`obikmer`) provides a **disk-backed builder and accessor** for managing *k-mer sets* across multiple biological sequence datasets. It supports efficient construction, persistence, and querying of canonical *k*-mers (accounting for DNA reverse-complement symmetry), with optional frequency filtering.
+
+### Core Functionalities
+
+- **K-mer Set Group Construction**:  
+  `NewKmerSetGroupBuilder` creates a builder configured with:
+  - *k* (k-mer length),
+  - *m* (minimal unique substring for partitioning),
+  - number of sets (`nSets`),
+  - and optional parameters like `WithMinFrequency`.
+
+- **Sequence Ingestion**:  
+  Sequences are added per set via `AddSequence(setID, bioseq)`. Internally:
+  - Canonical *k*-mers are extracted (using `IterCanonicalKmers`),
+  - Deduplicated and optionally filtered by occurrence frequency.
+
+- **Persistence & Round-Trip**:  
+  `builder.Close()` materializes the *k*-mer sets to disk (in temp or specified directory).  
+  `OpenKmerSetGroup(dir)` reloads them — preserving all metadata and structure.
+
+- **Metadata & Attributes**:  
+  Supports custom identifiers (`SetId`) and key-value attributes (e.g., `"organism": "test"`), saved to disk via `SaveMetadata`.
+
+- **Efficient Iteration**:  
+  The iterator (`ksg.Iterator(setID)`) yields *sorted*, deduplicated canonical *k*-mers — using a k-way merge across internal partitions.
+
+- **Frequency Filtering**:  
+  `WithMinFrequency(n)` ensures only *k*-mers appearing ≥*n* times across inputs survive — enabling noise suppression (e.g., in error correction or abundance-based filtering).
+
+- **Multi-set Support**:  
+  Handles multiple independent *k*-mer sets (e.g., per sample or taxonomic group), verified via `Size()` and indexed access (`Len(setID)`).
+
+### Testing Coverage
+
+Comprehensive unit tests validate:
+- Basic construction & correctness,
+- Multi-sequence ingestion and deduplication,
+- Frequency-based inclusion/exclusion logic,
+- Cross-set isolation (`nSets > 1`),
+- Metadata round-trip integrity.
+
+This module is designed for scalable, reproducible *k*-mer indexing in metagenomic or amplicon analysis pipelines (e.g., OBITools4 ecosystem).
@@ -0,0 +1,44 @@
+# `obikmer` Package: Disk-Based K-mer Set Group Management
+
+The `obikmer` package provides a streaming, disk-backed implementation for managing collections of *k*-mer sets (called **K-mer Set Groups**), optimized for large-scale metagenomic or genomic analyses.
+
+### Core Concepts
+- A **KmerSetGroup** stores *N* disjoint sets of sorted *k*-mers, partitioned into *P* files per set.
+- Each group is defined by immutable parameters: `k` (*mer size), `m* (minimizer size), and *P* partitions.
+- Data is stored on disk as `.kdi` files (sorted k-mers) with optional sparse indices (`.kdx`) for fast lookup.
+- Metadata is serialized in TOML format (`metadata.toml`), supporting both group-level and per-set attributes.
+
+### Key Functionalities
+
+#### 1. **Lifecycle Management**
+- `OpenKmerSetGroup(directory)` loads an existing index in read-only mode.
+- `NewFilteredKmerSetGroup(...)` constructs a new group (e.g., after filtering).
+- `SaveMetadata()` persists metadata changes to disk.
+
+#### 2. **Accessors & Metadata**
+- Basic properties: `K()`, `M()`, `Partitions()`, `Size()` (i.e., *N*), and group ID.
+- Attribute API: get/set/delete user-defined metadata (group-level or per-set).
+  - Supports type coercion (`GetIntAttribute`, `GetStringAttribute`).
+
+#### 3. **Membership & Iteration**
+- `Contains(setIndex, kmer)` checks presence using indexed binary search + linear scan across all partitions (parallelized).
+- `Iterator(setIndex)` yields sorted *k*-mers via k-way merge of partition readers.
+
+#### 4. **Similarity & Distance Metrics**
+- `JaccardDistanceMatrix()` and `JaccardSimilarityMatrix()`: compute pairwise metrics in a streaming fashion.
+  - Per-partition processing with parallel goroutines and sorted merge for accurate set intersection/union counts.
+
+#### 5. **Set Management**
+- `CopySetsByIDTo(ids, destDir)` copies selected sets (with metadata) to another group.
+  - Supports compatibility checks and optional overwriting (`force`).
+- `RemoveSetByID(id)` deletes a set, renumbers remaining sets for contiguous indices.
+- Glob pattern matching: `MatchSetIDs(patterns)` resolves IDs like `"sample_*"`.
+
+#### 6. **Compatibility & Utility**
+- `IsCompatibleWith(other)` verifies same `(k, m, partitions)`.
+- Helper methods: `PartitionPath`, `Spectrum(...)`, and spectrum file I/O.
+
+### Design Highlights
+- **Streaming**: Operations avoid loading full datasets into memory.
+- **Immutability after creation** ensures consistency; modifications require explicit save operations.
+- Thread-safe for concurrent partition processing (via `sync.Mutex`/`WaitGroup`).
@@ -0,0 +1,26 @@
+# Semantic Description of `obikmer` Set Operations
+
+This Go package implements scalable set operations over collections of *k*-mers stored in disk-backed, sorted structures (`.kdi` files). A `KmerSetGroup` represents a group of *N* disjoint sets (e.g., per-sample or per-partition), each containing sorted unique *k*-mers.
+
+## Core Set Operations
+
+- **`Union()`**: Computes the union across all *N* sets — a k-mer appears in output if present in ≥1 input set.
+- **`Intersect()`**: Computes the intersection — a k-mer appears only if present in *all* sets.
+- **`Difference()`**: Computes `set₀ \ (set₁ ∪ … ∪ setₙ₋₁)` — keeps k-mers unique to the first set.
+- **`QuorumAtLeast(q)`**: Returns k-mers present in ≥ *q* sets.
+- **`QuorumExactly(q)`**: Returns k-mers present in exactly *q* sets.
+- **`QuorumAtMost(q)`**: Returns k-mers present in ≤ *q* sets.
+
+## Pairwise Group Operations
+
+- **`UnionWith(other)` / `IntersectWith(other)`**: Performs *per-set* binary operations between two compatible groups (same k, m, partitions, size). Result has *N* sets: `setᵢ = this.setᵢ ⊕ other.setᵢ`, where ⊕ is union or intersection.
+
+## Implementation Highlights
+
+- **Partitioned & Parallelized**: Each operation processes partitions in parallel using `runtime.NumCPU()` workers.
+- **Streaming K-way Merge**: Uses efficient sorted-stream merging (via `KWayMerge`) to avoid loading full sets into memory.
+- **Quorum Filtering**: Counts occurrences per k-mer across partitions by merging sorted streams and tallying hits.
+- **Compatibility Check**: Ensures groups share metadata (k, m, partitions) before pairwise operations.
+- **Disk Output**: All results materialized as new `KmerSetGroup` in a directory, with per-partition `.kdi` files and metadata.
+
+All operations preserve sorted order and support large-scale genomic datasets via streaming, partitioning, and minimal memory footprint.
@@ -0,0 +1,28 @@
+# Semantic Description of `obikmer` Package Functionalities
+
+The `obikmer` package provides disk-backed operations on *k*-mer sets derived from biological sequences. It supports scalable set algebra and similarity computations via the `KmerSetGroup` type.
+
+## Core Features
+
+- **Sequence-to-*k*-mer Indexing**: Sequences are converted into *k*-mers (of length `k`) and stored in a group of sets (`KmerSetGroup`), with one set per sequence. Minimizer-based sampling (parameter `m`) reduces redundancy.
+
+- **Set Operations on Disk**: Efficient disk-resident implementations of standard set operations:
+  - `Union`: Merges all *k*-mers from selected sets.
+  - `Intersect`: Retains only *k*-mers present in all input sets.
+  - `Difference` (`A \ B`): Keeps *k*-mers present in set A but not in B.
+  - `QuorumAtLeast(r)`: Returns *k*-mers appearing in ≥`r` sets (generalizes union (`r=1`) and intersection (`r=n`)).
+
+- **Consistency Guarantees**: Operations obey mathematical identities (e.g., `|A ∪ B| = |A| + |B| − |A ∩ B|`), validated via unit tests.
+
+- **Similarity & Distance Metrics**:
+  - `JaccardDistanceMatrix()`: Computes pairwise Jaccard *distances* (1 − similarity) between all sets.
+  - `JaccardSimilarityMatrix()`: Computes pairwise Jaccard *similarities* (`|A ∩ B| / |A ∪ B|`).
+  - Identical sets yield distance = `0.0`, disjoint ones give `1.0`; similarity is complementary.
+
+## Design Principles
+
+- **Temporary Directory Usage**: All operations use OS temp dirs for isolation and cleanup.
+- **Testing-Focused API**: Helper functions (`buildGroupFromSeqs`, `collectKmers`) simplify test setup.
+- **Scalability**: Disk-backed design avoids memory overflow for large sequence collections.
+
+This package enables robust, reproducible *k*-mer set analysis in bioinformatics pipelines—especially useful for metagenomic binning, error correction, or read clustering.
@@ -0,0 +1,37 @@
+# Semantic Description of `KmerMap` Functionality
+
+The provided Go package implements a **k-mer indexing and matching system** for biological sequences (`BioSequence`). It supports both standard and *sparse* k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).
+
+### Core Data Structures
+- `KmerMap[T]`: A generic hash map associating *normalized* k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
+- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.
+
+### Key Features
+1. **K-mer Normalization**  
+   - Handles both forward and reverse-complement k-mers.
+   - Selects the lexicographically smaller representation (canonical form).
+   - Supports *sparse* k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.
+
+2. **Efficient Indexing (`Push`)**  
+   - Builds an index of all canonical k-mers from a set of sequences.
+   - Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).
+
+3. **Querying (`Query`)**  
+   - Given a query sequence, returns all sequences in the index sharing k-mers with it.
+   - Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
+
+4. **Result Utilities (`KmerMatch`)**  
+   - `FilterMinCount`: Remove low-count matches.
+   - `Max()`, `Sequences()`: Retrieve best match or all matched sequences.
+
+5. **Construction (`NewKmerMap`)**  
+   - Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
+   - Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
+   - Integrates progress bar during indexing.
+
+### Use Cases
+- Read clustering (e.g., OTU/ASV picking).
+- Error correction via k-mer abundance.
+- Sequence similarity search or contamination screening.
+
+The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.
@@ -0,0 +1,27 @@
+# Minimizer Size Utilities in `obikmer`
+
+This Go package provides helper functions to compute and validate the **minimizer size** `m` in k-mer-based genomic algorithms (e.g., minimizer schemes for sequence comparison or indexing).
+
+## Core Functions
+
+- **`DefaultMinimizerSize(k)`**  
+  Returns a *recommended* minimizer size: `ceil(k / 2.5)`, clamped to `[1, k−1]`.  
+  → Ensures `m` is reasonably large for uniqueness while keeping window size (`k − m + 1`) manageable.
+
+- **`MinMinimizerSize(nworkers)`**  
+  Computes the *minimum* `m` such that there are ≥ `nworkers` distinct minimizers:  
+  solves `4^m ≥ n_workers`, i.e., `ceil(log₄(nworkers))`.  
+  → Guarantees enough diversity for parallelization (e.g., hashing-based distribution across workers).
+
+- **`ValidateMinimizerSize(m, k, nworkers)`**  
+  Enforces constraints on `m`:  
+    - Lower bound: ≥ `MinMinimizerSize(nworkers)` (warns & adjusts if violated)  
+    - Hard bounds: `1 ≤ m < k`  
+  → Prevents invalid or inefficient parameter choices.
+
+## Semantic Purpose
+
+These functions ensure that minimizer-based workflows are:
+- **Theoretically sound** (sufficient entropy for parallelism),
+- **Practically viable** (avoiding degenerate cases like `m = 0` or `m ≥ k`),
+- **User-friendly** (providing sensible defaults + clear warnings on adjustment).
@@ -0,0 +1,24 @@
+# SKM File Reader for Super-Kmers
+
+This Go package provides a binary file reader (`SkmReader`) for `.skm` files, which store *super-kmers* — compact representations of DNA sequences using 2-bit encoding.
+
+## Core Functionality
+
+- **Binary Format Parsing**: Reads structured data from `.skm` files, where each record contains:
+  - A 2-byte little-endian integer specifying the sequence length.
+  - Packed nucleotide data, where every byte encodes up to four bases (2 bits per base).
+
+- **Decoding Logic**: Converts packed 2-bit codes (`00`, `01`, `10`, `11`) to nucleotide characters using the mapping:  
+  `{ 'a', 'c', 'g', 't' }`.
+
+- **Memory-Efficient Reading**: Uses buffered I/O (64 KiB buffer) for fast sequential access.
+
+- **Streaming Interface**: `Next()` returns the next super-kmer as a struct with:
+  - `Sequence`: decoded nucleotide byte slice.
+  - `Start`, `End`: positional metadata (currently fixed to full length).
+
+- **Resource Management**: Provides a clean `.Close()` method for file handle cleanup.
+
+## Use Case
+
+Designed for high-performance processing of large genomic datasets (e.g., in k-mer analysis or sequence indexing), where storage size and read speed are critical.
@@ -0,0 +1,23 @@
+# SKM File Format Specification
+
+This Go package implements a binary format for storing *super-kmers*—compact representations of DNA sequences used in bioinformatics. The tests validate reading/writing, padding behavior, and file size correctness.
+
+## Core Functionalities
+
+- **SuperKmer Structure**: Each super-kmer stores a DNA sequence (as bytes), likely padded to 4-base boundaries for efficient storage.
+- **SkmWriter**: Serializes super-kmers into a binary file. Each entry writes:
+  - A 2-byte little-endian length (number of bases),
+  - Then `ceil(length/4)` bytes encoding nucleotides in 2 bits each (A=0, C=1, G=2, T=3).
+- **SkmReader**: Parses the binary format back into memory. Returns `(SuperKmer, bool)` via `Next()`, with EOF signaled by `ok = false`.
+- **Case Handling**: Writes preserve original case; reads normalize to lowercase (via `| 0x20` in tests), ensuring robust comparison.
+
+## Test Coverage
+
+- **Round-trip integrity**: Verifies exact sequence recovery after write/read.
+- **Empty file handling**: Confirms reader returns `ok = false` immediately on empty files.
+- **Variable-length padding**: Validates correct encoding/decoding for sequences of length 1–5.
+- **Size validation**: Confirms file size = `2 + ceil(L/4)` bytes for a sequence of length *L*.
+
+## Use Case
+
+Efficient, lossless storage and retrieval of super-kmers for downstream genomic analysis (e.g., assembly or alignment acceleration).
@@ -0,0 +1,24 @@
+# `.skm` File Format and `SkmWriter` Functionality
+
+The Go package `obikmer` provides a binary writer for `.skm` (super-kmer) files, optimized for compact storage of DNA sequences.
+
+- **Purpose**: Efficiently serialize *super-kmers* (long k-mers) into a binary format.
+- **Format per super-kmer**:
+  - `len: uint16 LE` — length of the sequence in bases (little-endian, 2 bytes).
+  - `data: ⌈len/4⌉ bytes` — nucleotide sequence encoded as **2 bits per base**, packed tightly.
+
+- **Encoding scheme**:
+  - `A → 00`, `C → 01`, `G → 10`, `T → 11`.
+  - Padding: trailing bits in the final byte are zeroed if `len % 4 ≠ 0`.
+
+- **Implementation details**:
+  - Uses buffered I/O (`bufio.Writer` with 64 KiB buffer) for performance.
+  - `NewSkmWriter(path)` opens/creates the file and returns a writer instance.
+  - `Write(sk SuperKmer)` encodes sequence length, then packs bases using a lookup (`__single_base_code__[seq[pos]&31]`).
+  - `Close()` flushes buffers and closes the file handle.
+
+- **Use case**: Ideal for high-throughput genomic preprocessing (e.g., indexing, sketching), where space and I/O speed matter.
+
+- **Assumptions**: `SuperKmer` type exposes a `.Sequence []byte`; bases are ASCII (`A,C,G,T,a,c,g,t`) — `&31` normalizes to lowercase index.
+
+- **Efficiency**: 4× compression vs. ASCII (1 byte/base → ~0.25 bytes/base), minimal overhead.
@@ -0,0 +1,35 @@
+# K-mer Spectrum Analysis Package (`obikmer`)
+
+This Go package provides tools for analyzing k-mer frequency distributions in biological sequences.
+
+## Core Data Structures
+
+- **`SpectrumEntry`**: Represents a bin in the k-mer frequency spectrum:  
+  `Frequency`: how often a k-mer was observed; `Count`: number of distinct k-mers with that frequency.
+
+- **`KmerSpectrum`**: A sorted list of non-zero `SpectrumEntry`s (ascending by frequency), enabling efficient statistics and serialization.
+
+## Key Functionalities
+
+### Spectrum Management
+- `MapToSpectrum()` / `ToMap()`: Convert between map and structured spectrum representations.
+- `MergeSpectraMaps()` / `MergeTopN()`: Combine spectral or top-k data from multiple sources.
+- `MaxFrequency()` returns the highest observed k-mer count.
+
+### I/O & Persistence
+- Binary format (`KSP\x01` magic header) with varint encoding for compact storage:
+  - `WriteSpectrum()` / `ReadSpectrum()`: Save/load full spectra to disk.
+- CSV export:
+  - `WriteTopKmersCSV()`: Outputs top-k k-mers with their sequences (decoded from uint64) and frequencies.
+
+### Top-N K-mer Tracking
+- Uses a **min-heap** to efficiently maintain the *N most frequent* k-mers in streaming scenarios:
+  - `NewTopNKmers(n)`: Initialize collector.
+  - `Add(kmer, freq)`: Insert/update while respecting capacity *n*.
+  - `Results()`: Return top-kmers sorted descending by frequency.
+
+## Design Highlights
+- Memory-efficient: Uses `uint64` for k-mers (suitable up to *k* ≤ 32).
+- Streaming-friendly: Top-N collector supports incremental updates.
+- Thread-safety note: External synchronization required for concurrent access.
+
@@ -0,0 +1,48 @@
+# SuperKmer and Minimizer-Based Sliding Window Analysis
+
+This Go package provides functionality for extracting *super k-mers* from DNA sequences using a minimizer-based sliding window approach.
+
+## Core Concepts
+
+- **K-mers**: Substrings of length `k` from a DNA sequence.
+- **Minimizer**: The lexicographically smallest canonical *m*-mer (substring of length `m`) among all `(k − m + 1)` overlapping *m*-mers in a given k-mer.
+- **Super K-mer**: A maximal contiguous subsequence where *every* consecutive k-mer shares the **same minimizer**.
+
+## Data Structures
+
+### `SuperKmer`
+Represents a maximal region with uniform minimizer:
+- `Minimizer`: Canonical 64-bit hash of the shared m-mer.
+- `Start`, `End`: Slice-style bounds (0-indexed, exclusive end).
+- `Sequence`: Raw byte slice of the DNA subsequence.
+
+### `dequeItem`
+Used internally to maintain a monotone deque:
+- `position`: Index of the m-mer in the sequence.
+- `canonical`: Canonical hash value (e.g., lexicographically smallest of forward/reverse-complement).
+
+## Main Function
+
+### `ExtractSuperKmers(seq, k, m, buffer)`
+- Extracts all maximal super k-mers from `seq`.
+- Parameters validated:  
+  - `1 ≤ m < k`,  
+  - `2 ≤ k ≤ 31`,  
+  - sequence length ≥ `k`.
+- Uses an efficient **O(n)** time algorithm via internal iteration.
+- Supports optional preallocation (`buffer`) to reduce memory allocations.
+
+## Algorithm Highlights
+
+- Maintains a sliding window of size `k − m + 1` over *m*-mers.
+- Tracks the current minimizer using a monotone deque for O(1) updates per step.
+- Detects *minimizer transitions* to delimit super k-mer boundaries.
+
+## Complexity
+
+| Aspect        | Bound                         |
+|---------------|-------------------------------|
+| Time          | **O(n)** (linear in sequence length) |
+| Space         | **O(k − m + 1)** for deque + output size |
+
+Useful in genome compression, read clustering, and minimizer-based alignment acceleration.
@@ -0,0 +1,32 @@
+# Super K-mers Extraction Module (`obikmer`)
+
+This Go package provides efficient tools for extracting **super k-mers** from DNA sequences using *minimizer-based sliding windows*. Super k-mers are maximal contiguous subsequences sharing the same minimal canonical minimizer in a window of size `k`.
+
+## Core Functionality
+
+- **`IterSuperKmers(seq, k, m)`**  
+  Returns an iterator over `SuperKmer` structs. Each struct contains:
+  - `Start`, `End`: genomic positions of the super k-mer in the original sequence  
+  - `Minimizer`: canonical minimizer value (uint64) for that segment  
+  - `Sequence`: the actual DNA subsequence  
+
+- **`SuperKmer.ToBioSequence(...)`**  
+  Converts a raw `SuperKmer` into an enriched `obiseq.BioSequence`, embedding metadata:
+  - ID: `{parentID}_superkmer_{start}_{end}`  
+  - Attributes: minimizer sequence (`minimizer_seq`), value, `k`, `m`, positions, and parent ID  
+
+- **`SuperKmerWorker(k, m)`**  
+  A `SeqWorker` adapter for pipeline integration (e.g., with `obiiter`). Processes a full BioSequence and returns all extracted super k-mers as a slice of `BioSequence`s.
+
+## Algorithm Highlights
+
+- Uses **canonical minimizers** (forward/reverse-complement minimum) to ensure strand-invariance  
+- Maintains a monotonic deque for efficient *sliding-window minimizer* tracking (O(n) time complexity)  
+- Supports DNA bases `A/C/G/T/U` case-insensitively via bitmasking (`seq[i] & 31`)  
+- Enforces parameter constraints: `1 ≤ m < k ≤ 31`, sequence length ≥ `k`
+
+## Use Cases
+
+- Read partitioning in metagenomics (e.g., for error correction or clustering)  
+- Efficient k-mer space segmentation without storing all individual kmers  
+- Integration into modular bioinformatics pipelines via `SeqWorker` interface
@@ -0,0 +1,39 @@
+# Semantic Description of `obikmer` Package Functionalities
+
+The `obikmer` package provides tools for **super k-mer extraction and minimizer-based sequence analysis** in bioinformatics.
+
+## Core Concepts
+
+A **super k-mer** is a maximal contiguous subsequence of DNA where *all* embedded *k*-mers share the **same minimizer**—a compact representative (typically lexicographically minimal) of *m*-mers, considering both forward and reverse-complement strands.
+
+## Key Functions & Features
+
+- **`IterSuperKmers(seq, k, m)`**:  
+  An iterator over all super *k*-mers in input sequence `seq`, parameterized by:
+  - `k`: length of embedded *k*-mers,
+  - `m`: size of minimizer window (`m ≤ k`).  
+  Yields structured objects with:
+  - `Sequence`: the super *k*-mer substring,
+  - `Start`/`End`: genomic coordinates (0-based half-open),
+  - `Minimizer`: canonical hash of the shared minimizer.
+
+- **`ExtractSuperKmers(...)`**:  
+  Synchronous counterpart returning a slice of all super *k*-mers.
+
+## Verified Properties (via Tests)
+
+1. **Boundary correctness**: Extracted subsequences match `seq[start:end]`.
+2. **Consistency between iterator and slice versions**: Both APIs produce identical results.
+3. **Bijection property**:
+   - Each unique super *k*-mer sequence maps to exactly one minimizer.
+   - All embedded *k*-mers within a super *k-mer* share the same minimizer.
+
+## Implementation Notes
+
+- Minimizers are computed canonically (min of forward and reverse-complement encodings).
+- Uses base encoding via `__single_base_code__` (assumed helper mapping A/C/G/T → 0/1/2/3).
+- Tests cover simple, homopolymer-rich, and complex genomic patterns.
+
+## Design Rationale
+
+Super *k*-mers enable efficient compression, indexing (e.g., in minimizer spaces), and alignment-free comparisons—crucial for scalable genomic analysis.
@@ -0,0 +1,33 @@
+# Variable-Length Integer Encoding/Decoding Utility
+
+This Go package (`obikmer`) provides efficient serialization of `uint64` integers using **protobuf-style variable-length encoding (varint)**.
+
+## Core Features
+
+- ✅ `EncodeVarint(io.Writer, uint64) (n int, err error)`  
+  Writes a `uint64` as a compact varint to any `io.Writer`. Uses **7 bits per byte**, with the MSB as a continuation flag. Max 10 bytes for `uint64`.
+
+- ✅ `DecodeVarint(io.Reader) (val uint64, err error)`  
+  Reads and decodes a varint from any `io.Reader`. Handles multi-byte sequences safely; returns error on malformed input or overflow (>70 bits).
+
+- ✅ `VarintLen(uint64) int`  
+  Computes the exact byte length required to encode a value *without* performing I/O — useful for buffer preallocation or size estimation.
+
+## Encoding Scheme
+
+- Each byte holds 7 bits of data; bit 8 (MSB) = `1` if more bytes follow, else `0`.
+- Example:  
+  - `0x7F` → `1 byte`: `0111_1111`  
+  - `0x80` → `2 bytes`: `1000_0000 0000_0001`
+
+## Use Cases
+
+- Network protocols & binary file formats requiring compact integer representation  
+- Serialization frameworks (e.g., custom protobuf-like codecs)  
+- Embedded systems or bandwidth-constrained environments where space efficiency matters
+
+## Design Notes
+
+- No external dependencies; uses only `io` from the standard library.  
+- Thread-safe *per call* (no shared state), but `io.Reader`/`Writer` concurrency must be handled externally.  
+- Compatible with standard protobuf varint format (e.g., interoperable with `encoding/binary` or gRPC).
@@ -0,0 +1,37 @@
+# Varint Encoding and Decoding Module (`obikmer`)
+
+This Go package implements **variable-length integer encoding/decoding**, commonly used in binary protocols (e.g., Protocol Buffers, SQLite) to efficiently store small integers using fewer bytes.
+
+## Core Features
+
+- **`EncodeVarint(w io.Writer, v uint64) (n int, err error)`**  
+  Encodes a `uint64` value into the minimal number of bytes (1–10) using **LEB128-style varint**, writing the result to a writer. Returns bytes written and any I/O error.
+
+- **`DecodeVarint(r io.Reader) (uint64, error)`**  
+  Reads and decodes a varint from an `io.Reader`, reconstructing the original `uint64`. Fails on malformed or incomplete data.
+
+- **`VarintLen(v uint64) int`**  
+  Computes the exact number of bytes required to encode `v`, without performing I/O.
+
+## Test Coverage
+
+- **Round-trip correctness**: All test values (including edge cases like `0`, powers of two, and max `uint64`) encode → decode back identically.
+- **Length validation**: Encoded length matches `VarintLen` predictions exactly (e.g., 127 → 1 byte; 16384 → 3 bytes).
+- **Sequence handling**: Multiple varints can be concatenated and decoded in order, preserving data integrity.
+
+## Efficiency & Design
+
+- Uses **7-bit groups per byte**, with the MSB as a continuation flag (`1` = more bytes follow).
+- Minimal memory footprint — no allocations beyond buffer I/O.
+- Designed for streaming use (e.g., network or file serialization).
+
+## Edge Cases Verified
+
+| Value          | Encoded Length |
+|----------------|---------------|
+| `0`            | 1 byte        |
+| `2⁷−1 = 127`   | 1 byte        |
+| `2⁷ = 128`     | 2 bytes       |
+| `2¹⁴−1 = 16383`| 2 bytes       |
+| `^uint64(0)`   | **10 bytes**  |
+