⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5" - Update version.txt from 4.29 → .30 (automated by Makefile)
2026-04-30 03:50:39 +00:00 · 2026-04-07 08:36:50 +02:00
parent 670edc1958
commit 8c7017a99d
392 changed files with 18875 additions and 141 deletions
@@ -0,0 +1,37 @@
+# Semantic Description of `KmerMap` Functionality
+
+The provided Go package implements a **k-mer indexing and matching system** for biological sequences (`BioSequence`). It supports both standard and *sparse* k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).
+
+### Core Data Structures
+- `KmerMap[T]`: A generic hash map associating *normalized* k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
+- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.
+
+### Key Features
+1. **K-mer Normalization**  
+   - Handles both forward and reverse-complement k-mers.
+   - Selects the lexicographically smaller representation (canonical form).
+   - Supports *sparse* k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.
+
+2. **Efficient Indexing (`Push`)**  
+   - Builds an index of all canonical k-mers from a set of sequences.
+   - Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).
+
+3. **Querying (`Query`)**  
+   - Given a query sequence, returns all sequences in the index sharing k-mers with it.
+   - Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
+
+4. **Result Utilities (`KmerMatch`)**  
+   - `FilterMinCount`: Remove low-count matches.
+   - `Max()`, `Sequences()`: Retrieve best match or all matched sequences.
+
+5. **Construction (`NewKmerMap`)**  
+   - Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
+   - Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
+   - Integrates progress bar during indexing.
+
+### Use Cases
+- Read clustering (e.g., OTU/ASV picking).
+- Error correction via k-mer abundance.
+- Sequence similarity search or contamination screening.
+
+The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.