autodoc/docmd/pkg/obikmer/kmermap.md

# Semantic Description of `KmerMap` Functionality

The provided Go package implements a **k-mer indexing and matching system** for biological sequences (`BioSequence`). It supports both standard and *sparse* k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).

### Core Data Structures
- `KmerMap[T]`: A generic hash map associating *normalized* k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.

### Key Features
1. **K-mer Normalization**  
   - Handles both forward and reverse-complement k-mers.
   - Selects the lexicographically smaller representation (canonical form).
   - Supports *sparse* k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.

2. **Efficient Indexing (`Push`)**  
   - Builds an index of all canonical k-mers from a set of sequences.
   - Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).

3. **Querying (`Query`)**  
   - Given a query sequence, returns all sequences in the index sharing k-mers with it.
   - Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).

4. **Result Utilities (`KmerMatch`)**  
   - `FilterMinCount`: Remove low-count matches.
   - `Max()`, `Sequences()`: Retrieve best match or all matched sequences.

5. **Construction (`NewKmerMap`)**  
   - Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
   - Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
   - Integrates progress bar during indexing.

### Use Cases
- Read clustering (e.g., OTU/ASV picking).
- Error correction via k-mer abundance.
- Sequence similarity search or contamination screening.

The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.
⬆️ version bump to v4.5 2026-04-07 08:36:50 +02:00			# Semantic Description of `KmerMap` Functionality

			The provided Go package implements a k-mer indexing and matching system for biological sequences (`BioSequence`). It supports both standard and sparse k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).

			`### Core Data Structures`
			- `KmerMap[T]`: A generic hash map associating normalized k-mers (type `T`, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
			- `KmerMatch`: A map from sequence pointers to k-mer match counts, used for query results.

			`### Key Features`
			`1. K-mer Normalization`
			`- Handles both forward and reverse-complement k-mers.`
			`- Selects the lexicographically smaller representation (canonical form).`
			- Supports sparse k-mers: when `SparseAt ≥ 0`, the central base is ignored (replaced by `#` in string view), and k-mers are symmetrically normalized.

			2. Efficient Indexing (`Push`)
			`- Builds an index of all canonical k-mers from a set of sequences.`
			- Optionally limits per-k-mer storage (`maxocc`), useful for filtering high-frequency k-mers (e.g., contaminants).

			3. Querying (`Query`)
			`- Given a query sequence, returns all sequences in the index sharing k-mers with it.`
			`- Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).`

			4. Result Utilities (`KmerMatch`)
			- `FilterMinCount`: Remove low-count matches.
			- `Max()`, `Sequences()`: Retrieve best match or all matched sequences.

			5. Construction (`NewKmerMap`)
			`- Automatically adjusts k-mer size: odd for sparse mode, even otherwise.`
			`- Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).`
			`- Integrates progress bar during indexing.`

			`### Use Cases`
			`- Read clustering (e.g., OTU/ASV picking).`
			`- Error correction via k-mer abundance.`
			`- Sequence similarity search or contamination screening.`

			`The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.`