Files
obitools4/autodoc/docmd/pkg/obikmer/kmermap.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

2.0 KiB

Semantic Description of KmerMap Functionality

The provided Go package implements a k-mer indexing and matching system for biological sequences (BioSequence). It supports both standard and sparse k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).

Core Data Structures

  • KmerMap[T]: A generic hash map associating normalized k-mers (type T, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
  • KmerMatch: A map from sequence pointers to k-mer match counts, used for query results.

Key Features

  1. K-mer Normalization

    • Handles both forward and reverse-complement k-mers.
    • Selects the lexicographically smaller representation (canonical form).
    • Supports sparse k-mers: when SparseAt ≥ 0, the central base is ignored (replaced by # in string view), and k-mers are symmetrically normalized.
  2. Efficient Indexing (Push)

    • Builds an index of all canonical k-mers from a set of sequences.
    • Optionally limits per-k-mer storage (maxocc), useful for filtering high-frequency k-mers (e.g., contaminants).
  3. Querying (Query)

    • Given a query sequence, returns all sequences in the index sharing k-mers with it.
    • Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
  4. Result Utilities (KmerMatch)

    • FilterMinCount: Remove low-count matches.
    • Max(), Sequences(): Retrieve best match or all matched sequences.
  5. Construction (NewKmerMap)

    • Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
    • Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
    • Integrates progress bar during indexing.

Use Cases

  • Read clustering (e.g., OTU/ASV picking).
  • Error correction via k-mer abundance.
  • Sequence similarity search or contamination screening.

The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.