mirror of https://github.com/metabarcoding/obitools4.git synced 2026-04-30 12:00:39 +00:00

Files

T

Eric Coissac 8c7017a99d ⬆️ version bump to v4.5

- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)

2026-04-13 13:34:53 +02:00

2.0 KiB

Raw Blame History

Semantic Description of `KmerMap` Functionality

The provided Go package implements a k-mer indexing and matching system for biological sequences (BioSequence). It supports both standard and sparse k-mer representations (where one position is masked, typically for handling ambiguous bases or symmetry).

Core Data Structures

KmerMap[T]: A generic hash map associating normalized k-mers (type T, e.g., uint64 encoded in 2 bits per base) to lists of sequences containing them.
KmerMatch: A map from sequence pointers to k-mer match counts, used for query results.

Key Features

K-mer Normalization
- Handles both forward and reverse-complement k-mers.
- Selects the lexicographically smaller representation (canonical form).
- Supports sparse k-mers: when SparseAt ≥ 0, the central base is ignored (replaced by # in string view), and k-mers are symmetrically normalized.
Efficient Indexing (Push)
- Builds an index of all canonical k-mers from a set of sequences.
- Optionally limits per-k-mer storage (maxocc), useful for filtering high-frequency k-mers (e.g., contaminants).
Querying (Query)
- Given a query sequence, returns all sequences in the index sharing k-mers with it.
- Counts per-sequence how many shared k-mers exist (used for similarity estimation or clustering).
Result Utilities (KmerMatch)
- FilterMinCount: Remove low-count matches.
- Max(), Sequences(): Retrieve best match or all matched sequences.
Construction (NewKmerMap)
- Automatically adjusts k-mer size: odd for sparse mode, even otherwise.
- Precomputes bitmasks for efficient k-mer manipulation (masking, shifting).
- Integrates progress bar during indexing.

Use Cases

Read clustering (e.g., OTU/ASV picking).
Error correction via k-mer abundance.
Sequence similarity search or contamination screening.

The implementation leverages low-level bit operations for performance and memory efficiency, especially critical in large-scale NGS data processing.

2.0 KiB Raw Blame History

Semantic Description of KmerMap Functionality

Core Data Structures

Key Features

Use Cases

2.0 KiB

Raw Blame History

Semantic Description of `KmerMap` Functionality