Files
obitools4/autodoc/docmd/pkg/obikmer/kmer_match.md
T
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00

15 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Semantic Description of `obikmer` Package
The `obikmer` package implements efficient k-mer matching between query sequences and an indexed reference using **canonical k-mers** partitioned by minimizer-based hashing.
- `QueryEntry` represents a single canonical kmer with its origin: sequence index and 1-based position.
- `PreparedQueries` groups queries into sorted buckets per partition, enabling batched and parallelized matching.
- `PrepareQueries` scans input sequences using *super-kmers* (with window size `m`) to compute minimizers, assigns each kmer to a partition via modulo hashing, and sorts buckets by kmer value.
- `MergeQueries` combines two sets of prepared queries across batches using a merge-sort strategy, correctly offsetting sequence indices to preserve global ordering.
- `MatchBatch` performs parallel matching per partition: each goroutine runs a **merge-scan** between sorted queries and the corresponding KDI (K-mer Disk Index) stream.
- Efficient seeking is used only when beneficial, avoiding costly syscalls for small skips.
- Matches are recorded with thread-safe per-sequence mutexes; final positions within each sequence are sorted post-match.
- `matchPartition` implements the core merge-scan: it opens a KDI reader, seeks to relevant regions of the index, and walks both query list and kmer stream in lockstep.
The design supports **large-scale batch processing**, incremental query accumulation, and high-performance parallel lookup—ideal for metagenomic or biodiversity sequencing workflows.