T

Eric Coissac a1499e6153 feat: add kmer filtering and refactor layer iteration

Introduce a `passes_all` utility to validate kmer rows against multiple filters using short-circuit logic. Integrate a `filters` parameter into the iteration functions to conditionally emit kmers based on filter results. Extract repetitive layer traversal and filtering into an `iter_src_layers` helper, refactoring Pass 1 and Pass 2 to eliminate duplication. Additionally, add a debug conditional to the dump output to include partition and layer metadata alongside kmer sequences.

2026-06-04 21:08:15 +02:00

.kilo/plans

harden obiskio error handling with explicit variants and bounds checking

2026-05-09 18:30:19 +08:00

.zed

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

data-stress

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

doc

docs: clarify MPHF indexing, storage layout, and distance traits

2026-05-17 15:59:10 +08:00

docmd

feat: add metadata-driven k-mer filtering for rebuild command

2026-06-04 21:01:58 +02:00

memory

refactor: restructure k-mer partitioning pipeline for memory efficiency

2026-05-17 16:08:47 +08:00

scripts

feat: implement persistent layered index and chunked binary format

2026-05-09 17:38:29 +08:00

src

feat: add kmer filtering and refactor layer iteration

2026-06-04 21:08:15 +02:00

.DS_Store

feat: introduce genome metadata tracking and CSV export

2026-05-22 09:35:20 +02:00

.gitignore

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

Cargo.lock

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

CLAUDE.md

refactor: optimize MPHF construction and update legacy guidelines

2026-05-26 10:54:59 +02:00

debug.log

Refactor: Extract utility function for string reversal

2026-04-30 06:58:46 +02:00

kmer_spectrum_raw.json

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

Makefile

Refactor: Simplify user authentication flow

2026-04-28 08:38:26 +02:00

mkdocs.yml

feat: add metadata-driven k-mer filtering for rebuild command

2026-06-04 21:01:58 +02:00

obikmer 2026-05-20 12.53 profile.json.gz

feat(obikmer): add index subcommand for kmer counting pipeline

2026-05-20 18:25:12 +02:00

profile.json.gz

Refactor: Simplify user authentication flow

2026-04-28 08:38:26 +02:00

README.md

docs: Update README to reflect new indexing workflow

2026-06-01 13:56:48 +02:00

test.sk.fasta

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

TODO.md

refactor: aggregate query results at sequence level

2026-05-30 07:18:54 +02:00

xxx.HTML

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

README.md

obikmer

obikmer is a Rust toolkit for indexing, querying, and comparing DNA sequences represented as sets of k-mers. It targets individual genome datasets (tens of Gbases) with maximum efficiency in computation, memory, and disk usage.

Key principles

Compact k-mer encoding. Each k-mer is stored in a u64 at 2 bits/base. k is odd, k ∈ [11, 31], fixed at runtime. The canonical form min(kmer, revcomp(kmer)) halves the effective space by collapsing both strands.

Superkmer-based partitioning. Sequences are decomposed into superkmers — maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to partitions via the minimizer hash, enabling partition-parallel indexing and querying with no cross-partition communication.

Layered MPHF index. Each partition holds a stack of layers. Each layer is a minimal perfect hash function (MPHF) over the k-mers of one input genome, paired with a per-genome presence/count matrix. Queries scatter k-mers to their partition, probe each layer in order, and aggregate results.

Approximate indexing (Findere). With -z Z, the index stores k-mers of size s = k − z + 1 instead of k. At query time, results are produced at size s, then a per-genome sliding window of size z aggregates z consecutive s-mer hits into one confirmed k-mer answer. This reduces the false-positive rate from 1/2^b per s-mer to 1/2^(b·z) per k-mer, at the cost of z−1 unconfirmed positions at each sequence break. The aggregation window spans the full query sequence, not individual superkmers, to avoid false negatives at superkmer boundaries.

Multi-genome. A single index can hold any number of genomes. Each k-mer slot carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees, and classification are derived from these vectors without rebuilding the index.

Input formats

Command              Formats accepted
───────────────────  ──────────────────────────────────────────────────────────────
index, superkmer     FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
                     (.gb .gbk .gbff), all optionally gzip-compressed.
                     Directories expanded recursively. Streaming stdin via -.
query                FASTA, FASTQ, optionally gzip-compressed. Stdin via -.

Non-ACGT characters act as hard breaks between k-mer segments in all formats.

Commands

Command    Role
─────────  ────────────────────────────────────────────────────────────────────
index      Build a genome index from sequence files.
           Runs scatter → dereplicate → count → layered MPHF.
           Resumes automatically if interrupted.
merge      Merge multiple independently built indexes into one.
rebuild    Filter and compact an existing index: apply count thresholds,
           drop layers, rewrite as a single-layer index.
reindex    Convert evidence in-place across all layers:
           exact (evidence.bin) ↔ approximate (fingerprint.bin).
           Does not touch the MPHF or unitigs.
query      Query an index with FASTA/FASTQ sequences.
           Annotates each sequence with per-genome k-mer match counts
           and optional per-position coverage vectors (--detail).
           Parallel over sequence chunks.
distance   Compute a pairwise Bray-Curtis or Jaccard distance matrix
           between all indexed genomes.
           Optionally outputs a Newick NJ or UPGMA tree.
annotate   Add or update genome metadata (taxonomy, etc.) from a CSV
           file; or dump the current metadata as CSV.
estimate   Dry-run: resolve and print approximate-index parameters
           (z, evidence bits b, FP rates) given any two of (b, z, fp).
           Does not touch any index.
dump       Dump all indexed k-mers as CSV with per-genome counts or
           presence flags.
superkmer  Extract superkmers from a sequence file and write to stdout.
           Diagnostic / pipeline use.
unitig     Dump the unitig sequences stored in a built index. Debug use.
utils      Miscellaneous utilities.
           --new-label NEW=OLD  renames a genome label in-place.

Quick start

# Build an exact index for each genome independently
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/

# Merge into a single multi-genome index
obikmer merge --output index/ index_a/ index_b/

# Convert to approximate index (z=5, 8-bit fingerprints)
obikmer reindex --approx -z 5 --evidence-bits 8 index/

# Query reads
obikmer query index/ reads.fq.gz > annotated.fa

# Pairwise distances
obikmer distance index/ > distances.tsv

Parameter constraints

Parameter              Constraint
─────────────────────  ──────────────
k  (--kmer-size)       odd, 11 ≤ k ≤ 31
m  (--minimizer-size)  odd, 3 ≤ m ≤ k−1
z  (-z, --approx only) 1 ≤ z ≤ k−1

Documentation

Extended architecture and implementation notes are in docmd/. Build with make doc (requires Python + MkDocs Material).

Languages

Rust 93.1%

HTML 3.8%

TeX 1.5%

Python 0.7%

Shell 0.6%

Other 0.2%

README.md Unescape Escape

obikmer

Key principles

Input formats

Commands

Quick start

Parameter constraints

Documentation

README.md