T

Eric Coissac 7dd8db1409 docs: document conservative rounding strategy for filtering thresholds

Specifies that minimum bounds use ceiling and maximum bounds use floor to enforce strictness. Clarifies that the implementation avoids explicit rounding by directly comparing integer counts against floating-point fractions, which is mathematically equivalent.

2026-06-09 10:26:21 +02:00

.kilo/plans

harden obiskio error handling with explicit variants and bounds checking

2026-05-09 18:30:19 +08:00

.zed

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

data-stress

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

doc

docs: expand kmer indexing, filtering, and merging documentation

2026-06-04 22:59:41 +02:00

docmd

docs: document conservative rounding strategy for filtering thresholds

2026-06-09 10:26:21 +02:00

memory

refactor: restructure k-mer partitioning pipeline for memory efficiency

2026-05-17 16:08:47 +08:00

scripts

feat: implement persistent layered index and chunked binary format

2026-05-09 17:38:29 +08:00

src

refactor: centralize k-mer filtering logic and add validation

2026-06-09 10:22:25 +02:00

.DS_Store

feat: introduce genome metadata tracking and CSV export

2026-05-22 09:35:20 +02:00

.gitignore

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

Cargo.lock

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

CLAUDE.md

refactor: optimize MPHF construction and update legacy guidelines

2026-05-26 10:54:59 +02:00

debug.log

Refactor: Extract utility function for string reversal

2026-04-30 06:58:46 +02:00

kmer_spectrum_raw.json

refactor: implement RoutableSuperKmer and update k-mer indexing pipeline

2026-05-01 09:33:26 +02:00

Makefile

Refactor: Simplify user authentication flow

2026-04-28 08:38:26 +02:00

mkdocs.yml

refactor: centralize k-mer filtering logic and add validation

2026-06-09 10:22:25 +02:00

obikmer 2026-05-20 12.53 profile.json.gz

feat(obikmer): add index subcommand for kmer counting pipeline

2026-05-20 18:25:12 +02:00

profile.json.gz

Refactor: Simplify user authentication flow

2026-04-28 08:38:26 +02:00

README.md

docs: Update README to reflect new indexing workflow

2026-06-01 13:56:48 +02:00

test.sk.fasta

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

TODO.md

refactor: aggregate query results at sequence level

2026-05-30 07:18:54 +02:00

xxx.HTML

first implementation but far to be optimal

2026-04-19 12:17:16 +02:00

README.md

obikmer

obikmer is a Rust toolkit for indexing, querying, and comparing DNA sequences represented as sets of k-mers. It targets individual genome datasets (tens of Gbases) with maximum efficiency in computation, memory, and disk usage.

Key principles

Compact k-mer encoding. Each k-mer is stored in a u64 at 2 bits/base. k is odd, k ∈ [11, 31], fixed at runtime. The canonical form min(kmer, revcomp(kmer)) halves the effective space by collapsing both strands.

Superkmer-based partitioning. Sequences are decomposed into superkmers — maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to partitions via the minimizer hash, enabling partition-parallel indexing and querying with no cross-partition communication.

Layered MPHF index. Each partition holds a stack of layers. Each layer is a minimal perfect hash function (MPHF) over the k-mers of one input genome, paired with a per-genome presence/count matrix. Queries scatter k-mers to their partition, probe each layer in order, and aggregate results.

Approximate indexing (Findere). With -z Z, the index stores k-mers of size s = k − z + 1 instead of k. At query time, results are produced at size s, then a per-genome sliding window of size z aggregates z consecutive s-mer hits into one confirmed k-mer answer. This reduces the false-positive rate from 1/2^b per s-mer to 1/2^(b·z) per k-mer, at the cost of z−1 unconfirmed positions at each sequence break. The aggregation window spans the full query sequence, not individual superkmers, to avoid false negatives at superkmer boundaries.

Multi-genome. A single index can hold any number of genomes. Each k-mer slot carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees, and classification are derived from these vectors without rebuilding the index.

Input formats

Command              Formats accepted
───────────────────  ──────────────────────────────────────────────────────────────
index, superkmer     FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
                     (.gb .gbk .gbff), all optionally gzip-compressed.
                     Directories expanded recursively. Streaming stdin via -.
query                FASTA, FASTQ, optionally gzip-compressed. Stdin via -.

Non-ACGT characters act as hard breaks between k-mer segments in all formats.

Commands

Command    Role
─────────  ────────────────────────────────────────────────────────────────────
index      Build a genome index from sequence files.
           Runs scatter → dereplicate → count → layered MPHF.
           Resumes automatically if interrupted.
merge      Merge multiple independently built indexes into one.
rebuild    Filter and compact an existing index: apply count thresholds,
           drop layers, rewrite as a single-layer index.
reindex    Convert evidence in-place across all layers:
           exact (evidence.bin) ↔ approximate (fingerprint.bin).
           Does not touch the MPHF or unitigs.
query      Query an index with FASTA/FASTQ sequences.
           Annotates each sequence with per-genome k-mer match counts
           and optional per-position coverage vectors (--detail).
           Parallel over sequence chunks.
distance   Compute a pairwise Bray-Curtis or Jaccard distance matrix
           between all indexed genomes.
           Optionally outputs a Newick NJ or UPGMA tree.
annotate   Add or update genome metadata (taxonomy, etc.) from a CSV
           file; or dump the current metadata as CSV.
estimate   Dry-run: resolve and print approximate-index parameters
           (z, evidence bits b, FP rates) given any two of (b, z, fp).
           Does not touch any index.
dump       Dump all indexed k-mers as CSV with per-genome counts or
           presence flags.
superkmer  Extract superkmers from a sequence file and write to stdout.
           Diagnostic / pipeline use.
unitig     Dump the unitig sequences stored in a built index. Debug use.
utils      Miscellaneous utilities.
           --new-label NEW=OLD  renames a genome label in-place.

Quick start

# Build an exact index for each genome independently
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index_a/
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index_b/

# Merge into a single multi-genome index
obikmer merge --output index/ index_a/ index_b/

# Convert to approximate index (z=5, 8-bit fingerprints)
obikmer reindex --approx -z 5 --evidence-bits 8 index/

# Query reads
obikmer query index/ reads.fq.gz > annotated.fa

# Pairwise distances
obikmer distance index/ > distances.tsv

Parameter constraints

Parameter              Constraint
─────────────────────  ──────────────
k  (--kmer-size)       odd, 11 ≤ k ≤ 31
m  (--minimizer-size)  odd, 3 ≤ m ≤ k−1
z  (-z, --approx only) 1 ≤ z ≤ k−1

Documentation

Extended architecture and implementation notes are in docmd/. Build with make doc (requires Python + MkDocs Material).

Languages

Rust 93.1%

HTML 3.8%

TeX 1.5%

Python 0.7%

Shell 0.6%

Other 0.2%

README.md Unescape Escape

obikmer

Key principles

Input formats

Commands

Quick start

Parameter constraints

Documentation

README.md