docs: clarify query pipeline, Findere trick, and input formats

Fix a stray prefix in the README heading and update documentation to reflect the query pipeline's operation on `s-mers` (`s = k - z + 1`) with post-partition z-window filtering. Clarify the Findere trick, including k-mer size reduction, consecutive match requirements, and false positive rate calculations. Additionally, expand input format documentation to cover supported file extensions, gzip compression, recursive directory handling, and `query` command specifications.
2026-05-30 15:54:13 +02:00
parent 708b0abf9b
commit 8a0b898b4b
4 changed files with 150 additions and 36 deletions
@@ -1 +1,107 @@
-toto
+# obikmer
+
+`obikmer` is a Rust toolkit for indexing, querying, and comparing DNA sequences
+represented as sets of k-mers. It targets individual genome datasets (tens of
+Gbases) with maximum efficiency in computation, memory, and disk usage.
+
+## Key principles
+
+**Compact k-mer encoding.** Each k-mer is stored in a `u64` at 2 bits/base.
+k is odd, k ∈ [11, 31], fixed at runtime. The canonical form `min(kmer, revcomp(kmer))`
+halves the effective space by collapsing both strands.
+
+**Superkmer-based partitioning.** Sequences are decomposed into superkmers —
+maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to
+partitions via the minimizer hash, enabling partition-parallel indexing and querying
+with no cross-partition communication.
+
+**Layered MPHF index.** Each partition holds a stack of layers. Each layer is a
+minimal perfect hash function (MPHF) over the k-mers of one input genome, paired
+with a per-genome presence/count matrix. Queries scatter k-mers to their partition,
+probe each layer in order, and aggregate results.
+
+**Approximate indexing (Findere).** With `-z Z`, the index stores k-mers of size
+`s = k − z + 1` instead of k. At query time, results are produced at size s, then
+a per-genome sliding window of size z aggregates z consecutive s-mer hits into one
+confirmed k-mer answer. This reduces the false-positive rate from `1/2^b` per s-mer
+to `1/2^(b·z)` per k-mer, at the cost of z−1 unconfirmed positions at each sequence
+break. The aggregation window spans the full query sequence, not individual superkmers,
+to avoid false negatives at superkmer boundaries.
+
+**Multi-genome.** A single index can hold any number of genomes. Each k-mer slot
+carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees,
+and classification are derived from these vectors without rebuilding the index.
+
+## Input formats
+
+    Command              Formats accepted
+    ───────────────────  ──────────────────────────────────────────────────────────────
+    index, superkmer     FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
+                         (.gb .gbk .gbff), all optionally gzip-compressed.
+                         Directories expanded recursively. Streaming stdin via -.
+    query                FASTA, FASTQ, optionally gzip-compressed. Stdin via -.
+
+Non-ACGT characters act as hard breaks between k-mer segments in all formats.
+
+## Commands
+
+    Command    Role
+    ─────────  ────────────────────────────────────────────────────────────────────
+    index      Build a genome index from sequence files.
+               Runs scatter → dereplicate → count → layered MPHF.
+               Resumes automatically if interrupted.
+    merge      Merge multiple independently built indexes into one.
+    rebuild    Filter and compact an existing index: apply count thresholds,
+               drop layers, rewrite as a single-layer index.
+    reindex    Convert evidence in-place across all layers:
+               exact (evidence.bin) ↔ approximate (fingerprint.bin).
+               Does not touch the MPHF or unitigs.
+    query      Query an index with FASTA/FASTQ sequences.
+               Annotates each sequence with per-genome k-mer match counts
+               and optional per-position coverage vectors (--detail).
+               Parallel over sequence chunks.
+    distance   Compute a pairwise Bray-Curtis or Jaccard distance matrix
+               between all indexed genomes.
+               Optionally outputs a Newick NJ or UPGMA tree.
+    annotate   Add or update genome metadata (taxonomy, etc.) from a CSV
+               file; or dump the current metadata as CSV.
+    estimate   Dry-run: resolve and print approximate-index parameters
+               (z, evidence bits b, FP rates) given any two of (b, z, fp).
+               Does not touch any index.
+    dump       Dump all indexed k-mers as CSV with per-genome counts or
+               presence flags.
+    superkmer  Extract superkmers from a sequence file and write to stdout.
+               Diagnostic / pipeline use.
+    unitig     Dump the unitig sequences stored in a built index. Debug use.
+    utils      Miscellaneous utilities.
+               --new-label NEW=OLD  renames a genome label in-place.
+
+## Quick start
+
+```sh
+# Build an exact index for two genomes
+obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index/
+obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index/
+
+# Convert to approximate index (z=5, 8-bit fingerprints)
+obikmer reindex --approx -z 5 --evidence-bits 8 index/
+
+# Query reads
+obikmer query index/ reads.fq.gz > annotated.fa
+
+# Pairwise distances
+obikmer distance index/ > distances.tsv
+```
+
+## Parameter constraints
+
+    Parameter              Constraint
+    ─────────────────────  ──────────────
+    k  (--kmer-size)       odd, 11 ≤ k ≤ 31
+    m  (--minimizer-size)  odd, 3 ≤ m ≤ k−1
+    z  (-z, --approx only) 1 ≤ z ≤ k−1
+
+## Documentation
+
+Extended architecture and implementation notes are in `docmd/`. Build with
+`make doc` (requires Python + MkDocs Material).