📦 Add obipipeline crate and refactor path handling

- Introduce new `obipackage` library with pipeline stages, scheduler and worker pool - Refactor path expansion in `obiread`: replace old list_of_files with new PathIter iterator - Add MIME type detection using `infer` crate (fastq/fasta) - Update dependencies in Cargo.lock: add bumpalo, byteorder, cfb (with deps), fnv, infer, js-sys/uuid/wasm-bindgen ecosystem - Fix formatting and improve tests in SuperKmer (canonical, revcomp) * Note: edition = "2024" in obipipeline/Cargo.toml is invalid; should be 2021
2026-04-20 20:02:58 +02:00
parent 380b5a6f94
commit 664d0216b5
8 changed files with 211 additions and 1 deletions
@@ -1,32 +0,0 @@
-# Kmers and super-kmers
-
-## Kmers
-
-A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
-
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
-
-## Super-kmers
-
-A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k−1 nucleotides. Each kmer of the run carries the same **canonical minimizer**. The **canonical minimizer** of a kmer is the smallest value of `min(m-mer, revcomp(m-mer))` over all m-mers within the kmer (m < k, m odd), with the constraint that **non-degenerate m-mers are always preferred** over degenerate ones. A degenerate m-mer is one composed of a single repeated nucleotide (all-A, all-C, all-G, or all-T); such m-mers are selected only if no non-degenerate candidate exists in the window.
-
-### Canonical super-kmers
-
-A **canonical super-kmer** is the lexicographic minimum of a super-kmer and its reverse complement:
-
-```
-canonical(super-kmer) = min(super-kmer, revcomp(super-kmer))
-```
-
-When a read and its reverse-complement are both sequenced, they produce super-kmers that are reverse complements of each other. Both map to the same canonical form: the same genomic region is represented by a single canonical super-kmer regardless of which strand was read.
-
-### Expected length of a super-kmer
-
-For a random minimizer of length m over k-mers of length k, the density of minimizer positions is approximately 2/(k−m+2) [@Zheng2020-ji; @Golan2025-xf], so the expected number of consecutive k-mers per super-kmer is (k−m+2)/2. A run of n k-mers spans n + k − 1 nucleotides, giving:
-
-$$L_{\text{nt}} = \frac{k-m+2}{2} + k - 1$$
-
-For k=31, m=13: expected ≈ 40 nt. In practice super-kmers rarely exceed a few dozen nucleotides.[^superkmer_length]
-
-[^superkmer_length]: The expected length formula and the density approximation 2/(k−m+2) should be verified against the values reported in [@Zheng2020-ji] and [@Golan2025-xf].