Push nvyqwlpspwvl #11

Merged
coissac merged 6 commits from push-nvyqwlpspwvl into main 2026-05-29 07:21:58 +00:00
5 changed files with 51 additions and 1 deletions
Showing only changes of commit 0d9be53d1f - Show all commits
+12
View File
@@ -27,6 +27,18 @@
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
- Input formats: FASTA, FASTQ, gzip, streaming stdin; `index` reads from stdin automatically when no input files are provided (`-` can also be passed explicitly among other paths)
## Parameter constraints (enforced at CLI)
All constraints below are checked by `CommonArgs::validate()` at the start of `superkmer` and `index`. Invalid values exit immediately with an error.
| Parameter | Constraint | Reason |
|-----------|-----------|--------|
| k (`--kmer-size`) | odd | even k allows palindromic k-mers: kmer == revcomp(kmer), breaking the canonical form invariant |
| k (`--kmer-size`) | k ∈ [11, 31] | k > 31 overflows u64 at 2 bits/base; k < 11 gives insufficient specificity |
| m (`--minimizer-size`) | odd | same palindrome argument as k |
| m (`--minimizer-size`) | 3 ≤ m ≤ k1 | minimizer must be strictly shorter than the kmer |
| z (`-z`, Findere, `index --approx` only) | z ≤ k1 | effective indexed kmer size is kz+1; z ≥ k would make it ≤ 0 |
## Genome label constraints
Genome labels are arbitrary Unicode strings with the following restrictions:
+3 -1
View File
@@ -4,9 +4,11 @@
A **kmer** is a DNA subsequence of fixed length k. Two constraints govern the choice of k:
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word.
- **k ∈ [11, 31]**: the range ensures the kmer is long enough to be specific and short enough to fit in a single machine word (u64 at 2 bits/base requires k ≤ 32; k < 11 yields insufficient specificity).
- **k is odd**: an odd-length sequence cannot equal its own reverse complement (no palindromes). This guarantees that the canonical form `min(kmer, revcomp(kmer))` is always strictly defined — the two orientations are always distinct — which is required for strand-independent counting.
Both constraints are **enforced at CLI entry** by `CommonArgs::validate()` in `superkmer` and `index`. Passing an invalid k exits immediately with an error message.
## Super-kmers
A **super-kmer** is a maximal run of consecutive kmers from a DNA read, each overlapping the next by k1 nucleotides, sharing the same **canonical minimizer**. The **canonical minimizer** of a kmer is the m-mer (m < k) whose canonical hash `hash_kmer(min(m-mer, revcomp(m-mer)))` is smallest over all m-mers in the kmer window. The hash function is a `mix64`-based bijection; selection is purely hash-ordered with no degeneracy filter. A super-kmer is capped at 256 nucleotides; a longer run is split at that boundary.
+23
View File
@@ -63,6 +63,29 @@ pub fn block_size_to_bits(n: usize) -> u8 {
}
impl CommonArgs {
/// Validate k and m constraints. Exits on error.
pub fn validate(&self) {
let k = self.kmer_size;
let m = self.minimizer_size;
if k < 11 || k > 31 {
eprintln!("error: --kmer-size must be in [11, 31] (got {k})");
std::process::exit(1);
}
if k % 2 == 0 {
eprintln!("error: --kmer-size must be odd (got {k}); even k allows palindromic k-mers");
std::process::exit(1);
}
if m < 3 || m >= k {
eprintln!("error: --minimizer-size must be in [3, k1] = [3, {}] (got {m})", k - 1);
std::process::exit(1);
}
if m % 2 == 0 {
eprintln!("error: --minimizer-size must be odd (got {m})");
std::process::exit(1);
}
}
pub fn seqfile_paths(&self) -> obiread::PathIter {
let paths: Vec<PathBuf> = if self.inputs.is_empty() {
vec![PathBuf::from("-")]
+11
View File
@@ -152,12 +152,23 @@ pub(crate) fn resolve_approx_params(
}
pub fn run(args: IndexArgs) {
args.common.validate();
let output = args.output.clone();
let mut rep = Reporter::new();
// ── Resolve evidence kind ────────────────────────────────────────────────
let evidence = if args.approx {
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
let k = args.common.kmer_size;
if z as usize >= k {
eprintln!(
"error: Findere z={z} must be < kmer-size={k} \
(effective kmer size kz+1 = {} ≤ 0)",
k as isize - z as isize + 1
);
std::process::exit(1);
}
info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}");
IndexMode::Approx { b, z }
} else {
+2
View File
@@ -33,6 +33,8 @@ fn write_batch(
// ── Entry point ───────────────────────────────────────────────────────────────
pub fn run(args: SuperkmerArgs) {
args.common.validate();
let k = args.common.kmer_size;
let m = args.common.minimizer_size;
let theta = args.common.theta;