7 Commits

Author SHA1 Message Date
coissac 82ec6aa1cf Merge pull request 'Push tklvqnrqtzpo' (#10) from push-tklvqnrqtzpo into main
Reviewed-on: #10
2026-05-26 15:41:05 +00:00
Eric Coissac 694da5208e feat: add Findere z-window filtering and detail mode coverage tracking
Introduces the `--findere-z` CLI flag to override the index's sliding window parameter and implements `apply_findere` to filter k-mer hits using a z-consecutive positions window. Adds structural support for `--detail` mode, including per-sequence k-mer offsets, conditional allocation of per-position coverage vectors, and JSON serialization. Updates architecture documentation, CLI references, and annotation schemas to align with the new implementation, resolving prior discrepancies with `--detail` and `--mismatch` flags.
2026-05-26 15:43:17 +02:00
Eric Coissac 26ab165807 refactor: add rolling buffer methods and document label constraints
Added `is_empty()`, `clear()`, and `iter()` methods to the rolling statistics buffer to enable standard traversal and state reset operations. Documented genome label constraints, specifying forbidden characters, empty label rejection, space quoting requirements, and auto-derived label bypass rules. Additionally, updated doc comments and added `#[allow(dead_code)]` attributes for `kmer_offset` and `n_kmers` fields to suppress compiler warnings while reserving them for future `--detail` coverage vector logic.
2026-05-26 15:40:23 +02:00
Eric Coissac dfa0b2bac2 feat: add utils subcommand for renaming genome labels
Introduces a `utils` CLI subcommand to enable in-place genome label renaming without full reindexing. Adds strict label validation to reject empty strings, filesystem separators, and control characters, ensuring safe CSV serialization. Updates index metadata, renames corresponding spectrum JSON files, and registers the command in the main dispatch logic. CLI reference documentation is also updated.
2026-05-26 15:35:22 +02:00
Eric Coissac 9e60a711bc Enforce minimum input paths and handle stdin sentinel
Update CLI validation to require at least 10 input paths, defaulting to stdin (`-`) when the argument list is empty. Refactor the path iterator to explicitly recognize the stdin sentinel, bypassing extension validation and directory expansion to ensure direct passthrough to the file buffer without triggering `stat()` or recursive traversal.
2026-05-26 15:22:38 +02:00
Eric Coissac 98c14aade9 feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
2026-05-26 15:08:29 +02:00
Eric Coissac 7501b6e854 refactor: switch indexing to IndexMode and update metadata
Replace EvidenceKind with IndexMode (Exact, Approx, Hybrid) across layer construction and query dispatch. Update PartitionMeta and LayerMeta serialization to centralize index-wide configuration. Add flexible push_layer overloads to LayeredMap for dynamic index expansion without full rebuilds. Improve UnitigFileReader to gracefully fallback to sequential scanning when indexes are missing, eliminating panics.
2026-05-26 14:57:17 +02:00
29 changed files with 793 additions and 533 deletions
+79 -38
View File
@@ -26,7 +26,8 @@ for each batch of sequences:
query_partition(p, superkmers_routed_to_p) query_partition(p, superkmers_routed_to_p)
→ load QueryLayer(s) for p → load QueryLayer(s) for p
→ for each kmer in each superkmer: MphfLayer::find(kmer) → for each kmer in each superkmer: MphfLayer::find(kmer)
broadcast results back to each (seq_idx, kmer_offset) that referenced the superkmer apply_findere(sk_kmer_results, effective_z) ← per superkmer
broadcast confirmed results back to each (seq_idx, kmer_offset)
emit annotated sequences emit annotated sequences
``` ```
@@ -36,28 +37,46 @@ Parallelism is **not yet active** in the current implementation: batches are pro
--- ---
## Layer lookup: `MphfLayer::find` ## Findere z-window filter
`MphfLayer::open` reads `layer_meta.json` and loads either exact or approximate evidence. The caller (`QueryLayer::find`) never chooses the dispatch path — it is fixed at open time by `LayerEvidence`: For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match.
`apply_findere` implements this as a sliding-window confirmation, independently for each genome:
```rust ```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> { fn apply_findere(
match &self.ev { results: &[Option<Box<[u32]>>],
LayerEvidence::Exact { .. } => self.find_exact(kmer), z: usize,
LayerEvidence::Approx { .. } => self.find_approx(kmer), n_genomes: usize,
} ) -> Vec<Option<Box<[u32]>>>
}
``` ```
### Exact layers For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.
`find_exact` maps the k-mer through the MPHF to a slot, then calls `UnitigFileReader::verify_canonical_kmer(chunk_id, rank, kmer)` to confirm the stored k-mer matches. Zero false positives. Requires `UnitigFileReader::open()` (random-access via `.idx`); `open_sequential()` cannot serve random-access verification. Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.
### Approximate layers **Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.
`find_approx` maps the k-mer through the MPHF, then checks a stored `b`-bit fingerprint of the canonical hash. False-positive rate: **1/2^b per k-mer query**. No `.idx` file is needed; the layer carries only `fingerprint.bin`. **Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.
For a query window of z consecutive k-mers (Findere scheme), the false-positive rate per window is **1/2^(b·z)**. The `z` parameter is recorded in `layer_meta.json` at build time but is not enforced during querying — the caller is responsible for interpreting window-level results accordingly. ### Effective z at query time
`effective_z` is resolved at the start of `run()`:
```rust
let effective_z = args.findere_z.unwrap_or_else(|| match idx.meta().config.evidence {
IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
IndexMode::Exact => 1,
});
```
The `-z` CLI option overrides the index metadata value. A higher z increases stringency (lower FP, some true positives may be discarded at sequence ends); a lower z increases sensitivity.
---
## Layer lookup: `MphfLayer::find`
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read. The caller (`QueryLayer`) never chooses the dispatch path: it is fixed at open time by `LayerEvidence`. See [obilayeredmap](../implementation/obilayeredmap.md) for the full `find` / `find_strict` API.
### `QueryLayer` variant selection ### `QueryLayer` variant selection
@@ -72,18 +91,6 @@ For a query window of z consecutive k-mers (Findere scheme), the false-positive
--- ---
## `open()` vs `open_sequential()`
`UnitigFileReader::open()` loads the `.idx` block-offset table, enabling random access to individual unitig chunks. It is required whenever `verify_canonical_kmer` is called (exact layers at query time).
`UnitigFileReader::open_sequential()` skips the `.idx` and supports only forward iteration. It is sufficient for:
- build passes that scan all unitigs sequentially (`build_exact_evidence`, `build_approx_evidence`);
- the `unitig` subcommand, which iterates and prints unitigs without random access.
`KmerIndex::open()` (called by `query::run`) triggers `MphfLayer::open` for each layer, which calls `UnitigFileReader::open()` for exact layers. Approximate layers do not open a unitig reader at all.
---
## Presence / count mode at query time ## Presence / count mode at query time
The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores: The `--force-presence` flag and `--presence-threshold` control how per-genome values are accumulated, independently of what the index stores:
@@ -96,49 +103,83 @@ genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
--- ---
## Coverage vectors (`--detail`)
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:
```
cov[seq_idx][g][abs_pos] += contribution
where abs_pos = desc.kmer_offset + local_pos (absolute kmer position in the sequence)
```
Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.
---
## `kmer_missing` semantics
`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.
---
## Output format ## Output format
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line. Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
``` ```
>read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7,...}} >read_id {"kmer_count":59,"kmer_strict_matches":{"genome_a":42,"genome_b":7}}
ATCGATCG... ATCGATCG...
``` ```
Genome keys in `kmer_strict_matches` are genome labels from `index.meta`. Key order follows iteration order of `meta.genomes`. With `--detail`:
```
>read_id {"kmer_count":59,"kmer_strict_matches":{...},"coverage":{"genome_a":[0,1,2,...],...}}
ATCGATCG...
```
Genome keys follow the iteration order of `meta.genomes`.
--- ---
## Annotation schema (current implementation) ## Annotation schema
| Key | Type | Condition | Semantics | | Key | Type | Condition | Semantics |
|---|---|---|---| |---|---|---|---|
| `kmer_count` | int | always | k-mers with at least one match | | `kmer_count` | int | always | k-mers confirmed (post-Findere) with at least one genome match |
| `kmer_missing` | int | `--count-missing` | k-mers absent from every layer | | `kmer_missing` | int | `--count-missing` | k-mers absent from the index entirely (pre-Findere None) |
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) | | `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |
`kmer_count` counts matched k-mer positions (incremented once per `Some(row)` hit regardless of how many genomes are covered). `kmer_missing` counts `None` hits. `kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.
**Note on doc/impl divergence:** the doc previously used keys `kmer_total`, `kmer_found`, and `kmer_match` (list). The implementation uses `kmer_count` (int, matched only) and `kmer_strict_matches` (object keyed by genome label). `--mismatch` and `--detail` are parsed but not yet implemented and emit a warning.
--- ---
## CLI ## CLI
``` ```
obikmer query -i <index> [--detail] [--mismatch] [--count-missing] obikmer query <index> [--detail] [--mismatch] [--count-missing]
[--force-presence] [--presence-threshold <n>] [--force-presence] [--presence-threshold <n>]
[-T <threads>] <query.fa> [<query2.fa> ...] [-z <z>] [-T <threads>]
<query.fa> [<query2.fa> ...]
``` ```
`--mismatch` and `--detail` are accepted but currently ignored with a warning on stderr. | Option | Default | Semantics |
|---|---|---|
| `-z` / `--findere-z` | from index metadata | Override Findere z parameter |
| `--detail` | off | Emit per-position coverage vectors in JSON |
| `--count-missing` | off | Add `kmer_missing` field to JSON |
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
| `--presence-threshold` | 1 | Minimum count to declare genome present |
| `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) |
`--mismatch` is accepted but currently ignored with a warning on stderr.
--- ---
## Future work ## Future work
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently. - **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
- **`--detail`**: per-position coverage vectors (`cov_<i>`) per genome.
- **Read classification** (`--classify`): assign each read to the genome with the highest match score. - **Read classification** (`--classify`): assign each read to the genome with the highest match score.
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value. - **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores. - **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
+17 -17
View File
@@ -27,10 +27,10 @@ part_XXXXX/
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`. After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory: `MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:
1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`). 1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population. 2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted. `mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
@@ -110,7 +110,7 @@ layer_i/
mphf.bin — ptr_hash phase-2 MPHF mphf.bin — ptr_hash phase-2 MPHF
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode] evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
fingerprint.bin — n × b-bit fingerprints per slot [approx mode] fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
layer_meta.json evidence kind, recorded at build time [no layer_meta.json — mode stored once in partition-level meta.json]
``` ```
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B: Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
@@ -121,32 +121,32 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
### Evidence modes ### Evidence modes
Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`. Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.
**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access. **Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.
**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed. **Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).
### Build functions ### Build functions
``` ```
MphfLayer::build(dir, block_bits, fill_slot) MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
Pass 1: par_iter over chunks via .idx → build mphf.bin Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
Pass 2: sequential iter → fill evidence.bin + call fill_slot Pass 2: sequential iter → fill evidence files + call fill_slot
.idx written last for Exact/Hybrid (query-time only)
MphfLayer::build_exact_evidence(dir, block_bits) MphfLayer::build_exact_evidence(dir, block_bits)
Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
Uses open_sequential(); no .idx required on entry Uses open_sequential(); no .idx required on entry
MphfLayer::build_approx_evidence(dir, b, z) MphfLayer::build_approx_evidence(dir, b, z)
Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
Uses open_sequential(); never writes .idx Uses open_sequential(); never writes .idx
MphfLayer::build_evidence(dir, kind, block_bits)
Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
``` ```
`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle. There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`. In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
@@ -170,7 +170,7 @@ fn query(kmer) → Option<(layer_index, slot)>:
return None return None
``` ```
`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check. `MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
### Merging layers ### Merging layers
+48 -42
View File
@@ -20,21 +20,26 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
--- ---
## Evidence kinds ## Index mode (homogeneity invariant)
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time: A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.
```rust ```rust
pub enum EvidenceKind { #[derive(Serialize, Deserialize, Default)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum IndexMode {
#[default]
Exact, Exact,
Approx { b: u8, z: u8 }, Approx { b: u8, z: u8 },
Hybrid { b: u8, z: u8 },
} }
``` ```
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load. `IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json`.
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time. - **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L k z + 2 is the number of windows in a read of length L. No `.idx` written or required. - **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.
--- ---
@@ -55,44 +60,44 @@ pub struct MphfLayer {
```rust ```rust
enum LayerEvidence { enum LayerEvidence {
Exact { evidence: Evidence, unitigs: UnitigFileReader }, Exact { evidence: Evidence, unitigs: UnitigFileReader },
Approx { fingerprint: FingerprintVec }, Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
} }
``` ```
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
### Query API ### Query API
Three public query methods, all returning `Option<usize>` (slot index): Two public query methods, both returning `Option<usize>` (slot index):
```rust ```rust
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize> pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
``` ```
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`. - `find`: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
- `find_exact` panics if the layer holds approximate evidence; zero false positives. - `find_strict`: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx`).
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes. There are no `find_exact`/`find_approx` methods; panicking dispatch is eliminated.
### Build surface ### Build surface
```rust ```rust
// Full MPHF + exact evidence build (two-pass, parallel) // Full MPHF + evidence build (two-pass)
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize> pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>
// Evidence-only builds (MPHF already present in dir) // Evidence-only post-hoc builds (MPHF already present)
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize> pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize> pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
``` ```
`MphfLayer::build` runs two sequential passes over `unitigs.bin`: `MphfLayer::build` runs two passes over `unitigs.bin`:
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`. 1. **Pass 1** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>`, no file reopening) is passed to `new_from_par_iter` via `par_bridge()`. Produces `mphf.bin`. No `.idx` is read or created at this stage.
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline. 2. **Pass 2** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction. There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files. For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
@@ -133,38 +138,37 @@ pub struct Hit<T = ()> {
```rust ```rust
// mode 1 // mode 1
impl Layer<()> { impl Layer<()> {
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize> pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
} }
// mode 2 // mode 2
impl Layer<PersistentCompactIntMatrix> { impl Layer<PersistentCompactIntMatrix> {
pub fn build(out_dir: &Path, block_bits: u8, pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
pub fn build_from_map(out_dir: &Path, block_bits: u8, pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize> counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
} }
// mode 3 // mode 3
impl Layer<PersistentBitMatrix> { impl Layer<PersistentBitMatrix> {
pub fn build_presence(out_dir: &Path, block_bits: u8, pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
n_genomes: usize, n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize> present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
} }
``` ```
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`. All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.
### Evidence build helpers on Layer Evidence-only post-hoc builds are accessible directly on `Layer<D>`:
```rust ```rust
impl<D: LayerData> Layer<D> { impl<D: LayerData> Layer<D> {
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize> pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize> pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
} }
``` ```
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level. There is no `build_evidence` dispatch wrapper.
--- ---
@@ -212,14 +216,17 @@ pub struct LayeredMap<D: LayerData = ()> {
### Common methods ### Common methods
```rust ```rust
pub fn open(root: &Path) -> OLMResult<Self> pub fn open(root: &Path) -> OLMResult<Self>
pub fn create(root: &Path) -> OLMResult<Self> pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
pub fn n_layers(&self) -> usize pub fn n_layers(&self) -> usize
pub fn layer(&self, i: usize) -> &Layer<D> pub fn layer(&self, i: usize) -> &Layer<D>
pub fn mode(&self) -> &IndexMode
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)> pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter> pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
``` ```
`open` reads `PartitionMeta` once, extracts `mode`, and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta`.
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0. `query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
### push_layer ### push_layer
@@ -272,14 +279,13 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
``` ```
partition_root/ ← LayeredMap (one partition) partition_root/ ← LayeredMap (one partition)
meta.json — {"n_layers": N} meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
layer_0/ ← Layer layer_0/ ← Layer
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
mphf.bin — ptr_hash MPHF (epserde format) mphf.bin — ptr_hash MPHF (epserde format)
unitigs.bin — packed 2-bit nucleotide sequences unitigs.bin — packed 2-bit nucleotide sequences
unitigs.bin.idx — UIDX index (exact evidence only) unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
evidence.bin — [u32; n], LE (exact evidence only) evidence.bin — [u32; n], LE (Exact/Hybrid only)
fingerprint.bin — packed b-bit array (approx evidence only) fingerprint.bin — packed b-bit array (Approx/Hybrid only)
counts/ [mode 2] PersistentCompactIntMatrix counts/ [mode 2] PersistentCompactIntMatrix
meta.json meta.json
col_000000.pciv col_000000.pciv
@@ -290,7 +296,7 @@ partition_root/ ← LayeredMap (one partition)
``` ```
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction. There is no `layer_meta.json`. The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.
--- ---
@@ -387,4 +393,4 @@ Each partition's new layer is built independently; the operation is fully parall
| `obiskio` | unitig file writer/reader + `.idx` build | | `obiskio` | unitig file writer/reader + `.idx` build |
| `obicompactvec` | payload types + aggregation traits | | `obicompactvec` | payload types + aggregation traits |
| `rayon 1` | parallel MPHF construction pass | | `rayon 1` | parallel MPHF construction pass |
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation | | `serde / serde_json` | `PartitionMeta` serialisation |
+12 -23
View File
@@ -61,35 +61,24 @@ File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed. Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
### `open()` vs `open_sequential()` ### `open()`, `open_sequential()`, `open_direct_access()`
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`. `UnitigFileReader` has three constructors:
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message. - `open(path)` — smart default: if `unitigs.bin.idx` exists, delegates to `open_direct_access`; otherwise delegates to `open_sequential`. Prefer this in call sites that don't require one specific mode.
- `open_sequential(path)` — never reads `.idx`. Sequential iterators only; `chunk_start(i)` falls back to an O(i) mmap scan rather than panicking.
- `open_direct_access(path)` — requires `.idx` to be present. Enables O(1) or O(2^block_bits) `chunk_start(i)`, used by `verify_canonical_kmer` at query time.
### `chunk_start(i)` — random access `CanonicalKmerIter` — a clonable sequential iterator returned by `UnitigFileReader::iter_canonical_kmers()`. It holds an `Arc<Mmap>` so cloning resets the cursor to the start without reopening the file. This makes it usable with `par_bridge()` for parallel MPHF construction without random access.
```rust ### `chunk_start(i)` — access modes
fn chunk_start(&self, i: usize) -> usize {
// block_bits=0: single table lookup, O(1) — hot path
if self.block_bits == 0 {
return self.block_offsets[i] as usize;
}
// block_bits>0: lookup block, then scan at most 2^block_bits 1 records
let block = i >> self.block_bits;
let rem = i & self.mask;
let mut offset = self.block_offsets[block] as usize;
for _ in 0..rem {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
}
```
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first. When `.idx` is loaded (`open_direct_access`):
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads. - `block_bits = 0`: single array lookup, O(1).
- `block_bits > 0`: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).
When `.idx` is absent (`open_sequential`): `chunk_start(i)` performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by `find_strict()` on Approx layers (sequential scan of all canonical kmers).
### Decoding a kmer from slot `s` ### Decoding a kmer from slot `s`
+16 -1
View File
@@ -17,6 +17,7 @@
| `unitig` | Dump unitigs from a built index to stdout (debug) | | `unitig` | Dump unitigs from a built index to stdout (debug) |
| `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing | | `estimate` | Estimate approximate-index parameters (z, evidence bits, FP rates) before indexing |
| `reindex` | Convert an index's evidence in-place: exact ↔ approx | | `reindex` | Convert an index's evidence in-place: exact ↔ approx |
| `utils` | Miscellaneous index utilities: `--new-label NEW=OLD` renames a genome label in-place (NEW gets OLD's identity) |
## Constraints ## Constraints
@@ -24,7 +25,21 @@
- Maximum efficiency in computation, memory, and disk usage - Maximum efficiency in computation, memory, and disk usage
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base) - k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half - Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
- Input formats: FASTA, FASTQ, gzip, streaming stdin - Input formats: FASTA, FASTQ, gzip, streaming stdin; `index` reads from stdin automatically when no input files are provided (`-` can also be passed explicitly among other paths)
## Genome label constraints
Genome labels are arbitrary Unicode strings with the following restrictions:
| Character | Forbidden | Reason |
|-----------|-----------|--------|
| `/` | yes | filesystem path separator |
| `=` | yes | `--new-label` parser separator |
| `\0` | yes | null byte |
| `\n` `\r` `\t` | yes | break CSV output |
| spaces | **allowed** | use shell quoting: `--new-label 'new label=old label'` |
Empty labels are also rejected. Labels derived automatically from the index directory name (when `--label` is omitted) are not validated since they come from the filesystem and are already safe.
## Priority operations ## Priority operations
+1 -1
View File
@@ -12,5 +12,5 @@ pub use error::{OKIError, OKIResult};
pub use distance::{DistanceMetric, DistanceOutput}; pub use distance::{DistanceMetric, DistanceOutput};
pub use index::KmerIndex; pub use index::KmerIndex;
pub use merge::MergeMode; pub use merge::MergeMode;
pub use meta::{GenomeInfo, IndexConfig, IndexMeta, META_FILENAME}; pub use meta::{validate_label, GenomeInfo, IndexConfig, IndexMeta, META_FILENAME};
pub use state::{IndexState, SENTINEL_COUNTED, SENTINEL_INDEXED, SENTINEL_SCATTERED}; pub use state::{IndexState, SENTINEL_COUNTED, SENTINEL_INDEXED, SENTINEL_SCATTERED};
+7 -5
View File
@@ -9,7 +9,7 @@ use obisys::{Reporter, Stage};
use rayon::prelude::*; use rayon::prelude::*;
use tracing::info; use tracing::info;
use obilayeredmap::EvidenceKind; use obilayeredmap::IndexMode;
use crate::error::{OKIError, OKIResult}; use crate::error::{OKIError, OKIResult};
use crate::index::KmerIndex; use crate::index::KmerIndex;
@@ -271,14 +271,16 @@ fn partition_bar(n: u64) -> ProgressBar {
/// - all `Exact` → OK, returns `Exact` /// - all `Exact` → OK, returns `Exact`
/// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }` /// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }`
/// - mixed exact/approx or different approx params → `IncompatibleEvidence` /// - mixed exact/approx or different approx params → `IncompatibleEvidence`
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind> { fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<IndexMode> {
let ref_ev = &sources[0].meta.config.evidence; let ref_ev = &sources[0].meta.config.evidence;
for src in &sources[1..] { for src in &sources[1..] {
let ev = &src.meta.config.evidence; let ev = &src.meta.config.evidence;
let compat = match (ref_ev, ev) { let compat = match (ref_ev, ev) {
(EvidenceKind::Exact, EvidenceKind::Exact) => true, (IndexMode::Exact, IndexMode::Exact) => true,
(EvidenceKind::Approx { b: b1, z: z1 }, (IndexMode::Approx { b: b1, z: z1 },
EvidenceKind::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2, IndexMode::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
(IndexMode::Hybrid { b: b1, z: z1 },
IndexMode::Hybrid { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
_ => false, _ => false,
}; };
if !compat { if !compat {
+25 -2
View File
@@ -3,7 +3,7 @@ use std::fs;
use std::io; use std::io;
use std::path::Path; use std::path::Path;
use obilayeredmap::EvidenceKind; use obilayeredmap::IndexMode;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
pub const META_FILENAME: &str = "index.meta"; pub const META_FILENAME: &str = "index.meta";
@@ -30,7 +30,7 @@ pub struct IndexConfig {
pub n_bits: usize, pub n_bits: usize,
pub with_counts: bool, pub with_counts: bool,
#[serde(default)] #[serde(default)]
pub evidence: EvidenceKind, pub evidence: IndexMode,
/// Block size for the unitig index as a power-of-two exponent. /// Block size for the unitig index as a power-of-two exponent.
/// The `.idx` block covers 2^block_bits consecutive unitigs. /// The `.idx` block covers 2^block_bits consecutive unitigs.
/// 0 = one entry per unitig (O(1) access, largest `.idx`). /// 0 = one entry per unitig (O(1) access, largest `.idx`).
@@ -70,3 +70,26 @@ impl IndexMeta {
self.genomes.iter().map(|g| g.label.as_str()) self.genomes.iter().map(|g| g.label.as_str())
} }
} }
/// Validate a user-supplied genome label.
///
/// Forbidden: `/` (filesystem separator), `=` (--new-label parser separator),
/// `\0` (null byte), `\n`, `\r`, `\t` (break CSV output).
/// Empty labels are also rejected.
pub fn validate_label(label: &str) -> Result<(), String> {
if label.is_empty() {
return Err("genome label must not be empty".into());
}
const FORBIDDEN: &[char] = &['/', '=', '\0', '\n', '\r', '\t'];
if let Some(c) = label.chars().find(|c| FORBIDDEN.contains(c)) {
let display = match c {
'\0' => "\\0 (null)".to_string(),
'\n' => "\\n (newline)".to_string(),
'\r' => "\\r (carriage return)".to_string(),
'\t' => "\\t (tab)".to_string(),
c => format!("'{c}'"),
};
return Err(format!("genome label contains forbidden character {display}"));
}
Ok(())
}
+19 -11
View File
@@ -3,7 +3,7 @@ use std::path::Path;
use std::time::Duration; use std::time::Duration;
use indicatif::{ProgressBar, ProgressStyle}; use indicatif::{ProgressBar, ProgressStyle};
use obilayeredmap::{EvidenceKind, layer::Layer}; use obilayeredmap::{IndexMode, layer::Layer};
use obilayeredmap::meta::PartitionMeta; use obilayeredmap::meta::PartitionMeta;
use obisys::{Reporter, Stage}; use obisys::{Reporter, Stage};
use rayon::prelude::*; use rayon::prelude::*;
@@ -31,7 +31,7 @@ impl KmerIndex {
/// `index.meta` is updated with the new evidence kind on success. /// `index.meta` is updated with the new evidence kind on success.
pub fn reindex( pub fn reindex(
&mut self, &mut self,
target: EvidenceKind, target: IndexMode,
block_bits: u8, block_bits: u8,
rep: &mut Reporter, rep: &mut Reporter,
) -> OKIResult<()> { ) -> OKIResult<()> {
@@ -75,7 +75,7 @@ impl KmerIndex {
} }
self.meta.config.evidence = target; self.meta.config.evidence = target;
if matches!(self.meta.config.evidence, EvidenceKind::Exact) { if matches!(self.meta.config.evidence, IndexMode::Exact) {
self.meta.config.block_bits = block_bits; self.meta.config.block_bits = block_bits;
} }
self.meta.write(&self.root_path)?; self.meta.write(&self.root_path)?;
@@ -85,7 +85,7 @@ impl KmerIndex {
} }
/// Process all layers of one partition's index directory. /// Process all layers of one partition's index directory.
fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> { fn reindex_partition(index_dir: &Path, target: &IndexMode, block_bits: u8) -> OKIResult<()> {
if !index_dir.exists() { if !index_dir.exists() {
return Ok(()); return Ok(());
} }
@@ -97,22 +97,30 @@ fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) ->
Ok(()) Ok(())
} }
fn reindex_layer(layer_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> { fn reindex_layer(layer_dir: &Path, target: &IndexMode, block_bits: u8) -> OKIResult<()> {
Layer::<()>::build_evidence(layer_dir, target, block_bits).map_err(olm_to_oki)?; match target {
IndexMode::Exact => {
Layer::<()>::build_exact_evidence(layer_dir, block_bits).map_err(olm_to_oki)?;
}
IndexMode::Approx { b, z } | IndexMode::Hybrid { b, z } => {
Layer::<()>::build_approx_evidence(layer_dir, *b, *z).map_err(olm_to_oki)?;
}
}
remove_stale_evidence(layer_dir, target) remove_stale_evidence(layer_dir, target)
} }
fn remove_stale_evidence(layer_dir: &Path, target: &EvidenceKind) -> OKIResult<()> { fn remove_stale_evidence(layer_dir: &Path, target: &IndexMode) -> OKIResult<()> {
match target { match target {
EvidenceKind::Exact => { IndexMode::Exact => {
// fingerprint.bin is no longer valid
remove_if_exists(&layer_dir.join(FINGERPRINT_FILE)); remove_if_exists(&layer_dir.join(FINGERPRINT_FILE));
} }
EvidenceKind::Approx { .. } => { IndexMode::Approx { .. } => {
// exact bundle is no longer valid
remove_if_exists(&layer_dir.join(EVIDENCE_FILE)); remove_if_exists(&layer_dir.join(EVIDENCE_FILE));
remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE)); remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE));
} }
IndexMode::Hybrid { .. } => {
// both bundles kept — nothing to remove
}
} }
Ok(()) Ok(())
} }
+8 -3
View File
@@ -10,8 +10,9 @@ use obikseq::RoutableSuperKmer;
#[derive(Args)] #[derive(Args)]
pub struct CommonArgs { pub struct CommonArgs {
/// Input files or directories (FASTA/FASTQ, optionally gzip-compressed) /// Input files or directories (FASTA/FASTQ, optionally gzip-compressed).
#[arg(num_args = 1..)] /// If omitted, reads from stdin.
#[arg(num_args = 0..)]
pub inputs: Vec<String>, pub inputs: Vec<String>,
/// k-mer size /// k-mer size
@@ -63,7 +64,11 @@ pub fn block_size_to_bits(n: usize) -> u8 {
impl CommonArgs { impl CommonArgs {
pub fn seqfile_paths(&self) -> obiread::PathIter { pub fn seqfile_paths(&self) -> obiread::PathIter {
let paths = self.inputs.iter().map(PathBuf::from).collect(); let paths: Vec<PathBuf> = if self.inputs.is_empty() {
vec![PathBuf::from("-")]
} else {
self.inputs.iter().map(PathBuf::from).collect()
};
obiread::PathIter::new(paths) obiread::PathIter::new(paths)
} }
} }
+8 -4
View File
@@ -1,8 +1,8 @@
use std::path::PathBuf; use std::path::PathBuf;
use clap::Args; use clap::Args;
use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex}; use obikindex::{validate_label, GenomeInfo, IndexConfig, IndexState, KmerIndex};
use obilayeredmap::EvidenceKind; use obilayeredmap::IndexMode;
fn parse_key_value(s: &str) -> Result<(String, String), String> { fn parse_key_value(s: &str) -> Result<(String, String), String> {
let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?; let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?;
@@ -159,9 +159,9 @@ pub fn run(args: IndexArgs) {
let evidence = if args.approx { let evidence = if args.approx {
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp); let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}"); info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}");
EvidenceKind::Approx { b, z } IndexMode::Approx { b, z }
} else { } else {
EvidenceKind::Exact IndexMode::Exact
}; };
// ── Open or create the index ───────────────────────────────────────────── // ── Open or create the index ─────────────────────────────────────────────
@@ -194,6 +194,10 @@ pub fn run(args: IndexArgs) {
block_bits, block_bits,
}; };
let genome_info = args.label.as_ref().map(|label| { let genome_info = args.label.as_ref().map(|label| {
validate_label(label).unwrap_or_else(|e| {
eprintln!("error: --label: {e}");
std::process::exit(1);
});
let mut info = GenomeInfo::new(label.clone()); let mut info = GenomeInfo::new(label.clone());
for (k, v) in &args.meta { for (k, v) in &args.meta {
info.meta.insert(k.clone(), v.clone()); info.meta.insert(k.clone(), v.clone());
+1
View File
@@ -1,4 +1,5 @@
pub mod annotate; pub mod annotate;
pub mod utils;
pub mod distance; pub mod distance;
pub mod dump; pub mod dump;
pub mod estimate; pub mod estimate;
+139 -34
View File
@@ -4,6 +4,7 @@ use std::path::PathBuf;
use clap::Args; use clap::Args;
use obikindex::KmerIndex; use obikindex::KmerIndex;
use obilayeredmap::IndexMode;
use obiread::record::{SeqRecord, parse_chunk}; use obiread::record::{SeqRecord, parse_chunk};
use obiread::chunk::read_sequence_chunks; use obiread::chunk::read_sequence_chunks;
use obikseq::{RoutableSuperKmer, set_k, set_m}; use obikseq::{RoutableSuperKmer, set_k, set_m};
@@ -21,7 +22,7 @@ pub struct QueryArgs {
#[arg(num_args = 1..)] #[arg(num_args = 1..)]
pub inputs: Vec<String>, pub inputs: Vec<String>,
/// Also report per-position coverage vectors (cov_<i> per genome) /// Report per-position coverage vectors per genome (adds "coverage" to JSON)
#[arg(long)] #[arg(long)]
pub detail: bool, pub detail: bool,
@@ -41,6 +42,10 @@ pub struct QueryArgs {
#[arg(long, default_value_t = 1)] #[arg(long, default_value_t = 1)]
pub presence_threshold: u32, pub presence_threshold: u32,
/// Override the Findere z parameter from index metadata
#[arg(short = 'z', long)]
pub findere_z: Option<usize>,
/// Number of worker threads /// Number of worker threads
#[arg( #[arg(
short = 'T', short = 'T',
@@ -59,24 +64,18 @@ pub struct SKDesc {
/// Index of the source sequence within the batch. /// Index of the source sequence within the batch.
pub seq_idx: u32, pub seq_idx: u32,
/// Kmer offset of the first kmer of this superkmer within its sequence. /// Kmer offset of the first kmer of this superkmer within its sequence.
/// Computed as the cumulative number of kmers emitted before this superkmer
/// in the same sequence. Used for `--detail` coverage vectors.
pub kmer_offset: u32, pub kmer_offset: u32,
} }
// ── QueryBatch ──────────────────────────────────────────────────────────────── // ── QueryBatch ────────────────────────────────────────────────────────────────
/// A batch of query sequences with their superkmers deduplicated. /// A batch of query sequences with their superkmers deduplicated.
///
/// Each unique `RoutableSuperKmer` maps to all the (seq_idx, kmer_offset)
/// positions it occupies across the batch. The superkmer is queried once
/// per partition; the result is broadcast to every SKDesc entry.
pub struct QueryBatch { pub struct QueryBatch {
/// Sequence ids in batch order. /// Sequence ids in batch order.
pub ids: Vec<String>, pub ids: Vec<String>,
/// Raw sequence bytes (for output), in batch order. /// Raw sequence bytes (for output), in batch order.
pub seqs: Vec<Vec<u8>>, pub seqs: Vec<Vec<u8>>,
/// Per-sequence total kmer count (kmer_count + kmer_missing). /// Total kmer count per sequence (used for `--detail` coverage allocation).
pub n_kmers: Vec<u32>, pub n_kmers: Vec<u32>,
/// Deduplicated superkmer map. /// Deduplicated superkmer map.
pub map: HashMap<RoutableSuperKmer, Vec<SKDesc>>, pub map: HashMap<RoutableSuperKmer, Vec<SKDesc>>,
@@ -90,8 +89,8 @@ impl QueryBatch {
level_max: usize, level_max: usize,
theta: f64, theta: f64,
) -> Self { ) -> Self {
let mut ids = Vec::with_capacity(records.len()); let mut ids = Vec::with_capacity(records.len());
let mut seqs = Vec::with_capacity(records.len()); let mut seqs = Vec::with_capacity(records.len());
let mut n_kmers = Vec::with_capacity(records.len()); let mut n_kmers = Vec::with_capacity(records.len());
let mut map: HashMap<RoutableSuperKmer, Vec<SKDesc>> = HashMap::new(); let mut map: HashMap<RoutableSuperKmer, Vec<SKDesc>> = HashMap::new();
@@ -115,9 +114,6 @@ impl QueryBatch {
} }
/// Split the superkmer map by partition index. /// Split the superkmer map by partition index.
///
/// Returns a vec of length `n_partitions`; each slot holds the RSK refs
/// whose minimizer routes to that partition.
pub fn split_by_partition(&self, n_partitions: usize) -> Vec<Vec<&RoutableSuperKmer>> { pub fn split_by_partition(&self, n_partitions: usize) -> Vec<Vec<&RoutableSuperKmer>> {
let mask = (n_partitions as u64) - 1; let mask = (n_partitions as u64) - 1;
let mut by_part: Vec<Vec<&RoutableSuperKmer>> = vec![Vec::new(); n_partitions]; let mut by_part: Vec<Vec<&RoutableSuperKmer>> = vec![Vec::new(); n_partitions];
@@ -132,22 +128,83 @@ impl QueryBatch {
// ── Per-sequence accumulator ────────────────────────────────────────────────── // ── Per-sequence accumulator ──────────────────────────────────────────────────
struct SeqAcc { struct SeqAcc {
kmer_count: u32, kmer_count: u32,
kmer_missing: u32, kmer_missing: u32,
/// Per-genome accumulated count (count mode) or presence sum (presence mode).
genome_totals: Vec<u32>, genome_totals: Vec<u32>,
} }
impl SeqAcc { impl SeqAcc {
fn new(n_genomes: usize) -> Self { fn new(n_genomes: usize) -> Self {
Self { Self {
kmer_count: 0, kmer_count: 0,
kmer_missing: 0, kmer_missing: 0,
genome_totals: vec![0u32; n_genomes], genome_totals: vec![0u32; n_genomes],
} }
} }
} }
// ── Findere z-window filter ───────────────────────────────────────────────────
/// Apply the Findere z-window filter to per-kmer query results for one superkmer.
///
/// A k-mer at position i for genome g is confirmed only if it belongs to at least
/// one run of z consecutive positions where all k-mers are present for g.
/// Unconfirmed positions are zeroed; positions whose entire row becomes zero are
/// returned as `None`.
///
/// When z <= 1 or the superkmer is shorter than z k-mers, results are returned
/// unchanged (short superkmers cannot satisfy the z-window constraint).
fn apply_findere(
results: &[Option<Box<[u32]>>],
z: usize,
n_genomes: usize,
) -> Vec<Option<Box<[u32]>>> {
let n = results.len();
if z <= 1 || n < z {
return results.iter().map(|r| r.as_ref().map(|row| row.clone())).collect();
}
let mut confirmed = vec![vec![false; n_genomes]; n];
for g in 0..n_genomes {
let present: Vec<bool> = results
.iter()
.map(|r| r.as_ref().map_or(false, |row| row[g] > 0))
.collect();
let mut window_count = present[..z].iter().filter(|&&p| p).count();
if window_count == z {
for c in confirmed[..z].iter_mut() {
c[g] = true;
}
}
for j in 1..=(n - z) {
if present[j - 1] { window_count -= 1; }
if present[j + z - 1] { window_count += 1; }
if window_count == z {
for c in confirmed[j..j + z].iter_mut() {
c[g] = true;
}
}
}
}
results.iter().enumerate().map(|(i, opt)| {
let row = opt.as_ref()?;
let mut new_row: Box<[u32]> = row.clone();
let mut any = false;
for g in 0..n_genomes {
if !confirmed[i][g] {
new_row[g] = 0;
} else {
any = true;
}
}
if any { Some(new_row) } else { None }
}).collect()
}
// ── Entry point ─────────────────────────────────────────────────────────────── // ── Entry point ───────────────────────────────────────────────────────────────
pub fn run(args: QueryArgs) { pub fn run(args: QueryArgs) {
@@ -164,17 +221,22 @@ pub fn run(args: QueryArgs) {
let n_partitions = idx.n_partitions(); let n_partitions = idx.n_partitions();
let with_counts = idx.meta().config.with_counts; let with_counts = idx.meta().config.with_counts;
let effective_z: usize = args.findere_z.unwrap_or_else(|| {
match idx.meta().config.evidence {
IndexMode::Approx { z, .. } | IndexMode::Hybrid { z, .. } => z as usize,
IndexMode::Exact => 1,
}
});
info!( info!(
"query: k={k}, {} genome(s), with_counts={with_counts}, mismatch={}, detail={}", "query: k={k}, {} genome(s), with_counts={with_counts}, z={effective_z}, \
mismatch={}, detail={}",
n_genomes, args.mismatch, args.detail n_genomes, args.mismatch, args.detail
); );
if args.mismatch { if args.mismatch {
eprintln!("warning: --mismatch not yet implemented, ignored"); eprintln!("warning: --mismatch not yet implemented, ignored");
} }
if args.detail {
eprintln!("warning: --detail not yet implemented, ignored");
}
let paths: Vec<PathBuf> = args.inputs.iter().map(PathBuf::from).collect(); let paths: Vec<PathBuf> = args.inputs.iter().map(PathBuf::from).collect();
let mut out = BufWriter::new(io::stdout()); let mut out = BufWriter::new(io::stdout());
@@ -197,10 +259,20 @@ pub fn run(args: QueryArgs) {
continue; continue;
} }
let batch = QueryBatch::from_records(records, k, 6, 0.7); let batch = QueryBatch::from_records(records, k, 6, 0.7);
let n_seqs = batch.ids.len(); let n_seqs = batch.ids.len();
let mut accs: Vec<SeqAcc> = (0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect(); let mut accs: Vec<SeqAcc> =
(0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
// [seq_idx][genome_idx][kmer_position] — allocated only with --detail
let mut cov: Vec<Vec<Vec<u32>>> = if args.detail {
batch.n_kmers.iter()
.map(|&n| vec![vec![0u32; n as usize]; n_genomes])
.collect()
} else {
Vec::new()
};
let by_part = batch.split_by_partition(n_partitions); let by_part = batch.split_by_partition(n_partitions);
@@ -217,24 +289,40 @@ pub fn run(args: QueryArgs) {
std::process::exit(1); std::process::exit(1);
}); });
let presence = args.force_presence || !with_counts; let presence = args.force_presence || !with_counts;
let threshold = args.presence_threshold; let threshold = args.presence_threshold;
for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) { for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
let filtered = apply_findere(sk_kmer_results, effective_z, n_genomes);
let descs = batch.map.get(*rsk).expect("rsk must be in map"); let descs = batch.map.get(*rsk).expect("rsk must be in map");
for desc in descs { for desc in descs {
let acc = &mut accs[desc.seq_idx as usize]; let acc = &mut accs[desc.seq_idx as usize];
for hit in sk_kmer_results.iter() {
for (local_pos, hit) in filtered.iter().enumerate() {
match hit { match hit {
None => acc.kmer_missing += 1, None => {
// Only truly missing if the index also had no entry.
if sk_kmer_results[local_pos].is_none() {
acc.kmer_missing += 1;
}
}
Some(row) => { Some(row) => {
acc.kmer_count += 1; acc.kmer_count += 1;
for (g, &v) in row.iter().enumerate() { for (g, &v) in row.iter().enumerate() {
acc.genome_totals[g] += if presence { if v == 0 {
continue;
}
let contribution = if presence {
u32::from(v >= threshold) u32::from(v >= threshold)
} else { } else {
v v
}; };
acc.genome_totals[g] += contribution;
if args.detail {
let abs_pos = desc.kmer_offset as usize + local_pos;
cov[desc.seq_idx as usize][g][abs_pos] += contribution;
}
} }
} }
} }
@@ -243,7 +331,11 @@ pub fn run(args: QueryArgs) {
} }
} }
emit_batch(&batch, &accs, idx.meta(), args.count_missing, &mut out); emit_batch(
&batch, &accs, idx.meta(),
args.count_missing, args.detail, &cov,
&mut out,
);
} }
} }
} }
@@ -251,11 +343,13 @@ pub fn run(args: QueryArgs) {
// ── Output ──────────────────────────────────────────────────────────────────── // ── Output ────────────────────────────────────────────────────────────────────
fn emit_batch( fn emit_batch(
batch: &QueryBatch, batch: &QueryBatch,
accs: &[SeqAcc], accs: &[SeqAcc],
meta: &obikindex::meta::IndexMeta, meta: &obikindex::meta::IndexMeta,
count_missing: bool, count_missing: bool,
out: &mut impl Write, detail: bool,
cov: &[Vec<Vec<u32>>],
out: &mut impl Write,
) { ) {
for (seq_idx, (id, seq)) in batch.ids.iter().zip(batch.seqs.iter()).enumerate() { for (seq_idx, (id, seq)) in batch.ids.iter().zip(batch.seqs.iter()).enumerate() {
let acc = &accs[seq_idx]; let acc = &accs[seq_idx];
@@ -265,12 +359,23 @@ fn emit_batch(
if count_missing { if count_missing {
ann.insert("kmer_missing".into(), acc.kmer_missing.into()); ann.insert("kmer_missing".into(), acc.kmer_missing.into());
} }
let mut match_map = serde_json::Map::new(); let mut match_map = serde_json::Map::new();
for (g, genome) in meta.genomes.iter().enumerate() { for (g, genome) in meta.genomes.iter().enumerate() {
match_map.insert(genome.label.clone(), acc.genome_totals[g].into()); match_map.insert(genome.label.clone(), acc.genome_totals[g].into());
} }
ann.insert("kmer_strict_matches".into(), match_map.into()); ann.insert("kmer_strict_matches".into(), match_map.into());
if detail && !cov.is_empty() {
let mut cov_map = serde_json::Map::new();
for (g, genome) in meta.genomes.iter().enumerate() {
let v: Vec<serde_json::Value> =
cov[seq_idx][g].iter().map(|&x| x.into()).collect();
cov_map.insert(genome.label.clone(), v.into());
}
ann.insert("coverage".into(), cov_map.into());
}
let ann_str = serde_json::to_string(&ann).unwrap_or_else(|_| "{}".to_string()); let ann_str = serde_json::to_string(&ann).unwrap_or_else(|_| "{}".to_string());
// OBITools4 FASTA format: >id {"key":value,...} // OBITools4 FASTA format: >id {"key":value,...}
+3 -3
View File
@@ -2,7 +2,7 @@ use std::path::PathBuf;
use clap::Args; use clap::Args;
use obikindex::KmerIndex; use obikindex::KmerIndex;
use obilayeredmap::EvidenceKind; use obilayeredmap::IndexMode;
use obisys::Reporter; use obisys::Reporter;
use tracing::info; use tracing::info;
@@ -41,10 +41,10 @@ pub fn run(args: ReindexArgs) {
let target = if args.approx { let target = if args.approx {
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp); let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}"); info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}");
EvidenceKind::Approx { b, z } IndexMode::Approx { b, z }
} else { } else {
info!("target: exact evidence"); info!("target: exact evidence");
EvidenceKind::Exact IndexMode::Exact
}; };
let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| { let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
+91
View File
@@ -0,0 +1,91 @@
use std::path::PathBuf;
use clap::Args;
use obikindex::{validate_label, KmerIndex};
use tracing::info;
#[derive(Args)]
pub struct UtilsArgs {
/// Index directory to operate on
pub index: PathBuf,
/// Set a new genome label: NEW_LABEL=OLD_LABEL
#[arg(long, value_name = "NEW=OLD")]
pub new_label: Option<String>,
}
pub fn run(args: UtilsArgs) {
let mut any = false;
if let Some(spec) = &args.new_label {
any = true;
run_rename(&args.index, spec);
}
if !any {
eprintln!("utils: no operation specified. Available options: --new-label NEW=OLD");
std::process::exit(1);
}
}
fn run_rename(index_path: &PathBuf, spec: &str) {
let (old_label, new_label) = parse_rename_spec(spec);
let mut idx = KmerIndex::open(index_path).unwrap_or_else(|e| {
eprintln!("error opening index: {e}");
std::process::exit(1);
});
let pos = idx
.meta()
.genomes
.iter()
.position(|g| g.label == old_label)
.unwrap_or_else(|| {
eprintln!("error: genome '{old_label}' not found in index");
std::process::exit(1);
});
validate_label(&new_label).unwrap_or_else(|e| {
eprintln!("error: --new-label: {e}");
std::process::exit(1);
});
// Check the new label is not already taken.
if idx.meta().genomes.iter().any(|g| g.label == new_label) {
eprintln!("error: label '{new_label}' already exists in index");
std::process::exit(1);
}
idx.meta_mut().genomes[pos].label = new_label.clone();
idx.meta_mut().write(index_path).unwrap_or_else(|e| {
eprintln!("error writing index metadata: {e}");
std::process::exit(1);
});
// Rename the spectrum file if it exists.
let spectrums_dir = index_path.join("spectrums");
let old_spectrum = spectrums_dir.join(format!("{old_label}.json"));
let new_spectrum = spectrums_dir.join(format!("{new_label}.json"));
if old_spectrum.exists() {
std::fs::rename(&old_spectrum, &new_spectrum).unwrap_or_else(|e| {
eprintln!("warning: could not rename spectrum file: {e}");
});
}
info!("renamed genome '{old_label}' → '{new_label}'");
}
fn parse_rename_spec(spec: &str) -> (String, String) {
let eq = spec.find('=').unwrap_or_else(|| {
eprintln!("error: --new-label expects NEW_LABEL=OLD_LABEL, got '{spec}'");
std::process::exit(1);
});
let new = spec[..eq].trim().to_string();
let old = spec[eq + 1..].trim().to_string();
if old.is_empty() || new.is_empty() {
eprintln!("error: --new-label: both old and new labels must be non-empty");
std::process::exit(1);
}
(old, new)
}
+3
View File
@@ -36,6 +36,8 @@ enum Commands {
Estimate(cmd::estimate::EstimateArgs), Estimate(cmd::estimate::EstimateArgs),
/// Convert an index's evidence in-place: exact ↔ approx /// Convert an index's evidence in-place: exact ↔ approx
Reindex(cmd::reindex::ReindexArgs), Reindex(cmd::reindex::ReindexArgs),
/// Miscellaneous index utilities (--rename, …)
Utils(cmd::utils::UtilsArgs),
} }
fn main() { fn main() {
@@ -68,6 +70,7 @@ fn main() {
Commands::Unitig(args) => cmd::unitig::run(args), Commands::Unitig(args) => cmd::unitig::run(args),
Commands::Estimate(args) => cmd::estimate::run(args), Commands::Estimate(args) => cmd::estimate::run(args),
Commands::Reindex(args) => cmd::reindex::run(args), Commands::Reindex(args) => cmd::reindex::run(args),
Commands::Utils(args) => cmd::utils::run(args),
} }
#[cfg(feature = "profiling")] #[cfg(feature = "profiling")]
+12 -7
View File
@@ -1,8 +1,8 @@
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix}; use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
use obikseq::CanonicalKmer; use obikseq::CanonicalKmer;
use obiskio::{SKError, SKResult, UnitigFileReader}; use obiskio::{SKError, SKResult, UnitigFileReader};
use obilayeredmap::OLMError; use obilayeredmap::{IndexMode, MphfLayer, OLMError};
use obilayeredmap::MphfLayer; use obilayeredmap::meta::PartitionMeta;
use crate::partition::KmerPartition; use crate::partition::KmerPartition;
@@ -35,15 +35,16 @@ impl KmerPartition {
return Ok(()); return Ok(());
} }
// Discover layers by probing layer_0, layer_1, … until one is absent. let index_mode = PartitionMeta::load(&index_dir)
// PartitionMeta (meta.json) is only created by the merge path, not by .map(|m| m.mode)
// the initial single-genome build, so we cannot rely on it here. .unwrap_or(IndexMode::Exact);
let mut l = 0; let mut l = 0;
loop { loop {
let layer_dir = index_dir.join(format!("layer_{l}")); let layer_dir = index_dir.join(format!("layer_{l}"));
if !layer_dir.exists() { break; } if !layer_dir.exists() { break; }
l += 1; l += 1;
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(&layer_dir, &index_mode).map_err(olm_to_sk)?;
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?; let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
let counts_dir = layer_dir.join("counts"); let counts_dir = layer_dir.join("counts");
@@ -92,11 +93,15 @@ impl KmerPartition {
return Ok(()); return Ok(());
} }
let index_mode = PartitionMeta::load(&index_dir)
.map(|m| m.mode)
.unwrap_or(IndexMode::Exact);
let mut layer = 0; let mut layer = 0;
loop { loop {
let layer_dir = index_dir.join(format!("layer_{layer}")); let layer_dir = index_dir.join(format!("layer_{layer}"));
if !layer_dir.exists() { break; } if !layer_dir.exists() { break; }
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(&layer_dir, &index_mode).map_err(olm_to_sk)?;
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?; let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
let counts_dir = layer_dir.join("counts"); let counts_dir = layer_dir.join("counts");
+5 -7
View File
@@ -5,7 +5,7 @@ use cacheline_ef::{CachelineEf, CachelineEfVec};
use epserde::prelude::*; use epserde::prelude::*;
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec}; use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
use obidebruinj::GraphDeBruijn; use obidebruinj::GraphDeBruijn;
use obilayeredmap::{EvidenceKind, OLMError, layer::Layer}; use obilayeredmap::{IndexMode, OLMError, layer::Layer};
use obilayeredmap::meta::PartitionMeta; use obilayeredmap::meta::PartitionMeta;
use obiskio::{SKError, SKFileMeta, SKFileReader}; use obiskio::{SKError, SKFileMeta, SKFileReader};
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64}; use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
@@ -44,7 +44,7 @@ impl KmerPartition {
min_ab: u32, min_ab: u32,
max_ab: Option<u32>, max_ab: Option<u32>,
with_counts: bool, with_counts: bool,
evidence: &EvidenceKind, mode: &IndexMode,
block_bits: u8, block_bits: u8,
) -> Result<usize, SKError> { ) -> Result<usize, SKError> {
let part_dir = self.part_dir(i); let part_dir = self.part_dir(i);
@@ -110,7 +110,7 @@ impl KmerPartition {
uw.close()?; uw.close()?;
if with_counts { if with_counts {
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, block_bits, evidence, |kmer| { Layer::<PersistentCompactIntMatrix>::build(&layer_dir, block_bits, mode, |kmer| {
match (&mphf1_opt, &counts1_opt) { match (&mphf1_opt, &counts1_opt) {
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())), (Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
_ => 1, _ => 1,
@@ -118,13 +118,11 @@ impl KmerPartition {
}) })
.map_err(olm_to_sk)?; .map_err(olm_to_sk)?;
} else { } else {
Layer::<()>::build(&layer_dir, block_bits, evidence).map_err(olm_to_sk)?; Layer::<()>::build(&layer_dir, block_bits, mode).map_err(olm_to_sk)?;
} }
// Write meta.json in the index/ directory so LayeredMap::open works
// (e.g. for subsequent merge operations).
let index_dir = layer_dir.parent().expect("layer_dir has a parent"); let index_dir = layer_dir.parent().expect("layer_dir has a parent");
PartitionMeta { n_layers: 1 }.save(index_dir).map_err(olm_to_sk)?; PartitionMeta { n_layers: 1, mode: mode.clone() }.save(index_dir).map_err(olm_to_sk)?;
Ok(n_kmers) Ok(n_kmers)
} }
+10 -11
View File
@@ -9,7 +9,7 @@ use obicompactvec::{PersistentBitMatrix, PersistentBitMatrixBuilder,
PersistentCompactIntVecBuilder}; PersistentCompactIntVecBuilder};
use obikseq::CanonicalKmer; use obikseq::CanonicalKmer;
use obiskio::{SKError, SKResult, UnitigFileReader}; use obiskio::{SKError, SKResult, UnitigFileReader};
use obilayeredmap::{EvidenceKind, Layer, LayeredMap, MphfLayer, OLMError}; use obilayeredmap::{IndexMode, Layer, LayeredMap, MphfLayer, OLMError};
use obilayeredmap::meta::PartitionMeta; use obilayeredmap::meta::PartitionMeta;
use crate::partition::KmerPartition; use crate::partition::KmerPartition;
@@ -52,18 +52,17 @@ pub(crate) enum SrcLayerData {
} }
impl SrcLayerData { impl SrcLayerData {
pub(crate) fn open(layer_dir: &Path, mode: MergeMode) -> SKResult<Self> { pub(crate) fn open(layer_dir: &Path, merge_mode: MergeMode, index_mode: &IndexMode) -> SKResult<Self> {
let presence_dir = layer_dir.join("presence"); let presence_dir = layer_dir.join("presence");
let counts_dir = layer_dir.join("counts"); let counts_dir = layer_dir.join("counts");
match mode { match merge_mode {
MergeMode::Presence => { MergeMode::Presence => {
if presence_dir.exists() { if presence_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?; let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
Ok(SrcLayerData::Presence(mphf, mat)) Ok(SrcLayerData::Presence(mphf, mat))
} else if counts_dir.exists() { } else if counts_dir.exists() {
// Source is a count index; treat count > 0 as present via ColBuilder::Bit. let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?; let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(SrcLayerData::Count(mphf, mat)) Ok(SrcLayerData::Count(mphf, mat))
} else { } else {
@@ -72,7 +71,7 @@ impl SrcLayerData {
} }
MergeMode::Count => { MergeMode::Count => {
if counts_dir.exists() { if counts_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?; let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(SrcLayerData::Count(mphf, mat)) Ok(SrcLayerData::Count(mphf, mat))
} else { } else {
@@ -116,7 +115,7 @@ fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => { Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
let mut n = 0usize; let mut n = 0usize;
while dir.join(format!("layer_{n}")).exists() { n += 1; } while dir.join(format!("layer_{n}")).exists() { n += 1; }
let m = PartitionMeta { n_layers: n }; let m = PartitionMeta { n_layers: n, mode: IndexMode::default() };
m.save(dir).map_err(olm_to_sk)?; m.save(dir).map_err(olm_to_sk)?;
Ok(m) Ok(m)
} }
@@ -217,12 +216,12 @@ impl KmerPartition {
uw.write(&unitig)?; uw.write(&unitig)?;
} }
uw.close()?; uw.close()?;
Layer::<()>::build(&new_layer_dir, block_bits, &EvidenceKind::Exact).map_err(olm_to_sk)?; Layer::<()>::build(&new_layer_dir, block_bits, &IndexMode::Exact).map_err(olm_to_sk)?;
} }
drop(g); drop(g);
let new_mphf = if any_new { let new_mphf = if any_new {
Some(MphfLayer::open(&new_layer_dir).map_err(olm_to_sk)?) Some(MphfLayer::open(&new_layer_dir, &IndexMode::Exact).map_err(olm_to_sk)?)
} else { } else {
None None
}; };
@@ -304,7 +303,7 @@ impl KmerPartition {
for l in 0..src_meta.n_layers { for l in 0..src_meta.n_layers {
let src_layer_dir = src_index_dir.join(format!("layer_{l}")); let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
let reader = UnitigFileReader::open_sequential(&src_layer_dir.join("unitigs.bin"))?; let reader = UnitigFileReader::open_sequential(&src_layer_dir.join("unitigs.bin"))?;
let src_data = SrcLayerData::open(&src_layer_dir, mode)?; let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() { for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
let values = src_data.lookup(kmer, *src_n); let values = src_data.lookup(kmer, *src_n);
+7 -7
View File
@@ -3,7 +3,7 @@ use std::path::Path;
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix}; use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
use obikseq::{CanonicalKmer, RoutableSuperKmer}; use obikseq::{CanonicalKmer, RoutableSuperKmer};
use obiskio::{SKError, SKResult}; use obiskio::{SKError, SKResult};
use obilayeredmap::{MphfLayer, OLMError}; use obilayeredmap::{IndexMode, MphfLayer, OLMError};
use obilayeredmap::meta::PartitionMeta; use obilayeredmap::meta::PartitionMeta;
use crate::partition::KmerPartition; use crate::partition::KmerPartition;
@@ -27,25 +27,25 @@ enum QueryLayer {
} }
impl QueryLayer { impl QueryLayer {
fn open(layer_dir: &Path, with_counts: bool) -> SKResult<Self> { fn open(layer_dir: &Path, with_counts: bool, mode: &IndexMode) -> SKResult<Self> {
let presence_dir = layer_dir.join("presence"); let presence_dir = layer_dir.join("presence");
let counts_dir = layer_dir.join("counts"); let counts_dir = layer_dir.join("counts");
if with_counts && counts_dir.exists() { if with_counts && counts_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?; let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Count(mphf, mat)) Ok(QueryLayer::Count(mphf, mat))
} else if presence_dir.exists() { } else if presence_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?; let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Presence(mphf, mat)) Ok(QueryLayer::Presence(mphf, mat))
} else if counts_dir.exists() { } else if counts_dir.exists() {
// presence query on a count index — return counts as-is // presence query on a count index — return counts as-is
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?; let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Count(mphf, mat)) Ok(QueryLayer::Count(mphf, mat))
} else { } else {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?; let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
Ok(QueryLayer::SetOnly(mphf)) Ok(QueryLayer::SetOnly(mphf))
} }
} }
@@ -102,7 +102,7 @@ impl KmerPartition {
let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?; let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
let layers: Vec<QueryLayer> = (0..meta.n_layers) let layers: Vec<QueryLayer> = (0..meta.n_layers)
.map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts)) .map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts, &meta.mode))
.collect::<SKResult<_>>()?; .collect::<SKResult<_>>()?;
Ok(superkmers Ok(superkmers
+7 -7
View File
@@ -8,7 +8,7 @@ use obicompactvec::{PersistentBitMatrixBuilder,
PersistentCompactIntVecBuilder}; PersistentCompactIntVecBuilder};
use obidebruinj::GraphDeBruijn; use obidebruinj::GraphDeBruijn;
use obiskio::{SKError, SKResult, UnitigFileReader}; use obiskio::{SKError, SKResult, UnitigFileReader};
use obilayeredmap::{EvidenceKind, Layer, MphfLayer, OLMError}; use obilayeredmap::{IndexMode, Layer, MphfLayer, OLMError};
use obilayeredmap::meta::PartitionMeta; use obilayeredmap::meta::PartitionMeta;
use crate::filter::KmerFilter; use crate::filter::KmerFilter;
@@ -67,7 +67,7 @@ fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => { Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
let mut n = 0usize; let mut n = 0usize;
while dir.join(format!("layer_{n}")).exists() { n += 1; } while dir.join(format!("layer_{n}")).exists() { n += 1; }
let m = PartitionMeta { n_layers: n }; let m = PartitionMeta { n_layers: n, mode: IndexMode::default() };
m.save(dir).map_err(olm_to_sk)?; m.save(dir).map_err(olm_to_sk)?;
Ok(m) Ok(m)
} }
@@ -117,7 +117,7 @@ impl KmerPartition {
if !unitigs_path.exists() { continue; } if !unitigs_path.exists() { continue; }
let reader = UnitigFileReader::open_sequential(&unitigs_path)?; let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
let src_data = SrcLayerData::open(&src_layer_dir, mode)?; let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() { for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
let row = src_data.lookup(kmer, n_genomes); let row = src_data.lookup(kmer, n_genomes);
@@ -146,8 +146,8 @@ impl KmerPartition {
uw.close()?; uw.close()?;
drop(g); drop(g);
Layer::<()>::build(&dst_layer_dir, block_bits, &EvidenceKind::Exact).map_err(olm_to_sk)?; Layer::<()>::build(&dst_layer_dir, block_bits, &IndexMode::Exact).map_err(olm_to_sk)?;
let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?; let dst_mphf = MphfLayer::open(&dst_layer_dir, &IndexMode::Exact).map_err(olm_to_sk)?;
// ── Prepare matrix builders (one column per genome) ─────────────────── // ── Prepare matrix builders (one column per genome) ───────────────────
let data_dir = match mode { let data_dir = match mode {
@@ -182,7 +182,7 @@ impl KmerPartition {
if !unitigs_path.exists() { continue; } if !unitigs_path.exists() { continue; }
let reader = UnitigFileReader::open_sequential(&unitigs_path)?; let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
let src_data = SrcLayerData::open(&src_layer_dir, mode)?; let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() { for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
let row = src_data.lookup(kmer, n_genomes); let row = src_data.lookup(kmer, n_genomes);
@@ -199,7 +199,7 @@ impl KmerPartition {
for b in builders { b.close()?; } for b in builders { b.close()?; }
write_matrix_meta(&data_dir, n_new, n_genomes).map_err(SKError::Io)?; write_matrix_meta(&data_dir, n_new, n_genomes).map_err(SKError::Io)?;
PartitionMeta { n_layers: 1 }.save(&dst_index_dir).map_err(olm_to_sk)?; PartitionMeta { n_layers: 1, mode: IndexMode::Exact }.save(&dst_index_dir).map_err(olm_to_sk)?;
Ok(()) Ok(())
} }
+11 -16
View File
@@ -10,7 +10,7 @@ use obikseq::CanonicalKmer;
use obiskio::{UnitigFileReader, UnitigFileWriter}; use obiskio::{UnitigFileReader, UnitigFileWriter};
use crate::error::{OLMError, OLMResult}; use crate::error::{OLMError, OLMResult};
use crate::meta::EvidenceKind; use crate::meta::IndexMode;
use crate::mphf_layer::MphfLayer; use crate::mphf_layer::MphfLayer;
pub(crate) use crate::mphf_layer::UNITIGS_FILE; pub(crate) use crate::mphf_layer::UNITIGS_FILE;
@@ -62,8 +62,8 @@ pub struct Hit<T = ()> {
// ── Common read path ────────────────────────────────────────────────────────── // ── Common read path ──────────────────────────────────────────────────────────
impl<D: LayerData> Layer<D> { impl<D: LayerData> Layer<D> {
pub fn open(path: &Path) -> OLMResult<Self> { pub fn open(path: &Path, mode: &IndexMode) -> OLMResult<Self> {
let mphf = MphfLayer::open(path)?; let mphf = MphfLayer::open(path, mode)?;
let data = D::open(path)?; let data = D::open(path)?;
Ok(Self { mphf, data }) Ok(Self { mphf, data })
} }
@@ -92,18 +92,13 @@ impl<D: LayerData> Layer<D> {
MphfLayer::build_approx_evidence(layer_dir, b, z) MphfLayer::build_approx_evidence(layer_dir, b, z)
} }
/// Dispatch to `build_exact_evidence` or `build_approx_evidence`.
/// `block_bits` is forwarded to exact evidence only.
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
MphfLayer::build_evidence(layer_dir, kind, block_bits)
}
} }
// ── Mode 1 — set membership ─────────────────────────────────────────────────── // ── Mode 1 — set membership ───────────────────────────────────────────────────
impl Layer<()> { impl Layer<()> {
pub fn build(out_dir: &Path, block_bits: u8, evidence_kind: &EvidenceKind) -> OLMResult<usize> { pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize> {
MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |_, _| Ok(())) MphfLayer::build(out_dir, block_bits, mode, &mut |_, _| Ok(()))
} }
/// Create a presence matrix for a set-membership layer (first merge). /// Create a presence matrix for a set-membership layer (first merge).
@@ -126,7 +121,7 @@ impl Layer<PersistentCompactIntMatrix> {
pub fn build( pub fn build(
out_dir: &Path, out_dir: &Path,
block_bits: u8, block_bits: u8,
evidence_kind: &EvidenceKind, mode: &IndexMode,
count_of: impl Fn(CanonicalKmer) -> u32, count_of: impl Fn(CanonicalKmer) -> u32,
) -> OLMResult<usize> { ) -> OLMResult<usize> {
let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers(); let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers();
@@ -134,7 +129,7 @@ impl Layer<PersistentCompactIntMatrix> {
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir) let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
.map_err(OLMError::Io)?; .map_err(OLMError::Io)?;
let mut col = mb.add_col().map_err(OLMError::Io)?; let mut col = mb.add_col().map_err(OLMError::Io)?;
let n_built = MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |slot, kmer| { let n_built = MphfLayer::build(out_dir, block_bits, mode, &mut |slot, kmer| {
col.set(slot, count_of(kmer)); col.set(slot, count_of(kmer));
Ok(()) Ok(())
})?; })?;
@@ -146,10 +141,10 @@ impl Layer<PersistentCompactIntMatrix> {
pub fn build_from_map( pub fn build_from_map(
out_dir: &Path, out_dir: &Path,
block_bits: u8, block_bits: u8,
evidence_kind: &EvidenceKind, mode: &IndexMode,
counts: &HashMap<CanonicalKmer, u32>, counts: &HashMap<CanonicalKmer, u32>,
) -> OLMResult<usize> { ) -> OLMResult<usize> {
Self::build(out_dir, block_bits, evidence_kind, |kmer| counts.get(&kmer).copied().unwrap_or(0)) Self::build(out_dir, block_bits, mode, |kmer| counts.get(&kmer).copied().unwrap_or(0))
} }
} }
@@ -179,7 +174,7 @@ impl Layer<PersistentBitMatrix> {
pub fn build_presence( pub fn build_presence(
out_dir: &Path, out_dir: &Path,
block_bits: u8, block_bits: u8,
evidence_kind: &EvidenceKind, mode: &IndexMode,
n_genomes: usize, n_genomes: usize,
present_in: impl Fn(CanonicalKmer, usize) -> bool, present_in: impl Fn(CanonicalKmer, usize) -> bool,
) -> OLMResult<usize> { ) -> OLMResult<usize> {
@@ -189,7 +184,7 @@ impl Layer<PersistentBitMatrix> {
let mut cols: Vec<_> = (0..n_genomes) let mut cols: Vec<_> = (0..n_genomes)
.map(|_| mb.add_col().map_err(OLMError::Io)) .map(|_| mb.add_col().map_err(OLMError::Io))
.collect::<OLMResult<_>>()?; .collect::<OLMResult<_>>()?;
let n_built = MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |slot, kmer| { let n_built = MphfLayer::build(out_dir, block_bits, mode, &mut |slot, kmer| {
for (g, col) in cols.iter_mut().enumerate() { for (g, col) in cols.iter_mut().enumerate() {
col.set(slot, present_in(kmer, g)); col.set(slot, present_in(kmer, g));
} }
+1 -1
View File
@@ -11,5 +11,5 @@ pub use error::{OLMError, OLMResult};
pub use layer::{Hit, Layer, LayerData}; pub use layer::{Hit, Layer, LayerData};
pub use layered_store::LayeredStore; pub use layered_store::LayeredStore;
pub use map::LayeredMap; pub use map::LayeredMap;
pub use meta::{EvidenceKind, LayerMeta}; pub use meta::{IndexMode, PartitionMeta};
pub use mphf_layer::MphfLayer; pub use mphf_layer::MphfLayer;
+17 -36
View File
@@ -5,11 +5,10 @@ use std::path::{Path, PathBuf};
use obicompactvec::PersistentCompactIntMatrix; use obicompactvec::PersistentCompactIntMatrix;
use obikseq::CanonicalKmer; use obikseq::CanonicalKmer;
use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS}; use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS};
use crate::meta::EvidenceKind;
use crate::error::OLMResult; use crate::error::OLMResult;
use crate::layer::{Hit, Layer, LayerData}; use crate::layer::{Hit, Layer, LayerData};
use crate::meta::PartitionMeta; use crate::meta::{IndexMode, PartitionMeta};
/// Layered kmer index for a single partition. /// Layered kmer index for a single partition.
/// ///
@@ -17,8 +16,8 @@ use crate::meta::PartitionMeta;
/// the first match wins. Adding a dataset appends a new layer without /// the first match wins. Adding a dataset appends a new layer without
/// rebuilding existing ones. /// rebuilding existing ones.
pub struct LayeredMap<D: LayerData = ()> { pub struct LayeredMap<D: LayerData = ()> {
root: PathBuf, root: PathBuf,
meta: PartitionMeta, meta: PartitionMeta,
layers: Vec<Layer<D>>, layers: Vec<Layer<D>>,
} }
@@ -26,39 +25,26 @@ pub struct LayeredMap<D: LayerData = ()> {
impl<D: LayerData> LayeredMap<D> { impl<D: LayerData> LayeredMap<D> {
/// Open an existing layered index at `root`. /// Open an existing layered index at `root`.
/// The mode is read once from `PartitionMeta` and applied to all layers.
pub fn open(root: &Path) -> OLMResult<Self> { pub fn open(root: &Path) -> OLMResult<Self> {
let meta = PartitionMeta::load(root)?; let meta = PartitionMeta::load(root)?;
let layers = (0..meta.n_layers) let layers = (0..meta.n_layers)
.map(|i| Layer::<D>::open(&layer_dir(root, i))) .map(|i| Layer::<D>::open(&layer_dir(root, i), &meta.mode))
.collect::<OLMResult<Vec<_>>>()?; .collect::<OLMResult<Vec<_>>>()?;
Ok(Self { Ok(Self { root: root.to_owned(), meta, layers })
root: root.to_owned(),
meta,
layers,
})
} }
/// Create a new, empty layered index at `root`. /// Create a new, empty layered index at `root` with the given mode.
pub fn create(root: &Path) -> OLMResult<Self> { pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self> {
fs::create_dir_all(root)?; fs::create_dir_all(root)?;
let meta = PartitionMeta::new(); let meta = PartitionMeta::new(mode);
meta.save(root)?; meta.save(root)?;
Ok(Self { Ok(Self { root: root.to_owned(), meta, layers: Vec::new() })
root: root.to_owned(),
meta,
layers: Vec::new(),
})
} }
/// Return the number of layers in this index. pub fn n_layers(&self) -> usize { self.layers.len() }
pub fn n_layers(&self) -> usize { pub fn layer(&self, i: usize) -> &Layer<D> { &self.layers[i] }
self.layers.len() pub fn mode(&self) -> &IndexMode { &self.meta.mode }
}
/// Return a reference to the `i`-th layer.
pub fn layer(&self, i: usize) -> &Layer<D> {
&self.layers[i]
}
/// Query `kmer` across all layers. Returns `(layer_index, Hit)` on match. /// Query `kmer` across all layers. Returns `(layer_index, Hit)` on match.
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)> { pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)> {
@@ -68,17 +54,15 @@ impl<D: LayerData> LayeredMap<D> {
.find_map(|(i, layer)| layer.query(kmer).map(|hit| (i, hit))) .find_map(|(i, layer)| layer.query(kmer).map(|hit| (i, hit)))
} }
/// Return a `UnitigFileWriter` for the next layer to be built.
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter> { pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter> {
let dir = layer_dir(&self.root, self.layers.len()); let dir = layer_dir(&self.root, self.layers.len());
Layer::<D>::unitig_writer(&dir) Layer::<D>::unitig_writer(&dir)
} }
/// Append a new layer to the index.
fn append_layer(&mut self) -> OLMResult<()> { fn append_layer(&mut self) -> OLMResult<()> {
let i = self.layers.len(); let i = self.layers.len();
let dir = layer_dir(&self.root, i); let dir = layer_dir(&self.root, i);
self.layers.push(Layer::<D>::open(&dir)?); self.layers.push(Layer::<D>::open(&dir, &self.meta.mode)?);
self.meta.n_layers = self.layers.len(); self.meta.n_layers = self.layers.len();
self.meta.save(&self.root)?; self.meta.save(&self.root)?;
Ok(()) Ok(())
@@ -91,7 +75,7 @@ impl LayeredMap<()> {
pub fn push_layer(&mut self) -> OLMResult<usize> { pub fn push_layer(&mut self) -> OLMResult<usize> {
let i = self.layers.len(); let i = self.layers.len();
let dir = layer_dir(&self.root, i); let dir = layer_dir(&self.root, i);
Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS, &EvidenceKind::Exact)?; Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS, &self.meta.mode)?;
self.append_layer()?; self.append_layer()?;
Ok(i) Ok(i)
} }
@@ -103,15 +87,12 @@ impl LayeredMap<PersistentCompactIntMatrix> {
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> { pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
let i = self.layers.len(); let i = self.layers.len();
let dir = layer_dir(&self.root, i); let dir = layer_dir(&self.root, i);
Layer::<PersistentCompactIntMatrix>::build(&dir, DEFAULT_BLOCK_BITS, &EvidenceKind::Exact, count_of)?; Layer::<PersistentCompactIntMatrix>::build(&dir, DEFAULT_BLOCK_BITS, &self.meta.mode, count_of)?;
self.append_layer()?; self.append_layer()?;
Ok(i) Ok(i)
} }
pub fn push_layer_from_map( pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize> {
&mut self,
counts: &HashMap<CanonicalKmer, u32>,
) -> OLMResult<usize> {
self.push_layer(|kmer| counts.get(&kmer).copied().unwrap_or(0)) self.push_layer(|kmer| counts.get(&kmer).copied().unwrap_or(0))
} }
} }
+20 -40
View File
@@ -5,65 +5,45 @@ use serde::{Deserialize, Serialize};
use crate::error::OLMResult; use crate::error::OLMResult;
const META_FILE: &str = "meta.json"; const META_FILE: &str = "meta.json";
const LAYER_META_FILE: &str = "layer_meta.json";
// ── Layer-level metadata ────────────────────────────────────────────────────── // ── IndexMode ─────────────────────────────────────────────────────────────────
/// Describes the evidence bundle stored alongside the MPHF for one layer. /// Evidence mode for an entire partitioned index — homogeneous across all layers.
///
/// Determined once at build time; stored in `PartitionMeta` (`meta.json`).
/// All layers within an index share the same mode.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "type", rename_all = "snake_case")] #[serde(tag = "type", rename_all = "snake_case")]
pub enum EvidenceKind { pub enum IndexMode {
/// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives. /// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
Exact, Exact,
/// Approximate evidence: `fingerprint.bin` only. /// Approximate evidence: `fingerprint.bin` only.
/// `b` — fingerprint bits; false-positive rate per k-mer query = 1/2^b. /// `b` — fingerprint bits per slot; false-positive rate ≈ 1/2^b per query.
/// `z` — consecutive k-mers that must all match (Findere trick); /// `z` — Findere consecutive-kmer parameter (build-time only; not used at query time).
/// effective FP rate per sequencing read ≈ W / 2^(b·z)
/// where W = L - k - z + 2 is the number of windows in a read of length L.
Approx { b: u8, z: u8 }, Approx { b: u8, z: u8 },
/// Hybrid: both `fingerprint.bin` and `evidence.bin` + `unitigs.bin.idx`.
/// `find()` uses the fingerprint (O(1), approx); `find_strict()` uses exact evidence.
Hybrid { b: u8, z: u8 },
} }
#[derive(Debug, Clone, Serialize, Deserialize)] impl Default for IndexMode {
pub struct LayerMeta {
pub evidence: EvidenceKind,
}
impl Default for EvidenceKind {
fn default() -> Self { Self::Exact } fn default() -> Self { Self::Exact }
} }
impl LayerMeta { // ── PartitionMeta ─────────────────────────────────────────────────────────────
pub fn exact() -> Self {
Self { evidence: EvidenceKind::Exact }
}
pub fn approx(b: u8, z: u8) -> Self {
Self { evidence: EvidenceKind::Approx { b, z } }
}
pub fn load(layer_dir: &Path) -> OLMResult<Self> {
let f = File::open(layer_dir.join(LAYER_META_FILE))?;
Ok(serde_json::from_reader(f)?)
}
pub fn save(&self, layer_dir: &Path) -> OLMResult<()> {
let f = File::create(layer_dir.join(LAYER_META_FILE))?;
serde_json::to_writer_pretty(f, self)?;
Ok(())
}
}
// ── Partition-level metadata ──────────────────────────────────────────────────
/// Index-level metadata stored in `meta.json` at the root of a partition index.
#[derive(Debug, Clone, Serialize, Deserialize)] #[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PartitionMeta { pub struct PartitionMeta {
pub n_layers: usize, pub n_layers: usize,
#[serde(default)]
pub mode: IndexMode,
} }
impl PartitionMeta { impl PartitionMeta {
pub fn new() -> Self { pub fn new(mode: IndexMode) -> Self {
Self { n_layers: 0 } Self { n_layers: 0, mode }
} }
pub fn load(dir: &Path) -> OLMResult<Self> { pub fn load(dir: &Path) -> OLMResult<Self> {
@@ -79,5 +59,5 @@ impl PartitionMeta {
} }
impl Default for PartitionMeta { impl Default for PartitionMeta {
fn default() -> Self { Self::new() } fn default() -> Self { Self::new(IndexMode::Exact) }
} }
+138 -153
View File
@@ -1,5 +1,5 @@
use std::fs; use std::fs;
use std::path::Path; use std::path::{Path, PathBuf};
use cacheline_ef::{CachelineEf, CachelineEfVec}; use cacheline_ef::{CachelineEf, CachelineEfVec};
use epserde::prelude::*; use epserde::prelude::*;
@@ -10,7 +10,7 @@ use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
use crate::error::{OLMError, OLMResult}; use crate::error::{OLMError, OLMResult};
use crate::evidence::{Evidence, EvidenceWriter}; use crate::evidence::{Evidence, EvidenceWriter};
use crate::fingerprint::{FingerprintVec, FingerprintVecWriter}; use crate::fingerprint::{FingerprintVec, FingerprintVecWriter};
use crate::meta::{EvidenceKind, LayerMeta}; use crate::meta::IndexMode;
pub(crate) const MPHF_FILE: &str = "mphf.bin"; pub(crate) const MPHF_FILE: &str = "mphf.bin";
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin"; pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
@@ -19,19 +19,22 @@ pub(crate) const FINGERPRINT_FILE: &str = "fingerprint.bin";
pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>; pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
// ── Evidence store ──────────────────────────────────────────────────────────── // ── LayerEvidence ────────────────────────────────────────────────────────────
enum LayerEvidence { enum LayerEvidence {
Exact { evidence: Evidence, unitigs: UnitigFileReader }, Exact { evidence: Evidence, unitigs: UnitigFileReader },
Approx { fingerprint: FingerprintVec }, Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
} }
// ── MphfLayer ───────────────────────────────────────────────────────────────── // ── MphfLayer ─────────────────────────────────────────────────────────────────
/// Autonomous kmer → slot mapping for one layer. /// Autonomous kmer → slot mapping for one layer.
/// ///
/// Dispatches queries to exact or approximate evidence transparently based on /// Two query methods:
/// the `layer_meta.json` written at build time. /// - [`find`](Self::find) — O(1), uses fingerprint (Approx/Hybrid) or exact evidence (Exact).
/// - [`find_strict`](Self::find_strict) — always exact; O(1) on Exact/Hybrid layers,
/// O(n) sequential scan on Approx layers.
pub struct MphfLayer { pub struct MphfLayer {
mphf: Mphf, mphf: Mphf,
ev: LayerEvidence, ev: LayerEvidence,
@@ -39,21 +42,31 @@ pub struct MphfLayer {
} }
impl MphfLayer { impl MphfLayer {
pub fn open(dir: &Path) -> OLMResult<Self> { /// Open a layer using the index-level `mode` determined at `LayeredMap` open time.
let meta = LayerMeta::load(dir)?; /// No per-layer metadata file is read.
pub fn open(dir: &Path, mode: &IndexMode) -> OLMResult<Self> {
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE)) let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?; .map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
let (ev, n) = match meta.evidence { let (ev, n) = match mode {
EvidenceKind::Exact => { IndexMode::Exact => {
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?; let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
let n = evidence.len(); let n = evidence.len();
// open() auto-detects: uses direct access since exact layers always have .idx
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?; let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
(LayerEvidence::Exact { evidence, unitigs }, n) (LayerEvidence::Exact { evidence, unitigs }, n)
} }
EvidenceKind::Approx { .. } => { IndexMode::Approx { .. } => {
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?; let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
let n = fingerprint.n(); let n = fingerprint.n();
(LayerEvidence::Approx { fingerprint }, n) let unitigs_path = dir.join(UNITIGS_FILE);
(LayerEvidence::Approx { fingerprint, unitigs_path }, n)
}
IndexMode::Hybrid { .. } => {
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
let n = evidence.len();
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
(LayerEvidence::Hybrid { evidence, unitigs, fingerprint }, n)
} }
}; };
Ok(Self { mphf, ev, n }) Ok(Self { mphf, ev, n })
@@ -61,45 +74,60 @@ impl MphfLayer {
// ── Query API ───────────────────────────────────────────────────────────── // ── Query API ─────────────────────────────────────────────────────────────
/// Transparent dispatch: routes to `find_exact` or `find_approx` based on /// O(1) lookup — dispatches automatically:
/// the evidence loaded at `open` time. /// - Exact: evidence + `verify_canonical_kmer`, zero false positives.
/// - Approx: fingerprint check, false-positive rate ≈ 1/2^b.
/// - Hybrid: fingerprint check (fast path), zero false positives via `find_strict`.
#[inline] #[inline]
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> { pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
let slot = self.mphf.index(&kmer.raw());
if slot >= self.n { return None; }
match &self.ev { match &self.ev {
LayerEvidence::Exact { .. } => self.find_exact(kmer), LayerEvidence::Exact { evidence, unitigs } => {
LayerEvidence::Approx { .. } => self.find_approx(kmer), let (chunk_id, rank) = evidence.decode(slot);
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
Some(slot)
} else {
None
}
}
LayerEvidence::Approx { fingerprint, .. } |
LayerEvidence::Hybrid { fingerprint, .. } => {
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
}
} }
} }
/// Exact lookup: zero false positives. Panics if the layer was opened with /// Always-exact lookup zero false positives regardless of mode.
/// approximate evidence. ///
#[inline] /// - Exact/Hybrid: O(1) via evidence + `verify_canonical_kmer`.
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize> { /// - Approx: O(n) sequential scan of `unitigs.bin` to confirm the kmer
let LayerEvidence::Exact { evidence, unitigs } = &self.ev else { /// that owns the slot, then exact comparison.
panic!("find_exact called on an approximate layer"); pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize> {
};
let slot = self.mphf.index(&kmer.raw()); let slot = self.mphf.index(&kmer.raw());
if slot >= self.n { return None; } if slot >= self.n { return None; }
let (chunk_id, rank) = evidence.decode(slot); match &self.ev {
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) { LayerEvidence::Exact { evidence, unitigs } |
Some(slot) LayerEvidence::Hybrid { evidence, unitigs, .. } => {
} else { let (chunk_id, rank) = evidence.decode(slot);
None if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
Some(slot)
} else {
None
}
}
LayerEvidence::Approx { unitigs_path, .. } => {
let reader = UnitigFileReader::open_sequential(unitigs_path).ok()?;
for stored in reader.iter_canonical_kmers() {
if self.mphf.index(&stored.raw()) == slot {
return if stored == kmer { Some(slot) } else { None };
}
}
None
}
} }
} }
/// Approximate lookup: false-positive rate 1/2^b per k-mer query. Panics
/// if the layer was opened with exact evidence.
#[inline]
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize> {
let LayerEvidence::Approx { fingerprint } = &self.ev else {
panic!("find_approx called on an exact layer");
};
let slot = self.mphf.index(&kmer.raw());
if slot >= self.n { return None; }
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
}
pub fn n(&self) -> usize { self.n } pub fn n(&self) -> usize { self.n }
// ── Build helpers ───────────────────────────────────────────────────────── // ── Build helpers ─────────────────────────────────────────────────────────
@@ -109,19 +137,7 @@ impl MphfLayer {
Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?) Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?)
} }
/// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on
/// `kind`. `block_bits` is forwarded to exact evidence only.
pub fn build_evidence(dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
match kind {
EvidenceKind::Exact => Self::build_exact_evidence(dir, block_bits),
EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z),
}
}
/// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`. /// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
///
/// `block_bits` controls the `.idx` block size (2^block_bits chunks per block).
/// Uses sequential iteration — no `.idx` required on entry.
pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult<usize> { pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult<usize> {
let unitig_path = dir.join(UNITIGS_FILE); let unitig_path = dir.join(UNITIGS_FILE);
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?; let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
@@ -130,7 +146,6 @@ impl MphfLayer {
if n == 0 { if n == 0 {
fs::File::create(dir.join(EVIDENCE_FILE))?; fs::File::create(dir.join(EVIDENCE_FILE))?;
build_unitig_idx(&unitig_path, block_bits)?; build_unitig_idx(&unitig_path, block_bits)?;
LayerMeta::exact().save(dir)?;
return Ok(0); return Ok(0);
} }
@@ -156,13 +171,10 @@ impl MphfLayer {
ev.write(&dir.join(EVIDENCE_FILE))?; ev.write(&dir.join(EVIDENCE_FILE))?;
build_unitig_idx(&unitig_path, block_bits)?; build_unitig_idx(&unitig_path, block_bits)?;
LayerMeta::exact().save(dir)?;
Ok(n) Ok(n)
} }
/// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`. /// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
/// `b` — fingerprint bits (1..=64); `z` — Findere consecutive k-mer
/// parameter (≥1). No `.idx` is written.
pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult<usize> { pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult<usize> {
if b == 0 || b > 64 { if b == 0 || b > 64 {
return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into())); return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into()));
@@ -176,7 +188,6 @@ impl MphfLayer {
if n == 0 { if n == 0 {
FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?; FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?;
LayerMeta::approx(b, z).save(dir)?;
return Ok(0); return Ok(0);
} }
@@ -194,139 +205,113 @@ impl MphfLayer {
} }
fw.write(&dir.join(FINGERPRINT_FILE))?; fw.write(&dir.join(FINGERPRINT_FILE))?;
LayerMeta::approx(b, z).save(dir)?;
Ok(n) Ok(n)
} }
/// Build MPHF then evidence from the unitigs file already present in `dir`. /// Build MPHF + evidence from `unitigs.bin` already present in `dir`.
/// ///
/// - Exact: `.idx` is built for pass-1 parallel construction and kept for /// `fill_slot(slot, kmer)` is called once per kmer in all modes.
/// query-time kmer verification. `evidence.bin` is written. /// No `layer_meta.json` is written — the mode is an index-level property
/// - Approx: pass-1 uses `open_sequential` + `par_bridge` — no `.idx` is /// stored in `PartitionMeta`.
/// ever created. `fingerprint.bin` is written.
///
/// `fill_slot(slot, kmer)` is called once per kmer in both modes.
pub(crate) fn build( pub(crate) fn build(
dir: &Path, dir: &Path,
block_bits: u8, block_bits: u8,
evidence_kind: &EvidenceKind, mode: &IndexMode,
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>, fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
) -> OLMResult<usize> { ) -> OLMResult<usize> {
use rayon::prelude::*; use rayon::prelude::*;
let unitig_path = dir.join(UNITIGS_FILE); let unitig_path = dir.join(UNITIGS_FILE);
let n = UnitigFileReader::open_sequential(&unitig_path)?.n_kmers();
match evidence_kind { let sk_to_olm = |e: obiskio::SKError| match e {
// ── Exact path ──────────────────────────────────────────────────── obiskio::SKError::Io(io) => OLMError::Io(io),
// .idx is built LAST, once evidence.bin is written, so it is never e => OLMError::InvalidLayer(e.to_string()),
// present during construction — only at query time. };
EvidenceKind::Exact => {
let n = UnitigFileReader::open_sequential(&unitig_path)?.n_kmers();
let keys = CanonicalKmerIter::new(&unitig_path)
.map_err(|e| match e {
obiskio::SKError::Io(io) => OLMError::Io(io),
e => OLMError::InvalidLayer(e.to_string()),
})?;
if n == 0 { // ── Empty layer ───────────────────────────────────────────────────────
if n == 0 {
let mphf: Mphf =
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
mphf.store(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
match mode {
IndexMode::Exact | IndexMode::Hybrid { .. } => {
fs::File::create(dir.join(EVIDENCE_FILE))?; fs::File::create(dir.join(EVIDENCE_FILE))?;
let mphf: Mphf =
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
mphf.store(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
LayerMeta::exact().save(dir)?;
build_unitig_idx(&unitig_path, block_bits)?; build_unitig_idx(&unitig_path, block_bits)?;
return Ok(0);
} }
IndexMode::Approx { b, .. } => {
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
}
}
if let IndexMode::Hybrid { b, .. } = mode {
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
}
return Ok(0);
}
// Pass 1 MPHF construction via clonable mmap iterator // ── Pass 1: MPHF via clonable mmap iterator ───────────────────────────
let mphf: Mphf = let keys = CanonicalKmerIter::new(&unitig_path).map_err(sk_to_olm)?;
Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(), PtrHashParams::<CubicEps>::default()); let mphf: Mphf =
mphf.store(&dir.join(MPHF_FILE)) Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(),
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?; PtrHashParams::<CubicEps>::default());
mphf.store(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
// Pass 2 — sequential: fill evidence.bin + callback // ── Pass 2: fill evidence files + callback ────────────────────────────
let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?; let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?;
let mut ev = EvidenceWriter::new(n); let mut seen = vec![0u8; (n + 7) / 8];
let mut seen = vec![0u8; (n + 7) / 8];
match mode {
IndexMode::Exact => {
let mut ev = EvidenceWriter::new(n);
for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() { for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() {
let slot = mphf.index(&kmer.raw()); let slot = mphf.index(&kmer.raw());
if slot >= n { if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
return Err(OLMError::Mphf("slot out of bounds".into())); let byte = slot / 8; let bit = 1u8 << (slot % 8);
} if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
let byte = slot / 8;
let bit = 1u8 << (slot % 8);
if seen[byte] & bit != 0 {
return Err(OLMError::Mphf("duplicate slot".into()));
}
seen[byte] |= bit; seen[byte] |= bit;
ev.set(slot, chunk_id as u32, rank as u8); ev.set(slot, chunk_id as u32, rank as u8);
fill_slot(slot, kmer)?; fill_slot(slot, kmer)?;
} }
ev.write(&dir.join(EVIDENCE_FILE))?; ev.write(&dir.join(EVIDENCE_FILE))?;
LayerMeta::exact().save(dir)?;
// .idx built last: strictly for query-time kmer verification
build_unitig_idx(&unitig_path, block_bits)?; build_unitig_idx(&unitig_path, block_bits)?;
Ok(n)
} }
// ── Approx path ─────────────────────────────────────────────────── IndexMode::Approx { b, .. } => {
// No .idx is created at any point. let mut fw = FingerprintVecWriter::new(n, *b);
EvidenceKind::Approx { b, z } => {
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
let n = unitigs.n_kmers();
if n == 0 {
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
let mphf: Mphf =
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
mphf.store(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
LayerMeta::approx(*b, *z).save(dir)?;
return Ok(0);
}
// Pass 1 — MPHF construction via mmap-backed clonable iterator.
// No .idx is created. par_bridge() parallelises the sequential scan;
// Clone on CanonicalKmerRawIter shares the Arc<Mmap> and resets to pos 0.
let keys = CanonicalKmerIter::new(&unitig_path)
.map_err(|e| match e {
obiskio::SKError::Io(io) => OLMError::Io(io),
e => OLMError::InvalidLayer(e.to_string()),
})?;
let mphf: Mphf =
Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(), PtrHashParams::<CubicEps>::default());
mphf.store(&dir.join(MPHF_FILE))
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
// Pass 2 — sequential: fill fingerprint.bin + callback
let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?;
let mut fw = FingerprintVecWriter::new(n, *b);
let mut seen = vec![0u8; (n + 7) / 8];
for kmer in unitigs2.iter_canonical_kmers() { for kmer in unitigs2.iter_canonical_kmers() {
let slot = mphf.index(&kmer.raw()); let slot = mphf.index(&kmer.raw());
if slot >= n { if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
return Err(OLMError::Mphf("slot out of bounds".into())); let byte = slot / 8; let bit = 1u8 << (slot % 8);
} if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
let byte = slot / 8;
let bit = 1u8 << (slot % 8);
if seen[byte] & bit != 0 {
return Err(OLMError::Mphf("duplicate slot".into()));
}
seen[byte] |= bit; seen[byte] |= bit;
fw.set(slot, kmer.seq_hash()); fw.set(slot, kmer.seq_hash());
fill_slot(slot, kmer)?; fill_slot(slot, kmer)?;
} }
fw.write(&dir.join(FINGERPRINT_FILE))?; fw.write(&dir.join(FINGERPRINT_FILE))?;
LayerMeta::approx(*b, *z).save(dir)?; }
Ok(n)
IndexMode::Hybrid { b, .. } => {
let mut ev = EvidenceWriter::new(n);
let mut fw = FingerprintVecWriter::new(n, *b);
for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() {
let slot = mphf.index(&kmer.raw());
if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
let byte = slot / 8; let bit = 1u8 << (slot % 8);
if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
seen[byte] |= bit;
ev.set(slot, chunk_id as u32, rank as u8);
fw.set(slot, kmer.seq_hash());
fill_slot(slot, kmer)?;
}
ev.write(&dir.join(EVIDENCE_FILE))?;
fw.write(&dir.join(FINGERPRINT_FILE))?;
build_unitig_idx(&unitig_path, block_bits)?;
} }
} }
Ok(n)
} }
} }
+7 -6
View File
@@ -19,12 +19,13 @@ impl PathIter {
file_buffer: Vec::new(), file_buffer: Vec::new(),
}; };
for path in paths { for path in paths {
// Avoid stat() at construction time on network filesystems (Lustre, NFS) // "-" is the stdin sentinel — pass it through without any extension
// where metadata operations can be 100s of milliseconds each. // check or directory expansion.
// Paths that look like sequence files are assumed to be files. if path.as_os_str() == "-" {
// Anything else is treated as a potential directory and expanded lazily iter.file_buffer.push(path);
// in next(); read_dir errors are silently skipped. } else if is_fasta_or_fastq(&path) {
if is_fasta_or_fastq(&path) { // Avoid stat() at construction time on network filesystems (Lustre, NFS)
// where metadata operations can be 100s of milliseconds each.
iter.file_buffer.push(path); iter.file_buffer.push(path);
} else { } else {
iter.dir_stack.push(path); iter.dir_stack.push(path);
-8
View File
@@ -24,10 +24,6 @@ impl<T: Copy + Default, const N: usize> Ring<T, N> {
} }
} }
#[inline] #[inline]
fn is_empty(&self) -> bool {
self.len == 0
}
#[inline]
fn clear(&mut self) { fn clear(&mut self) {
self.len = 0; self.len = 0;
self.head = 0; self.head = 0;
@@ -67,10 +63,6 @@ impl<T: Copy + Default, const N: usize> Ring<T, N> {
} }
} }
/// Iterate over elements front-to-back (copies, since T: Copy).
fn iter(&self) -> impl Iterator<Item = T> + '_ {
(0..self.len).map(move |i| self.buf[(self.head + i) % N])
}
} }
// ── MmerItem ────────────────────────────────────────────────────────────────── // ── MmerItem ──────────────────────────────────────────────────────────────────
+81 -50
View File
@@ -198,11 +198,16 @@ pub fn build_unitig_idx(unitigs_path: &Path, block_bits: u8) -> SKResult<()> {
// ── Reader ──────────────────────────────────────────────────────────────────── // ── Reader ────────────────────────────────────────────────────────────────────
/// Read-only random-access view of a unitig file. /// Memory-mapped view of a unitig file, with optional direct-access index.
/// ///
/// The sequence file is memory-mapped; the block offset table is loaded into RAM /// Three constructors select the operating mode:
/// on open. Random access to chunk `i`: O(1 << block_bits) sequential mmap /// - [`open`](Self::open) — smart default: direct access if `.idx` exists, sequential otherwise.
/// reads. Sequential iteration: O(n) via a running-offset cursor. /// - [`open_sequential`](Self::open_sequential) — always sequential, ignores `.idx`.
/// - [`open_direct_access`](Self::open_direct_access) — requires `.idx`, errors if absent.
///
/// All positional methods (`chunk_start`, `verify_canonical_kmer`, …) work in
/// both modes. Without `.idx` they fall back to an O(i) sequential scan —
/// correct but slower.
pub struct UnitigFileReader { pub struct UnitigFileReader {
mmap: Mmap, mmap: Mmap,
block_offsets: Vec<u32>, block_offsets: Vec<u32>,
@@ -214,8 +219,52 @@ pub struct UnitigFileReader {
} }
impl UnitigFileReader { impl UnitigFileReader {
/// Open with `.idx` — enables both sequential iteration and random access. /// Smart default: opens with direct access if `.idx` is present, sequential otherwise.
pub fn open(path: &Path) -> SKResult<Self> { pub fn open(path: &Path) -> SKResult<Self> {
if idx_path(path).exists() {
Self::open_direct_access(path)
} else {
Self::open_sequential(path)
}
}
/// Always sequential — never reads `.idx` even if present.
///
/// Scans the binary file once to count chunks and k-mers.
/// Positional access (`chunk_start`, `verify_canonical_kmer`) falls back to
/// O(i) sequential scan.
pub fn open_sequential(path: &Path) -> SKResult<Self> {
let file = File::open(path).map_err(SKError::Io)?;
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
let k = obikseq::params::k();
let mut offset = 0usize;
let mut n_unitigs = 0usize;
let mut n_kmers = 0usize;
while offset < mmap.len() {
let seql_minus_k = mmap[offset] as usize;
n_kmers += seql_minus_k + 1;
offset += 1 + (seql_minus_k + k + 3) / 4;
n_unitigs += 1;
}
Ok(Self {
mmap,
block_offsets: Vec::new(),
n_unitigs,
n_kmers,
k,
block_bits: DEFAULT_BLOCK_BITS,
mask: (1usize << DEFAULT_BLOCK_BITS) - 1,
})
}
/// Requires `.idx` — errors if the companion index file is absent.
///
/// Enables O(1 << block_bits) positional access to any chunk.
/// Use only when direct access is architecturally required (query-time
/// verification on an exact-evidence layer).
pub fn open_direct_access(path: &Path) -> SKResult<Self> {
let file = File::open(path).map_err(SKError::Io)?; let file = File::open(path).map_err(SKError::Io)?;
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? }; let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?; let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?;
@@ -231,58 +280,38 @@ impl UnitigFileReader {
}) })
} }
/// Open without `.idx` — sequential iteration only, no random access.
///
/// Scans the binary file once to count chunks and k-mers. Use when only
/// [`Self::iter_kmers`], [`Self::iter_canonical_kmers`], or
/// [`Self::iter_indexed_canonical_kmers`] are needed.
pub fn open_sequential(path: &Path) -> SKResult<Self> {
let file = File::open(path).map_err(SKError::Io)?;
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
let k = obikseq::params::k();
let mut offset = 0usize;
let mut n_unitigs = 0usize;
let mut n_kmers = 0usize;
while offset < mmap.len() {
let seql_minus_k = mmap[offset] as usize;
n_kmers += seql_minus_k + 1;
offset += 1 + (seql_minus_k + k + 3) / 4;
n_unitigs += 1;
}
Ok(Self {
mmap,
block_offsets: Vec::new(), // empty → random access disabled
n_unitigs,
n_kmers,
k,
block_bits: DEFAULT_BLOCK_BITS,
mask: (1usize << DEFAULT_BLOCK_BITS) - 1,
})
}
pub fn len(&self) -> usize { self.n_unitigs } pub fn len(&self) -> usize { self.n_unitigs }
pub fn is_empty(&self) -> bool { self.n_unitigs == 0 } pub fn is_empty(&self) -> bool { self.n_unitigs == 0 }
pub fn n_kmers(&self) -> usize { self.n_kmers } pub fn n_kmers(&self) -> usize { self.n_kmers }
pub fn block_bits(&self) -> u8 { self.block_bits } pub fn block_bits(&self) -> u8 { self.block_bits }
pub fn has_direct_access(&self) -> bool { !self.block_offsets.is_empty() }
/// Byte offset of the START of record `i` (the seql byte) in the mmap. /// Byte offset of record `i` in the mmap.
///
/// Fast path (O(1 << block_bits)) when `.idx` is loaded; degraded O(i)
/// sequential scan otherwise.
#[inline] #[inline]
fn chunk_start(&self, i: usize) -> usize { fn chunk_start(&self, i: usize) -> usize {
assert!(!self.block_offsets.is_empty(), if !self.block_offsets.is_empty() {
"random access requires UnitigFileReader::open(); use open_sequential() for iteration only"); if self.block_bits == 0 {
if self.block_bits == 0 { return self.block_offsets[i] as usize;
return self.block_offsets[i] as usize; }
let block = i >> self.block_bits;
let rem = i & self.mask;
let mut offset = self.block_offsets[block] as usize;
for _ in 0..rem {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
} else {
let mut offset = 0usize;
for _ in 0..i {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
} }
let block = i >> self.block_bits;
let rem = i & self.mask;
let mut offset = self.block_offsets[block] as usize;
for _ in 0..rem {
let seql_minus_k = self.mmap[offset] as usize;
offset += 1 + (seql_minus_k + self.k + 3) / 4;
}
offset
} }
/// Nucleotide length of chunk `i`. /// Nucleotide length of chunk `i`.
@@ -307,7 +336,9 @@ impl UnitigFileReader {
extract_kmer_raw(&self.mmap[offset + 1..], j, self.k) extract_kmer_raw(&self.mmap[offset + 1..], j, self.k)
} }
/// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical). /// `true` iff the k-mer at position `j` of chunk `i` matches `query`.
///
/// Works in both modes; O(i) scan when `.idx` is absent.
#[inline] #[inline]
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool { pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw() canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()