feat: centralize index configuration and add hybrid mode
Centralizes index configuration by storing a single `IndexMode` (`Exact`, `Approx`, or `Hybrid`) in `PartitionMeta`, eliminating per-layer metadata files. Introduces a `Hybrid` evidence mode and an `--approx` CLI flag to toggle between exact and probabilistic indexing. Refactors the build and query pipelines to dynamically dispatch based on the configured mode, deferring `.idx` generation to Pass 2 and only requiring it for Exact/Hybrid modes. Updates layer opening to load appropriate data structures, enforces strict parameter validation during merges, and clarifies performance trade-offs in documentation.
This commit is contained in:
@@ -27,10 +27,10 @@ part_XXXXX/
|
|||||||
|
|
||||||
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
|
After filtering (applying a min-count threshold derived from the spectrum) and building the local De Bruijn graph + unitigs (see [Construction pipeline](pipeline.md)), the exact filtered kmer set is available via `unitigs.bin`.
|
||||||
|
|
||||||
`MphfLayer::build(dir, block_bits, fill_slot)` is called on the unitig directory:
|
`MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)` is called on the unitig directory:
|
||||||
|
|
||||||
1. **Pass 1**: build `.idx` via `build_unitig_idx(unitig_path, block_bits)`, then iterate all canonical kmers in parallel over chunks using `(0..unitigs.len()).into_par_iter()` + `unitigs.unitig(ci).into_canonical_kmers()`. Constructs and stores `mphf.bin` (ptr_hash, `new_from_par_iter`).
|
1. **Pass 1** (parallel): a `CanonicalKmerIter` — clonable via `Arc<Mmap>`, no file reopening — is passed directly to `new_from_par_iter` via `par_bridge()`. No `.idx` is read or created at this stage; parallelism is at partition/layer level, not within a single MPHF. Produces `mphf.bin`.
|
||||||
2. **Pass 2**: iterate sequentially with `iter_indexed_canonical_kmers`; fill `evidence.bin`; call `fill_slot(slot, kmer)` callback once per kmer for DataStore population.
|
2. **Pass 2** (sequential): iterate with `iter_indexed_canonical_kmers`; fill evidence files; call `fill_slot(slot, kmer)` callback per kmer. For Exact/Hybrid, `.idx` is written at the end of this pass — never earlier.
|
||||||
|
|
||||||
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
|
`mphf1.bin` and `counts1.bin` are no longer needed after phase 2 and can be deleted.
|
||||||
|
|
||||||
@@ -110,7 +110,7 @@ layer_i/
|
|||||||
mphf.bin — ptr_hash phase-2 MPHF
|
mphf.bin — ptr_hash phase-2 MPHF
|
||||||
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
evidence.bin — n × (chunk_id: 25 bits | rank: 7 bits) per slot [exact mode]
|
||||||
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
fingerprint.bin — n × b-bit fingerprints per slot [approx mode]
|
||||||
layer_meta.json — evidence kind, recorded at build time
|
[no layer_meta.json — mode stored once in partition-level meta.json]
|
||||||
```
|
```
|
||||||
|
|
||||||
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
|
Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0 is built from dataset A. Adding dataset B:
|
||||||
@@ -121,32 +121,32 @@ Layers are **disjoint**: a canonical kmer belongs to exactly one layer. Layer 0
|
|||||||
|
|
||||||
### Evidence modes
|
### Evidence modes
|
||||||
|
|
||||||
Two evidence modes are supported, selected at build time via `EvidenceKind` and recorded in `layer_meta.json`.
|
Three evidence modes are supported via `IndexMode`, stored once in `PartitionMeta` at partition root. There is no `layer_meta.json`.
|
||||||
|
|
||||||
**Exact** (`EvidenceKind::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot, encoding the position of the corresponding kmer in `unitigs.bin`. Membership verification reconstructs the kmer from `(chunk_id, rank)` and compares it to the query. Zero false positives. Requires `.idx` for random access.
|
**Exact** (`IndexMode::Exact`): `evidence.bin` stores one `(chunk_id, rank)` pair per MPHF slot. Verification reconstructs the kmer and compares to the query. Zero false positives. `.idx` required at query time.
|
||||||
|
|
||||||
**Approx** (`EvidenceKind::Approx { b, z }`): `fingerprint.bin` stores a `b`-bit hash of the kmer at each MPHF slot. Membership check compares `kmer.seq_hash()` against the stored fingerprint. False-positive rate: 1/2^b per query. No `.idx` is written or needed.
|
**Approx** (`IndexMode::Approx { b, z }`): `fingerprint.bin` stores a b-bit hash per slot. False-positive rate 1/2^b per query; Findere z-parameter reduces window FP to ≈ 1/2^(b·z). No `.idx` written or needed.
|
||||||
|
|
||||||
|
**Hybrid** (`IndexMode::Hybrid { b, z }`): both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (O(1)); `find_strict()` uses exact evidence (O(1)).
|
||||||
|
|
||||||
### Build functions
|
### Build functions
|
||||||
|
|
||||||
```
|
```
|
||||||
MphfLayer::build(dir, block_bits, fill_slot)
|
MphfLayer::build(dir, block_bits, mode: &IndexMode, fill_slot)
|
||||||
Pass 1: par_iter over chunks via .idx → build mphf.bin
|
Pass 1: CanonicalKmerIter + par_bridge() → build mphf.bin (no .idx used)
|
||||||
Pass 2: sequential iter → fill evidence.bin + call fill_slot
|
Pass 2: sequential iter → fill evidence files + call fill_slot
|
||||||
|
.idx written last for Exact/Hybrid (query-time only)
|
||||||
|
|
||||||
MphfLayer::build_exact_evidence(dir, block_bits)
|
MphfLayer::build_exact_evidence(dir, block_bits)
|
||||||
Standalone post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
Post-hoc: builds evidence.bin + .idx from existing mphf.bin + unitigs.bin
|
||||||
Uses open_sequential(); no .idx required on entry
|
Uses open_sequential(); no .idx required on entry
|
||||||
|
|
||||||
MphfLayer::build_approx_evidence(dir, b, z)
|
MphfLayer::build_approx_evidence(dir, b, z)
|
||||||
Standalone post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
Post-hoc: builds fingerprint.bin from existing mphf.bin + unitigs.bin
|
||||||
Uses open_sequential(); never writes .idx
|
Uses open_sequential(); never writes .idx
|
||||||
|
|
||||||
MphfLayer::build_evidence(dir, kind, block_bits)
|
|
||||||
Dispatch wrapper: routes to build_exact_evidence or build_approx_evidence
|
|
||||||
```
|
```
|
||||||
|
|
||||||
`build` always produces exact evidence. If approximate evidence is needed (e.g. `EvidenceKind::Approx`), the caller invokes `build_approx_evidence` after `build` to replace the evidence bundle.
|
There is no `build_evidence` dispatch wrapper. Callers choose the appropriate post-hoc build directly.
|
||||||
|
|
||||||
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
|
In `obikpartitionner`, `build_index_layer` receives `block_bits: u8` from `IndexConfig::block_bits` and forwards it directly to `Layer::build` and `Layer::build_approx_evidence`.
|
||||||
|
|
||||||
@@ -170,7 +170,7 @@ fn query(kmer) → Option<(layer_index, slot)>:
|
|||||||
return None
|
return None
|
||||||
```
|
```
|
||||||
|
|
||||||
`MphfLayer::find` dispatches transparently to `find_exact` or `find_approx` based on the evidence loaded at `open` time. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
`MphfLayer::find` dispatches on `LayerEvidence` at O(1) — no panicking `find_exact`/`find_approx` methods. `find_strict` always performs an exact check: O(1) for Exact/Hybrid, O(n) sequential scan for Approx. Expected probe depth: 1 for kmers in layer 0. Each probe is a ptr_hash lookup (~10 ns) plus one evidence check.
|
||||||
|
|
||||||
### Merging layers
|
### Merging layers
|
||||||
|
|
||||||
|
|||||||
@@ -20,21 +20,26 @@ Both `PersistentCompactIntMatrix` and `PersistentBitMatrix` come from the `obico
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Evidence kinds
|
## Index mode (homogeneity invariant)
|
||||||
|
|
||||||
Each layer carries one of two evidence bundles, recorded in `layer_meta.json` at build time:
|
A partitioned index is homogeneous: every layer within a partition shares the same mode. The mode is determined once at `LayeredMap::open()` from `PartitionMeta.mode` and passed to each `Layer::open()` — no per-layer file is read.
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub enum EvidenceKind {
|
#[derive(Serialize, Deserialize, Default)]
|
||||||
|
#[serde(tag = "type", rename_all = "snake_case")]
|
||||||
|
pub enum IndexMode {
|
||||||
|
#[default]
|
||||||
Exact,
|
Exact,
|
||||||
Approx { b: u8, z: u8 },
|
Approx { b: u8, z: u8 },
|
||||||
|
Hybrid { b: u8, z: u8 },
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`EvidenceKind` is stored in `LayerMeta` (one per layer directory). `open()` reads it to decide which evidence files to load.
|
`IndexMode` is stored once in `PartitionMeta` (`meta.json` at partition root). There is no `layer_meta.json`.
|
||||||
|
|
||||||
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives. Requires random-access `.idx` at query time.
|
- **Exact**: writes `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
|
||||||
- **Approx**: writes `fingerprint.bin` only. False-positive rate per kmer query = 1/2^b. `z` is the Findere consecutive-kmer parameter: `z` consecutive kmers must all match, reducing the effective FP rate per read to approximately W / 2^(b·z) where W = L − k − z + 2 is the number of windows in a read of length L. No `.idx` written or required.
|
- **Approx**: writes `fingerprint.bin` only. FP rate per kmer = 1/2^b; with Findere z-parameter, z consecutive kmers must all match → effective window FP ≈ 1/2^(b·z). No `.idx` written or required.
|
||||||
|
- **Hybrid**: writes both `fingerprint.bin` and `evidence.bin` + `.idx`. `find()` uses the fingerprint (fast, O(1)); `find_strict()` uses exact evidence.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -55,44 +60,44 @@ pub struct MphfLayer {
|
|||||||
```rust
|
```rust
|
||||||
enum LayerEvidence {
|
enum LayerEvidence {
|
||||||
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||||||
Approx { fingerprint: FingerprintVec },
|
Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
|
||||||
|
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`MphfLayer::open(dir, mode: &IndexMode)` receives the mode from `PartitionMeta` — no per-layer file is read.
|
||||||
|
|
||||||
### Query API
|
### Query API
|
||||||
|
|
||||||
Three public query methods, all returning `Option<usize>` (slot index):
|
Two public query methods, both returning `Option<usize>` (slot index):
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
|
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||||
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize>
|
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize>
|
||||||
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize>
|
|
||||||
```
|
```
|
||||||
|
|
||||||
- `find` dispatches transparently to `find_exact` or `find_approx` based on the evidence variant loaded at `open()`.
|
- `find`: O(1) auto-dispatch. Exact/Hybrid → exact evidence check. Approx/Hybrid → fingerprint comparison.
|
||||||
- `find_exact` panics if the layer holds approximate evidence; zero false positives.
|
- `find_strict`: always exact. Exact/Hybrid → O(1) evidence check. Approx → O(n) sequential scan (no `.idx`).
|
||||||
- `find_approx` panics if the layer holds exact evidence; FP rate 1/2^b per kmer.
|
|
||||||
|
|
||||||
`open()` requires `unitigs.bin.idx` (random access into unitigs). `open_sequential()` on `UnitigFileReader` does not require the `.idx` and is used during build passes.
|
There are no `find_exact`/`find_approx` methods; panicking dispatch is eliminated.
|
||||||
|
|
||||||
### Build surface
|
### Build surface
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
// Full MPHF + exact evidence build (two-pass, parallel)
|
// Full MPHF + evidence build (two-pass)
|
||||||
pub(crate) fn build(dir, block_bits, fill_slot) -> OLMResult<usize>
|
pub(crate) fn build(dir, block_bits, mode: &IndexMode, fill_slot) -> OLMResult<usize>
|
||||||
|
|
||||||
// Evidence-only builds (MPHF already present in dir)
|
// Evidence-only post-hoc builds (MPHF already present)
|
||||||
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
|
pub fn build_exact_evidence(dir, block_bits) -> OLMResult<usize>
|
||||||
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
|
pub fn build_approx_evidence(dir, b, z) -> OLMResult<usize>
|
||||||
pub fn build_evidence(dir, kind, block_bits) -> OLMResult<usize> // dispatch
|
|
||||||
```
|
```
|
||||||
|
|
||||||
`MphfLayer::build` runs two sequential passes over `unitigs.bin`:
|
`MphfLayer::build` runs two passes over `unitigs.bin`:
|
||||||
|
|
||||||
1. **Pass 1** (parallel via rayon): iterate all canonical kmers, construct and store `mphf.bin`. `new_from_par_iter` avoids materialising a full key `Vec`.
|
1. **Pass 1** (parallel via rayon): a `CanonicalKmerIter` (clonable, `Arc<Mmap>`, no file reopening) is passed to `new_from_par_iter` via `par_bridge()`. Produces `mphf.bin`. No `.idx` is read or created at this stage.
|
||||||
2. **Pass 2** (sequential): iterate again, fill `evidence.bin`, call `fill_slot(slot, kmer)` once per kmer for payload population. A compact `n/8`-byte seen-bitset verifies MPHF injectivity inline.
|
2. **Pass 2** (sequential): fill evidence files; call `fill_slot(slot, kmer)` per kmer. `.idx` is written last for Exact/Hybrid modes (query-time only).
|
||||||
|
|
||||||
`build` always produces exact evidence. For approximate evidence, use `build_approx_evidence` after MPHF construction.
|
There is no `build_evidence` dispatch wrapper — callers invoke `build_exact_evidence` or `build_approx_evidence` directly.
|
||||||
|
|
||||||
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
|
For empty layers (n = 0), all build variants return `Ok(0)` immediately after creating empty output files.
|
||||||
|
|
||||||
@@ -133,38 +138,37 @@ pub struct Hit<T = ()> {
|
|||||||
```rust
|
```rust
|
||||||
// mode 1
|
// mode 1
|
||||||
impl Layer<()> {
|
impl Layer<()> {
|
||||||
pub fn build(out_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize>
|
||||||
}
|
}
|
||||||
|
|
||||||
// mode 2
|
// mode 2
|
||||||
impl Layer<PersistentCompactIntMatrix> {
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
pub fn build(out_dir: &Path, block_bits: u8,
|
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||||||
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize>
|
||||||
pub fn build_from_map(out_dir: &Path, block_bits: u8,
|
pub fn build_from_map(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||||||
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize>
|
||||||
}
|
}
|
||||||
|
|
||||||
// mode 3
|
// mode 3
|
||||||
impl Layer<PersistentBitMatrix> {
|
impl Layer<PersistentBitMatrix> {
|
||||||
pub fn build_presence(out_dir: &Path, block_bits: u8,
|
pub fn build_presence(out_dir: &Path, block_bits: u8, mode: &IndexMode,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
|
present_in: impl Fn(CanonicalKmer, usize) -> bool) -> OLMResult<usize>
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
All build impls delegate MPHF + evidence construction to `MphfLayer::build` via a mode-specific `fill_slot` callback. Modes 2 and 3 pre-read `n_kmers` from `unitigs.bin` via `UnitigFileReader::open_sequential` to size the matrix builder before calling `MphfLayer::build`.
|
All build impls delegate to `MphfLayer::build` via a mode-specific `fill_slot` callback. The `mode` parameter is forwarded directly — no `LayerMeta` is written.
|
||||||
|
|
||||||
### Evidence build helpers on Layer
|
Evidence-only post-hoc builds are accessible directly on `Layer<D>`:
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
impl<D: LayerData> Layer<D> {
|
impl<D: LayerData> Layer<D> {
|
||||||
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
pub fn build_exact_evidence(layer_dir: &Path, block_bits: u8) -> OLMResult<usize>
|
||||||
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
|
pub fn build_approx_evidence(layer_dir: &Path, b: u8, z: u8) -> OLMResult<usize>
|
||||||
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize>
|
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
These delegate directly to the corresponding `MphfLayer` methods and are provided so call sites can remain typed at the `Layer<D>` level.
|
There is no `build_evidence` dispatch wrapper.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -213,13 +217,16 @@ pub struct LayeredMap<D: LayerData = ()> {
|
|||||||
|
|
||||||
```rust
|
```rust
|
||||||
pub fn open(root: &Path) -> OLMResult<Self>
|
pub fn open(root: &Path) -> OLMResult<Self>
|
||||||
pub fn create(root: &Path) -> OLMResult<Self>
|
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self>
|
||||||
pub fn n_layers(&self) -> usize
|
pub fn n_layers(&self) -> usize
|
||||||
pub fn layer(&self, i: usize) -> &Layer<D>
|
pub fn layer(&self, i: usize) -> &Layer<D>
|
||||||
|
pub fn mode(&self) -> &IndexMode
|
||||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
|
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)>
|
||||||
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
|
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
`open` reads `PartitionMeta` once, extracts `mode`, and passes it to every `Layer::open` — no per-layer file is read. `create` stores the given mode in `PartitionMeta`.
|
||||||
|
|
||||||
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
|
`query` probes layers in order and returns `(layer_index, Hit)` on the first match. Expected probe depth: 1 for kmers in layer 0.
|
||||||
|
|
||||||
### push_layer
|
### push_layer
|
||||||
@@ -272,14 +279,13 @@ See [Kmer index architecture](../architecture/index_architecture.md) for the ful
|
|||||||
|
|
||||||
```
|
```
|
||||||
partition_root/ ← LayeredMap (one partition)
|
partition_root/ ← LayeredMap (one partition)
|
||||||
meta.json — {"n_layers": N}
|
meta.json — {"n_layers": N, "mode": {"type": "exact"|"approx"|"hybrid", ...}}
|
||||||
layer_0/ ← Layer
|
layer_0/ ← Layer
|
||||||
layer_meta.json — {"type": "exact"} or {"type": "approx", "b": B, "z": Z}
|
|
||||||
mphf.bin — ptr_hash MPHF (epserde format)
|
mphf.bin — ptr_hash MPHF (epserde format)
|
||||||
unitigs.bin — packed 2-bit nucleotide sequences
|
unitigs.bin — packed 2-bit nucleotide sequences
|
||||||
unitigs.bin.idx — UIDX index (exact evidence only)
|
unitigs.bin.idx — UIDX index (Exact/Hybrid only; query-time, never built during MPHF construction)
|
||||||
evidence.bin — [u32; n], LE (exact evidence only)
|
evidence.bin — [u32; n], LE (Exact/Hybrid only)
|
||||||
fingerprint.bin — packed b-bit array (approx evidence only)
|
fingerprint.bin — packed b-bit array (Approx/Hybrid only)
|
||||||
counts/ [mode 2] PersistentCompactIntMatrix
|
counts/ [mode 2] PersistentCompactIntMatrix
|
||||||
meta.json
|
meta.json
|
||||||
col_000000.pciv
|
col_000000.pciv
|
||||||
@@ -290,7 +296,7 @@ partition_root/ ← LayeredMap (one partition)
|
|||||||
…
|
…
|
||||||
```
|
```
|
||||||
|
|
||||||
`unitigs.bin.idx` is required by `open()` (random access). `open_sequential()` on `UnitigFileReader` omits it and is used during build passes and approx-evidence construction.
|
There is no `layer_meta.json`. The mode is stored once in `PartitionMeta` and is valid for all layers. `unitigs.bin.idx` is built at the end of `build_exact_evidence` — never during MPHF construction — and is consumed at query time only.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -387,4 +393,4 @@ Each partition's new layer is built independently; the operation is fully parall
|
|||||||
| `obiskio` | unitig file writer/reader + `.idx` build |
|
| `obiskio` | unitig file writer/reader + `.idx` build |
|
||||||
| `obicompactvec` | payload types + aggregation traits |
|
| `obicompactvec` | payload types + aggregation traits |
|
||||||
| `rayon 1` | parallel MPHF construction pass |
|
| `rayon 1` | parallel MPHF construction pass |
|
||||||
| `serde / serde_json` | `LayerMeta` + `PartitionMeta` serialisation |
|
| `serde / serde_json` | `PartitionMeta` serialisation |
|
||||||
|
|||||||
@@ -61,35 +61,24 @@ File size = `n_slots × 4` bytes. `chunk_id` is the 0-based index of the record
|
|||||||
|
|
||||||
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
|
Scans `unitigs.bin` sequentially: for each chunk at byte offset `offset`, if `chunk_count & mask == 0` (where `mask = (1 << block_bits) − 1`), appends `offset as u32` to `block_offsets`. After the scan, appends a sentinel (= total file size), then writes the `.idx` file. Called after the unitig file is fully written and closed.
|
||||||
|
|
||||||
### `open()` vs `open_sequential()`
|
### `open()`, `open_sequential()`, `open_direct_access()`
|
||||||
|
|
||||||
`UnitigFileReader::open(path)` loads the `.idx` file into `block_offsets: Vec<u32>` and memory-maps `unitigs.bin`. Enables random access via `chunk_start(i)`, `unitig(i)`, `raw_kmer(i, j)`, and `verify_canonical_kmer(i, j, q)`.
|
`UnitigFileReader` has three constructors:
|
||||||
|
|
||||||
`UnitigFileReader::open_sequential(path)` does not read `.idx`. It scans `unitigs.bin` once to count chunks and kmers, then leaves `block_offsets` empty. Only sequential iterators work: `iter_unitigs`, `iter_kmers`, `iter_canonical_kmers`, `iter_indexed_canonical_kmers`. Any call to `chunk_start()` panics with a diagnostic message.
|
- `open(path)` — smart default: if `unitigs.bin.idx` exists, delegates to `open_direct_access`; otherwise delegates to `open_sequential`. Prefer this in call sites that don't require one specific mode.
|
||||||
|
- `open_sequential(path)` — never reads `.idx`. Sequential iterators only; `chunk_start(i)` falls back to an O(i) mmap scan rather than panicking.
|
||||||
|
- `open_direct_access(path)` — requires `.idx` to be present. Enables O(1) or O(2^block_bits) `chunk_start(i)`, used by `verify_canonical_kmer` at query time.
|
||||||
|
|
||||||
### `chunk_start(i)` — random access
|
`CanonicalKmerIter` — a clonable sequential iterator returned by `UnitigFileReader::iter_canonical_kmers()`. It holds an `Arc<Mmap>` so cloning resets the cursor to the start without reopening the file. This makes it usable with `par_bridge()` for parallel MPHF construction without random access.
|
||||||
|
|
||||||
```rust
|
### `chunk_start(i)` — access modes
|
||||||
fn chunk_start(&self, i: usize) -> usize {
|
|
||||||
// block_bits=0: single table lookup, O(1) — hot path
|
|
||||||
if self.block_bits == 0 {
|
|
||||||
return self.block_offsets[i] as usize;
|
|
||||||
}
|
|
||||||
// block_bits>0: lookup block, then scan at most 2^block_bits − 1 records
|
|
||||||
let block = i >> self.block_bits;
|
|
||||||
let rem = i & self.mask;
|
|
||||||
let mut offset = self.block_offsets[block] as usize;
|
|
||||||
for _ in 0..rem {
|
|
||||||
let seql_minus_k = self.mmap[offset] as usize;
|
|
||||||
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
|
||||||
}
|
|
||||||
offset
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
With `block_bits = 0` (the default), every chunk has a direct entry in `block_offsets`: lookup is a single array index, O(1), with no sequential scan. The `if self.block_bits == 0` branch is explicit in the code and handles this hot path first.
|
When `.idx` is loaded (`open_direct_access`):
|
||||||
|
|
||||||
With `block_bits > 0`, one offset covers `2^block_bits` consecutive chunks; access cost is O(`2^block_bits`) sequential mmap reads.
|
- `block_bits = 0`: single array lookup, O(1).
|
||||||
|
- `block_bits > 0`: lookup block, then scan ≤ 2^block_bits records, O(2^block_bits).
|
||||||
|
|
||||||
|
When `.idx` is absent (`open_sequential`): `chunk_start(i)` performs an O(i) sequential mmap scan from offset 0. No panic — the function degrades gracefully. This degraded path is used by `find_strict()` on Approx layers (sequential scan of all canonical kmers).
|
||||||
|
|
||||||
### Decoding a kmer from slot `s`
|
### Decoding a kmer from slot `s`
|
||||||
|
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ use obisys::{Reporter, Stage};
|
|||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
use obilayeredmap::EvidenceKind;
|
use obilayeredmap::IndexMode;
|
||||||
|
|
||||||
use crate::error::{OKIError, OKIResult};
|
use crate::error::{OKIError, OKIResult};
|
||||||
use crate::index::KmerIndex;
|
use crate::index::KmerIndex;
|
||||||
@@ -271,14 +271,16 @@ fn partition_bar(n: u64) -> ProgressBar {
|
|||||||
/// - all `Exact` → OK, returns `Exact`
|
/// - all `Exact` → OK, returns `Exact`
|
||||||
/// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }`
|
/// - all `Approx { b, z }` same params → OK, returns `Approx { b, z }`
|
||||||
/// - mixed exact/approx or different approx params → `IncompatibleEvidence`
|
/// - mixed exact/approx or different approx params → `IncompatibleEvidence`
|
||||||
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<EvidenceKind> {
|
fn validate_evidence_compat(sources: &[&KmerIndex]) -> OKIResult<IndexMode> {
|
||||||
let ref_ev = &sources[0].meta.config.evidence;
|
let ref_ev = &sources[0].meta.config.evidence;
|
||||||
for src in &sources[1..] {
|
for src in &sources[1..] {
|
||||||
let ev = &src.meta.config.evidence;
|
let ev = &src.meta.config.evidence;
|
||||||
let compat = match (ref_ev, ev) {
|
let compat = match (ref_ev, ev) {
|
||||||
(EvidenceKind::Exact, EvidenceKind::Exact) => true,
|
(IndexMode::Exact, IndexMode::Exact) => true,
|
||||||
(EvidenceKind::Approx { b: b1, z: z1 },
|
(IndexMode::Approx { b: b1, z: z1 },
|
||||||
EvidenceKind::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
|
IndexMode::Approx { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
|
||||||
|
(IndexMode::Hybrid { b: b1, z: z1 },
|
||||||
|
IndexMode::Hybrid { b: b2, z: z2 }) => b1 == b2 && z1 == z2,
|
||||||
_ => false,
|
_ => false,
|
||||||
};
|
};
|
||||||
if !compat {
|
if !compat {
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ use std::fs;
|
|||||||
use std::io;
|
use std::io;
|
||||||
use std::path::Path;
|
use std::path::Path;
|
||||||
|
|
||||||
use obilayeredmap::EvidenceKind;
|
use obilayeredmap::IndexMode;
|
||||||
use serde::{Deserialize, Serialize};
|
use serde::{Deserialize, Serialize};
|
||||||
|
|
||||||
pub const META_FILENAME: &str = "index.meta";
|
pub const META_FILENAME: &str = "index.meta";
|
||||||
@@ -30,7 +30,7 @@ pub struct IndexConfig {
|
|||||||
pub n_bits: usize,
|
pub n_bits: usize,
|
||||||
pub with_counts: bool,
|
pub with_counts: bool,
|
||||||
#[serde(default)]
|
#[serde(default)]
|
||||||
pub evidence: EvidenceKind,
|
pub evidence: IndexMode,
|
||||||
/// Block size for the unitig index as a power-of-two exponent.
|
/// Block size for the unitig index as a power-of-two exponent.
|
||||||
/// The `.idx` block covers 2^block_bits consecutive unitigs.
|
/// The `.idx` block covers 2^block_bits consecutive unitigs.
|
||||||
/// 0 = one entry per unitig (O(1) access, largest `.idx`).
|
/// 0 = one entry per unitig (O(1) access, largest `.idx`).
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ use std::path::Path;
|
|||||||
use std::time::Duration;
|
use std::time::Duration;
|
||||||
|
|
||||||
use indicatif::{ProgressBar, ProgressStyle};
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
use obilayeredmap::{EvidenceKind, layer::Layer};
|
use obilayeredmap::{IndexMode, layer::Layer};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
use obisys::{Reporter, Stage};
|
use obisys::{Reporter, Stage};
|
||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
@@ -31,7 +31,7 @@ impl KmerIndex {
|
|||||||
/// `index.meta` is updated with the new evidence kind on success.
|
/// `index.meta` is updated with the new evidence kind on success.
|
||||||
pub fn reindex(
|
pub fn reindex(
|
||||||
&mut self,
|
&mut self,
|
||||||
target: EvidenceKind,
|
target: IndexMode,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
rep: &mut Reporter,
|
rep: &mut Reporter,
|
||||||
) -> OKIResult<()> {
|
) -> OKIResult<()> {
|
||||||
@@ -75,7 +75,7 @@ impl KmerIndex {
|
|||||||
}
|
}
|
||||||
|
|
||||||
self.meta.config.evidence = target;
|
self.meta.config.evidence = target;
|
||||||
if matches!(self.meta.config.evidence, EvidenceKind::Exact) {
|
if matches!(self.meta.config.evidence, IndexMode::Exact) {
|
||||||
self.meta.config.block_bits = block_bits;
|
self.meta.config.block_bits = block_bits;
|
||||||
}
|
}
|
||||||
self.meta.write(&self.root_path)?;
|
self.meta.write(&self.root_path)?;
|
||||||
@@ -85,7 +85,7 @@ impl KmerIndex {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/// Process all layers of one partition's index directory.
|
/// Process all layers of one partition's index directory.
|
||||||
fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> {
|
fn reindex_partition(index_dir: &Path, target: &IndexMode, block_bits: u8) -> OKIResult<()> {
|
||||||
if !index_dir.exists() {
|
if !index_dir.exists() {
|
||||||
return Ok(());
|
return Ok(());
|
||||||
}
|
}
|
||||||
@@ -97,22 +97,30 @@ fn reindex_partition(index_dir: &Path, target: &EvidenceKind, block_bits: u8) ->
|
|||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
fn reindex_layer(layer_dir: &Path, target: &EvidenceKind, block_bits: u8) -> OKIResult<()> {
|
fn reindex_layer(layer_dir: &Path, target: &IndexMode, block_bits: u8) -> OKIResult<()> {
|
||||||
Layer::<()>::build_evidence(layer_dir, target, block_bits).map_err(olm_to_oki)?;
|
match target {
|
||||||
|
IndexMode::Exact => {
|
||||||
|
Layer::<()>::build_exact_evidence(layer_dir, block_bits).map_err(olm_to_oki)?;
|
||||||
|
}
|
||||||
|
IndexMode::Approx { b, z } | IndexMode::Hybrid { b, z } => {
|
||||||
|
Layer::<()>::build_approx_evidence(layer_dir, *b, *z).map_err(olm_to_oki)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
remove_stale_evidence(layer_dir, target)
|
remove_stale_evidence(layer_dir, target)
|
||||||
}
|
}
|
||||||
|
|
||||||
fn remove_stale_evidence(layer_dir: &Path, target: &EvidenceKind) -> OKIResult<()> {
|
fn remove_stale_evidence(layer_dir: &Path, target: &IndexMode) -> OKIResult<()> {
|
||||||
match target {
|
match target {
|
||||||
EvidenceKind::Exact => {
|
IndexMode::Exact => {
|
||||||
// fingerprint.bin is no longer valid
|
|
||||||
remove_if_exists(&layer_dir.join(FINGERPRINT_FILE));
|
remove_if_exists(&layer_dir.join(FINGERPRINT_FILE));
|
||||||
}
|
}
|
||||||
EvidenceKind::Approx { .. } => {
|
IndexMode::Approx { .. } => {
|
||||||
// exact bundle is no longer valid
|
|
||||||
remove_if_exists(&layer_dir.join(EVIDENCE_FILE));
|
remove_if_exists(&layer_dir.join(EVIDENCE_FILE));
|
||||||
remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE));
|
remove_if_exists(&layer_dir.join(UNITIG_IDX_FILE));
|
||||||
}
|
}
|
||||||
|
IndexMode::Hybrid { .. } => {
|
||||||
|
// both bundles kept — nothing to remove
|
||||||
|
}
|
||||||
}
|
}
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ use std::path::PathBuf;
|
|||||||
|
|
||||||
use clap::Args;
|
use clap::Args;
|
||||||
use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex};
|
use obikindex::{GenomeInfo, IndexConfig, IndexState, KmerIndex};
|
||||||
use obilayeredmap::EvidenceKind;
|
use obilayeredmap::IndexMode;
|
||||||
|
|
||||||
fn parse_key_value(s: &str) -> Result<(String, String), String> {
|
fn parse_key_value(s: &str) -> Result<(String, String), String> {
|
||||||
let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?;
|
let pos = s.find('=').ok_or_else(|| format!("invalid key=value: no '=' in '{s}'"))?;
|
||||||
@@ -159,9 +159,9 @@ pub fn run(args: IndexArgs) {
|
|||||||
let evidence = if args.approx {
|
let evidence = if args.approx {
|
||||||
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
||||||
info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}");
|
info!("approximate evidence: b={b}, z={z}, fp={fp:.2e}");
|
||||||
EvidenceKind::Approx { b, z }
|
IndexMode::Approx { b, z }
|
||||||
} else {
|
} else {
|
||||||
EvidenceKind::Exact
|
IndexMode::Exact
|
||||||
};
|
};
|
||||||
|
|
||||||
// ── Open or create the index ─────────────────────────────────────────────
|
// ── Open or create the index ─────────────────────────────────────────────
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ use std::path::PathBuf;
|
|||||||
|
|
||||||
use clap::Args;
|
use clap::Args;
|
||||||
use obikindex::KmerIndex;
|
use obikindex::KmerIndex;
|
||||||
use obilayeredmap::EvidenceKind;
|
use obilayeredmap::IndexMode;
|
||||||
use obisys::Reporter;
|
use obisys::Reporter;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
@@ -41,10 +41,10 @@ pub fn run(args: ReindexArgs) {
|
|||||||
let target = if args.approx {
|
let target = if args.approx {
|
||||||
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
let (z, b, fp) = resolve_approx_params(args.findere_z, args.evidence_bits, args.fp);
|
||||||
info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}");
|
info!("target: approximate evidence — b={b}, z={z}, fp={fp:.2e}");
|
||||||
EvidenceKind::Approx { b, z }
|
IndexMode::Approx { b, z }
|
||||||
} else {
|
} else {
|
||||||
info!("target: exact evidence");
|
info!("target: exact evidence");
|
||||||
EvidenceKind::Exact
|
IndexMode::Exact
|
||||||
};
|
};
|
||||||
|
|
||||||
let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
let mut idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
||||||
|
|||||||
@@ -1,8 +1,8 @@
|
|||||||
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
||||||
use obikseq::CanonicalKmer;
|
use obikseq::CanonicalKmer;
|
||||||
use obiskio::{SKError, SKResult, UnitigFileReader};
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
use obilayeredmap::OLMError;
|
use obilayeredmap::{IndexMode, MphfLayer, OLMError};
|
||||||
use obilayeredmap::MphfLayer;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
use crate::partition::KmerPartition;
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
@@ -35,15 +35,16 @@ impl KmerPartition {
|
|||||||
return Ok(());
|
return Ok(());
|
||||||
}
|
}
|
||||||
|
|
||||||
// Discover layers by probing layer_0, layer_1, … until one is absent.
|
let index_mode = PartitionMeta::load(&index_dir)
|
||||||
// PartitionMeta (meta.json) is only created by the merge path, not by
|
.map(|m| m.mode)
|
||||||
// the initial single-genome build, so we cannot rely on it here.
|
.unwrap_or(IndexMode::Exact);
|
||||||
|
|
||||||
let mut l = 0;
|
let mut l = 0;
|
||||||
loop {
|
loop {
|
||||||
let layer_dir = index_dir.join(format!("layer_{l}"));
|
let layer_dir = index_dir.join(format!("layer_{l}"));
|
||||||
if !layer_dir.exists() { break; }
|
if !layer_dir.exists() { break; }
|
||||||
l += 1;
|
l += 1;
|
||||||
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(&layer_dir, &index_mode).map_err(olm_to_sk)?;
|
||||||
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
|
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
|
||||||
|
|
||||||
let counts_dir = layer_dir.join("counts");
|
let counts_dir = layer_dir.join("counts");
|
||||||
@@ -92,11 +93,15 @@ impl KmerPartition {
|
|||||||
return Ok(());
|
return Ok(());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
let index_mode = PartitionMeta::load(&index_dir)
|
||||||
|
.map(|m| m.mode)
|
||||||
|
.unwrap_or(IndexMode::Exact);
|
||||||
|
|
||||||
let mut layer = 0;
|
let mut layer = 0;
|
||||||
loop {
|
loop {
|
||||||
let layer_dir = index_dir.join(format!("layer_{layer}"));
|
let layer_dir = index_dir.join(format!("layer_{layer}"));
|
||||||
if !layer_dir.exists() { break; }
|
if !layer_dir.exists() { break; }
|
||||||
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(&layer_dir, &index_mode).map_err(olm_to_sk)?;
|
||||||
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
|
let reader = UnitigFileReader::open_sequential(&layer_dir.join("unitigs.bin"))?;
|
||||||
|
|
||||||
let counts_dir = layer_dir.join("counts");
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ use cacheline_ef::{CachelineEf, CachelineEfVec};
|
|||||||
use epserde::prelude::*;
|
use epserde::prelude::*;
|
||||||
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
||||||
use obidebruinj::GraphDeBruijn;
|
use obidebruinj::GraphDeBruijn;
|
||||||
use obilayeredmap::{EvidenceKind, OLMError, layer::Layer};
|
use obilayeredmap::{IndexMode, OLMError, layer::Layer};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
use obiskio::{SKError, SKFileMeta, SKFileReader};
|
use obiskio::{SKError, SKFileMeta, SKFileReader};
|
||||||
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
||||||
@@ -44,7 +44,7 @@ impl KmerPartition {
|
|||||||
min_ab: u32,
|
min_ab: u32,
|
||||||
max_ab: Option<u32>,
|
max_ab: Option<u32>,
|
||||||
with_counts: bool,
|
with_counts: bool,
|
||||||
evidence: &EvidenceKind,
|
mode: &IndexMode,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
) -> Result<usize, SKError> {
|
) -> Result<usize, SKError> {
|
||||||
let part_dir = self.part_dir(i);
|
let part_dir = self.part_dir(i);
|
||||||
@@ -110,7 +110,7 @@ impl KmerPartition {
|
|||||||
uw.close()?;
|
uw.close()?;
|
||||||
|
|
||||||
if with_counts {
|
if with_counts {
|
||||||
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, block_bits, evidence, |kmer| {
|
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, block_bits, mode, |kmer| {
|
||||||
match (&mphf1_opt, &counts1_opt) {
|
match (&mphf1_opt, &counts1_opt) {
|
||||||
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
||||||
_ => 1,
|
_ => 1,
|
||||||
@@ -118,13 +118,11 @@ impl KmerPartition {
|
|||||||
})
|
})
|
||||||
.map_err(olm_to_sk)?;
|
.map_err(olm_to_sk)?;
|
||||||
} else {
|
} else {
|
||||||
Layer::<()>::build(&layer_dir, block_bits, evidence).map_err(olm_to_sk)?;
|
Layer::<()>::build(&layer_dir, block_bits, mode).map_err(olm_to_sk)?;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Write meta.json in the index/ directory so LayeredMap::open works
|
|
||||||
// (e.g. for subsequent merge operations).
|
|
||||||
let index_dir = layer_dir.parent().expect("layer_dir has a parent");
|
let index_dir = layer_dir.parent().expect("layer_dir has a parent");
|
||||||
PartitionMeta { n_layers: 1 }.save(index_dir).map_err(olm_to_sk)?;
|
PartitionMeta { n_layers: 1, mode: mode.clone() }.save(index_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
Ok(n_kmers)
|
Ok(n_kmers)
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ use obicompactvec::{PersistentBitMatrix, PersistentBitMatrixBuilder,
|
|||||||
PersistentCompactIntVecBuilder};
|
PersistentCompactIntVecBuilder};
|
||||||
use obikseq::CanonicalKmer;
|
use obikseq::CanonicalKmer;
|
||||||
use obiskio::{SKError, SKResult, UnitigFileReader};
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
use obilayeredmap::{EvidenceKind, Layer, LayeredMap, MphfLayer, OLMError};
|
use obilayeredmap::{IndexMode, Layer, LayeredMap, MphfLayer, OLMError};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
use crate::partition::KmerPartition;
|
use crate::partition::KmerPartition;
|
||||||
@@ -52,18 +52,17 @@ pub(crate) enum SrcLayerData {
|
|||||||
}
|
}
|
||||||
|
|
||||||
impl SrcLayerData {
|
impl SrcLayerData {
|
||||||
pub(crate) fn open(layer_dir: &Path, mode: MergeMode) -> SKResult<Self> {
|
pub(crate) fn open(layer_dir: &Path, merge_mode: MergeMode, index_mode: &IndexMode) -> SKResult<Self> {
|
||||||
let presence_dir = layer_dir.join("presence");
|
let presence_dir = layer_dir.join("presence");
|
||||||
let counts_dir = layer_dir.join("counts");
|
let counts_dir = layer_dir.join("counts");
|
||||||
match mode {
|
match merge_mode {
|
||||||
MergeMode::Presence => {
|
MergeMode::Presence => {
|
||||||
if presence_dir.exists() {
|
if presence_dir.exists() {
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
|
||||||
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
Ok(SrcLayerData::Presence(mphf, mat))
|
Ok(SrcLayerData::Presence(mphf, mat))
|
||||||
} else if counts_dir.exists() {
|
} else if counts_dir.exists() {
|
||||||
// Source is a count index; treat count > 0 as present via ColBuilder::Bit.
|
let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
|
||||||
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
Ok(SrcLayerData::Count(mphf, mat))
|
Ok(SrcLayerData::Count(mphf, mat))
|
||||||
} else {
|
} else {
|
||||||
@@ -72,7 +71,7 @@ impl SrcLayerData {
|
|||||||
}
|
}
|
||||||
MergeMode::Count => {
|
MergeMode::Count => {
|
||||||
if counts_dir.exists() {
|
if counts_dir.exists() {
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, index_mode).map_err(olm_to_sk)?;
|
||||||
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
Ok(SrcLayerData::Count(mphf, mat))
|
Ok(SrcLayerData::Count(mphf, mat))
|
||||||
} else {
|
} else {
|
||||||
@@ -116,7 +115,7 @@ fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
|
|||||||
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
||||||
let mut n = 0usize;
|
let mut n = 0usize;
|
||||||
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
||||||
let m = PartitionMeta { n_layers: n };
|
let m = PartitionMeta { n_layers: n, mode: IndexMode::default() };
|
||||||
m.save(dir).map_err(olm_to_sk)?;
|
m.save(dir).map_err(olm_to_sk)?;
|
||||||
Ok(m)
|
Ok(m)
|
||||||
}
|
}
|
||||||
@@ -217,12 +216,12 @@ impl KmerPartition {
|
|||||||
uw.write(&unitig)?;
|
uw.write(&unitig)?;
|
||||||
}
|
}
|
||||||
uw.close()?;
|
uw.close()?;
|
||||||
Layer::<()>::build(&new_layer_dir, block_bits, &EvidenceKind::Exact).map_err(olm_to_sk)?;
|
Layer::<()>::build(&new_layer_dir, block_bits, &IndexMode::Exact).map_err(olm_to_sk)?;
|
||||||
}
|
}
|
||||||
drop(g);
|
drop(g);
|
||||||
|
|
||||||
let new_mphf = if any_new {
|
let new_mphf = if any_new {
|
||||||
Some(MphfLayer::open(&new_layer_dir).map_err(olm_to_sk)?)
|
Some(MphfLayer::open(&new_layer_dir, &IndexMode::Exact).map_err(olm_to_sk)?)
|
||||||
} else {
|
} else {
|
||||||
None
|
None
|
||||||
};
|
};
|
||||||
@@ -304,7 +303,7 @@ impl KmerPartition {
|
|||||||
for l in 0..src_meta.n_layers {
|
for l in 0..src_meta.n_layers {
|
||||||
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
|
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
|
||||||
let reader = UnitigFileReader::open_sequential(&src_layer_dir.join("unitigs.bin"))?;
|
let reader = UnitigFileReader::open_sequential(&src_layer_dir.join("unitigs.bin"))?;
|
||||||
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
|
||||||
|
|
||||||
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
let values = src_data.lookup(kmer, *src_n);
|
let values = src_data.lookup(kmer, *src_n);
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ use std::path::Path;
|
|||||||
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
||||||
use obikseq::{CanonicalKmer, RoutableSuperKmer};
|
use obikseq::{CanonicalKmer, RoutableSuperKmer};
|
||||||
use obiskio::{SKError, SKResult};
|
use obiskio::{SKError, SKResult};
|
||||||
use obilayeredmap::{MphfLayer, OLMError};
|
use obilayeredmap::{IndexMode, MphfLayer, OLMError};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
use crate::partition::KmerPartition;
|
use crate::partition::KmerPartition;
|
||||||
@@ -27,25 +27,25 @@ enum QueryLayer {
|
|||||||
}
|
}
|
||||||
|
|
||||||
impl QueryLayer {
|
impl QueryLayer {
|
||||||
fn open(layer_dir: &Path, with_counts: bool) -> SKResult<Self> {
|
fn open(layer_dir: &Path, with_counts: bool, mode: &IndexMode) -> SKResult<Self> {
|
||||||
let presence_dir = layer_dir.join("presence");
|
let presence_dir = layer_dir.join("presence");
|
||||||
let counts_dir = layer_dir.join("counts");
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
|
||||||
if with_counts && counts_dir.exists() {
|
if with_counts && counts_dir.exists() {
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
|
||||||
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
Ok(QueryLayer::Count(mphf, mat))
|
Ok(QueryLayer::Count(mphf, mat))
|
||||||
} else if presence_dir.exists() {
|
} else if presence_dir.exists() {
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
|
||||||
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
Ok(QueryLayer::Presence(mphf, mat))
|
Ok(QueryLayer::Presence(mphf, mat))
|
||||||
} else if counts_dir.exists() {
|
} else if counts_dir.exists() {
|
||||||
// presence query on a count index — return counts as-is
|
// presence query on a count index — return counts as-is
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
|
||||||
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
Ok(QueryLayer::Count(mphf, mat))
|
Ok(QueryLayer::Count(mphf, mat))
|
||||||
} else {
|
} else {
|
||||||
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
let mphf = MphfLayer::open(layer_dir, mode).map_err(olm_to_sk)?;
|
||||||
Ok(QueryLayer::SetOnly(mphf))
|
Ok(QueryLayer::SetOnly(mphf))
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -102,7 +102,7 @@ impl KmerPartition {
|
|||||||
|
|
||||||
let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
|
let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
|
||||||
let layers: Vec<QueryLayer> = (0..meta.n_layers)
|
let layers: Vec<QueryLayer> = (0..meta.n_layers)
|
||||||
.map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts))
|
.map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts, &meta.mode))
|
||||||
.collect::<SKResult<_>>()?;
|
.collect::<SKResult<_>>()?;
|
||||||
|
|
||||||
Ok(superkmers
|
Ok(superkmers
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ use obicompactvec::{PersistentBitMatrixBuilder,
|
|||||||
PersistentCompactIntVecBuilder};
|
PersistentCompactIntVecBuilder};
|
||||||
use obidebruinj::GraphDeBruijn;
|
use obidebruinj::GraphDeBruijn;
|
||||||
use obiskio::{SKError, SKResult, UnitigFileReader};
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
use obilayeredmap::{EvidenceKind, Layer, MphfLayer, OLMError};
|
use obilayeredmap::{IndexMode, Layer, MphfLayer, OLMError};
|
||||||
use obilayeredmap::meta::PartitionMeta;
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
use crate::filter::KmerFilter;
|
use crate::filter::KmerFilter;
|
||||||
@@ -67,7 +67,7 @@ fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
|
|||||||
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
||||||
let mut n = 0usize;
|
let mut n = 0usize;
|
||||||
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
||||||
let m = PartitionMeta { n_layers: n };
|
let m = PartitionMeta { n_layers: n, mode: IndexMode::default() };
|
||||||
m.save(dir).map_err(olm_to_sk)?;
|
m.save(dir).map_err(olm_to_sk)?;
|
||||||
Ok(m)
|
Ok(m)
|
||||||
}
|
}
|
||||||
@@ -117,7 +117,7 @@ impl KmerPartition {
|
|||||||
if !unitigs_path.exists() { continue; }
|
if !unitigs_path.exists() { continue; }
|
||||||
|
|
||||||
let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
|
let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
|
||||||
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
|
||||||
|
|
||||||
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
let row = src_data.lookup(kmer, n_genomes);
|
let row = src_data.lookup(kmer, n_genomes);
|
||||||
@@ -146,8 +146,8 @@ impl KmerPartition {
|
|||||||
uw.close()?;
|
uw.close()?;
|
||||||
drop(g);
|
drop(g);
|
||||||
|
|
||||||
Layer::<()>::build(&dst_layer_dir, block_bits, &EvidenceKind::Exact).map_err(olm_to_sk)?;
|
Layer::<()>::build(&dst_layer_dir, block_bits, &IndexMode::Exact).map_err(olm_to_sk)?;
|
||||||
let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?;
|
let dst_mphf = MphfLayer::open(&dst_layer_dir, &IndexMode::Exact).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
// ── Prepare matrix builders (one column per genome) ───────────────────
|
// ── Prepare matrix builders (one column per genome) ───────────────────
|
||||||
let data_dir = match mode {
|
let data_dir = match mode {
|
||||||
@@ -182,7 +182,7 @@ impl KmerPartition {
|
|||||||
if !unitigs_path.exists() { continue; }
|
if !unitigs_path.exists() { continue; }
|
||||||
|
|
||||||
let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
|
let reader = UnitigFileReader::open_sequential(&unitigs_path)?;
|
||||||
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
let src_data = SrcLayerData::open(&src_layer_dir, mode, &src_meta.mode)?;
|
||||||
|
|
||||||
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
let row = src_data.lookup(kmer, n_genomes);
|
let row = src_data.lookup(kmer, n_genomes);
|
||||||
@@ -199,7 +199,7 @@ impl KmerPartition {
|
|||||||
for b in builders { b.close()?; }
|
for b in builders { b.close()?; }
|
||||||
write_matrix_meta(&data_dir, n_new, n_genomes).map_err(SKError::Io)?;
|
write_matrix_meta(&data_dir, n_new, n_genomes).map_err(SKError::Io)?;
|
||||||
|
|
||||||
PartitionMeta { n_layers: 1 }.save(&dst_index_dir).map_err(olm_to_sk)?;
|
PartitionMeta { n_layers: 1, mode: IndexMode::Exact }.save(&dst_index_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ use obikseq::CanonicalKmer;
|
|||||||
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
use obiskio::{UnitigFileReader, UnitigFileWriter};
|
||||||
|
|
||||||
use crate::error::{OLMError, OLMResult};
|
use crate::error::{OLMError, OLMResult};
|
||||||
use crate::meta::EvidenceKind;
|
use crate::meta::IndexMode;
|
||||||
use crate::mphf_layer::MphfLayer;
|
use crate::mphf_layer::MphfLayer;
|
||||||
pub(crate) use crate::mphf_layer::UNITIGS_FILE;
|
pub(crate) use crate::mphf_layer::UNITIGS_FILE;
|
||||||
|
|
||||||
@@ -62,8 +62,8 @@ pub struct Hit<T = ()> {
|
|||||||
// ── Common read path ──────────────────────────────────────────────────────────
|
// ── Common read path ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
impl<D: LayerData> Layer<D> {
|
impl<D: LayerData> Layer<D> {
|
||||||
pub fn open(path: &Path) -> OLMResult<Self> {
|
pub fn open(path: &Path, mode: &IndexMode) -> OLMResult<Self> {
|
||||||
let mphf = MphfLayer::open(path)?;
|
let mphf = MphfLayer::open(path, mode)?;
|
||||||
let data = D::open(path)?;
|
let data = D::open(path)?;
|
||||||
Ok(Self { mphf, data })
|
Ok(Self { mphf, data })
|
||||||
}
|
}
|
||||||
@@ -92,18 +92,13 @@ impl<D: LayerData> Layer<D> {
|
|||||||
MphfLayer::build_approx_evidence(layer_dir, b, z)
|
MphfLayer::build_approx_evidence(layer_dir, b, z)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Dispatch to `build_exact_evidence` or `build_approx_evidence`.
|
|
||||||
/// `block_bits` is forwarded to exact evidence only.
|
|
||||||
pub fn build_evidence(layer_dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
|
|
||||||
MphfLayer::build_evidence(layer_dir, kind, block_bits)
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 1 — set membership ───────────────────────────────────────────────────
|
// ── Mode 1 — set membership ───────────────────────────────────────────────────
|
||||||
|
|
||||||
impl Layer<()> {
|
impl Layer<()> {
|
||||||
pub fn build(out_dir: &Path, block_bits: u8, evidence_kind: &EvidenceKind) -> OLMResult<usize> {
|
pub fn build(out_dir: &Path, block_bits: u8, mode: &IndexMode) -> OLMResult<usize> {
|
||||||
MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |_, _| Ok(()))
|
MphfLayer::build(out_dir, block_bits, mode, &mut |_, _| Ok(()))
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Create a presence matrix for a set-membership layer (first merge).
|
/// Create a presence matrix for a set-membership layer (first merge).
|
||||||
@@ -126,7 +121,7 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
pub fn build(
|
pub fn build(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
evidence_kind: &EvidenceKind,
|
mode: &IndexMode,
|
||||||
count_of: impl Fn(CanonicalKmer) -> u32,
|
count_of: impl Fn(CanonicalKmer) -> u32,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
let n = UnitigFileReader::open_sequential(&out_dir.join(UNITIGS_FILE))?.n_kmers();
|
||||||
@@ -134,7 +129,7 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
|
let mut mb = PersistentCompactIntMatrixBuilder::new(n, &counts_dir)
|
||||||
.map_err(OLMError::Io)?;
|
.map_err(OLMError::Io)?;
|
||||||
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
||||||
let n_built = MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |slot, kmer| {
|
let n_built = MphfLayer::build(out_dir, block_bits, mode, &mut |slot, kmer| {
|
||||||
col.set(slot, count_of(kmer));
|
col.set(slot, count_of(kmer));
|
||||||
Ok(())
|
Ok(())
|
||||||
})?;
|
})?;
|
||||||
@@ -146,10 +141,10 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
pub fn build_from_map(
|
pub fn build_from_map(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
evidence_kind: &EvidenceKind,
|
mode: &IndexMode,
|
||||||
counts: &HashMap<CanonicalKmer, u32>,
|
counts: &HashMap<CanonicalKmer, u32>,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
Self::build(out_dir, block_bits, evidence_kind, |kmer| counts.get(&kmer).copied().unwrap_or(0))
|
Self::build(out_dir, block_bits, mode, |kmer| counts.get(&kmer).copied().unwrap_or(0))
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -179,7 +174,7 @@ impl Layer<PersistentBitMatrix> {
|
|||||||
pub fn build_presence(
|
pub fn build_presence(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
evidence_kind: &EvidenceKind,
|
mode: &IndexMode,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
present_in: impl Fn(CanonicalKmer, usize) -> bool,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
@@ -189,7 +184,7 @@ impl Layer<PersistentBitMatrix> {
|
|||||||
let mut cols: Vec<_> = (0..n_genomes)
|
let mut cols: Vec<_> = (0..n_genomes)
|
||||||
.map(|_| mb.add_col().map_err(OLMError::Io))
|
.map(|_| mb.add_col().map_err(OLMError::Io))
|
||||||
.collect::<OLMResult<_>>()?;
|
.collect::<OLMResult<_>>()?;
|
||||||
let n_built = MphfLayer::build(out_dir, block_bits, evidence_kind, &mut |slot, kmer| {
|
let n_built = MphfLayer::build(out_dir, block_bits, mode, &mut |slot, kmer| {
|
||||||
for (g, col) in cols.iter_mut().enumerate() {
|
for (g, col) in cols.iter_mut().enumerate() {
|
||||||
col.set(slot, present_in(kmer, g));
|
col.set(slot, present_in(kmer, g));
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -11,5 +11,5 @@ pub use error::{OLMError, OLMResult};
|
|||||||
pub use layer::{Hit, Layer, LayerData};
|
pub use layer::{Hit, Layer, LayerData};
|
||||||
pub use layered_store::LayeredStore;
|
pub use layered_store::LayeredStore;
|
||||||
pub use map::LayeredMap;
|
pub use map::LayeredMap;
|
||||||
pub use meta::{EvidenceKind, LayerMeta};
|
pub use meta::{IndexMode, PartitionMeta};
|
||||||
pub use mphf_layer::MphfLayer;
|
pub use mphf_layer::MphfLayer;
|
||||||
|
|||||||
@@ -5,11 +5,10 @@ use std::path::{Path, PathBuf};
|
|||||||
use obicompactvec::PersistentCompactIntMatrix;
|
use obicompactvec::PersistentCompactIntMatrix;
|
||||||
use obikseq::CanonicalKmer;
|
use obikseq::CanonicalKmer;
|
||||||
use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS};
|
use obiskio::{UnitigFileWriter, DEFAULT_BLOCK_BITS};
|
||||||
use crate::meta::EvidenceKind;
|
|
||||||
|
|
||||||
use crate::error::OLMResult;
|
use crate::error::OLMResult;
|
||||||
use crate::layer::{Hit, Layer, LayerData};
|
use crate::layer::{Hit, Layer, LayerData};
|
||||||
use crate::meta::PartitionMeta;
|
use crate::meta::{IndexMode, PartitionMeta};
|
||||||
|
|
||||||
/// Layered kmer index for a single partition.
|
/// Layered kmer index for a single partition.
|
||||||
///
|
///
|
||||||
@@ -26,39 +25,26 @@ pub struct LayeredMap<D: LayerData = ()> {
|
|||||||
|
|
||||||
impl<D: LayerData> LayeredMap<D> {
|
impl<D: LayerData> LayeredMap<D> {
|
||||||
/// Open an existing layered index at `root`.
|
/// Open an existing layered index at `root`.
|
||||||
|
/// The mode is read once from `PartitionMeta` and applied to all layers.
|
||||||
pub fn open(root: &Path) -> OLMResult<Self> {
|
pub fn open(root: &Path) -> OLMResult<Self> {
|
||||||
let meta = PartitionMeta::load(root)?;
|
let meta = PartitionMeta::load(root)?;
|
||||||
let layers = (0..meta.n_layers)
|
let layers = (0..meta.n_layers)
|
||||||
.map(|i| Layer::<D>::open(&layer_dir(root, i)))
|
.map(|i| Layer::<D>::open(&layer_dir(root, i), &meta.mode))
|
||||||
.collect::<OLMResult<Vec<_>>>()?;
|
.collect::<OLMResult<Vec<_>>>()?;
|
||||||
Ok(Self {
|
Ok(Self { root: root.to_owned(), meta, layers })
|
||||||
root: root.to_owned(),
|
|
||||||
meta,
|
|
||||||
layers,
|
|
||||||
})
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Create a new, empty layered index at `root`.
|
/// Create a new, empty layered index at `root` with the given mode.
|
||||||
pub fn create(root: &Path) -> OLMResult<Self> {
|
pub fn create(root: &Path, mode: IndexMode) -> OLMResult<Self> {
|
||||||
fs::create_dir_all(root)?;
|
fs::create_dir_all(root)?;
|
||||||
let meta = PartitionMeta::new();
|
let meta = PartitionMeta::new(mode);
|
||||||
meta.save(root)?;
|
meta.save(root)?;
|
||||||
Ok(Self {
|
Ok(Self { root: root.to_owned(), meta, layers: Vec::new() })
|
||||||
root: root.to_owned(),
|
|
||||||
meta,
|
|
||||||
layers: Vec::new(),
|
|
||||||
})
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Return the number of layers in this index.
|
pub fn n_layers(&self) -> usize { self.layers.len() }
|
||||||
pub fn n_layers(&self) -> usize {
|
pub fn layer(&self, i: usize) -> &Layer<D> { &self.layers[i] }
|
||||||
self.layers.len()
|
pub fn mode(&self) -> &IndexMode { &self.meta.mode }
|
||||||
}
|
|
||||||
|
|
||||||
/// Return a reference to the `i`-th layer.
|
|
||||||
pub fn layer(&self, i: usize) -> &Layer<D> {
|
|
||||||
&self.layers[i]
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Query `kmer` across all layers. Returns `(layer_index, Hit)` on match.
|
/// Query `kmer` across all layers. Returns `(layer_index, Hit)` on match.
|
||||||
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)> {
|
pub fn query(&self, kmer: CanonicalKmer) -> Option<(usize, Hit<D::Item>)> {
|
||||||
@@ -68,17 +54,15 @@ impl<D: LayerData> LayeredMap<D> {
|
|||||||
.find_map(|(i, layer)| layer.query(kmer).map(|hit| (i, hit)))
|
.find_map(|(i, layer)| layer.query(kmer).map(|hit| (i, hit)))
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Return a `UnitigFileWriter` for the next layer to be built.
|
|
||||||
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter> {
|
pub fn next_layer_writer(&self) -> OLMResult<UnitigFileWriter> {
|
||||||
let dir = layer_dir(&self.root, self.layers.len());
|
let dir = layer_dir(&self.root, self.layers.len());
|
||||||
Layer::<D>::unitig_writer(&dir)
|
Layer::<D>::unitig_writer(&dir)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Append a new layer to the index.
|
|
||||||
fn append_layer(&mut self) -> OLMResult<()> {
|
fn append_layer(&mut self) -> OLMResult<()> {
|
||||||
let i = self.layers.len();
|
let i = self.layers.len();
|
||||||
let dir = layer_dir(&self.root, i);
|
let dir = layer_dir(&self.root, i);
|
||||||
self.layers.push(Layer::<D>::open(&dir)?);
|
self.layers.push(Layer::<D>::open(&dir, &self.meta.mode)?);
|
||||||
self.meta.n_layers = self.layers.len();
|
self.meta.n_layers = self.layers.len();
|
||||||
self.meta.save(&self.root)?;
|
self.meta.save(&self.root)?;
|
||||||
Ok(())
|
Ok(())
|
||||||
@@ -91,7 +75,7 @@ impl LayeredMap<()> {
|
|||||||
pub fn push_layer(&mut self) -> OLMResult<usize> {
|
pub fn push_layer(&mut self) -> OLMResult<usize> {
|
||||||
let i = self.layers.len();
|
let i = self.layers.len();
|
||||||
let dir = layer_dir(&self.root, i);
|
let dir = layer_dir(&self.root, i);
|
||||||
Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS, &EvidenceKind::Exact)?;
|
Layer::<()>::build(&dir, DEFAULT_BLOCK_BITS, &self.meta.mode)?;
|
||||||
self.append_layer()?;
|
self.append_layer()?;
|
||||||
Ok(i)
|
Ok(i)
|
||||||
}
|
}
|
||||||
@@ -103,15 +87,12 @@ impl LayeredMap<PersistentCompactIntMatrix> {
|
|||||||
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
pub fn push_layer(&mut self, count_of: impl Fn(CanonicalKmer) -> u32) -> OLMResult<usize> {
|
||||||
let i = self.layers.len();
|
let i = self.layers.len();
|
||||||
let dir = layer_dir(&self.root, i);
|
let dir = layer_dir(&self.root, i);
|
||||||
Layer::<PersistentCompactIntMatrix>::build(&dir, DEFAULT_BLOCK_BITS, &EvidenceKind::Exact, count_of)?;
|
Layer::<PersistentCompactIntMatrix>::build(&dir, DEFAULT_BLOCK_BITS, &self.meta.mode, count_of)?;
|
||||||
self.append_layer()?;
|
self.append_layer()?;
|
||||||
Ok(i)
|
Ok(i)
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn push_layer_from_map(
|
pub fn push_layer_from_map(&mut self, counts: &HashMap<CanonicalKmer, u32>) -> OLMResult<usize> {
|
||||||
&mut self,
|
|
||||||
counts: &HashMap<CanonicalKmer, u32>,
|
|
||||||
) -> OLMResult<usize> {
|
|
||||||
self.push_layer(|kmer| counts.get(&kmer).copied().unwrap_or(0))
|
self.push_layer(|kmer| counts.get(&kmer).copied().unwrap_or(0))
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -6,64 +6,44 @@ use serde::{Deserialize, Serialize};
|
|||||||
use crate::error::OLMResult;
|
use crate::error::OLMResult;
|
||||||
|
|
||||||
const META_FILE: &str = "meta.json";
|
const META_FILE: &str = "meta.json";
|
||||||
const LAYER_META_FILE: &str = "layer_meta.json";
|
|
||||||
|
|
||||||
// ── Layer-level metadata ──────────────────────────────────────────────────────
|
// ── IndexMode ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Describes the evidence bundle stored alongside the MPHF for one layer.
|
/// Evidence mode for an entire partitioned index — homogeneous across all layers.
|
||||||
|
///
|
||||||
|
/// Determined once at build time; stored in `PartitionMeta` (`meta.json`).
|
||||||
|
/// All layers within an index share the same mode.
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
#[serde(tag = "type", rename_all = "snake_case")]
|
#[serde(tag = "type", rename_all = "snake_case")]
|
||||||
pub enum EvidenceKind {
|
pub enum IndexMode {
|
||||||
/// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
|
/// Exact evidence: `evidence.bin` + `unitigs.bin.idx`. Zero false positives.
|
||||||
Exact,
|
Exact,
|
||||||
/// Approximate evidence: `fingerprint.bin` only.
|
/// Approximate evidence: `fingerprint.bin` only.
|
||||||
/// `b` — fingerprint bits; false-positive rate per k-mer query = 1/2^b.
|
/// `b` — fingerprint bits per slot; false-positive rate ≈ 1/2^b per query.
|
||||||
/// `z` — consecutive k-mers that must all match (Findere trick);
|
/// `z` — Findere consecutive-kmer parameter (build-time only; not used at query time).
|
||||||
/// effective FP rate per sequencing read ≈ W / 2^(b·z)
|
|
||||||
/// where W = L - k - z + 2 is the number of windows in a read of length L.
|
|
||||||
Approx { b: u8, z: u8 },
|
Approx { b: u8, z: u8 },
|
||||||
|
/// Hybrid: both `fingerprint.bin` and `evidence.bin` + `unitigs.bin.idx`.
|
||||||
|
/// `find()` uses the fingerprint (O(1), approx); `find_strict()` uses exact evidence.
|
||||||
|
Hybrid { b: u8, z: u8 },
|
||||||
}
|
}
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
impl Default for IndexMode {
|
||||||
pub struct LayerMeta {
|
|
||||||
pub evidence: EvidenceKind,
|
|
||||||
}
|
|
||||||
|
|
||||||
impl Default for EvidenceKind {
|
|
||||||
fn default() -> Self { Self::Exact }
|
fn default() -> Self { Self::Exact }
|
||||||
}
|
}
|
||||||
|
|
||||||
impl LayerMeta {
|
// ── PartitionMeta ─────────────────────────────────────────────────────────────
|
||||||
pub fn exact() -> Self {
|
|
||||||
Self { evidence: EvidenceKind::Exact }
|
|
||||||
}
|
|
||||||
|
|
||||||
pub fn approx(b: u8, z: u8) -> Self {
|
|
||||||
Self { evidence: EvidenceKind::Approx { b, z } }
|
|
||||||
}
|
|
||||||
|
|
||||||
pub fn load(layer_dir: &Path) -> OLMResult<Self> {
|
|
||||||
let f = File::open(layer_dir.join(LAYER_META_FILE))?;
|
|
||||||
Ok(serde_json::from_reader(f)?)
|
|
||||||
}
|
|
||||||
|
|
||||||
pub fn save(&self, layer_dir: &Path) -> OLMResult<()> {
|
|
||||||
let f = File::create(layer_dir.join(LAYER_META_FILE))?;
|
|
||||||
serde_json::to_writer_pretty(f, self)?;
|
|
||||||
Ok(())
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// ── Partition-level metadata ──────────────────────────────────────────────────
|
|
||||||
|
|
||||||
|
/// Index-level metadata stored in `meta.json` at the root of a partition index.
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
pub struct PartitionMeta {
|
pub struct PartitionMeta {
|
||||||
pub n_layers: usize,
|
pub n_layers: usize,
|
||||||
|
#[serde(default)]
|
||||||
|
pub mode: IndexMode,
|
||||||
}
|
}
|
||||||
|
|
||||||
impl PartitionMeta {
|
impl PartitionMeta {
|
||||||
pub fn new() -> Self {
|
pub fn new(mode: IndexMode) -> Self {
|
||||||
Self { n_layers: 0 }
|
Self { n_layers: 0, mode }
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn load(dir: &Path) -> OLMResult<Self> {
|
pub fn load(dir: &Path) -> OLMResult<Self> {
|
||||||
@@ -79,5 +59,5 @@ impl PartitionMeta {
|
|||||||
}
|
}
|
||||||
|
|
||||||
impl Default for PartitionMeta {
|
impl Default for PartitionMeta {
|
||||||
fn default() -> Self { Self::new() }
|
fn default() -> Self { Self::new(IndexMode::Exact) }
|
||||||
}
|
}
|
||||||
|
|||||||
+119
-134
@@ -1,5 +1,5 @@
|
|||||||
use std::fs;
|
use std::fs;
|
||||||
use std::path::Path;
|
use std::path::{Path, PathBuf};
|
||||||
|
|
||||||
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
||||||
use epserde::prelude::*;
|
use epserde::prelude::*;
|
||||||
@@ -10,7 +10,7 @@ use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
|||||||
use crate::error::{OLMError, OLMResult};
|
use crate::error::{OLMError, OLMResult};
|
||||||
use crate::evidence::{Evidence, EvidenceWriter};
|
use crate::evidence::{Evidence, EvidenceWriter};
|
||||||
use crate::fingerprint::{FingerprintVec, FingerprintVecWriter};
|
use crate::fingerprint::{FingerprintVec, FingerprintVecWriter};
|
||||||
use crate::meta::{EvidenceKind, LayerMeta};
|
use crate::meta::IndexMode;
|
||||||
|
|
||||||
pub(crate) const MPHF_FILE: &str = "mphf.bin";
|
pub(crate) const MPHF_FILE: &str = "mphf.bin";
|
||||||
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
|
pub(crate) const UNITIGS_FILE: &str = "unitigs.bin";
|
||||||
@@ -19,19 +19,22 @@ pub(crate) const FINGERPRINT_FILE: &str = "fingerprint.bin";
|
|||||||
|
|
||||||
pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
pub(crate) type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
||||||
|
|
||||||
// ── Evidence store ────────────────────────────────────────────────────────────
|
// ── LayerEvidence ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
enum LayerEvidence {
|
enum LayerEvidence {
|
||||||
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
Exact { evidence: Evidence, unitigs: UnitigFileReader },
|
||||||
Approx { fingerprint: FingerprintVec },
|
Approx { fingerprint: FingerprintVec, unitigs_path: PathBuf },
|
||||||
|
Hybrid { evidence: Evidence, unitigs: UnitigFileReader, fingerprint: FingerprintVec },
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── MphfLayer ─────────────────────────────────────────────────────────────────
|
// ── MphfLayer ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Autonomous kmer → slot mapping for one layer.
|
/// Autonomous kmer → slot mapping for one layer.
|
||||||
///
|
///
|
||||||
/// Dispatches queries to exact or approximate evidence transparently based on
|
/// Two query methods:
|
||||||
/// the `layer_meta.json` written at build time.
|
/// - [`find`](Self::find) — O(1), uses fingerprint (Approx/Hybrid) or exact evidence (Exact).
|
||||||
|
/// - [`find_strict`](Self::find_strict) — always exact; O(1) on Exact/Hybrid layers,
|
||||||
|
/// O(n) sequential scan on Approx layers.
|
||||||
pub struct MphfLayer {
|
pub struct MphfLayer {
|
||||||
mphf: Mphf,
|
mphf: Mphf,
|
||||||
ev: LayerEvidence,
|
ev: LayerEvidence,
|
||||||
@@ -39,21 +42,31 @@ pub struct MphfLayer {
|
|||||||
}
|
}
|
||||||
|
|
||||||
impl MphfLayer {
|
impl MphfLayer {
|
||||||
pub fn open(dir: &Path) -> OLMResult<Self> {
|
/// Open a layer using the index-level `mode` determined at `LayeredMap` open time.
|
||||||
let meta = LayerMeta::load(dir)?;
|
/// No per-layer metadata file is read.
|
||||||
|
pub fn open(dir: &Path, mode: &IndexMode) -> OLMResult<Self> {
|
||||||
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
let mphf: Mphf = Mphf::load_full(&dir.join(MPHF_FILE))
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
let (ev, n) = match meta.evidence {
|
let (ev, n) = match mode {
|
||||||
EvidenceKind::Exact => {
|
IndexMode::Exact => {
|
||||||
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
|
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
|
||||||
let n = evidence.len();
|
let n = evidence.len();
|
||||||
|
// open() auto-detects: uses direct access since exact layers always have .idx
|
||||||
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
||||||
(LayerEvidence::Exact { evidence, unitigs }, n)
|
(LayerEvidence::Exact { evidence, unitigs }, n)
|
||||||
}
|
}
|
||||||
EvidenceKind::Approx { .. } => {
|
IndexMode::Approx { .. } => {
|
||||||
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
|
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
|
||||||
let n = fingerprint.n();
|
let n = fingerprint.n();
|
||||||
(LayerEvidence::Approx { fingerprint }, n)
|
let unitigs_path = dir.join(UNITIGS_FILE);
|
||||||
|
(LayerEvidence::Approx { fingerprint, unitigs_path }, n)
|
||||||
|
}
|
||||||
|
IndexMode::Hybrid { .. } => {
|
||||||
|
let evidence = Evidence::open(&dir.join(EVIDENCE_FILE))?;
|
||||||
|
let fingerprint = FingerprintVec::open(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
let n = evidence.len();
|
||||||
|
let unitigs = UnitigFileReader::open(&dir.join(UNITIGS_FILE))?;
|
||||||
|
(LayerEvidence::Hybrid { evidence, unitigs, fingerprint }, n)
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
Ok(Self { mphf, ev, n })
|
Ok(Self { mphf, ev, n })
|
||||||
@@ -61,25 +74,16 @@ impl MphfLayer {
|
|||||||
|
|
||||||
// ── Query API ─────────────────────────────────────────────────────────────
|
// ── Query API ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Transparent dispatch: routes to `find_exact` or `find_approx` based on
|
/// O(1) lookup — dispatches automatically:
|
||||||
/// the evidence loaded at `open` time.
|
/// - Exact: evidence + `verify_canonical_kmer`, zero false positives.
|
||||||
|
/// - Approx: fingerprint check, false-positive rate ≈ 1/2^b.
|
||||||
|
/// - Hybrid: fingerprint check (fast path), zero false positives via `find_strict`.
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
match &self.ev {
|
|
||||||
LayerEvidence::Exact { .. } => self.find_exact(kmer),
|
|
||||||
LayerEvidence::Approx { .. } => self.find_approx(kmer),
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Exact lookup: zero false positives. Panics if the layer was opened with
|
|
||||||
/// approximate evidence.
|
|
||||||
#[inline]
|
|
||||||
pub fn find_exact(&self, kmer: CanonicalKmer) -> Option<usize> {
|
|
||||||
let LayerEvidence::Exact { evidence, unitigs } = &self.ev else {
|
|
||||||
panic!("find_exact called on an approximate layer");
|
|
||||||
};
|
|
||||||
let slot = self.mphf.index(&kmer.raw());
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
if slot >= self.n { return None; }
|
if slot >= self.n { return None; }
|
||||||
|
match &self.ev {
|
||||||
|
LayerEvidence::Exact { evidence, unitigs } => {
|
||||||
let (chunk_id, rank) = evidence.decode(slot);
|
let (chunk_id, rank) = evidence.decode(slot);
|
||||||
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
||||||
Some(slot)
|
Some(slot)
|
||||||
@@ -87,17 +91,41 @@ impl MphfLayer {
|
|||||||
None
|
None
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
LayerEvidence::Approx { fingerprint, .. } |
|
||||||
|
LayerEvidence::Hybrid { fingerprint, .. } => {
|
||||||
|
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/// Approximate lookup: false-positive rate 1/2^b per k-mer query. Panics
|
/// Always-exact lookup — zero false positives regardless of mode.
|
||||||
/// if the layer was opened with exact evidence.
|
///
|
||||||
#[inline]
|
/// - Exact/Hybrid: O(1) via evidence + `verify_canonical_kmer`.
|
||||||
pub fn find_approx(&self, kmer: CanonicalKmer) -> Option<usize> {
|
/// - Approx: O(n) sequential scan of `unitigs.bin` to confirm the kmer
|
||||||
let LayerEvidence::Approx { fingerprint } = &self.ev else {
|
/// that owns the slot, then exact comparison.
|
||||||
panic!("find_approx called on an exact layer");
|
pub fn find_strict(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
};
|
|
||||||
let slot = self.mphf.index(&kmer.raw());
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
if slot >= self.n { return None; }
|
if slot >= self.n { return None; }
|
||||||
if fingerprint.matches(slot, kmer.seq_hash()) { Some(slot) } else { None }
|
match &self.ev {
|
||||||
|
LayerEvidence::Exact { evidence, unitigs } |
|
||||||
|
LayerEvidence::Hybrid { evidence, unitigs, .. } => {
|
||||||
|
let (chunk_id, rank) = evidence.decode(slot);
|
||||||
|
if unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
||||||
|
Some(slot)
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
}
|
||||||
|
}
|
||||||
|
LayerEvidence::Approx { unitigs_path, .. } => {
|
||||||
|
let reader = UnitigFileReader::open_sequential(unitigs_path).ok()?;
|
||||||
|
for stored in reader.iter_canonical_kmers() {
|
||||||
|
if self.mphf.index(&stored.raw()) == slot {
|
||||||
|
return if stored == kmer { Some(slot) } else { None };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
None
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn n(&self) -> usize { self.n }
|
pub fn n(&self) -> usize { self.n }
|
||||||
@@ -109,19 +137,7 @@ impl MphfLayer {
|
|||||||
Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?)
|
Ok(UnitigFileWriter::create(&dir.join(UNITIGS_FILE))?)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Dispatch to `build_exact_evidence` or `build_approx_evidence` based on
|
|
||||||
/// `kind`. `block_bits` is forwarded to exact evidence only.
|
|
||||||
pub fn build_evidence(dir: &Path, kind: &EvidenceKind, block_bits: u8) -> OLMResult<usize> {
|
|
||||||
match kind {
|
|
||||||
EvidenceKind::Exact => Self::build_exact_evidence(dir, block_bits),
|
|
||||||
EvidenceKind::Approx { b, z } => Self::build_approx_evidence(dir, *b, *z),
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
|
/// Build `evidence.bin` + `unitigs.bin.idx` from `unitigs.bin` + `mphf.bin`.
|
||||||
///
|
|
||||||
/// `block_bits` controls the `.idx` block size (2^block_bits chunks per block).
|
|
||||||
/// Uses sequential iteration — no `.idx` required on entry.
|
|
||||||
pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult<usize> {
|
pub fn build_exact_evidence(dir: &Path, block_bits: u8) -> OLMResult<usize> {
|
||||||
let unitig_path = dir.join(UNITIGS_FILE);
|
let unitig_path = dir.join(UNITIGS_FILE);
|
||||||
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
|
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
|
||||||
@@ -130,7 +146,6 @@ impl MphfLayer {
|
|||||||
if n == 0 {
|
if n == 0 {
|
||||||
fs::File::create(dir.join(EVIDENCE_FILE))?;
|
fs::File::create(dir.join(EVIDENCE_FILE))?;
|
||||||
build_unitig_idx(&unitig_path, block_bits)?;
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
LayerMeta::exact().save(dir)?;
|
|
||||||
return Ok(0);
|
return Ok(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -156,13 +171,10 @@ impl MphfLayer {
|
|||||||
|
|
||||||
ev.write(&dir.join(EVIDENCE_FILE))?;
|
ev.write(&dir.join(EVIDENCE_FILE))?;
|
||||||
build_unitig_idx(&unitig_path, block_bits)?;
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
LayerMeta::exact().save(dir)?;
|
|
||||||
Ok(n)
|
Ok(n)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
|
/// Build `fingerprint.bin` from `unitigs.bin` + `mphf.bin`.
|
||||||
/// `b` — fingerprint bits (1..=64); `z` — Findere consecutive k-mer
|
|
||||||
/// parameter (≥1). No `.idx` is written.
|
|
||||||
pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult<usize> {
|
pub fn build_approx_evidence(dir: &Path, b: u8, z: u8) -> OLMResult<usize> {
|
||||||
if b == 0 || b > 64 {
|
if b == 0 || b > 64 {
|
||||||
return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into()));
|
return Err(OLMError::InvalidLayer("fingerprint width must be 1..=64".into()));
|
||||||
@@ -176,7 +188,6 @@ impl MphfLayer {
|
|||||||
|
|
||||||
if n == 0 {
|
if n == 0 {
|
||||||
FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?;
|
FingerprintVecWriter::new(0, b).write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
LayerMeta::approx(b, z).save(dir)?;
|
|
||||||
return Ok(0);
|
return Ok(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -194,139 +205,113 @@ impl MphfLayer {
|
|||||||
}
|
}
|
||||||
|
|
||||||
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
LayerMeta::approx(b, z).save(dir)?;
|
|
||||||
Ok(n)
|
Ok(n)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Build MPHF then evidence from the unitigs file already present in `dir`.
|
/// Build MPHF + evidence from `unitigs.bin` already present in `dir`.
|
||||||
///
|
///
|
||||||
/// - Exact: `.idx` is built for pass-1 parallel construction and kept for
|
/// `fill_slot(slot, kmer)` is called once per kmer in all modes.
|
||||||
/// query-time kmer verification. `evidence.bin` is written.
|
/// No `layer_meta.json` is written — the mode is an index-level property
|
||||||
/// - Approx: pass-1 uses `open_sequential` + `par_bridge` — no `.idx` is
|
/// stored in `PartitionMeta`.
|
||||||
/// ever created. `fingerprint.bin` is written.
|
|
||||||
///
|
|
||||||
/// `fill_slot(slot, kmer)` is called once per kmer in both modes.
|
|
||||||
pub(crate) fn build(
|
pub(crate) fn build(
|
||||||
dir: &Path,
|
dir: &Path,
|
||||||
block_bits: u8,
|
block_bits: u8,
|
||||||
evidence_kind: &EvidenceKind,
|
mode: &IndexMode,
|
||||||
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
fill_slot: &mut impl FnMut(usize, CanonicalKmer) -> OLMResult<()>,
|
||||||
) -> OLMResult<usize> {
|
) -> OLMResult<usize> {
|
||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
|
|
||||||
let unitig_path = dir.join(UNITIGS_FILE);
|
let unitig_path = dir.join(UNITIGS_FILE);
|
||||||
|
|
||||||
match evidence_kind {
|
|
||||||
// ── Exact path ────────────────────────────────────────────────────
|
|
||||||
// .idx is built LAST, once evidence.bin is written, so it is never
|
|
||||||
// present during construction — only at query time.
|
|
||||||
EvidenceKind::Exact => {
|
|
||||||
let n = UnitigFileReader::open_sequential(&unitig_path)?.n_kmers();
|
let n = UnitigFileReader::open_sequential(&unitig_path)?.n_kmers();
|
||||||
let keys = CanonicalKmerIter::new(&unitig_path)
|
|
||||||
.map_err(|e| match e {
|
let sk_to_olm = |e: obiskio::SKError| match e {
|
||||||
obiskio::SKError::Io(io) => OLMError::Io(io),
|
obiskio::SKError::Io(io) => OLMError::Io(io),
|
||||||
e => OLMError::InvalidLayer(e.to_string()),
|
e => OLMError::InvalidLayer(e.to_string()),
|
||||||
})?;
|
};
|
||||||
|
|
||||||
|
// ── Empty layer ───────────────────────────────────────────────────────
|
||||||
if n == 0 {
|
if n == 0 {
|
||||||
fs::File::create(dir.join(EVIDENCE_FILE))?;
|
|
||||||
let mphf: Mphf =
|
let mphf: Mphf =
|
||||||
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
|
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
|
||||||
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
|
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
|
||||||
mphf.store(&dir.join(MPHF_FILE))
|
mphf.store(&dir.join(MPHF_FILE))
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
LayerMeta::exact().save(dir)?;
|
match mode {
|
||||||
|
IndexMode::Exact | IndexMode::Hybrid { .. } => {
|
||||||
|
fs::File::create(dir.join(EVIDENCE_FILE))?;
|
||||||
build_unitig_idx(&unitig_path, block_bits)?;
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
|
}
|
||||||
|
IndexMode::Approx { b, .. } => {
|
||||||
|
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if let IndexMode::Hybrid { b, .. } = mode {
|
||||||
|
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
}
|
||||||
return Ok(0);
|
return Ok(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Pass 1 — MPHF construction via clonable mmap iterator
|
// ── Pass 1: MPHF via clonable mmap iterator ───────────────────────────
|
||||||
|
let keys = CanonicalKmerIter::new(&unitig_path).map_err(sk_to_olm)?;
|
||||||
let mphf: Mphf =
|
let mphf: Mphf =
|
||||||
Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(), PtrHashParams::<CubicEps>::default());
|
Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(),
|
||||||
|
PtrHashParams::<CubicEps>::default());
|
||||||
mphf.store(&dir.join(MPHF_FILE))
|
mphf.store(&dir.join(MPHF_FILE))
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
||||||
|
|
||||||
// Pass 2 — sequential: fill evidence.bin + callback
|
// ── Pass 2: fill evidence files + callback ────────────────────────────
|
||||||
let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?;
|
let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?;
|
||||||
let mut ev = EvidenceWriter::new(n);
|
|
||||||
let mut seen = vec![0u8; (n + 7) / 8];
|
let mut seen = vec![0u8; (n + 7) / 8];
|
||||||
|
|
||||||
|
match mode {
|
||||||
|
IndexMode::Exact => {
|
||||||
|
let mut ev = EvidenceWriter::new(n);
|
||||||
for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() {
|
for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() {
|
||||||
let slot = mphf.index(&kmer.raw());
|
let slot = mphf.index(&kmer.raw());
|
||||||
if slot >= n {
|
if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
|
||||||
return Err(OLMError::Mphf("slot out of bounds".into()));
|
let byte = slot / 8; let bit = 1u8 << (slot % 8);
|
||||||
}
|
if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
|
||||||
let byte = slot / 8;
|
|
||||||
let bit = 1u8 << (slot % 8);
|
|
||||||
if seen[byte] & bit != 0 {
|
|
||||||
return Err(OLMError::Mphf("duplicate slot".into()));
|
|
||||||
}
|
|
||||||
seen[byte] |= bit;
|
seen[byte] |= bit;
|
||||||
ev.set(slot, chunk_id as u32, rank as u8);
|
ev.set(slot, chunk_id as u32, rank as u8);
|
||||||
fill_slot(slot, kmer)?;
|
fill_slot(slot, kmer)?;
|
||||||
}
|
}
|
||||||
|
|
||||||
ev.write(&dir.join(EVIDENCE_FILE))?;
|
ev.write(&dir.join(EVIDENCE_FILE))?;
|
||||||
LayerMeta::exact().save(dir)?;
|
|
||||||
// .idx built last: strictly for query-time kmer verification
|
|
||||||
build_unitig_idx(&unitig_path, block_bits)?;
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
Ok(n)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Approx path ───────────────────────────────────────────────────
|
IndexMode::Approx { b, .. } => {
|
||||||
// No .idx is created at any point.
|
|
||||||
EvidenceKind::Approx { b, z } => {
|
|
||||||
let unitigs = UnitigFileReader::open_sequential(&unitig_path)?;
|
|
||||||
let n = unitigs.n_kmers();
|
|
||||||
|
|
||||||
if n == 0 {
|
|
||||||
FingerprintVecWriter::new(0, *b).write(&dir.join(FINGERPRINT_FILE))?;
|
|
||||||
let mphf: Mphf =
|
|
||||||
Mphf::try_new(&[] as &[u64], PtrHashParams::<CubicEps>::default())
|
|
||||||
.ok_or_else(|| OLMError::Mphf("construction failed".into()))?;
|
|
||||||
mphf.store(&dir.join(MPHF_FILE))
|
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
|
||||||
LayerMeta::approx(*b, *z).save(dir)?;
|
|
||||||
return Ok(0);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Pass 1 — MPHF construction via mmap-backed clonable iterator.
|
|
||||||
// No .idx is created. par_bridge() parallelises the sequential scan;
|
|
||||||
// Clone on CanonicalKmerRawIter shares the Arc<Mmap> and resets to pos 0.
|
|
||||||
let keys = CanonicalKmerIter::new(&unitig_path)
|
|
||||||
.map_err(|e| match e {
|
|
||||||
obiskio::SKError::Io(io) => OLMError::Io(io),
|
|
||||||
e => OLMError::InvalidLayer(e.to_string()),
|
|
||||||
})?;
|
|
||||||
let mphf: Mphf =
|
|
||||||
Mphf::new_from_par_iter(n, keys.map(|k| k.raw()).par_bridge(), PtrHashParams::<CubicEps>::default());
|
|
||||||
mphf.store(&dir.join(MPHF_FILE))
|
|
||||||
.map_err(|e| OLMError::InvalidLayer(e.to_string()))?;
|
|
||||||
|
|
||||||
// Pass 2 — sequential: fill fingerprint.bin + callback
|
|
||||||
let unitigs2 = UnitigFileReader::open_sequential(&unitig_path)?;
|
|
||||||
let mut fw = FingerprintVecWriter::new(n, *b);
|
let mut fw = FingerprintVecWriter::new(n, *b);
|
||||||
let mut seen = vec![0u8; (n + 7) / 8];
|
|
||||||
|
|
||||||
for kmer in unitigs2.iter_canonical_kmers() {
|
for kmer in unitigs2.iter_canonical_kmers() {
|
||||||
let slot = mphf.index(&kmer.raw());
|
let slot = mphf.index(&kmer.raw());
|
||||||
if slot >= n {
|
if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
|
||||||
return Err(OLMError::Mphf("slot out of bounds".into()));
|
let byte = slot / 8; let bit = 1u8 << (slot % 8);
|
||||||
}
|
if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
|
||||||
let byte = slot / 8;
|
|
||||||
let bit = 1u8 << (slot % 8);
|
|
||||||
if seen[byte] & bit != 0 {
|
|
||||||
return Err(OLMError::Mphf("duplicate slot".into()));
|
|
||||||
}
|
|
||||||
seen[byte] |= bit;
|
seen[byte] |= bit;
|
||||||
fw.set(slot, kmer.seq_hash());
|
fw.set(slot, kmer.seq_hash());
|
||||||
fill_slot(slot, kmer)?;
|
fill_slot(slot, kmer)?;
|
||||||
}
|
}
|
||||||
|
|
||||||
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
LayerMeta::approx(*b, *z).save(dir)?;
|
}
|
||||||
|
|
||||||
|
IndexMode::Hybrid { b, .. } => {
|
||||||
|
let mut ev = EvidenceWriter::new(n);
|
||||||
|
let mut fw = FingerprintVecWriter::new(n, *b);
|
||||||
|
for (kmer, chunk_id, rank) in unitigs2.iter_indexed_canonical_kmers() {
|
||||||
|
let slot = mphf.index(&kmer.raw());
|
||||||
|
if slot >= n { return Err(OLMError::Mphf("slot out of bounds".into())); }
|
||||||
|
let byte = slot / 8; let bit = 1u8 << (slot % 8);
|
||||||
|
if seen[byte] & bit != 0 { return Err(OLMError::Mphf("duplicate slot".into())); }
|
||||||
|
seen[byte] |= bit;
|
||||||
|
ev.set(slot, chunk_id as u32, rank as u8);
|
||||||
|
fw.set(slot, kmer.seq_hash());
|
||||||
|
fill_slot(slot, kmer)?;
|
||||||
|
}
|
||||||
|
ev.write(&dir.join(EVIDENCE_FILE))?;
|
||||||
|
fw.write(&dir.join(FINGERPRINT_FILE))?;
|
||||||
|
build_unitig_idx(&unitig_path, block_bits)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
Ok(n)
|
Ok(n)
|
||||||
}
|
}
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -198,11 +198,16 @@ pub fn build_unitig_idx(unitigs_path: &Path, block_bits: u8) -> SKResult<()> {
|
|||||||
|
|
||||||
// ── Reader ────────────────────────────────────────────────────────────────────
|
// ── Reader ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Read-only random-access view of a unitig file.
|
/// Memory-mapped view of a unitig file, with optional direct-access index.
|
||||||
///
|
///
|
||||||
/// The sequence file is memory-mapped; the block offset table is loaded into RAM
|
/// Three constructors select the operating mode:
|
||||||
/// on open. Random access to chunk `i`: O(1 << block_bits) sequential mmap
|
/// - [`open`](Self::open) — smart default: direct access if `.idx` exists, sequential otherwise.
|
||||||
/// reads. Sequential iteration: O(n) via a running-offset cursor.
|
/// - [`open_sequential`](Self::open_sequential) — always sequential, ignores `.idx`.
|
||||||
|
/// - [`open_direct_access`](Self::open_direct_access) — requires `.idx`, errors if absent.
|
||||||
|
///
|
||||||
|
/// All positional methods (`chunk_start`, `verify_canonical_kmer`, …) work in
|
||||||
|
/// both modes. Without `.idx` they fall back to an O(i) sequential scan —
|
||||||
|
/// correct but slower.
|
||||||
pub struct UnitigFileReader {
|
pub struct UnitigFileReader {
|
||||||
mmap: Mmap,
|
mmap: Mmap,
|
||||||
block_offsets: Vec<u32>,
|
block_offsets: Vec<u32>,
|
||||||
@@ -214,28 +219,20 @@ pub struct UnitigFileReader {
|
|||||||
}
|
}
|
||||||
|
|
||||||
impl UnitigFileReader {
|
impl UnitigFileReader {
|
||||||
/// Open with `.idx` — enables both sequential iteration and random access.
|
/// Smart default: opens with direct access if `.idx` is present, sequential otherwise.
|
||||||
pub fn open(path: &Path) -> SKResult<Self> {
|
pub fn open(path: &Path) -> SKResult<Self> {
|
||||||
let file = File::open(path).map_err(SKError::Io)?;
|
if idx_path(path).exists() {
|
||||||
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
Self::open_direct_access(path)
|
||||||
let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?;
|
} else {
|
||||||
let k = obikseq::params::k();
|
Self::open_sequential(path)
|
||||||
Ok(Self {
|
}
|
||||||
mmap,
|
|
||||||
block_offsets,
|
|
||||||
n_unitigs,
|
|
||||||
n_kmers,
|
|
||||||
k,
|
|
||||||
block_bits,
|
|
||||||
mask: (1usize << block_bits) - 1,
|
|
||||||
})
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Open without `.idx` — sequential iteration only, no random access.
|
/// Always sequential — never reads `.idx` even if present.
|
||||||
///
|
///
|
||||||
/// Scans the binary file once to count chunks and k-mers. Use when only
|
/// Scans the binary file once to count chunks and k-mers.
|
||||||
/// [`Self::iter_kmers`], [`Self::iter_canonical_kmers`], or
|
/// Positional access (`chunk_start`, `verify_canonical_kmer`) falls back to
|
||||||
/// [`Self::iter_indexed_canonical_kmers`] are needed.
|
/// O(i) sequential scan.
|
||||||
pub fn open_sequential(path: &Path) -> SKResult<Self> {
|
pub fn open_sequential(path: &Path) -> SKResult<Self> {
|
||||||
let file = File::open(path).map_err(SKError::Io)?;
|
let file = File::open(path).map_err(SKError::Io)?;
|
||||||
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
@@ -253,7 +250,7 @@ impl UnitigFileReader {
|
|||||||
|
|
||||||
Ok(Self {
|
Ok(Self {
|
||||||
mmap,
|
mmap,
|
||||||
block_offsets: Vec::new(), // empty → random access disabled
|
block_offsets: Vec::new(),
|
||||||
n_unitigs,
|
n_unitigs,
|
||||||
n_kmers,
|
n_kmers,
|
||||||
k,
|
k,
|
||||||
@@ -262,16 +259,40 @@ impl UnitigFileReader {
|
|||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Requires `.idx` — errors if the companion index file is absent.
|
||||||
|
///
|
||||||
|
/// Enables O(1 << block_bits) positional access to any chunk.
|
||||||
|
/// Use only when direct access is architecturally required (query-time
|
||||||
|
/// verification on an exact-evidence layer).
|
||||||
|
pub fn open_direct_access(path: &Path) -> SKResult<Self> {
|
||||||
|
let file = File::open(path).map_err(SKError::Io)?;
|
||||||
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
|
let (n_unitigs, n_kmers, block_bits, block_offsets) = read_idx(&idx_path(path))?;
|
||||||
|
let k = obikseq::params::k();
|
||||||
|
Ok(Self {
|
||||||
|
mmap,
|
||||||
|
block_offsets,
|
||||||
|
n_unitigs,
|
||||||
|
n_kmers,
|
||||||
|
k,
|
||||||
|
block_bits,
|
||||||
|
mask: (1usize << block_bits) - 1,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
pub fn len(&self) -> usize { self.n_unitigs }
|
pub fn len(&self) -> usize { self.n_unitigs }
|
||||||
pub fn is_empty(&self) -> bool { self.n_unitigs == 0 }
|
pub fn is_empty(&self) -> bool { self.n_unitigs == 0 }
|
||||||
pub fn n_kmers(&self) -> usize { self.n_kmers }
|
pub fn n_kmers(&self) -> usize { self.n_kmers }
|
||||||
pub fn block_bits(&self) -> u8 { self.block_bits }
|
pub fn block_bits(&self) -> u8 { self.block_bits }
|
||||||
|
pub fn has_direct_access(&self) -> bool { !self.block_offsets.is_empty() }
|
||||||
|
|
||||||
/// Byte offset of the START of record `i` (the seql byte) in the mmap.
|
/// Byte offset of record `i` in the mmap.
|
||||||
|
///
|
||||||
|
/// Fast path (O(1 << block_bits)) when `.idx` is loaded; degraded O(i)
|
||||||
|
/// sequential scan otherwise.
|
||||||
#[inline]
|
#[inline]
|
||||||
fn chunk_start(&self, i: usize) -> usize {
|
fn chunk_start(&self, i: usize) -> usize {
|
||||||
assert!(!self.block_offsets.is_empty(),
|
if !self.block_offsets.is_empty() {
|
||||||
"random access requires UnitigFileReader::open(); use open_sequential() for iteration only");
|
|
||||||
if self.block_bits == 0 {
|
if self.block_bits == 0 {
|
||||||
return self.block_offsets[i] as usize;
|
return self.block_offsets[i] as usize;
|
||||||
}
|
}
|
||||||
@@ -283,6 +304,14 @@ impl UnitigFileReader {
|
|||||||
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||||
}
|
}
|
||||||
offset
|
offset
|
||||||
|
} else {
|
||||||
|
let mut offset = 0usize;
|
||||||
|
for _ in 0..i {
|
||||||
|
let seql_minus_k = self.mmap[offset] as usize;
|
||||||
|
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||||
|
}
|
||||||
|
offset
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Nucleotide length of chunk `i`.
|
/// Nucleotide length of chunk `i`.
|
||||||
@@ -307,7 +336,9 @@ impl UnitigFileReader {
|
|||||||
extract_kmer_raw(&self.mmap[offset + 1..], j, self.k)
|
extract_kmer_raw(&self.mmap[offset + 1..], j, self.k)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical).
|
/// `true` iff the k-mer at position `j` of chunk `i` matches `query`.
|
||||||
|
///
|
||||||
|
/// Works in both modes; O(i) scan when `.idx` is absent.
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
||||||
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
||||||
|
|||||||
Reference in New Issue
Block a user