feat: optimize unitig index and document evidence elimination
Replace the dense per-chunk offset index with a sparse block-sampled structure (64 chunks per block), reducing the index file size by approximately 300× while preserving O(1) k-mer extraction. Introduce a design document for eliminating the `evidence.bin` file, which accounts for ~66% of the lookup layer, by transitioning to fingerprint-based approximate indexing and value-based MPHF lookups. Update MkDocs navigation to include the new documentation and add a file count tracker to the scatter step progress bar for improved observability.
This commit is contained in:
@@ -0,0 +1,321 @@
|
|||||||
|
# Evidence elimination — design discussion
|
||||||
|
|
||||||
|
## Problem statement
|
||||||
|
|
||||||
|
`evidence.bin` maps each MPHF slot to a position in the unitig store so that
|
||||||
|
query verification is possible: given a slot `s` returned by `mphf.index(kmer)`,
|
||||||
|
retrieve the k-mer stored at that position and compare with the query.
|
||||||
|
|
||||||
|
On the bacterial BCT dataset (2048 partitions, k=31, ~33 M k-mers/partition):
|
||||||
|
|
||||||
|
| file | size/partition | total (2048 parts) | fraction of lookup layer |
|
||||||
|
|---|---|---|---|
|
||||||
|
| evidence.bin | 132 MB | ~270 GB | **66 %** |
|
||||||
|
| unitigs.bin | 58 MB | ~118 GB | 29 % |
|
||||||
|
| mphf.bin | 10 MB | ~20 GB | 5 % |
|
||||||
|
|
||||||
|
Evidence dominates. Eliminating or drastically shrinking it is the highest-leverage
|
||||||
|
optimisation available for index size.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why evidence exists
|
||||||
|
|
||||||
|
PtrHash (like all standard MPHFs) maps **any** input to a valid slot in `[0, n)`.
|
||||||
|
For a query k-mer not in the indexed set, the returned slot is meaningless but
|
||||||
|
indistinguishable from a real hit without external information. Evidence provides
|
||||||
|
that information: `evidence[s]` encodes the location of the k-mer that legitimately
|
||||||
|
occupies slot `s`, allowing the verification:
|
||||||
|
|
||||||
|
```
|
||||||
|
slot = mphf.index(query)
|
||||||
|
(chunk_id, rank) = evidence.decode(slot)
|
||||||
|
stored_kmer = unitigs.kmer_at(chunk_id, rank)
|
||||||
|
return canonical(stored_kmer) == canonical(query)
|
||||||
|
```
|
||||||
|
|
||||||
|
Evidence is a **permutation** from MPHF-space to unitig-position-space.
|
||||||
|
Storing it costs at minimum log₂(n_kmers) bits per slot — irrespective of encoding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Information-theoretic lower bound
|
||||||
|
|
||||||
|
For a partition with P k-mers and U unitigs of average length m_u k-mers:
|
||||||
|
|
||||||
|
- global k-mer index range: [0, P) → ⌈log₂ P⌉ bits
|
||||||
|
- (chunk_id, rank) pair: ⌈log₂ U⌉ + ⌈log₂ L_max⌉ bits
|
||||||
|
|
||||||
|
Current implementation: 25 + 7 = 32 bits (aligned u32).
|
||||||
|
Theoretical minimum: ⌈log₂ P⌉ ≈ 25 bits for P ≈ 33 M.
|
||||||
|
|
||||||
|
**Packing headroom: ~22 %.** Not a path to elimination.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Two-index architecture
|
||||||
|
|
||||||
|
The exact index is mandatory for set operations (union, intersection, diff) and
|
||||||
|
exact k-mer retrieval. A separate approximate index, built for query operations,
|
||||||
|
can tolerate a controlled false positive rate in exchange for a much smaller
|
||||||
|
footprint.
|
||||||
|
|
||||||
|
| component | exact index | approximate index |
|
||||||
|
|---|---|---|
|
||||||
|
| `mphf.bin` | ✓ | ✓ (same structure) |
|
||||||
|
| `evidence.bin` | ✓ (32 bits/k-mer) | ✗ |
|
||||||
|
| `fingerprint.bin` | ✗ | ✓ (B bits/k-mer) |
|
||||||
|
| `unitigs.bin` | ✓ | ✓ (K-mer enumeration) |
|
||||||
|
| `unitigs.bin.idx` | ✓ | ✗ (random access not needed) |
|
||||||
|
|
||||||
|
The approximate index drops `evidence.bin` and `unitigs.bin.idx`; it keeps
|
||||||
|
`unitigs.bin` for sequential enumeration of K-mers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MPHF as a perfect Bloom filter
|
||||||
|
|
||||||
|
A standard Bloom filter with a single hash function and N bits storing M keys has
|
||||||
|
occupancy M/N. For a foreign query k-mer, P(FP) = M/N — the probability of
|
||||||
|
landing on a set bit. The empty space (fraction 1 − M/N of bits at 0) is what
|
||||||
|
rejects foreign k-mers.
|
||||||
|
|
||||||
|
An MPHF is a Bloom filter with **zero internal collisions**: every indexed k-mer
|
||||||
|
occupies its own unique slot. But unlike a Bloom filter, the MPHF maps **any**
|
||||||
|
input to a slot in [0, M) — there is no empty space. Every query lands on an
|
||||||
|
occupied slot. The MPHF alone cannot reject foreign k-mers at all.
|
||||||
|
|
||||||
|
Adding a B-bit fingerprint restores the discrimination:
|
||||||
|
|
||||||
|
```
|
||||||
|
slot = mphf.index(query)
|
||||||
|
fingerprint = hash(query) & mask_B
|
||||||
|
present = fingerprint_table[slot] == fingerprint
|
||||||
|
```
|
||||||
|
|
||||||
|
The fingerprint plays the role of the sparse space in the Bloom filter: it provides
|
||||||
|
the B bits of information needed to reject foreign k-mers.
|
||||||
|
|
||||||
|
Both structures reach the same fundamental cost for a given FP rate. For 1% FP:
|
||||||
|
|
||||||
|
- Bloom filter (optimal, k hash functions): ~9.6 bits/key
|
||||||
|
- MPHF (~3 bits/key) + fingerprint (7 bits/key): ~10 bits/key
|
||||||
|
|
||||||
|
This is a fundamental bound, not an implementation detail.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Approach A — MPHF + fingerprint (approximate index)
|
||||||
|
|
||||||
|
### Size
|
||||||
|
|
||||||
|
| B (bits) | fingerprint.bin/partition | vs evidence.bin (32 bits) |
|
||||||
|
|---|---|---|
|
||||||
|
| 8 | 33 MB | 4× smaller |
|
||||||
|
| 12 | 49 MB | 2.7× smaller |
|
||||||
|
| 16 | 66 MB | 2× smaller |
|
||||||
|
|
||||||
|
Total approximate index per partition at B=8: ~43 MB (vs ~142 MB for exact lookup layer).
|
||||||
|
|
||||||
|
### False positive rate — per k-mer query
|
||||||
|
|
||||||
|
For a specific non-indexed query k-mer q:
|
||||||
|
|
||||||
|
1. MPHF(q) → slot s, some value in [0, M)
|
||||||
|
2. fingerprint_table[s] holds the B-bit fingerprint of the legitimate k-mer at s
|
||||||
|
3. FP event: hash(q) & mask_B == fingerprint_table[s]
|
||||||
|
|
||||||
|
Since q is not the legitimate k-mer at s, its fingerprint is independent of
|
||||||
|
fingerprint_table[s], giving:
|
||||||
|
|
||||||
|
```
|
||||||
|
P(FP per k-mer) = 1 / 2^B
|
||||||
|
```
|
||||||
|
|
||||||
|
This is the probability of error **for one specific query k-mer**. It is not the
|
||||||
|
fraction of the k-mer universe that would be misclassified: querying all 4^k
|
||||||
|
possible k-mers would yield (4^k − M)/2^B false positives in absolute terms, but
|
||||||
|
that is not the relevant quantity for practical use.
|
||||||
|
|
||||||
|
### Equivalence classes
|
||||||
|
|
||||||
|
The MPHF + fingerprint partitions the universe of 4^k k-mers into M·2^B
|
||||||
|
equivalence classes of average size 4^k/(M·2^B). Each class contains 1 true
|
||||||
|
indexed k-mer and 4^k/(M·2^B) − 1 false positives. A larger M (fewer partitions)
|
||||||
|
produces smaller classes — finer discrimination in k-mer space — while P(FP) = 1/2^B
|
||||||
|
remains constant.
|
||||||
|
|
||||||
|
### Read-level use case
|
||||||
|
|
||||||
|
The relevant decision unit is the **read**, not the individual k-mer. For a read
|
||||||
|
of ~100 nucleotides and k=31, there are ~70 k-mers.
|
||||||
|
|
||||||
|
- A bacterial read queried against a bacterial index: nearly all ~70 k-mers are
|
||||||
|
true positives → high coverage fraction.
|
||||||
|
- A plant read queried against a bacterial index: k-mers are foreign; each has
|
||||||
|
P(FP) = 1/2^B independently → expected coverage fraction ≈ 1/2^B.
|
||||||
|
|
||||||
|
A coverage threshold separates the two cases decisively. This is the same
|
||||||
|
principle as Findere: local coverage continuity distinguishes true hits from noise.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Approach B — z-consecutive k-mer matching
|
||||||
|
|
||||||
|
A query for a K-mer of size K = k + z − 1 decomposes into z overlapping k-mers.
|
||||||
|
Declaring a match only when **all z are present** reduces the per-window FP rate:
|
||||||
|
|
||||||
|
```
|
||||||
|
P(FP per window of z) = (1/2^B)^z = 1/2^(B·z)
|
||||||
|
```
|
||||||
|
|
||||||
|
For a read with ~70 k-mers, there are ~70 − z + 1 independent windows of size z.
|
||||||
|
The probability that at least one window is a false positive:
|
||||||
|
|
||||||
|
```
|
||||||
|
P(FP_read) = 1 - (1 - 1/2^(B·z))^(70-z+1) ≈ (70-z+1) / 2^(B·z)
|
||||||
|
```
|
||||||
|
|
||||||
|
For B=8, z=4: P(FP_read) ≈ 67 / 2^32 ≈ 1.6×10⁻⁸.
|
||||||
|
|
||||||
|
A plant read is misclassified as bacterial roughly once in 60 million reads —
|
||||||
|
negligible for any practical dataset.
|
||||||
|
|
||||||
|
### Choosing B from (z, L, P_target)
|
||||||
|
|
||||||
|
z is a query-time parameter and does not affect the index structure. However,
|
||||||
|
knowing z at build time allows computing the minimum B required to reach a target
|
||||||
|
FP rate P_target for reads of length L (giving W = L − k − z + 2 independent
|
||||||
|
windows):
|
||||||
|
|
||||||
|
```
|
||||||
|
P_target ≈ W / 2^(B·z) → B = ceil( (log2(W) - log2(P_target)) / z )
|
||||||
|
```
|
||||||
|
|
||||||
|
Example: L=100, k=31, z=4, P_target=10⁻⁸ → W=67, B = ceil((6.07 + 26.6) / 4) = ceil(8.17) = **9 bits**.
|
||||||
|
|
||||||
|
(B, z) are co-determined at build time to minimise fingerprint size while
|
||||||
|
guaranteeing the target read-level FP rate.
|
||||||
|
|
||||||
|
### Combined sizing
|
||||||
|
|
||||||
|
| B | z | K = k+z−1 | P(FP_read) | fingerprint.bin/partition |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| 8 | 2 | 32 | ~67/2^16 ≈ 10⁻³ | 33 MB |
|
||||||
|
| 8 | 4 | 34 | ~67/2^32 ≈ 10⁻⁸ | 33 MB |
|
||||||
|
| 4 | 4 | 34 | ~67/2^16 ≈ 10⁻³ | 16 MB |
|
||||||
|
| 4 | 8 | 38 | ~63/2^32 ≈ 10⁻⁸ | 16 MB |
|
||||||
|
|
||||||
|
Smaller B → smaller fingerprint table; larger z → longer minimum match length K
|
||||||
|
and fewer independent windows per read.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Approach 1 — value-based MPHF (eliminates evidence.bin from exact index)
|
||||||
|
|
||||||
|
Build the MPHF to output the global k-mer position directly:
|
||||||
|
|
||||||
|
```
|
||||||
|
mphf: kmer → global_pos ∈ [0, P)
|
||||||
|
```
|
||||||
|
|
||||||
|
Verification becomes:
|
||||||
|
|
||||||
|
```
|
||||||
|
global_pos = mphf.index(query)
|
||||||
|
stored_kmer = unitigs.kmer_at_global_pos(global_pos)
|
||||||
|
return canonical(stored_kmer) == canonical(query)
|
||||||
|
```
|
||||||
|
|
||||||
|
No evidence array. The unitig block index (see below) provides
|
||||||
|
`kmer_at_global_pos` in O(log(n_blocks) + BLOCK_SIZE) time.
|
||||||
|
|
||||||
|
### What is required
|
||||||
|
|
||||||
|
A **retrieval data structure** (also called a value-based or function-based MPHF):
|
||||||
|
given a set of (key, value) pairs with distinct keys and bijective values in `[0, n)`,
|
||||||
|
build a compact structure that maps each key to its assigned value.
|
||||||
|
|
||||||
|
Known constructions:
|
||||||
|
|
||||||
|
- **GOV / GBF (Generalized Bloomier Filter)**: random 3-uniform hypergraph +
|
||||||
|
XOR-based assignment. ~2.3 bits/key overhead over the information-theoretic
|
||||||
|
minimum. Construction: O(n). Query: O(1).
|
||||||
|
- **SSHash approach**: builds the MPHF to map k-mers to their positions in a
|
||||||
|
concatenated unitig string. Achieves elimination of external evidence using a
|
||||||
|
"skew" construction that aligns the MPHF output with the sequential unitig layout.
|
||||||
|
|
||||||
|
### Rust availability
|
||||||
|
|
||||||
|
No Rust crate implements a retrieval data structure suitable for this use case as
|
||||||
|
of 2025. The `ph`, `boomphf`, `fmphf`, and `ptr_hash` crates all build plain
|
||||||
|
MPHFs. **This is the key blocking factor.**
|
||||||
|
|
||||||
|
### SSHash construction (reference)
|
||||||
|
|
||||||
|
SSHash (Pibiri 2022, doi:10.1186/s13015-022-00216-6) constructs the MPHF over
|
||||||
|
(minimizer, position-within-minimizer-bucket) pairs, aligning slots with sequential
|
||||||
|
positions in the concatenated unitig string. A port to obikmer would require:
|
||||||
|
|
||||||
|
1. Concatenating all unitig sequences into a single flat buffer per partition.
|
||||||
|
2. Assigning each k-mer a global position (its offset in that buffer).
|
||||||
|
3. Building the MPHF to output that position directly (retrieval step).
|
||||||
|
4. Replacing `evidence.bin` with a small prefix-sum index for `kmer_at_global_pos`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Approach 2 — block index prefix sums (reduces evidence to negligible)
|
||||||
|
|
||||||
|
A prerequisite already implemented: `unitigs.bin.idx` now uses a **block-sampled
|
||||||
|
offset index** (one `u32` per `BLOCK_SIZE=64` chunks) instead of a per-chunk offset
|
||||||
|
table.
|
||||||
|
|
||||||
|
### Extension: k-mer prefix sums per block
|
||||||
|
|
||||||
|
Add a second array to `unitigs.bin.idx`: `kmer_prefix[b]` = total k-mers before
|
||||||
|
block `b`. For 33 M k-mers: ~73 600 blocks × 4 bytes = **295 KB/partition**.
|
||||||
|
|
||||||
|
This enables `kmer_at_global_pos(p)`:
|
||||||
|
|
||||||
|
1. Binary search in `kmer_prefix[]` to find block `b`.
|
||||||
|
2. Sequential scan from `block_offsets[b]` until cumulative k-mer count reaches `p`.
|
||||||
|
3. Extract the k-mer at the remaining rank within the found chunk.
|
||||||
|
|
||||||
|
Cost: O(log(n_blocks) + BLOCK_SIZE) ≈ O(17 + 64) memory accesses.
|
||||||
|
|
||||||
|
### Combined with Approach 1
|
||||||
|
|
||||||
|
- evidence.bin: **eliminated** (~270 GB saved across 2048 partitions)
|
||||||
|
- kmer_prefix array: ~295 KB/partition × 2048 = ~600 MB total (negligible)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommended path
|
||||||
|
|
||||||
|
1. **Short term (approximate index)**: implement MPHF + fingerprint.bin. Choose
|
||||||
|
(B, z) as index parameters. Drop `evidence.bin` and `unitigs.bin.idx`; keep
|
||||||
|
`unitigs.bin` for K-mer enumeration. Expected size: ~43 MB/partition at B=8
|
||||||
|
vs ~142 MB for the exact lookup layer.
|
||||||
|
|
||||||
|
2. **Short term (exact index)**: add `kmer_prefix[]` to `unitigs.bin.idx`.
|
||||||
|
Zero cost if evidence.bin is kept; enables Approach 1 when ready.
|
||||||
|
|
||||||
|
3. **Medium term**: implement GOV retrieval data structure in Rust, or port
|
||||||
|
SSHash construction.
|
||||||
|
|
||||||
|
4. **Long term**: replace `evidence.bin` with the value-based MPHF. Expected
|
||||||
|
index size reduction: ~50 % of the lookup layer, ~270 GB on the BCT dataset.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- Is a GOV construction compatible with the parallel MPHF build currently used
|
||||||
|
(PtrHash's `new_from_par_iter`)? GOV construction is inherently sequential
|
||||||
|
(hypergraph peeling); parallelisation is non-trivial.
|
||||||
|
- Can the SSHash "skew" insight be reused without the minimizer-bucket structure?
|
||||||
|
The obikmer partitioning already uses minimizers — there may be natural alignment.
|
||||||
|
- What is the query latency impact of replacing O(1) evidence lookup with
|
||||||
|
O(log n_blocks + BLOCK_SIZE) scan? Needs benchmarking at realistic BCT scale.
|
||||||
|
- What is the optimal (B, z) trade-off for the approximate index given the target
|
||||||
|
read length and acceptable P(FP_read)?
|
||||||
@@ -268,3 +268,5 @@ The MPHF is built from the **k-mer set**, not from the unitig sequences themselv
|
|||||||
## Open questions
|
## Open questions
|
||||||
|
|
||||||
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
- **Cross-partition evidence**: for set operations spanning multiple partitions, strategy B allows unitig-level operations (e.g. mark entire unitigs as present/absent) rather than kmer-level, potentially reducing the operation cost by a factor of m_u.
|
||||||
|
|
||||||
|
- **Eliminating evidence.bin**: at ~66 % of the per-layer lookup footprint (132 MB vs 200 MB total per partition on the bacterial BCT dataset), evidence.bin dominates index size. A dedicated design investigation is open — see [Evidence elimination design discussion](evidence_elimination.md).
|
||||||
|
|||||||
@@ -44,6 +44,7 @@ nav:
|
|||||||
- On-disk storage: implementation/storage.md
|
- On-disk storage: implementation/storage.md
|
||||||
- MPHF selection: implementation/mphf.md
|
- MPHF selection: implementation/mphf.md
|
||||||
- Unitig evidence encoding: implementation/unitig_evidence.md
|
- Unitig evidence encoding: implementation/unitig_evidence.md
|
||||||
|
- Evidence elimination (discussion): implementation/evidence_elimination.md
|
||||||
- obilayeredmap crate: implementation/obilayeredmap.md
|
- obilayeredmap crate: implementation/obilayeredmap.md
|
||||||
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
||||||
- PersistentBitVec: implementation/persistent_bit_vec.md
|
- PersistentBitVec: implementation/persistent_bit_vec.md
|
||||||
|
|||||||
@@ -1,4 +1,6 @@
|
|||||||
use std::path::PathBuf;
|
use std::path::PathBuf;
|
||||||
|
use std::sync::atomic::{AtomicU64, Ordering};
|
||||||
|
use std::sync::Arc;
|
||||||
use std::time::{Duration, Instant};
|
use std::time::{Duration, Instant};
|
||||||
|
|
||||||
use indicatif::{ProgressBar, ProgressStyle};
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
@@ -42,15 +44,21 @@ pub fn scatter(
|
|||||||
// Throttle in the source thread — never in a worker — to prevent deadlock.
|
// Throttle in the source thread — never in a worker — to prevent deadlock.
|
||||||
let throttled = throttle_paths(path_source, max_open);
|
let throttled = throttle_paths(path_source, max_open);
|
||||||
|
|
||||||
|
let file_count = Arc::new(AtomicU64::new(0));
|
||||||
|
|
||||||
let t = Stage::start("scatter");
|
let t = Stage::start("scatter");
|
||||||
let pipe = obipipeline::make_pipe! {
|
let pipe = obipipeline::make_pipe! {
|
||||||
PipelineData : PathWithSlot => Vec<RoutableSuperKmer>,
|
PipelineData : PathWithSlot => Vec<RoutableSuperKmer>,
|
||||||
||? { move |pw: PathWithSlot| {
|
||? {
|
||||||
|
let file_count = Arc::clone(&file_count);
|
||||||
|
move |pw: PathWithSlot| {
|
||||||
let PathWithSlot { path, _guard } = pw;
|
let PathWithSlot { path, _guard } = pw;
|
||||||
info!("indexing: {}", path.display());
|
let n = file_count.fetch_add(1, Ordering::Relaxed) + 1;
|
||||||
|
info!("indexing [{}]: {}", n, path.display());
|
||||||
// _guard travels into GuardedIter; released when all chunks are read
|
// _guard travels into GuardedIter; released when all chunks are read
|
||||||
open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard })
|
open_chunks(path).map(|iter| GuardedIter { inner: iter, _guard })
|
||||||
}} : Path => RawChunk,
|
}
|
||||||
|
} : Path => RawChunk,
|
||||||
|? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk,
|
|? { move |rope| obiread::normalize_sequence_chunk(rope, k) } : RawChunk => NormChunk,
|
||||||
| { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch,
|
| { move |rope| obiskbuilder::build_superkmers(rope, k, level_max, theta) } : NormChunk => Batch,
|
||||||
};
|
};
|
||||||
@@ -84,7 +92,8 @@ pub fn scatter(
|
|||||||
} else {
|
} else {
|
||||||
(format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6))
|
(format!("{:.0} Mbp", bp / 1e6), format!("{:.0} Mbp/s", ema_rate / 1e6))
|
||||||
};
|
};
|
||||||
pb.set_message(format!("{count_str} {rate_str}"));
|
let n_files = file_count.load(Ordering::Relaxed);
|
||||||
|
pb.set_message(format!("{count_str} {rate_str} {n_files} files"));
|
||||||
}
|
}
|
||||||
kp.write_batch(batch).unwrap_or_else(|e| {
|
kp.write_batch(batch).unwrap_or_else(|e| {
|
||||||
eprintln!("error: {e}");
|
eprintln!("error: {e}");
|
||||||
|
|||||||
+134
-130
@@ -9,21 +9,20 @@ pub use obikseq::MAX_KMERS_PER_CHUNK;
|
|||||||
|
|
||||||
use crate::error::{SKError, SKResult};
|
use crate::error::{SKError, SKResult};
|
||||||
|
|
||||||
// ── Index file format ─────────────────────────────────────────────────────────
|
// ── Block index parameters ────────────────────────────────────────────────────
|
||||||
//
|
//
|
||||||
// magic: [u8; 4] = b"UIDX"
|
// One offset entry per BLOCK_SIZE chunks. BLOCK_SIZE must be a power of two
|
||||||
// n_unitigs: u32 LE
|
// so that block = i >> LOG2_BLOCK_SIZE and rem = i & (BLOCK_SIZE − 1) are
|
||||||
// n_kmers: u64 LE total kmer count across all chunks
|
// branchless shifts/masks rather than divisions.
|
||||||
// seqls: [u8; n_unitigs] max kmer index per chunk (= n_kmers − 1)
|
|
||||||
// packed_offsets: [u32; n_unitigs + 1] byte offsets to packed bytes in the
|
|
||||||
// sequence file; last entry is sentinel
|
|
||||||
//
|
//
|
||||||
// Each sequence record in the binary file: [u8: n_kmers−1][packed bytes].
|
// With BLOCK_SIZE = 64 and an average chunk size of ~10 bytes, a random lookup
|
||||||
// Offsets point to the first packed byte of each record, past the leading u8.
|
// scans at most 63 × 10 = 630 bytes sequentially — negligible next to the MPHF
|
||||||
// Unitigs with more than MAX_KMERS_PER_CHUNK kmers are transparently split by the
|
// lookup that precedes it. The index file shrinks from ~5 bytes/chunk to
|
||||||
// writer into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost.
|
// ~1/64 bytes/chunk (≈ 300× for typical workloads).
|
||||||
|
|
||||||
const MAGIC: [u8; 4] = *b"UIDX";
|
const MAGIC: [u8; 4] = *b"UIX2";
|
||||||
|
const BLOCK_SIZE: usize = 64;
|
||||||
|
const LOG2_BLOCK_SIZE: u32 = 6; // 2^6 = BLOCK_SIZE
|
||||||
|
|
||||||
fn idx_path(path: &Path) -> PathBuf {
|
fn idx_path(path: &Path) -> PathBuf {
|
||||||
crate::append_path_suffix(path, ".idx")
|
crate::append_path_suffix(path, ".idx")
|
||||||
@@ -32,19 +31,19 @@ fn idx_path(path: &Path) -> PathBuf {
|
|||||||
// ── Writer ────────────────────────────────────────────────────────────────────
|
// ── Writer ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds
|
/// Writes a sequence of [`Unitig`] to an uncompressed binary file and builds
|
||||||
/// an offset index at close time.
|
/// a block-sampled offset index at close time.
|
||||||
///
|
///
|
||||||
/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] kmers are transparently split
|
/// One offset is stored every [`BLOCK_SIZE`] chunks; random access to chunk `i`
|
||||||
/// into overlapping chunks (k-1 nucleotide overlap) so no kmer is lost.
|
/// costs at most `BLOCK_SIZE − 1` sequential chunk scans after the block lookup.
|
||||||
///
|
///
|
||||||
/// The companion index file (`path.idx`) is written on [`close`].
|
/// Unitigs with more than [`MAX_KMERS_PER_CHUNK`] k-mers are transparently split
|
||||||
/// The binary format per record is `[u8: n_kmers−1][packed 2-bit bytes]`.
|
/// into overlapping chunks (k−1 nucleotide overlap) so no k-mer is lost.
|
||||||
pub struct UnitigFileWriter {
|
pub struct UnitigFileWriter {
|
||||||
path: PathBuf,
|
path: PathBuf,
|
||||||
file: BufWriter<File>,
|
file: BufWriter<File>,
|
||||||
seqls: Vec<u8>,
|
block_offsets: Vec<u32>, // byte offset of first record in each block
|
||||||
packed_offsets: Vec<u32>,
|
chunk_count: usize,
|
||||||
next_offset: u32,
|
next_offset: u32, // byte offset of the START of the next record
|
||||||
n_kmers: usize,
|
n_kmers: usize,
|
||||||
k: usize,
|
k: usize,
|
||||||
}
|
}
|
||||||
@@ -55,15 +54,16 @@ impl UnitigFileWriter {
|
|||||||
Ok(Self {
|
Ok(Self {
|
||||||
path: path.to_owned(),
|
path: path.to_owned(),
|
||||||
file: BufWriter::new(file),
|
file: BufWriter::new(file),
|
||||||
seqls: Vec::new(),
|
block_offsets: Vec::new(),
|
||||||
packed_offsets: Vec::new(),
|
chunk_count: 0,
|
||||||
next_offset: 0,
|
next_offset: 0,
|
||||||
n_kmers: 0,
|
n_kmers: 0,
|
||||||
k: obikseq::params::k(),
|
k: obikseq::params::k(),
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Write a unitig, splitting it into chunks if it exceeds [`MAX_KMERS_PER_CHUNK`].
|
/// Write a unitig, splitting into overlapping chunks if it exceeds
|
||||||
|
/// [`MAX_KMERS_PER_CHUNK`].
|
||||||
pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> {
|
pub fn write(&mut self, unitig: &Unitig) -> SKResult<()> {
|
||||||
let seql = unitig.seql();
|
let seql = unitig.seql();
|
||||||
let k = self.k;
|
let k = self.k;
|
||||||
@@ -77,17 +77,13 @@ impl UnitigFileWriter {
|
|||||||
return self.write_chunk(unitig);
|
return self.write_chunk(unitig);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Split into overlapping chunks of MAX_KMERS_PER_CHUNK kmers.
|
|
||||||
// Overlap of k-1 nucleotides ensures no kmer is lost at boundaries.
|
|
||||||
let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1;
|
let chunk_nucl = MAX_KMERS_PER_CHUNK + k - 1;
|
||||||
let stride = MAX_KMERS_PER_CHUNK;
|
let stride = MAX_KMERS_PER_CHUNK;
|
||||||
let mut start = 0;
|
let mut start = 0;
|
||||||
while start < seql {
|
while start < seql {
|
||||||
let end = (start + chunk_nucl).min(seql);
|
let end = (start + chunk_nucl).min(seql);
|
||||||
self.write_chunk(&unitig.sub(start, end))?;
|
self.write_chunk(&unitig.sub(start, end))?;
|
||||||
if end == seql {
|
if end == seql { break; }
|
||||||
break;
|
|
||||||
}
|
|
||||||
start += stride;
|
start += stride;
|
||||||
}
|
}
|
||||||
Ok(())
|
Ok(())
|
||||||
@@ -97,54 +93,48 @@ impl UnitigFileWriter {
|
|||||||
let seql = unitig.seql();
|
let seql = unitig.seql();
|
||||||
let byte_len = (seql + 3) / 4;
|
let byte_len = (seql + 3) / 4;
|
||||||
|
|
||||||
// Header is 1 byte (u8: n_kmers − 1 = seql − k); packed bytes follow.
|
|
||||||
debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK");
|
debug_assert!(seql - self.k <= u8::MAX as usize, "chunk exceeds MAX_KMERS_PER_CHUNK");
|
||||||
self.packed_offsets.push(self.next_offset + 1);
|
|
||||||
self.seqls.push((seql - self.k) as u8);
|
|
||||||
self.n_kmers += seql - self.k + 1;
|
|
||||||
|
|
||||||
unitig
|
// Record a block offset at the start of every BLOCK_SIZE-th chunk.
|
||||||
.write_to_binary(&mut self.file)
|
if self.chunk_count & (BLOCK_SIZE - 1) == 0 {
|
||||||
.map_err(SKError::Io)?;
|
self.block_offsets.push(self.next_offset);
|
||||||
|
}
|
||||||
|
|
||||||
|
self.n_kmers += seql - self.k + 1;
|
||||||
|
self.chunk_count += 1;
|
||||||
|
|
||||||
|
unitig.write_to_binary(&mut self.file).map_err(SKError::Io)?;
|
||||||
|
|
||||||
self.next_offset += 1 + byte_len as u32;
|
self.next_offset += 1 + byte_len as u32;
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Flush the sequence file and write the companion `.idx`.
|
|
||||||
pub fn close(mut self) -> SKResult<()> {
|
pub fn close(mut self) -> SKResult<()> {
|
||||||
self.file.flush().map_err(SKError::Io)?;
|
self.file.flush().map_err(SKError::Io)?;
|
||||||
drop(self.file);
|
drop(self.file);
|
||||||
|
|
||||||
// Sentinel: byte offset past the last record's packed bytes.
|
// Sentinel: byte offset past the last record (needed for end-of-file detection).
|
||||||
let sentinel = match (self.packed_offsets.last(), self.seqls.last()) {
|
self.block_offsets.push(self.next_offset);
|
||||||
(Some(&last_off), Some(&last_seql)) => {
|
|
||||||
let seql = last_seql as u32 + self.k as u32;
|
|
||||||
last_off + (seql + 3) / 4
|
|
||||||
}
|
|
||||||
_ => 0,
|
|
||||||
};
|
|
||||||
self.packed_offsets.push(sentinel);
|
|
||||||
|
|
||||||
write_idx(&idx_path(&self.path), &self.seqls, &self.packed_offsets, self.n_kmers)
|
write_idx(
|
||||||
|
&idx_path(&self.path),
|
||||||
|
self.chunk_count as u32,
|
||||||
|
self.n_kmers as u64,
|
||||||
|
&self.block_offsets,
|
||||||
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn len(&self) -> usize {
|
pub fn len(&self) -> usize { self.chunk_count }
|
||||||
self.seqls.len()
|
pub fn is_empty(&self) -> bool { self.chunk_count == 0 }
|
||||||
}
|
|
||||||
|
|
||||||
pub fn is_empty(&self) -> bool {
|
|
||||||
self.seqls.is_empty()
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize) -> SKResult<()> {
|
fn write_idx(path: &Path, n_unitigs: u32, n_kmers: u64, block_offsets: &[u32]) -> SKResult<()> {
|
||||||
let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?);
|
let mut w = BufWriter::new(File::create(path).map_err(SKError::Io)?);
|
||||||
w.write_all(&MAGIC).map_err(SKError::Io)?;
|
w.write_all(&MAGIC).map_err(SKError::Io)?;
|
||||||
w.write_all(&(seqls.len() as u32).to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&(BLOCK_SIZE as u32).to_le_bytes()).map_err(SKError::Io)?;
|
||||||
w.write_all(&(n_kmers as u64).to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&n_unitigs.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
w.write_all(seqls).map_err(SKError::Io)?;
|
w.write_all(&n_kmers.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
for &off in packed_offsets {
|
for &off in block_offsets {
|
||||||
w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?;
|
w.write_all(&off.to_le_bytes()).map_err(SKError::Io)?;
|
||||||
}
|
}
|
||||||
w.flush().map_err(SKError::Io)
|
w.flush().map_err(SKError::Io)
|
||||||
@@ -154,12 +144,17 @@ fn write_idx(path: &Path, seqls: &[u8], packed_offsets: &[u32], n_kmers: usize)
|
|||||||
|
|
||||||
/// Read-only random-access view of a unitig file.
|
/// Read-only random-access view of a unitig file.
|
||||||
///
|
///
|
||||||
/// The sequence file is memory-mapped; the index is loaded into RAM on open.
|
/// The sequence file is memory-mapped; the block offset table is loaded into RAM
|
||||||
/// All per-kmer operations are O(1) and allocation-free.
|
/// on open (≈ n_chunks / BLOCK_SIZE entries, negligible memory).
|
||||||
|
///
|
||||||
|
/// Random access to chunk `i`: O(BLOCK_SIZE) sequential mmap reads — branchless
|
||||||
|
/// shift/mask arithmetic, cache-friendly, negligible versus the MPHF lookup.
|
||||||
|
///
|
||||||
|
/// Sequential iteration: O(n) via a running-offset cursor (no per-chunk overhead).
|
||||||
pub struct UnitigFileReader {
|
pub struct UnitigFileReader {
|
||||||
mmap: Mmap,
|
mmap: Mmap,
|
||||||
seqls: Vec<u8>,
|
block_offsets: Vec<u32>,
|
||||||
packed_offsets: Vec<u32>,
|
n_unitigs: usize,
|
||||||
n_kmers: usize,
|
n_kmers: usize,
|
||||||
k: usize,
|
k: usize,
|
||||||
}
|
}
|
||||||
@@ -168,91 +163,97 @@ impl UnitigFileReader {
|
|||||||
pub fn open(path: &Path) -> SKResult<Self> {
|
pub fn open(path: &Path) -> SKResult<Self> {
|
||||||
let file = File::open(path).map_err(SKError::Io)?;
|
let file = File::open(path).map_err(SKError::Io)?;
|
||||||
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
let mmap = unsafe { Mmap::map(&file).map_err(SKError::Io)? };
|
||||||
let (seqls, packed_offsets, n_kmers) = read_idx(&idx_path(path))?;
|
let (n_unitigs, n_kmers, block_offsets) = read_idx(&idx_path(path))?;
|
||||||
let k = obikseq::params::k();
|
let k = obikseq::params::k();
|
||||||
Ok(Self { mmap, seqls, packed_offsets, n_kmers, k })
|
Ok(Self { mmap, block_offsets, n_unitigs, n_kmers, k })
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn len(&self) -> usize {
|
pub fn len(&self) -> usize { self.n_unitigs }
|
||||||
self.seqls.len()
|
pub fn is_empty(&self) -> bool { self.n_unitigs == 0 }
|
||||||
|
pub fn n_kmers(&self) -> usize { self.n_kmers }
|
||||||
|
|
||||||
|
/// Byte offset of the START of record `i` (the seql byte) in the mmap.
|
||||||
|
/// O(BLOCK_SIZE) sequential scan within the block.
|
||||||
|
#[inline]
|
||||||
|
fn chunk_start(&self, i: usize) -> usize {
|
||||||
|
let block = i >> LOG2_BLOCK_SIZE;
|
||||||
|
let rem = i & (BLOCK_SIZE - 1);
|
||||||
|
let mut offset = self.block_offsets[block] as usize;
|
||||||
|
for _ in 0..rem {
|
||||||
|
let seql_minus_k = self.mmap[offset] as usize;
|
||||||
|
offset += 1 + (seql_minus_k + self.k + 3) / 4;
|
||||||
|
}
|
||||||
|
offset
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn is_empty(&self) -> bool {
|
/// Nucleotide length of chunk `i`.
|
||||||
self.seqls.is_empty()
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Total number of kmers across all chunks.
|
|
||||||
pub fn n_kmers(&self) -> usize {
|
|
||||||
self.n_kmers
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Return the nucleotide length of chunk `i`.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn seql(&self, i: usize) -> usize {
|
pub fn seql(&self, i: usize) -> usize {
|
||||||
self.seqls[i] as usize + self.k
|
self.mmap[self.chunk_start(i)] as usize + self.k
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Reconstruct chunk `i` as a [`Unitig`]. Allocates a copy of the packed bytes.
|
/// Reconstruct chunk `i` as a [`Unitig`].
|
||||||
pub fn unitig(&self, i: usize) -> Unitig {
|
pub fn unitig(&self, i: usize) -> Unitig {
|
||||||
let seql = self.seqls[i] as usize + self.k;
|
let offset = self.chunk_start(i);
|
||||||
let start = self.packed_offsets[i] as usize;
|
let seql = self.mmap[offset] as usize + self.k;
|
||||||
let byte_len = (seql + 3) / 4;
|
let byte_len = (seql + 3) / 4;
|
||||||
let tail = (seql % 4) as u8;
|
let bytes = self.mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice();
|
||||||
let bytes = self.mmap[start..start + byte_len].to_vec().into_boxed_slice();
|
Unitig::new((seql % 4) as u8, bytes)
|
||||||
Unitig::new(tail, bytes)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Extract the raw left-aligned u64 of the kmer at position `j` within chunk `i`.
|
/// Raw left-aligned u64 of the k-mer at position `j` within chunk `i`.
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn raw_kmer(&self, i: usize, j: usize) -> u64 {
|
pub fn raw_kmer(&self, i: usize, j: usize) -> u64 {
|
||||||
let start = self.packed_offsets[i] as usize;
|
let offset = self.chunk_start(i);
|
||||||
extract_kmer_raw(&self.mmap[start..], j, self.k)
|
extract_kmer_raw(&self.mmap[offset + 1..], j, self.k)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Return `true` iff the kmer at position `j` of chunk `i` equals `query`.
|
/// `true` iff the k-mer at position `j` of chunk `i` equals `query` (canonical).
|
||||||
///
|
|
||||||
/// O(1), zero allocation. The chunk may store either orientation of the kmer;
|
|
||||||
/// canonicalization is applied before comparison.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
pub fn verify_canonical_kmer(&self, i: usize, j: usize, query: CanonicalKmer) -> bool {
|
||||||
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
canonical_raw(self.raw_kmer(i, j), self.k) == query.raw()
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over all kmers in file order (all positions of chunk 0, then chunk 1, …).
|
// ── Sequential iterators (O(n) running-offset cursor) ─────────────────────
|
||||||
///
|
|
||||||
/// Each chunk is copied from the mmap once; iteration within the chunk is
|
/// Iterate all chunks in file order with a running byte offset — O(n) total.
|
||||||
/// zero-allocation (sliding-window via [`OwnedPackedSeqKmerIter`]).
|
fn iter_chunks_sequential(&self) -> impl Iterator<Item = (usize, Unitig)> + '_ {
|
||||||
|
let k = self.k;
|
||||||
|
let mmap = &*self.mmap;
|
||||||
|
let n = self.n_unitigs;
|
||||||
|
let mut offset = 0usize;
|
||||||
|
(0..n).map(move |chunk_id| {
|
||||||
|
let seql = mmap[offset] as usize + k;
|
||||||
|
let byte_len = (seql + 3) / 4;
|
||||||
|
let bytes = mmap[offset + 1..offset + 1 + byte_len].to_vec().into_boxed_slice();
|
||||||
|
offset += 1 + byte_len;
|
||||||
|
(chunk_id, Unitig::new((seql % 4) as u8, bytes))
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
pub fn iter_kmers(&self) -> impl Iterator<Item = Kmer> + '_ {
|
pub fn iter_kmers(&self) -> impl Iterator<Item = Kmer> + '_ {
|
||||||
(0..self.len()).flat_map(move |i| self.unitig(i).into_kmers())
|
self.iter_chunks_sequential()
|
||||||
|
.flat_map(|(_, u)| u.into_kmers())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over all canonical kmers in file order.
|
|
||||||
///
|
|
||||||
/// Equivalent to `iter_kmers().map(|km| km.canonical())` but uses the
|
|
||||||
/// built-in canonical iterator on each chunk, which avoids a separate
|
|
||||||
/// canonicalization pass.
|
|
||||||
pub fn iter_canonical_kmers(&self) -> impl Iterator<Item = CanonicalKmer> + '_ {
|
pub fn iter_canonical_kmers(&self) -> impl Iterator<Item = CanonicalKmer> + '_ {
|
||||||
(0..self.len()).flat_map(move |i| self.unitig(i).into_canonical_kmers())
|
self.iter_chunks_sequential()
|
||||||
|
.flat_map(|(_, u)| u.into_canonical_kmers())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Iterate over `(kmer, chunk_id, rank)` for every canonical kmer in the file.
|
|
||||||
///
|
|
||||||
/// `chunk_id` is the index of the chunk within this file; `rank` is the
|
|
||||||
/// 0-based position of the kmer within that chunk. Used to build the
|
|
||||||
/// evidence table in `obilayeredmap`.
|
|
||||||
pub fn iter_indexed_canonical_kmers(
|
pub fn iter_indexed_canonical_kmers(
|
||||||
&self,
|
&self,
|
||||||
) -> impl Iterator<Item = (CanonicalKmer, usize, usize)> + '_ {
|
) -> impl Iterator<Item = (CanonicalKmer, usize, usize)> + '_ {
|
||||||
(0..self.len()).flat_map(move |chunk_id| {
|
self.iter_chunks_sequential()
|
||||||
self.unitig(chunk_id)
|
.flat_map(|(chunk_id, u)| {
|
||||||
.into_canonical_kmers()
|
u.into_canonical_kmers()
|
||||||
.enumerate()
|
.enumerate()
|
||||||
.map(move |(rank, kmer)| (kmer, chunk_id, rank))
|
.map(move |(rank, kmer)| (kmer, chunk_id, rank))
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
fn read_idx(path: &Path) -> SKResult<(usize, usize, Vec<u32>)> {
|
||||||
let data = std::fs::read(path).map_err(SKError::Io)?;
|
let data = std::fs::read(path).map_err(SKError::Io)?;
|
||||||
let mut pos = 0;
|
let mut pos = 0;
|
||||||
|
|
||||||
@@ -260,15 +261,27 @@ fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
|||||||
.ok_or(SKError::Truncated { context: "unitig index: magic" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: magic" })?;
|
||||||
if magic_bytes != &MAGIC {
|
if magic_bytes != &MAGIC {
|
||||||
return Err(SKError::BadMagic {
|
return Err(SKError::BadMagic {
|
||||||
expected: "UIDX",
|
expected: "UIX2",
|
||||||
got: magic_bytes.try_into().unwrap(),
|
got: magic_bytes.try_into().unwrap(),
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
pos += 4;
|
pos += 4;
|
||||||
|
|
||||||
|
// block_size stored for forward-compatibility verification
|
||||||
|
let bs_bytes = data.get(pos..pos + 4)
|
||||||
|
.ok_or(SKError::Truncated { context: "unitig index: block_size" })?;
|
||||||
|
let stored_bs = u32::from_le_bytes(bs_bytes.try_into().unwrap()) as usize;
|
||||||
|
if stored_bs != BLOCK_SIZE {
|
||||||
|
return Err(SKError::InvalidData {
|
||||||
|
context: "unitig index",
|
||||||
|
detail: format!("block_size mismatch: file={stored_bs} code={BLOCK_SIZE}"),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
pos += 4;
|
||||||
|
|
||||||
let n_bytes = data.get(pos..pos + 4)
|
let n_bytes = data.get(pos..pos + 4)
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: n_unitigs" })?;
|
||||||
let n = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize;
|
let n_unitigs = u32::from_le_bytes(n_bytes.try_into().unwrap()) as usize;
|
||||||
pos += 4;
|
pos += 4;
|
||||||
|
|
||||||
let nk_bytes = data.get(pos..pos + 8)
|
let nk_bytes = data.get(pos..pos + 8)
|
||||||
@@ -276,25 +289,21 @@ fn read_idx(path: &Path) -> SKResult<(Vec<u8>, Vec<u32>, usize)> {
|
|||||||
let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize;
|
let n_kmers = u64::from_le_bytes(nk_bytes.try_into().unwrap()) as usize;
|
||||||
pos += 8;
|
pos += 8;
|
||||||
|
|
||||||
let seqls = data.get(pos..pos + n)
|
let n_blocks = (n_unitigs + BLOCK_SIZE - 1) >> LOG2_BLOCK_SIZE;
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: seqls" })?
|
let n_offsets = n_blocks + 1; // +1 for sentinel
|
||||||
.to_vec();
|
let mut block_offsets = Vec::with_capacity(n_offsets);
|
||||||
pos += n;
|
for _ in 0..n_offsets {
|
||||||
|
|
||||||
let mut packed_offsets = Vec::with_capacity(n + 1);
|
|
||||||
for _ in 0..=n {
|
|
||||||
let off_bytes = data.get(pos..pos + 4)
|
let off_bytes = data.get(pos..pos + 4)
|
||||||
.ok_or(SKError::Truncated { context: "unitig index: packed_offsets" })?;
|
.ok_or(SKError::Truncated { context: "unitig index: block_offsets" })?;
|
||||||
packed_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap()));
|
block_offsets.push(u32::from_le_bytes(off_bytes.try_into().unwrap()));
|
||||||
pos += 4;
|
pos += 4;
|
||||||
}
|
}
|
||||||
|
|
||||||
Ok((seqls, packed_offsets, n_kmers))
|
Ok((n_unitigs, n_kmers, block_offsets))
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Kmer utilities ────────────────────────────────────────────────────────────
|
// ── Kmer utilities ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
/// Reverse complement of a left-aligned 2-bit kmer (same algorithm as [`KmerOf::revcomp`]).
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
||||||
let x = !raw;
|
let x = !raw;
|
||||||
@@ -304,22 +313,17 @@ fn revcomp_raw(raw: u64, k: usize) -> u64 {
|
|||||||
x << (64 - 2 * k)
|
x << (64 - 2 * k)
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Canonical form of a left-aligned 2-bit kmer: `min(kmer, revcomp(kmer))`.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn canonical_raw(raw: u64, k: usize) -> u64 {
|
fn canonical_raw(raw: u64, k: usize) -> u64 {
|
||||||
raw.min(revcomp_raw(raw, k))
|
raw.min(revcomp_raw(raw, k))
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Bit extraction ────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
/// Extract the kmer at nucleotide position `j` from MSB-first 2-bit packed `bytes`.
|
|
||||||
/// Returns a left-aligned u64 matching [`KmerOf`]'s internal representation.
|
|
||||||
#[inline]
|
#[inline]
|
||||||
fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 {
|
fn extract_kmer_raw(bytes: &[u8], j: usize, k: usize) -> u64 {
|
||||||
let bit_start = j * 2;
|
let bit_start = j * 2;
|
||||||
let byte_start = bit_start / 8;
|
let byte_start = bit_start / 8;
|
||||||
let bit_offset = bit_start % 8; // always 0, 2, 4, or 6
|
let bit_offset = bit_start % 8;
|
||||||
let bytes_needed = (bit_offset + 2 * k + 7) / 8; // ≤ 9 for k ≤ 32
|
let bytes_needed = (bit_offset + 2 * k + 7) / 8;
|
||||||
|
|
||||||
let mut acc = 0u128;
|
let mut acc = 0u128;
|
||||||
for idx in 0..bytes_needed {
|
for idx in 0..bytes_needed {
|
||||||
|
|||||||
Reference in New Issue
Block a user