Merge pull request 'Push qkpyqurltlpk' (#1) from push-qkpyqurltlpk into main
Reviewed-on: #1
This commit was merged in pull request #1.
This commit is contained in:
@@ -1,28 +1,8 @@
|
|||||||
## Chose à vérifier suite à la commande index
|
|
||||||
|
|
||||||
- partition.meta ne devrait plus exister
|
|
||||||
- les spectrums globaux devrait etre identifier par génome
|
|
||||||
- regrouper dans un sous-dossier spectrums à la racine de l'index avec un nom basé sur le génome
|
|
||||||
- les spectrum patiels ont-ils vocation à être conserver ?
|
|
||||||
- l'étape de déreplication dure quasiment autant de temps que le comptage mais ne laisse aucune trace de progression à l'utilisateur
|
|
||||||
|
|
||||||
## commandes à ajouter
|
## commandes à ajouter
|
||||||
|
|
||||||
- merge : pour construire un index à partir d'index existants
|
|
||||||
- deux modes : count et presence/absence. count exige que tous les index mergés soient déjà en mode count. mode presence/absence par defaut. Si passage de mode count à mode presence/absence, par defaut presence = count >= 1. Possibilité de spécifier un seuil personnalisé.
|
|
||||||
|
|
||||||
- filter : produit un nouvel index filtré à partir d'un index existant en verifiant que les kmer présents dans le nouvel index respectent les critères de filtrage spécifiés
|
|
||||||
- quorum de presence en fraction-(min/max) du nombre de génomes, en nombre-(min/max) de génomes, si mode count la présence peut être défini par un seuil personnalisé minimum et maximum
|
|
||||||
|
|
||||||
- aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.
|
- aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.
|
||||||
|
|
||||||
- query : scan un fichier de sequences et retourne pour chaque sequence quels kmer sont présents dans l'index et dans quel genomes
|
- query : scan un fichier de sequences et retourne pour chaque sequence quels kmer sont présents dans l'index et dans quel genomes
|
||||||
|
--detail et --mismatch à implementer
|
||||||
|
|
||||||
- distance : calcule la matrice de distance entre les genomes
|
- status : affiche le statut de l'index
|
||||||
- proposer une option pour chaque distance à calculer
|
|
||||||
- un possibité de récuperer la matrice des kmer communs
|
|
||||||
- un possibité de calculer l'arbre nj
|
|
||||||
- les matrices sont sauvegardées en CSV
|
|
||||||
- les arbres NJ sont sauvegardés en Newick avec les longeurs de branche
|
|
||||||
|
|
||||||
- dump : une table csv de l'index avec les kmer et les genomes associés en mode count ou presence/absence avec une option pour forcer le mode presence/absence meme si l'index est en mode count. Par defaut, le mode count est utilisé pour les index en mode count et le mode presence/absence pour les index en mode presence/absence.
|
|
||||||
|
|||||||
@@ -327,3 +327,30 @@ Other derivations: threshold a count matrix → binary presence matrix; union tw
|
|||||||
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
2. Replace `LayerData` trait with the `DataStore` / `ColumnWeights` / `CountPartials` / `BitPartials` system.
|
||||||
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
3. Rewire `LayeredMap` to hold `LayeredStore<PersistentCompactIntMatrix>` (or bit variant) alongside the MPHF layers.
|
||||||
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|
4. Implement `PartitionedIndex` using `LayeredStore<LayeredStore<S>>` for data and parallel dispatch for queries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Multi-genome column invariant
|
||||||
|
|
||||||
|
### The invariant
|
||||||
|
|
||||||
|
After any merge operation, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the current total genome count recorded in `index.meta`. This applies to both `PersistentCompactIntMatrix` (Count mode) and `PersistentBitMatrix` (Presence mode).
|
||||||
|
|
||||||
|
### How it is maintained
|
||||||
|
|
||||||
|
The invariant is established and preserved by three coordinated operations:
|
||||||
|
|
||||||
|
**1. Existing layers — column append.**
|
||||||
|
When merging source genome G into an existing index with `n_existing_genomes` genomes, one column is appended to every existing layer via `append_genome_column`. Slots that contain a kmer present in source G receive its count or `true`; all other slots receive 0 or `false`. After this step, every pre-existing layer has `n_existing_genomes + 1` columns.
|
||||||
|
|
||||||
|
**2. New layers — absent columns prepended.**
|
||||||
|
If source G introduces kmers not found in any existing layer, a new layer is created for those kmers. Before appending source G's own column, `n_existing_genomes` absent columns (all-zero or all-false) are prepended — one per genome already in the index. This ensures the new layer starts at the same column count as every other layer in the partition immediately after creation.
|
||||||
|
|
||||||
|
**3. First merge, Presence mode — `init_presence_matrix`.**
|
||||||
|
The initial single-genome index has no `presence/` directory (presence is implicit: every kmer in the index is present in genome 0). On the first merge, before appending any column for source 1, `Layer<()>::init_presence_matrix` creates `presence/col_000000.pbiv` set entirely to `true` for each existing layer. This retroactively materialises genome 0's presence column, bringing the column count from 0 to 1 so that the subsequent append for source 1 raises it to 2.
|
||||||
|
|
||||||
|
### Why the invariant is required
|
||||||
|
|
||||||
|
The `LayeredStore` aggregation traits (`col_weights`, `partial_bray`, `partial_jaccard`, etc.) sum contributions across all `(partition, layer)` pairs without any shape check. If one layer had fewer columns than others, its contribution would silently produce a malformed result — wrong column sums, wrong distance matrices, and incorrect genome-level statistics.
|
||||||
|
|
||||||
|
The invariant is the precondition that makes the progressive aggregation principle correct: each level can blindly sum matrices from the level below because all matrices have the same shape.
|
||||||
|
|||||||
@@ -0,0 +1,111 @@
|
|||||||
|
# Query system
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Input
|
||||||
|
|
||||||
|
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
|
||||||
|
- Sequences shorter than k bases are silently skipped.
|
||||||
|
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Algorithm
|
||||||
|
|
||||||
|
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
||||||
|
|
||||||
|
```
|
||||||
|
for each query sequence:
|
||||||
|
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
|
||||||
|
for each superkmer:
|
||||||
|
route to partition p via minimiser hash
|
||||||
|
for each kmer in the superkmer:
|
||||||
|
lookup kmer in partition p (MPHF → evidence check → matrix row)
|
||||||
|
accumulate result into per-sequence accumulators
|
||||||
|
emit annotated sequence
|
||||||
|
```
|
||||||
|
|
||||||
|
Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Exact vs. approximate matching
|
||||||
|
|
||||||
|
### Exact (default)
|
||||||
|
|
||||||
|
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
|
||||||
|
|
||||||
|
### 1-mismatch (`--mismatch` flag)
|
||||||
|
|
||||||
|
For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
|
||||||
|
|
||||||
|
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
|
||||||
|
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
|
||||||
|
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
|
||||||
|
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output format
|
||||||
|
|
||||||
|
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
|
||||||
|
|
||||||
|
```
|
||||||
|
>read_id {"kmer_total":150,"kmer_found":59,...}
|
||||||
|
ATCGATCG...
|
||||||
|
```
|
||||||
|
|
||||||
|
Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Annotation schema
|
||||||
|
|
||||||
|
### Summary mode (default)
|
||||||
|
|
||||||
|
| Key | Type | Condition | Semantics |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
|
||||||
|
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
|
||||||
|
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
|
||||||
|
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
|
||||||
|
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
|
||||||
|
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
|
||||||
|
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
|
||||||
|
|
||||||
|
`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
|
||||||
|
|
||||||
|
`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
|
||||||
|
|
||||||
|
### Detail mode (`--detail`)
|
||||||
|
|
||||||
|
All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:
|
||||||
|
|
||||||
|
| Key | Type | Condition | Semantics |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
|
||||||
|
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
|
||||||
|
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
|
||||||
|
|
||||||
|
Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## CLI
|
||||||
|
|
||||||
|
```
|
||||||
|
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
|
||||||
|
```
|
||||||
|
|
||||||
|
`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future work
|
||||||
|
|
||||||
|
- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
|
||||||
|
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
# Merge command
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
`obikmer merge` combines multiple existing kmer indexes into a single index. The result contains all kmers from all sources, with per-genome presence/absence or count data for every genome across every layer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Modes
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub enum MergeMode { Presence, Count }
|
||||||
|
```
|
||||||
|
|
||||||
|
Default mode is `Presence`. `Count` mode requires **all** source indexes to have `with_counts=true`; mixing count and non-count sources is rejected at validation.
|
||||||
|
|
||||||
|
| Mode | Column type | Constraint |
|
||||||
|
|---|---|---|
|
||||||
|
| `Presence` | `PersistentBitMatrix` (one bit per genome per slot) | none |
|
||||||
|
| `Count` | `PersistentCompactIntMatrix` (one u32 per genome per slot) | all sources `with_counts=true` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Input / output constraints
|
||||||
|
|
||||||
|
All source indexes must satisfy:
|
||||||
|
|
||||||
|
- `IndexState::Indexed` (fully built — `index.done` sentinel present)
|
||||||
|
- Same `kmer_size`, `minimizer_size`, `n_bits`
|
||||||
|
- If `Count` mode: all sources must have `with_counts=true`
|
||||||
|
|
||||||
|
`--force`: if the output directory already exists, it is deleted before the merge begins.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Algorithm
|
||||||
|
|
||||||
|
### 1. Validation
|
||||||
|
|
||||||
|
Check all sources against the constraints above. Abort on any mismatch.
|
||||||
|
|
||||||
|
### 2. Bootstrap output from first source
|
||||||
|
|
||||||
|
Recursive file copy of `sources[0]` → `output`. This establishes the partition layout, all existing MPHFs, unitigs, and evidence files. The first source's genome is genome 0 in the output.
|
||||||
|
|
||||||
|
### 3. For each subsequent source (parallel across partitions)
|
||||||
|
|
||||||
|
For each partition, process one source at a time sequentially:
|
||||||
|
|
||||||
|
**a. Classify kmers**
|
||||||
|
|
||||||
|
Iterate all kmers in the source partition (via `UnitigFileReader` + canonical kmer iteration). For each kmer, probe the destination `LayeredMap<()>`:
|
||||||
|
|
||||||
|
- **Hit** `(dst_layer, slot)`: record `(dst_layer, slot, value)` where `value` is the source count (Count mode) or `1` (Presence mode).
|
||||||
|
- **Miss**: add kmer to a `GraphDeBruijn` accumulator; record its value in a `HashMap<CanonicalKmer, Vec<u32>>`.
|
||||||
|
|
||||||
|
**b. Extend existing layers**
|
||||||
|
|
||||||
|
For each existing layer in the destination partition, call `append_genome_column` once per source genome being merged. Slots that received a hit are populated with their recorded value; all other slots receive 0 (absent).
|
||||||
|
|
||||||
|
If this is the **first merge** and operating in Presence mode, call `Layer<()>::init_presence_matrix` before appending any source column. This creates `presence/` with `col_000000.pbiv` set all-true (genome 0 is present in every slot of this layer).
|
||||||
|
|
||||||
|
**c. Build new layer for new kmers**
|
||||||
|
|
||||||
|
If the `GraphDeBruijn` accumulator is non-empty (misses exist), write compact unitigs from the de Bruijn graph, then call the appropriate `Layer::build` variant. Before appending the source's own column, prepend `n_existing_genomes` absent columns (value 0) — one per genome already in the index — so the column count invariant holds immediately after layer creation.
|
||||||
|
|
||||||
|
**d. Update partition metadata**
|
||||||
|
|
||||||
|
Write updated `meta.json` with the incremented `n_layers` if a new layer was added.
|
||||||
|
|
||||||
|
### 4. Update index metadata
|
||||||
|
|
||||||
|
Append the merged source's genome label(s) to `index.meta.genomes`. Write updated `index.meta`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Column count invariant
|
||||||
|
|
||||||
|
After any merge, **every layer in every partition has exactly `n_genomes` columns**, where `n_genomes` is the total genome count in the index at that point.
|
||||||
|
|
||||||
|
This is maintained by three mechanisms:
|
||||||
|
|
||||||
|
1. **Existing layers**: one column appended per source genome (`append_genome_column`).
|
||||||
|
2. **New layers created during merge**: `n_existing_genomes` absent columns prepended before the source's own column.
|
||||||
|
3. **First merge, Presence mode**: `init_presence_matrix` retroactively creates `presence/col_0` all-true for genome 0 before any source column is appended.
|
||||||
|
|
||||||
|
The invariant is a precondition of the `LayeredStore` aggregation traits: `col_weights()` and all partial distance methods assume every inner store has the same column count.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New layer construction
|
||||||
|
|
||||||
|
All kmers absent from the destination index — across **all** sources being merged — accumulate into a **single** `GraphDeBruijn` per partition. One new layer is built from this graph, not one layer per source. This keeps the layer count bounded and preserves compact unitig representation.
|
||||||
|
|
||||||
|
De Bruijn reconstruction ensures that overlapping k-1 suffixes/prefixes from different source kmers are merged into minimal unitigs before MPHF construction.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## On-disk impact
|
||||||
|
|
||||||
|
After merging `G` genomes (1 existing + G-1 new sources):
|
||||||
|
|
||||||
|
```
|
||||||
|
partitions/
|
||||||
|
part_00000/
|
||||||
|
index/
|
||||||
|
meta.json ← n_layers updated if new layer added
|
||||||
|
layer_0/
|
||||||
|
mphf.bin ← unchanged (MPHF immutable)
|
||||||
|
unitigs.bin ← unchanged
|
||||||
|
evidence.bin ← unchanged
|
||||||
|
presence/ ← created on first merge (Presence mode)
|
||||||
|
meta.json {"n": N, "n_cols": G}
|
||||||
|
col_000000.pbiv ← all-true (genome 0)
|
||||||
|
col_000001.pbiv ← source 1 presence
|
||||||
|
...
|
||||||
|
counts/ ← extended (Count mode)
|
||||||
|
meta.json {"n": N, "n_cols": G}
|
||||||
|
col_000000.pciv ← genome 0 counts (from original build)
|
||||||
|
col_000001.pciv ← source 1 counts
|
||||||
|
...
|
||||||
|
layer_1/ ← new layer (if new kmers found)
|
||||||
|
mphf.bin
|
||||||
|
unitigs.bin
|
||||||
|
evidence.bin
|
||||||
|
presence/ or counts/
|
||||||
|
meta.json {"n": N1, "n_cols": G}
|
||||||
|
col_000000.pbiv ← all-false (genome 0 absent from this layer)
|
||||||
|
...
|
||||||
|
col_000001.pbiv ← source 1 presence in this new layer
|
||||||
|
```
|
||||||
|
|
||||||
|
The `index.meta` genome list grows from 1 entry to G entries after all sources are merged.
|
||||||
@@ -255,3 +255,57 @@ Each partition's new layer is built independently; the operation is fully parall
|
|||||||
| `obicompactvec` | payload types + aggregation traits |
|
| `obicompactvec` | payload types + aggregation traits |
|
||||||
| `rayon 1` | parallel MPHF construction pass |
|
| `rayon 1` | parallel MPHF construction pass |
|
||||||
| `ndarray 0.16` | aggregation output arrays |
|
| `ndarray 0.16` | aggregation output arrays |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Column append and merge support
|
||||||
|
|
||||||
|
These methods extend existing layers with new genome columns without touching the MPHF. They are the building blocks of the `merge` command.
|
||||||
|
|
||||||
|
### Matrix column append
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl PersistentCompactIntMatrix {
|
||||||
|
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
|
||||||
|
impl PersistentBitMatrix {
|
||||||
|
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Both methods write a new column file (`col_NNNNNN.pciv` / `col_NNNNNN.pbiv`) and update `meta.json` to increment `n_cols`. The `value_of` closure is called once per slot (indexed 0..n) to populate the column. The matrix `n` (row count) is read from the existing `meta.json` and must not change.
|
||||||
|
|
||||||
|
### Presence matrix initialisation
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl Layer<()> {
|
||||||
|
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Called on the first merge of a Presence-mode index. Creates the `presence/` subdirectory with `meta.json {"n": n_kmers, "n_cols": 1}` and `col_000000.pbiv` set entirely to `true`. This retroactively records that genome 0 (the original source) is present in every slot of this layer, satisfying the column count invariant before any new-source column is appended.
|
||||||
|
|
||||||
|
### Layer-level genome column append
|
||||||
|
|
||||||
|
```rust
|
||||||
|
impl Layer<PersistentBitMatrix> {
|
||||||
|
pub fn append_genome_column(
|
||||||
|
layer_dir: &Path,
|
||||||
|
value_of: impl Fn(usize) -> bool,
|
||||||
|
) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
|
pub fn append_genome_column(
|
||||||
|
layer_dir: &Path,
|
||||||
|
value_of: impl Fn(usize) -> u32,
|
||||||
|
) -> OLMResult<()>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
These delegate directly to the corresponding `PersistentBitMatrix::append_column` / `PersistentCompactIntMatrix::append_column`. They are typed at the `Layer` level to make call sites mode-aware without exposing the inner matrix path construction.
|
||||||
|
|
||||||
|
### Why the MPHF is never rebuilt
|
||||||
|
|
||||||
|
The MPHF (`mphf.bin`, `evidence.bin`, `unitigs.bin`) is built once from the kmer set of a layer and is immutable for the lifetime of that layer. Adding a genome column does not change the kmer set — it only adds a new data column indexed by the same slot numbers. Rebuilding the MPHF would require re-running the full construction pipeline (two sequential passes over unitigs, parallel ptr_hash construction) and would invalidate any open memory maps. Column append avoids all of this: the only disk writes are one new `.pciv`/`.pbiv` file and a single `meta.json` update. Kmers absent from a given layer are represented as zero (count) or false (presence) values in the new column — no structural change to the layer is required.
|
||||||
|
|||||||
@@ -47,6 +47,7 @@ nav:
|
|||||||
- obilayeredmap crate: implementation/obilayeredmap.md
|
- obilayeredmap crate: implementation/obilayeredmap.md
|
||||||
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
- PersistentCompactIntVec: implementation/persistent_compact_int_vec.md
|
||||||
- PersistentBitVec: implementation/persistent_bit_vec.md
|
- PersistentBitVec: implementation/persistent_bit_vec.md
|
||||||
|
- Merge command: implementation/merge.md
|
||||||
- Architecture:
|
- Architecture:
|
||||||
- Sequences: architecture/sequences/invariant.md
|
- Sequences: architecture/sequences/invariant.md
|
||||||
- Kmer index: architecture/index_architecture.md
|
- Kmer index: architecture/index_architecture.md
|
||||||
|
|||||||
Generated
+96
-10
@@ -176,6 +176,21 @@ dependencies = [
|
|||||||
"thiserror 1.0.69",
|
"thiserror 1.0.69",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "bit-set"
|
||||||
|
version = "0.5.3"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "0700ddab506f33b20a03b13996eccd309a48e5ff77d0d95926aa0210fb4e95f1"
|
||||||
|
dependencies = [
|
||||||
|
"bit-vec",
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "bit-vec"
|
||||||
|
version = "0.6.3"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "349f9b6a179ed607305526ca489b34ad0a41aed5f7980fa90eb03160b69598fb"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "bitflags"
|
name = "bitflags"
|
||||||
version = "1.3.2"
|
version = "1.3.2"
|
||||||
@@ -625,6 +640,12 @@ dependencies = [
|
|||||||
"syn",
|
"syn",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "dtoa"
|
||||||
|
version = "1.0.11"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "4c3cf4824e2d5f025c7b531afcb2325364084a16806f6d47fbc1f5fbd9960590"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "either"
|
name = "either"
|
||||||
version = "1.15.0"
|
version = "1.15.0"
|
||||||
@@ -1102,6 +1123,15 @@ dependencies = [
|
|||||||
"wasm-bindgen",
|
"wasm-bindgen",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "kodama"
|
||||||
|
version = "0.2.3"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "7a44f3a71a44fbf49ce38152db7dc9adf959d4fe5c29344cd1858bdbda8d9091"
|
||||||
|
dependencies = [
|
||||||
|
"num-traits",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "lazy_static"
|
name = "lazy_static"
|
||||||
version = "1.5.0"
|
version = "1.5.0"
|
||||||
@@ -1452,17 +1482,13 @@ dependencies = [
|
|||||||
name = "obikindex"
|
name = "obikindex"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"cacheline-ef",
|
|
||||||
"epserde",
|
|
||||||
"indicatif",
|
"indicatif",
|
||||||
|
"ndarray",
|
||||||
"obicompactvec",
|
"obicompactvec",
|
||||||
"obidebruinj",
|
|
||||||
"obikpartitionner",
|
"obikpartitionner",
|
||||||
"obikseq",
|
|
||||||
"obilayeredmap",
|
"obilayeredmap",
|
||||||
"obiskio",
|
"obiskio",
|
||||||
"obisys",
|
"obisys",
|
||||||
"ptr_hash",
|
|
||||||
"rayon",
|
"rayon",
|
||||||
"serde",
|
"serde",
|
||||||
"serde_json",
|
"serde_json",
|
||||||
@@ -1475,6 +1501,7 @@ version = "0.1.0"
|
|||||||
dependencies = [
|
dependencies = [
|
||||||
"clap",
|
"clap",
|
||||||
"indicatif",
|
"indicatif",
|
||||||
|
"kodama",
|
||||||
"obifastwrite",
|
"obifastwrite",
|
||||||
"obikindex",
|
"obikindex",
|
||||||
"obikpartitionner",
|
"obikpartitionner",
|
||||||
@@ -1487,6 +1514,8 @@ dependencies = [
|
|||||||
"obisys",
|
"obisys",
|
||||||
"pprof",
|
"pprof",
|
||||||
"rayon",
|
"rayon",
|
||||||
|
"serde_json",
|
||||||
|
"speedytree",
|
||||||
"tracing",
|
"tracing",
|
||||||
"tracing-subscriber",
|
"tracing-subscriber",
|
||||||
]
|
]
|
||||||
@@ -1501,8 +1530,10 @@ dependencies = [
|
|||||||
"memmap2",
|
"memmap2",
|
||||||
"niffler 3.0.0",
|
"niffler 3.0.0",
|
||||||
"obicompactvec",
|
"obicompactvec",
|
||||||
|
"obidebruinj",
|
||||||
"obikrope",
|
"obikrope",
|
||||||
"obikseq",
|
"obikseq",
|
||||||
|
"obilayeredmap",
|
||||||
"obiread",
|
"obiread",
|
||||||
"obiskbuilder",
|
"obiskbuilder",
|
||||||
"obiskio",
|
"obiskio",
|
||||||
@@ -1891,8 +1922,8 @@ dependencies = [
|
|||||||
"lazy_static",
|
"lazy_static",
|
||||||
"log",
|
"log",
|
||||||
"mem_dbg",
|
"mem_dbg",
|
||||||
"rand",
|
"rand 0.9.4",
|
||||||
"rand_chacha",
|
"rand_chacha 0.9.0",
|
||||||
"rayon",
|
"rayon",
|
||||||
"rdst",
|
"rdst",
|
||||||
"rustc-hash",
|
"rustc-hash",
|
||||||
@@ -1923,14 +1954,35 @@ version = "0.7.0"
|
|||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "dc33ff2d4973d518d823d61aa239014831e521c75da58e3df4840d3f47749d09"
|
checksum = "dc33ff2d4973d518d823d61aa239014831e521c75da58e3df4840d3f47749d09"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "rand"
|
||||||
|
version = "0.8.6"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "5ca0ecfa931c29007047d1bc58e623ab12e5590e8c7cc53200d5202b69266d8a"
|
||||||
|
dependencies = [
|
||||||
|
"libc",
|
||||||
|
"rand_chacha 0.3.1",
|
||||||
|
"rand_core 0.6.4",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "rand"
|
name = "rand"
|
||||||
version = "0.9.4"
|
version = "0.9.4"
|
||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "44c5af06bb1b7d3216d91932aed5265164bf384dc89cd6ba05cf59a35f5f76ea"
|
checksum = "44c5af06bb1b7d3216d91932aed5265164bf384dc89cd6ba05cf59a35f5f76ea"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"rand_chacha",
|
"rand_chacha 0.9.0",
|
||||||
"rand_core",
|
"rand_core 0.9.5",
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "rand_chacha"
|
||||||
|
version = "0.3.1"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
|
||||||
|
dependencies = [
|
||||||
|
"ppv-lite86",
|
||||||
|
"rand_core 0.6.4",
|
||||||
]
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
@@ -1940,7 +1992,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
|||||||
checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb"
|
checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"ppv-lite86",
|
"ppv-lite86",
|
||||||
"rand_core",
|
"rand_core 0.9.5",
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "rand_core"
|
||||||
|
version = "0.6.4"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c"
|
||||||
|
dependencies = [
|
||||||
|
"getrandom 0.2.17",
|
||||||
]
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
@@ -1978,6 +2039,12 @@ dependencies = [
|
|||||||
"crossbeam-utils",
|
"crossbeam-utils",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "rb_tree"
|
||||||
|
version = "0.5.0"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "e6724c78e033e1c4155b4d3f76593d09ad384739c6986c1addc2bd2f55b1aefe"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "rdst"
|
name = "rdst"
|
||||||
version = "0.20.14"
|
version = "0.20.14"
|
||||||
@@ -2232,6 +2299,25 @@ version = "1.15.1"
|
|||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03"
|
checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "speedytree"
|
||||||
|
version = "0.1.0"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "7f4522052445ce1b002c095d93e9dc545c326e9e2321c1315f1ab2a381c11666"
|
||||||
|
dependencies = [
|
||||||
|
"bit-set",
|
||||||
|
"bit-vec",
|
||||||
|
"bitvec",
|
||||||
|
"clap",
|
||||||
|
"dtoa",
|
||||||
|
"fixedbitset",
|
||||||
|
"parking_lot",
|
||||||
|
"petgraph",
|
||||||
|
"rand 0.8.6",
|
||||||
|
"rayon",
|
||||||
|
"rb_tree",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "stable_deref_trait"
|
name = "stable_deref_trait"
|
||||||
version = "1.2.1"
|
version = "1.2.1"
|
||||||
|
|||||||
@@ -101,6 +101,26 @@ impl PersistentBitMatrix {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Column append ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
impl PersistentBitMatrix {
|
||||||
|
/// Append a new column to an existing matrix on disk.
|
||||||
|
///
|
||||||
|
/// Reads `meta.json` to obtain `n` and the current column count, writes
|
||||||
|
/// `col_{n_cols:06}.pbiv` filled by `value_of(slot)`, then increments
|
||||||
|
/// `n_cols` in `meta.json`.
|
||||||
|
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> bool) -> io::Result<()> {
|
||||||
|
let mut meta = MatrixMeta::load(dir)?;
|
||||||
|
let mut b = PersistentBitVecBuilder::new(meta.n, &col_path(dir, meta.n_cols))?;
|
||||||
|
for slot in 0..meta.n {
|
||||||
|
b.set(slot, value_of(slot));
|
||||||
|
}
|
||||||
|
b.close()?;
|
||||||
|
meta.n_cols += 1;
|
||||||
|
meta.save(dir)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
fn upper_pairs(n: usize) -> Vec<(usize, usize)> {
|
fn upper_pairs(n: usize) -> Vec<(usize, usize)> {
|
||||||
(0..n).flat_map(|i| (i + 1..n).map(move |j| (i, j))).collect()
|
(0..n).flat_map(|i| (i + 1..n).map(move |j| (i, j))).collect()
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -203,6 +203,26 @@ where
|
|||||||
m
|
m
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Column append ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
impl PersistentCompactIntMatrix {
|
||||||
|
/// Append a new column to an existing matrix on disk.
|
||||||
|
///
|
||||||
|
/// Reads `meta.json` to obtain `n` and the current column count, writes
|
||||||
|
/// `col_{n_cols:06}.pciv` filled by `value_of(slot)`, then increments
|
||||||
|
/// `n_cols` in `meta.json`.
|
||||||
|
pub fn append_column(dir: &Path, value_of: impl Fn(usize) -> u32) -> io::Result<()> {
|
||||||
|
let mut meta = MatrixMeta::load(dir)?;
|
||||||
|
let mut b = PersistentCompactIntVecBuilder::new(meta.n, &col_path(dir, meta.n_cols))?;
|
||||||
|
for slot in 0..meta.n {
|
||||||
|
b.set(slot, value_of(slot));
|
||||||
|
}
|
||||||
|
b.close()?;
|
||||||
|
meta.n_cols += 1;
|
||||||
|
meta.save(dir)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// ── Trait impls ───────────────────────────────────────────────────────────────
|
// ── Trait impls ───────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
use crate::traits::{ColumnWeights, CountPartials};
|
use crate::traits::{ColumnWeights, CountPartials};
|
||||||
|
|||||||
@@ -80,6 +80,13 @@ pub trait CountPartials: ColumnWeights {
|
|||||||
let sq2 = std::f64::consts::SQRT_2;
|
let sq2 = std::f64::consts::SQRT_2;
|
||||||
self.partial_hellinger(&global).mapv(|v| v.sqrt() / sq2)
|
self.partial_hellinger(&global).mapv(|v| v.sqrt() / sq2)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Euclidean distance in the Hellinger (√relative-frequency) space.
|
||||||
|
/// Equal to √2 × hellinger_dist — unnormalised variant.
|
||||||
|
fn hellinger_euclidean_dist_matrix(&self) -> Array2<f64> {
|
||||||
|
let global = self.col_weights();
|
||||||
|
self.partial_hellinger(&global).mapv(|v| v.sqrt())
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Partial distance matrices for bit-based data (`PersistentBitMatrix`).
|
/// Partial distance matrices for bit-based data (`PersistentBitMatrix`).
|
||||||
|
|||||||
@@ -5,15 +5,11 @@ edition = "2024"
|
|||||||
|
|
||||||
[dependencies]
|
[dependencies]
|
||||||
obikpartitionner = { path = "../obikpartitionner" }
|
obikpartitionner = { path = "../obikpartitionner" }
|
||||||
obikseq = { path = "../obikseq" }
|
|
||||||
obisys = { path = "../obisys" }
|
|
||||||
obiskio = { path = "../obiskio" }
|
obiskio = { path = "../obiskio" }
|
||||||
obidebruinj = { path = "../obidebruinj" }
|
obisys = { path = "../obisys" }
|
||||||
obilayeredmap = { path = "../obilayeredmap" }
|
|
||||||
obicompactvec = { path = "../obicompactvec" }
|
obicompactvec = { path = "../obicompactvec" }
|
||||||
cacheline-ef = "1.1"
|
obilayeredmap = { path = "../obilayeredmap" }
|
||||||
epserde = "0.8"
|
ndarray = "0.16"
|
||||||
ptr_hash = "1.1"
|
|
||||||
rayon = "1"
|
rayon = "1"
|
||||||
serde = { version = "1", features = ["derive"] }
|
serde = { version = "1", features = ["derive"] }
|
||||||
serde_json = "1"
|
serde_json = "1"
|
||||||
|
|||||||
@@ -0,0 +1,132 @@
|
|||||||
|
use ndarray::Array2;
|
||||||
|
use obicompactvec::traits::{BitPartials, CountPartials};
|
||||||
|
use obilayeredmap::LayeredStore;
|
||||||
|
use rayon::prelude::*;
|
||||||
|
|
||||||
|
use crate::error::{OKIError, OKIResult};
|
||||||
|
use crate::index::KmerIndex;
|
||||||
|
|
||||||
|
// ── Public API ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||||
|
pub enum DistanceMetric {
|
||||||
|
/// Jaccard distance on presence/absence data.
|
||||||
|
Jaccard,
|
||||||
|
/// Hamming distance (number of differing kmer positions) on presence/absence data.
|
||||||
|
Hamming,
|
||||||
|
/// Bray-Curtis dissimilarity on raw counts.
|
||||||
|
BrayCurtis,
|
||||||
|
/// Bray-Curtis dissimilarity normalised by per-genome total counts.
|
||||||
|
RelfreqBrayCurtis,
|
||||||
|
/// Euclidean distance on raw counts.
|
||||||
|
Euclidean,
|
||||||
|
/// Euclidean distance on relative frequencies.
|
||||||
|
RelfreqEuclidean,
|
||||||
|
/// Hellinger distance on counts.
|
||||||
|
Hellinger,
|
||||||
|
/// Euclidean distance in the Hellinger (√relative-frequency) space (unnormalised variant).
|
||||||
|
HellingerEuclidean,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct DistanceOutput {
|
||||||
|
/// n×n pairwise distance matrix (genomes in index order).
|
||||||
|
pub matrix: Array2<f64>,
|
||||||
|
/// n×n shared-kmer count matrix (intersection), if requested.
|
||||||
|
pub shared_kmers: Option<Array2<u64>>,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl DistanceMetric {
|
||||||
|
pub fn requires_counts(self) -> bool {
|
||||||
|
matches!(
|
||||||
|
self,
|
||||||
|
DistanceMetric::BrayCurtis
|
||||||
|
| DistanceMetric::RelfreqBrayCurtis
|
||||||
|
| DistanceMetric::Euclidean
|
||||||
|
| DistanceMetric::RelfreqEuclidean
|
||||||
|
| DistanceMetric::Hellinger
|
||||||
|
| DistanceMetric::HellingerEuclidean
|
||||||
|
)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── KmerIndex::distance ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
impl KmerIndex {
|
||||||
|
pub fn distance(&self, metric: DistanceMetric, shared_kmers: bool) -> OKIResult<DistanceOutput> {
|
||||||
|
let n_genomes = self.meta.genomes.len();
|
||||||
|
if n_genomes < 2 {
|
||||||
|
return Err(OKIError::InvalidInput(
|
||||||
|
"distance requires at least 2 genomes in the index".into(),
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
let use_counts = self.meta.config.with_counts;
|
||||||
|
if metric.requires_counts() && !use_counts {
|
||||||
|
return Err(OKIError::InvalidInput(format!(
|
||||||
|
"{metric:?} requires a count index (with_counts = true)"
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
|
||||||
|
let n_parts = self.n_partitions();
|
||||||
|
|
||||||
|
if use_counts {
|
||||||
|
let stores: Vec<_> = (0..n_parts)
|
||||||
|
.into_par_iter()
|
||||||
|
.map(|i| self.partition.count_store(i).map_err(OKIError::Partition))
|
||||||
|
.collect::<OKIResult<_>>()?;
|
||||||
|
let global = LayeredStore::new(stores);
|
||||||
|
|
||||||
|
let matrix = match metric {
|
||||||
|
DistanceMetric::BrayCurtis => CountPartials::bray_dist_matrix(&global),
|
||||||
|
DistanceMetric::RelfreqBrayCurtis => CountPartials::relfreq_bray_dist_matrix(&global),
|
||||||
|
DistanceMetric::Euclidean => CountPartials::euclidean_dist_matrix(&global),
|
||||||
|
DistanceMetric::RelfreqEuclidean => CountPartials::relfreq_euclidean_dist_matrix(&global),
|
||||||
|
DistanceMetric::Hellinger => CountPartials::hellinger_dist_matrix(&global),
|
||||||
|
DistanceMetric::HellingerEuclidean => CountPartials::hellinger_euclidean_dist_matrix(&global),
|
||||||
|
// Jaccard on count data: threshold at 0 (present if count > 0)
|
||||||
|
DistanceMetric::Jaccard => CountPartials::threshold_jaccard_dist_matrix(&global, 0),
|
||||||
|
DistanceMetric::Hamming => {
|
||||||
|
return Err(OKIError::InvalidInput(
|
||||||
|
"Hamming is only available for presence/absence indexes".into(),
|
||||||
|
));
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
let shared = if shared_kmers {
|
||||||
|
let (inter, _) = CountPartials::partial_threshold_jaccard(&global, 0);
|
||||||
|
Some(inter)
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
};
|
||||||
|
|
||||||
|
Ok(DistanceOutput { matrix, shared_kmers: shared })
|
||||||
|
} else {
|
||||||
|
let stores: Vec<_> = (0..n_parts)
|
||||||
|
.into_par_iter()
|
||||||
|
.map(|i| self.partition.presence_store(i).map_err(OKIError::Partition))
|
||||||
|
.collect::<OKIResult<_>>()?;
|
||||||
|
let global = LayeredStore::new(stores);
|
||||||
|
|
||||||
|
let matrix = match metric {
|
||||||
|
DistanceMetric::Jaccard => BitPartials::jaccard_dist_matrix(&global),
|
||||||
|
DistanceMetric::Hamming => {
|
||||||
|
BitPartials::hamming_dist_matrix(&global).mapv(|v| v as f64)
|
||||||
|
}
|
||||||
|
other => {
|
||||||
|
return Err(OKIError::InvalidInput(format!(
|
||||||
|
"{other:?} requires a count index; use --metric jaccard or --metric hamming"
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
let shared = if shared_kmers {
|
||||||
|
let (inter, _) = BitPartials::partial_jaccard(&global);
|
||||||
|
Some(inter)
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
};
|
||||||
|
|
||||||
|
Ok(DistanceOutput { matrix, shared_kmers: shared })
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,67 @@
|
|||||||
|
use std::io::Write;
|
||||||
|
|
||||||
|
use crate::error::{OKIError, OKIResult};
|
||||||
|
use crate::index::KmerIndex;
|
||||||
|
|
||||||
|
impl KmerIndex {
|
||||||
|
/// Write a CSV table of all indexed kmers to `out`.
|
||||||
|
///
|
||||||
|
/// Columns: `kmer`, then one column per genome (in index order).
|
||||||
|
/// Values are counts (u32) when `use_counts = true`, otherwise 0/1.
|
||||||
|
///
|
||||||
|
/// `force_presence` overrides `with_counts`: even if the index stores counts,
|
||||||
|
/// the output uses 0/1 presence columns.
|
||||||
|
///
|
||||||
|
/// The caller must have set the global kmer length (`obikseq::set_k`) before
|
||||||
|
/// calling this method.
|
||||||
|
pub fn dump<W: Write>(&self, out: &mut W, force_presence: bool, debug: bool) -> OKIResult<()> {
|
||||||
|
|
||||||
|
let genomes = &self.meta.genomes;
|
||||||
|
let use_counts = self.meta.config.with_counts && !force_presence;
|
||||||
|
let n_genomes = genomes.len().max(1);
|
||||||
|
let kmer_size = self.kmer_size();
|
||||||
|
|
||||||
|
// ── Header ────────────────────────────────────────────────────────────
|
||||||
|
if debug {
|
||||||
|
write!(out, "partition,layer,")?;
|
||||||
|
}
|
||||||
|
write!(out, "kmer")?;
|
||||||
|
for g in genomes {
|
||||||
|
write!(out, ",{g}")?;
|
||||||
|
}
|
||||||
|
writeln!(out)?;
|
||||||
|
|
||||||
|
// ── Rows ──────────────────────────────────────────────────────────────
|
||||||
|
let n = self.n_partitions();
|
||||||
|
for i in 0..n {
|
||||||
|
if debug {
|
||||||
|
self.partition
|
||||||
|
.iter_partition_kmers_located(i, use_counts, n_genomes, |part, layer, kmer, row| {
|
||||||
|
let seq = String::from_utf8(kmer.to_ascii())
|
||||||
|
.unwrap_or_else(|_| "?".repeat(kmer_size));
|
||||||
|
let _ = write!(out, "{part},{layer},{seq}");
|
||||||
|
for &v in row.iter() {
|
||||||
|
let _ = write!(out, ",{v}");
|
||||||
|
}
|
||||||
|
let _ = writeln!(out);
|
||||||
|
})
|
||||||
|
.map_err(OKIError::Partition)?;
|
||||||
|
} else {
|
||||||
|
self.partition
|
||||||
|
.iter_partition_kmers(i, use_counts, n_genomes, |kmer, row| {
|
||||||
|
let seq = String::from_utf8(kmer.to_ascii())
|
||||||
|
.unwrap_or_else(|_| "?".repeat(kmer_size));
|
||||||
|
let _ = write!(out, "{seq}");
|
||||||
|
for &v in row.iter() {
|
||||||
|
let _ = write!(out, ",{v}");
|
||||||
|
}
|
||||||
|
let _ = writeln!(out);
|
||||||
|
})
|
||||||
|
.map_err(OKIError::Partition)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
out.flush()?;
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
+19
-11
@@ -2,14 +2,22 @@ use std::fmt;
|
|||||||
use std::io;
|
use std::io;
|
||||||
|
|
||||||
use obiskio::SKError;
|
use obiskio::SKError;
|
||||||
use obilayeredmap::OLMError;
|
|
||||||
|
|
||||||
#[derive(Debug)]
|
#[derive(Debug)]
|
||||||
pub enum OKIError {
|
pub enum OKIError {
|
||||||
Io(io::Error),
|
Io(io::Error),
|
||||||
Json(serde_json::Error),
|
Json(serde_json::Error),
|
||||||
Partition(SKError),
|
Partition(SKError),
|
||||||
Layer(OLMError),
|
/// Source index is not in `Indexed` state.
|
||||||
|
NotIndexed(std::path::PathBuf),
|
||||||
|
/// Source indexes have incompatible configurations (k, m, n_bits).
|
||||||
|
IncompatibleConfig,
|
||||||
|
/// Count mode requested but a source index lacks count data.
|
||||||
|
MismatchedMode,
|
||||||
|
/// Two or more sources share the same genome label.
|
||||||
|
DuplicateGenomeLabel(String),
|
||||||
|
/// Operation not valid for this index configuration.
|
||||||
|
InvalidInput(String),
|
||||||
}
|
}
|
||||||
|
|
||||||
pub type OKIResult<T> = Result<T, OKIError>;
|
pub type OKIResult<T> = Result<T, OKIError>;
|
||||||
@@ -17,10 +25,14 @@ pub type OKIResult<T> = Result<T, OKIError>;
|
|||||||
impl fmt::Display for OKIError {
|
impl fmt::Display for OKIError {
|
||||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||||
match self {
|
match self {
|
||||||
OKIError::Io(e) => write!(f, "I/O error: {e}"),
|
OKIError::Io(e) => write!(f, "I/O error: {e}"),
|
||||||
OKIError::Json(e) => write!(f, "JSON error: {e}"),
|
OKIError::Json(e) => write!(f, "JSON error: {e}"),
|
||||||
OKIError::Partition(e) => write!(f, "partition error: {e}"),
|
OKIError::Partition(e) => write!(f, "partition error: {e}"),
|
||||||
OKIError::Layer(e) => write!(f, "layer error: {e}"),
|
OKIError::NotIndexed(p) => write!(f, "index not fully built: {}", p.display()),
|
||||||
|
OKIError::IncompatibleConfig => write!(f, "incompatible index configurations"),
|
||||||
|
OKIError::MismatchedMode => write!(f, "count mode requires all sources to have with_counts=true"),
|
||||||
|
OKIError::DuplicateGenomeLabel(l) => write!(f, "duplicate genome label across sources: {l}"),
|
||||||
|
OKIError::InvalidInput(m) => write!(f, "invalid input: {m}"),
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -31,7 +43,7 @@ impl std::error::Error for OKIError {
|
|||||||
OKIError::Io(e) => Some(e),
|
OKIError::Io(e) => Some(e),
|
||||||
OKIError::Json(e) => Some(e),
|
OKIError::Json(e) => Some(e),
|
||||||
OKIError::Partition(e) => Some(e),
|
OKIError::Partition(e) => Some(e),
|
||||||
OKIError::Layer(e) => Some(e),
|
_ => None, // IncompatibleConfig, MismatchedMode, DuplicateGenomeLabel
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -47,7 +59,3 @@ impl From<serde_json::Error> for OKIError {
|
|||||||
impl From<SKError> for OKIError {
|
impl From<SKError> for OKIError {
|
||||||
fn from(e: SKError) -> Self { OKIError::Partition(e) }
|
fn from(e: SKError) -> Self { OKIError::Partition(e) }
|
||||||
}
|
}
|
||||||
|
|
||||||
impl From<OLMError> for OKIError {
|
|
||||||
fn from(e: OLMError) -> Self { OKIError::Layer(e) }
|
|
||||||
}
|
|
||||||
|
|||||||
+66
-150
@@ -1,31 +1,23 @@
|
|||||||
|
use std::collections::BTreeMap;
|
||||||
use std::fs;
|
use std::fs;
|
||||||
use std::path::{Path, PathBuf};
|
use std::path::{Path, PathBuf};
|
||||||
use std::sync::atomic::{AtomicUsize, Ordering};
|
use std::sync::atomic::{AtomicUsize, Ordering};
|
||||||
use std::sync::{Arc, Mutex};
|
use std::sync::{Arc, Mutex};
|
||||||
|
|
||||||
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
|
||||||
use epserde::prelude::*;
|
|
||||||
use indicatif::{ProgressBar, ProgressStyle};
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
use obikpartitionner::{KmerPartition, KmerSpectrum};
|
||||||
use obidebruinj::GraphDeBruijn;
|
|
||||||
use obikpartitionner::KmerPartition;
|
|
||||||
use obilayeredmap::layer::Layer;
|
|
||||||
use obiskio::{SKFileMeta, SKFileReader};
|
|
||||||
use obisys::{Reporter, Stage};
|
use obisys::{Reporter, Stage};
|
||||||
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
|
||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
use crate::error::{OKIError, OKIResult};
|
use crate::error::{OKIError, OKIResult};
|
||||||
use crate::meta::{IndexConfig, IndexMeta};
|
use crate::meta::{IndexConfig, IndexMeta};
|
||||||
use crate::state::{IndexState, SENTINEL_INDEXED, SENTINEL_SCATTERED};
|
use crate::state::{IndexState, SENTINEL_COUNTED, SENTINEL_INDEXED, SENTINEL_SCATTERED};
|
||||||
|
|
||||||
type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
|
||||||
|
|
||||||
pub struct KmerIndex {
|
pub struct KmerIndex {
|
||||||
root_path: PathBuf,
|
pub(crate) root_path: PathBuf,
|
||||||
meta: IndexMeta,
|
pub(crate) meta: IndexMeta,
|
||||||
partition: KmerPartition,
|
pub(crate) partition: KmerPartition,
|
||||||
}
|
}
|
||||||
|
|
||||||
impl KmerIndex {
|
impl KmerIndex {
|
||||||
@@ -59,7 +51,12 @@ impl KmerIndex {
|
|||||||
pub fn open<P: AsRef<Path>>(path: P) -> OKIResult<Self> {
|
pub fn open<P: AsRef<Path>>(path: P) -> OKIResult<Self> {
|
||||||
let root_path = path.as_ref().to_owned();
|
let root_path = path.as_ref().to_owned();
|
||||||
let meta = IndexMeta::read(&root_path).map_err(OKIError::Io)?;
|
let meta = IndexMeta::read(&root_path).map_err(OKIError::Io)?;
|
||||||
let partition = KmerPartition::open(&root_path)?;
|
let partition = KmerPartition::open_with_config(
|
||||||
|
&root_path,
|
||||||
|
meta.config.kmer_size,
|
||||||
|
meta.config.minimizer_size,
|
||||||
|
meta.config.n_bits,
|
||||||
|
)?;
|
||||||
Ok(Self { root_path, meta, partition })
|
Ok(Self { root_path, meta, partition })
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -87,13 +84,10 @@ impl KmerIndex {
|
|||||||
/// Mark scatter as complete and write `scatter.done`.
|
/// Mark scatter as complete and write `scatter.done`.
|
||||||
///
|
///
|
||||||
/// If no genome label was set at creation time, one is derived from
|
/// If no genome label was set at creation time, one is derived from
|
||||||
/// `first_scatter_path` (filename stripped of all extensions).
|
/// the index root directory name (stripped of all extensions).
|
||||||
/// If `first_scatter_path` is also `None`, the label defaults to `"unknown"`.
|
pub fn mark_scattered(&mut self) -> OKIResult<()> {
|
||||||
pub fn mark_scattered(&mut self, first_scatter_path: Option<&Path>) -> OKIResult<()> {
|
|
||||||
if self.meta.genomes.is_empty() {
|
if self.meta.genomes.is_empty() {
|
||||||
let label = first_scatter_path
|
let label = label_from_path(&self.root_path);
|
||||||
.map(label_from_path)
|
|
||||||
.unwrap_or_else(|| "unknown".to_string());
|
|
||||||
self.meta.genomes.push(label);
|
self.meta.genomes.push(label);
|
||||||
self.meta.write(&self.root_path)?;
|
self.meta.write(&self.root_path)?;
|
||||||
}
|
}
|
||||||
@@ -103,33 +97,43 @@ impl KmerIndex {
|
|||||||
|
|
||||||
/// Dereplicate all partitions then compute kmer counts.
|
/// Dereplicate all partitions then compute kmer counts.
|
||||||
///
|
///
|
||||||
/// Writes `kmer_spectrum_raw.json` at the index root upon completion
|
/// Writes `spectrums/{label}.json` and touches `count.done` upon completion.
|
||||||
/// (this file doubles as the `Counted` sentinel).
|
/// Per-partition spectrum files are removed unless `keep_intermediate` is true.
|
||||||
pub fn dereplicate_and_count(&self, rep: &mut Reporter) -> OKIResult<()> {
|
pub fn dereplicate_and_count(&self, keep_intermediate: bool, rep: &mut Reporter) -> OKIResult<()> {
|
||||||
let t = Stage::start("dereplicate");
|
let t = Stage::start("dereplicate");
|
||||||
self.partition.dereplicate()?;
|
self.partition.dereplicate()?;
|
||||||
rep.push(t.stop());
|
rep.push(t.stop());
|
||||||
|
|
||||||
let t = Stage::start("count_kmer");
|
let t = Stage::start("count_kmer");
|
||||||
self.partition.count_kmer()?;
|
let spectrum = self.partition.count_kmer(keep_intermediate)?;
|
||||||
rep.push(t.stop());
|
rep.push(t.stop());
|
||||||
|
|
||||||
|
self.write_spectrum(&spectrum)?;
|
||||||
|
touch(&self.root_path.join(SENTINEL_COUNTED))?;
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Build the layered MPHF index for all partitions.
|
fn write_spectrum(&self, sp: &KmerSpectrum) -> OKIResult<()> {
|
||||||
///
|
let label = self.meta.genomes.first().map(String::as_str).unwrap_or("unknown");
|
||||||
/// Default mode (`config.with_counts = false`): set membership only.
|
let spectrums_dir = self.root_path.join("spectrums");
|
||||||
/// With counts: count matrix per kmer.
|
fs::create_dir_all(&spectrums_dir)?;
|
||||||
///
|
let path = spectrums_dir.join(format!("{label}.json"));
|
||||||
/// Writes `index.done` upon completion.
|
let spectrum_map: BTreeMap<String, u64> = sp.counts
|
||||||
/// Path to the unitigs file for partition `part`, layer `layer`.
|
.iter()
|
||||||
pub fn layer_unitigs_path(&self, part: usize, layer: usize) -> PathBuf {
|
.map(|(&c, &f)| (format!("{c:010}"), f))
|
||||||
self.partition.part_dir(part)
|
.collect();
|
||||||
.join("index")
|
let f = fs::File::create(&path)?;
|
||||||
.join(format!("layer_{layer}"))
|
serde_json::to_writer_pretty(
|
||||||
.join("unitigs.bin")
|
f,
|
||||||
|
&serde_json::json!({ "f0": sp.f0, "f1": sp.f1, "spectrum": spectrum_map }),
|
||||||
|
)
|
||||||
|
.map_err(OKIError::Json)?;
|
||||||
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Build the layered MPHF index for all partitions in parallel.
|
||||||
|
///
|
||||||
|
/// Writes `index.done` upon completion.
|
||||||
pub fn build_layers(
|
pub fn build_layers(
|
||||||
&self,
|
&self,
|
||||||
min_ab: u32,
|
min_ab: u32,
|
||||||
@@ -140,12 +144,8 @@ impl KmerIndex {
|
|||||||
let n = self.partition.n_partitions();
|
let n = self.partition.n_partitions();
|
||||||
let t = Stage::start("index");
|
let t = Stage::start("index");
|
||||||
let with_counts = self.meta.config.with_counts;
|
let with_counts = self.meta.config.with_counts;
|
||||||
let filter_active = min_ab > 1 || max_ab.is_some();
|
|
||||||
let need_counts = filter_active || with_counts;
|
|
||||||
let total_kmers = AtomicUsize::new(0);
|
let total_kmers = AtomicUsize::new(0);
|
||||||
|
|
||||||
let partition = &self.partition;
|
|
||||||
|
|
||||||
let pb = Arc::new(Mutex::new(
|
let pb = Arc::new(Mutex::new(
|
||||||
ProgressBar::new(n as u64).with_style(
|
ProgressBar::new(n as u64).with_style(
|
||||||
ProgressStyle::with_template("index — [{bar:20}] {pos}/{len} | {msg}").unwrap(),
|
ProgressStyle::with_template("index — [{bar:20}] {pos}/{len} | {msg}").unwrap(),
|
||||||
@@ -153,101 +153,19 @@ impl KmerIndex {
|
|||||||
));
|
));
|
||||||
|
|
||||||
(0..n).into_par_iter().for_each(|i| {
|
(0..n).into_par_iter().for_each(|i| {
|
||||||
let part_dir = partition.part_dir(i);
|
match self.partition.build_index_layer(i, min_ab, max_ab, with_counts) {
|
||||||
let dedup_path = part_dir.join("dereplicated.skmer.zst");
|
Ok(0) => {}
|
||||||
if !dedup_path.exists() {
|
Ok(n_kmers) => {
|
||||||
return;
|
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
|
||||||
}
|
let pb = pb.lock().unwrap();
|
||||||
|
pb.inc(1);
|
||||||
let layer_dir = part_dir.join("index").join("layer_0");
|
pb.set_message(format!("{i}: {n_kmers} kmers"));
|
||||||
if layer_dir.join("mphf.bin").exists() {
|
}
|
||||||
return;
|
Err(e) => {
|
||||||
}
|
eprintln!("error building layer for partition {i}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
let mphf1_opt: Option<Mphf> = if need_counts {
|
|
||||||
let p = part_dir.join("mphf1.bin");
|
|
||||||
p.exists().then(|| Mphf::load_full(&p).ok()).flatten()
|
|
||||||
} else {
|
|
||||||
None
|
|
||||||
};
|
|
||||||
|
|
||||||
let counts1_opt: Option<PersistentCompactIntVec> = if need_counts {
|
|
||||||
let p = part_dir.join("counts1.bin");
|
|
||||||
p.exists()
|
|
||||||
.then(|| PersistentCompactIntVec::open(&p).ok())
|
|
||||||
.flatten()
|
|
||||||
} else {
|
|
||||||
None
|
|
||||||
};
|
|
||||||
|
|
||||||
let mut g = GraphDeBruijn::new();
|
|
||||||
let mut reader = SKFileReader::open(&dedup_path).unwrap_or_else(|e| {
|
|
||||||
eprintln!("error opening {}: {e}", dedup_path.display());
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
for sk in reader.iter() {
|
|
||||||
for kmer in sk.iter_canonical_kmers() {
|
|
||||||
let accept = if filter_active {
|
|
||||||
match (&mphf1_opt, &counts1_opt) {
|
|
||||||
(Some(mphf), Some(counts)) => {
|
|
||||||
let ab = counts.get(mphf.index(&kmer.raw()));
|
|
||||||
ab >= min_ab && max_ab.map_or(true, |max| ab <= max)
|
|
||||||
}
|
|
||||||
_ => true,
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
true
|
|
||||||
};
|
|
||||||
if accept {
|
|
||||||
g.push(kmer);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
let n_kmers = g.len();
|
|
||||||
total_kmers.fetch_add(n_kmers, Ordering::Relaxed);
|
|
||||||
g.compute_degrees();
|
|
||||||
|
|
||||||
fs::create_dir_all(&layer_dir).unwrap_or_else(|e| {
|
|
||||||
eprintln!("error creating {}: {e}", layer_dir.display());
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
let mut uw = Layer::<()>::unitig_writer(&layer_dir).unwrap_or_else(|e| {
|
|
||||||
eprintln!("error creating unitig writer (partition {i}): {e}");
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
for unitig in g.iter_unitig() {
|
|
||||||
uw.write(&unitig).unwrap_or_else(|e| {
|
|
||||||
eprintln!("error writing unitig (partition {i}): {e}");
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
uw.close().unwrap_or_else(|e| {
|
|
||||||
eprintln!("error closing unitig writer (partition {i}): {e}");
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
|
|
||||||
if with_counts {
|
|
||||||
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, |kmer| {
|
|
||||||
match (&mphf1_opt, &counts1_opt) {
|
|
||||||
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
|
||||||
_ => 1,
|
|
||||||
}
|
|
||||||
})
|
|
||||||
.unwrap_or_else(|e| {
|
|
||||||
eprintln!("error building count layer (partition {i}): {e}");
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
} else {
|
|
||||||
Layer::<()>::build(&layer_dir).unwrap_or_else(|e| {
|
|
||||||
eprintln!("error building set layer (partition {i}): {e}");
|
|
||||||
std::process::exit(1);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
let pb = pb.lock().unwrap();
|
|
||||||
pb.inc(1);
|
|
||||||
pb.set_message(format!("{i}: {n_kmers} kmers"));
|
|
||||||
});
|
});
|
||||||
|
|
||||||
pb.lock().unwrap().finish_and_clear();
|
pb.lock().unwrap().finish_and_clear();
|
||||||
@@ -258,13 +176,7 @@ impl KmerIndex {
|
|||||||
|
|
||||||
if !keep_intermediate {
|
if !keep_intermediate {
|
||||||
for i in 0..n {
|
for i in 0..n {
|
||||||
let part_dir = partition.part_dir(i);
|
self.partition.remove_build_artifacts(i);
|
||||||
remove_if_exists(&part_dir.join("dereplicated.skmer.zst"));
|
|
||||||
remove_if_exists(&SKFileMeta::sidecar_path(
|
|
||||||
&part_dir.join("dereplicated.skmer.zst"),
|
|
||||||
));
|
|
||||||
remove_if_exists(&part_dir.join("mphf1.bin"));
|
|
||||||
remove_if_exists(&part_dir.join("counts1.bin"));
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -272,9 +184,21 @@ impl KmerIndex {
|
|||||||
rep.push(t.stop());
|
rep.push(t.stop());
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Borrow the inner partition for direct superkmer-level queries.
|
||||||
|
pub fn partition(&self) -> &KmerPartition {
|
||||||
|
&self.partition
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Path to the unitigs file for partition `part`, layer `layer`.
|
||||||
|
pub fn layer_unitigs_path(&self, part: usize, layer: usize) -> PathBuf {
|
||||||
|
self.partition.part_dir(part)
|
||||||
|
.join("index")
|
||||||
|
.join(format!("layer_{layer}"))
|
||||||
|
.join("unitigs.bin")
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Derive a genome label from a file path: filename stripped of all extensions.
|
|
||||||
fn label_from_path(path: &Path) -> String {
|
fn label_from_path(path: &Path) -> String {
|
||||||
let name = path
|
let name = path
|
||||||
.file_name()
|
.file_name()
|
||||||
@@ -291,11 +215,3 @@ fn label_from_path(path: &Path) -> String {
|
|||||||
fn touch(path: &Path) -> Result<(), std::io::Error> {
|
fn touch(path: &Path) -> Result<(), std::io::Error> {
|
||||||
fs::File::create(path).map(|_| ())
|
fs::File::create(path).map(|_| ())
|
||||||
}
|
}
|
||||||
|
|
||||||
fn remove_if_exists(path: &Path) {
|
|
||||||
if let Err(e) = fs::remove_file(path) {
|
|
||||||
if e.kind() != std::io::ErrorKind::NotFound {
|
|
||||||
eprintln!("warning: could not remove {}: {e}", path.display());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|||||||
@@ -1,9 +1,15 @@
|
|||||||
pub mod error;
|
pub mod error;
|
||||||
pub mod meta;
|
pub mod meta;
|
||||||
pub mod state;
|
pub mod state;
|
||||||
|
mod distance;
|
||||||
|
mod dump;
|
||||||
mod index;
|
mod index;
|
||||||
|
mod merge;
|
||||||
|
mod rebuild;
|
||||||
|
|
||||||
pub use error::{OKIError, OKIResult};
|
pub use error::{OKIError, OKIResult};
|
||||||
|
pub use distance::{DistanceMetric, DistanceOutput};
|
||||||
pub use index::KmerIndex;
|
pub use index::KmerIndex;
|
||||||
|
pub use merge::MergeMode;
|
||||||
pub use meta::{IndexConfig, IndexMeta, META_FILENAME};
|
pub use meta::{IndexConfig, IndexMeta, META_FILENAME};
|
||||||
pub use state::{IndexState, SENTINEL_COUNTED, SENTINEL_INDEXED, SENTINEL_SCATTERED};
|
pub use state::{IndexState, SENTINEL_COUNTED, SENTINEL_INDEXED, SENTINEL_SCATTERED};
|
||||||
|
|||||||
@@ -0,0 +1,272 @@
|
|||||||
|
use std::collections::HashMap;
|
||||||
|
use std::fs;
|
||||||
|
use std::io;
|
||||||
|
use std::path::Path;
|
||||||
|
use std::time::Duration;
|
||||||
|
|
||||||
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
|
use obisys::{Reporter, Stage};
|
||||||
|
use rayon::prelude::*;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
use crate::error::{OKIError, OKIResult};
|
||||||
|
use crate::index::KmerIndex;
|
||||||
|
use crate::meta::IndexMeta;
|
||||||
|
use crate::state::IndexState;
|
||||||
|
|
||||||
|
pub use obikpartitionner::MergeMode;
|
||||||
|
|
||||||
|
impl KmerIndex {
|
||||||
|
/// Merge `sources` into a new index at `output`.
|
||||||
|
///
|
||||||
|
/// All sources must be in `Indexed` state and share the same `kmer_size`,
|
||||||
|
/// `minimizer_size`, and `n_partitions`. Count mode additionally requires
|
||||||
|
/// every source to have `with_counts = true`.
|
||||||
|
///
|
||||||
|
/// Genome labels must be unique across all sources. If `rename_duplicates`
|
||||||
|
/// is true, repeated labels are disambiguated by appending `.1`, `.2`, …
|
||||||
|
/// to the second and subsequent occurrences. Otherwise a
|
||||||
|
/// `DuplicateGenomeLabel` error is returned on the first conflict.
|
||||||
|
pub fn merge<P: AsRef<Path>>(
|
||||||
|
output: P,
|
||||||
|
sources: &[&KmerIndex],
|
||||||
|
mode: MergeMode,
|
||||||
|
force: bool,
|
||||||
|
rename_duplicates: bool,
|
||||||
|
rep: &mut Reporter,
|
||||||
|
) -> OKIResult<Self> {
|
||||||
|
let output = output.as_ref();
|
||||||
|
|
||||||
|
if sources.is_empty() {
|
||||||
|
return Err(OKIError::Io(io::Error::new(
|
||||||
|
io::ErrorKind::InvalidInput,
|
||||||
|
"merge requires at least one source index",
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Validate config compatibility ─────────────────────────────────────
|
||||||
|
let ref0 = sources[0];
|
||||||
|
for src in sources {
|
||||||
|
if src.state() != IndexState::Indexed {
|
||||||
|
return Err(OKIError::NotIndexed(src.root_path.clone()));
|
||||||
|
}
|
||||||
|
if src.kmer_size() != ref0.kmer_size()
|
||||||
|
|| src.minimizer_size() != ref0.minimizer_size()
|
||||||
|
|| src.n_partitions() != ref0.n_partitions()
|
||||||
|
{
|
||||||
|
return Err(OKIError::IncompatibleConfig);
|
||||||
|
}
|
||||||
|
if mode == MergeMode::Count && !src.meta.config.with_counts {
|
||||||
|
return Err(OKIError::MismatchedMode);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Compute final genome labels (rename duplicates if requested) ───────
|
||||||
|
let (source_labels, all_genomes) = compute_labels(sources, rename_duplicates)?;
|
||||||
|
|
||||||
|
// ── Prepare output directory ──────────────────────────────────────────
|
||||||
|
if output.exists() {
|
||||||
|
if force {
|
||||||
|
fs::remove_dir_all(output)?;
|
||||||
|
} else {
|
||||||
|
return Err(OKIError::Io(io::Error::new(
|
||||||
|
io::ErrorKind::AlreadyExists,
|
||||||
|
format!("{}: output directory already exists", output.display()),
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Bootstrap: copy first source to output ────────────────────────────
|
||||||
|
info!(
|
||||||
|
"bootstrap: copying {} → {} ({} genome(s))",
|
||||||
|
sources[0].root_path.display(),
|
||||||
|
output.display(),
|
||||||
|
sources[0].meta.genomes.len(),
|
||||||
|
);
|
||||||
|
let t = Stage::start("bootstrap");
|
||||||
|
let pb = spinner("bootstrap — copying index …");
|
||||||
|
copy_dir_all(&sources[0].root_path, output)?;
|
||||||
|
|
||||||
|
// Rewrite index.meta with final genome labels and the effective mode.
|
||||||
|
let mut meta = IndexMeta::read(output).map_err(OKIError::Io)?;
|
||||||
|
meta.genomes = all_genomes;
|
||||||
|
meta.config.with_counts = mode == MergeMode::Count;
|
||||||
|
meta.write(output)?;
|
||||||
|
|
||||||
|
// In presence/absence mode, purge counts/ directories inherited from
|
||||||
|
// source_0 — they are stale data from the source's count index.
|
||||||
|
if mode == MergeMode::Presence {
|
||||||
|
remove_dirs_named(output, "counts")?;
|
||||||
|
}
|
||||||
|
pb.finish_and_clear();
|
||||||
|
rep.push(t.stop());
|
||||||
|
|
||||||
|
// Rebuild spectrums/ from all sources using the (possibly renamed) labels.
|
||||||
|
// Drop the spectrums/ that were copied from source_0 and rebuild from scratch.
|
||||||
|
info!("rebuilding spectrums for {} source(s)", sources.len());
|
||||||
|
let t = Stage::start("spectrums");
|
||||||
|
let pb = spinner("spectrums — copying …");
|
||||||
|
let spectrums_dir = output.join("spectrums");
|
||||||
|
if spectrums_dir.exists() {
|
||||||
|
fs::remove_dir_all(&spectrums_dir)?;
|
||||||
|
}
|
||||||
|
for (src, new_labels) in sources.iter().zip(&source_labels) {
|
||||||
|
copy_spectrums(&src.root_path, output, &src.meta.genomes, new_labels)?;
|
||||||
|
}
|
||||||
|
pb.finish_and_clear();
|
||||||
|
rep.push(t.stop());
|
||||||
|
|
||||||
|
// Open the destination index.
|
||||||
|
let dst = KmerIndex::open(output)?;
|
||||||
|
let n_partitions = dst.n_partitions();
|
||||||
|
let n_dst_genomes = sources[0].meta.genomes.len();
|
||||||
|
|
||||||
|
// ── Merge each subsequent source partition-by-partition ───────────────
|
||||||
|
let remaining_sources: Vec<&KmerIndex> = sources[1..].to_vec();
|
||||||
|
if !remaining_sources.is_empty() {
|
||||||
|
let n_src_genomes: usize = remaining_sources.iter().map(|s| s.meta.genomes.len()).sum();
|
||||||
|
info!(
|
||||||
|
"merging {} partition(s) × {} additional source genome(s) into {} destination genome(s)",
|
||||||
|
n_partitions, n_src_genomes, n_dst_genomes,
|
||||||
|
);
|
||||||
|
let t = Stage::start("merge_partitions");
|
||||||
|
let pb = partition_bar(n_partitions as u64);
|
||||||
|
|
||||||
|
let dst_partition = &dst.partition;
|
||||||
|
|
||||||
|
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
||||||
|
.into_par_iter()
|
||||||
|
.filter_map(|i| {
|
||||||
|
let srcs: Vec<(&obikpartitionner::KmerPartition, usize)> =
|
||||||
|
remaining_sources.iter().map(|s| (&s.partition, s.meta.genomes.len())).collect();
|
||||||
|
let result = dst_partition.merge_partition(i, &srcs, mode, n_dst_genomes).err();
|
||||||
|
pb.inc(1);
|
||||||
|
result
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
pb.finish_and_clear();
|
||||||
|
if let Some(e) = errors.into_iter().next() {
|
||||||
|
return Err(OKIError::Partition(e));
|
||||||
|
}
|
||||||
|
|
||||||
|
rep.push(t.stop());
|
||||||
|
}
|
||||||
|
|
||||||
|
// Re-open to get the updated state.
|
||||||
|
KmerIndex::open(output)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Compute the final genome label lists for all sources.
|
||||||
|
///
|
||||||
|
/// Returns `(per_source_labels, all_genomes_flat)`.
|
||||||
|
/// The first occurrence of a label keeps the original name. Subsequent
|
||||||
|
/// occurrences receive `.1`, `.2`, … suffixes when `rename_duplicates` is true,
|
||||||
|
/// or trigger a `DuplicateGenomeLabel` error otherwise.
|
||||||
|
fn compute_labels(
|
||||||
|
sources: &[&KmerIndex],
|
||||||
|
rename_duplicates: bool,
|
||||||
|
) -> OKIResult<(Vec<Vec<String>>, Vec<String>)> {
|
||||||
|
let mut seen: HashMap<String, usize> = HashMap::new();
|
||||||
|
let mut source_labels: Vec<Vec<String>> = Vec::with_capacity(sources.len());
|
||||||
|
let mut all_genomes: Vec<String> = Vec::new();
|
||||||
|
|
||||||
|
for src in sources {
|
||||||
|
let mut labels = Vec::with_capacity(src.meta.genomes.len());
|
||||||
|
for label in &src.meta.genomes {
|
||||||
|
let count = seen.entry(label.clone()).or_insert(0);
|
||||||
|
let new_label = if *count == 0 {
|
||||||
|
label.clone()
|
||||||
|
} else if rename_duplicates {
|
||||||
|
format!("{label}.{count}")
|
||||||
|
} else {
|
||||||
|
return Err(OKIError::DuplicateGenomeLabel(label.clone()));
|
||||||
|
};
|
||||||
|
*count += 1;
|
||||||
|
labels.push(new_label.clone());
|
||||||
|
all_genomes.push(new_label);
|
||||||
|
}
|
||||||
|
source_labels.push(labels);
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok((source_labels, all_genomes))
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Copy spectrum JSON files from `src_root/spectrums/` to `dst_root/spectrums/`,
|
||||||
|
/// mapping each `old_labels[i]` filename to `new_labels[i]`.
|
||||||
|
fn copy_spectrums(
|
||||||
|
src_root: &Path,
|
||||||
|
dst_root: &Path,
|
||||||
|
old_labels: &[String],
|
||||||
|
new_labels: &[String],
|
||||||
|
) -> io::Result<()> {
|
||||||
|
let src_dir = src_root.join("spectrums");
|
||||||
|
let dst_dir = dst_root.join("spectrums");
|
||||||
|
fs::create_dir_all(&dst_dir)?;
|
||||||
|
for (old, new) in old_labels.iter().zip(new_labels.iter()) {
|
||||||
|
let src_file = src_dir.join(format!("{old}.json"));
|
||||||
|
if src_file.exists() {
|
||||||
|
fs::copy(&src_file, dst_dir.join(format!("{new}.json")))?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Recursively remove every directory named `name` under `root`.
|
||||||
|
fn remove_dirs_named(root: &Path, name: &str) -> io::Result<()> {
|
||||||
|
for entry in fs::read_dir(root)? {
|
||||||
|
let entry = entry?;
|
||||||
|
let path = entry.path();
|
||||||
|
if path.is_dir() {
|
||||||
|
if path.file_name().and_then(|n| n.to_str()) == Some(name) {
|
||||||
|
fs::remove_dir_all(&path)?;
|
||||||
|
} else {
|
||||||
|
remove_dirs_named(&path, name)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
fn spinner(msg: &'static str) -> ProgressBar {
|
||||||
|
let pb = ProgressBar::new_spinner();
|
||||||
|
pb.set_style(
|
||||||
|
ProgressStyle::with_template("{spinner} {msg} {elapsed}")
|
||||||
|
.unwrap()
|
||||||
|
.tick_strings(&["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]),
|
||||||
|
);
|
||||||
|
pb.set_message(msg);
|
||||||
|
pb.enable_steady_tick(Duration::from_millis(100));
|
||||||
|
pb
|
||||||
|
}
|
||||||
|
|
||||||
|
fn partition_bar(n: u64) -> ProgressBar {
|
||||||
|
let pb = ProgressBar::new(n);
|
||||||
|
pb.set_style(
|
||||||
|
ProgressStyle::with_template(
|
||||||
|
"{spinner} merge — {bar:40.cyan/blue} {pos}/{len} partitions {elapsed}",
|
||||||
|
)
|
||||||
|
.unwrap()
|
||||||
|
.tick_strings(&["⠋", "⠙", "⠹", "⠸", "⠼", "⠴", "⠦", "⠧", "⠇", "⠏"]),
|
||||||
|
);
|
||||||
|
pb.enable_steady_tick(Duration::from_millis(100));
|
||||||
|
pb
|
||||||
|
}
|
||||||
|
|
||||||
|
fn copy_dir_all(src: &Path, dst: &Path) -> io::Result<()> {
|
||||||
|
fs::create_dir_all(dst)?;
|
||||||
|
for entry in fs::read_dir(src)? {
|
||||||
|
let entry = entry?;
|
||||||
|
let src_path = entry.path();
|
||||||
|
let dst_path = dst.join(entry.file_name());
|
||||||
|
if src_path.is_dir() {
|
||||||
|
copy_dir_all(&src_path, &dst_path)?;
|
||||||
|
} else {
|
||||||
|
fs::copy(&src_path, &dst_path)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
@@ -0,0 +1,116 @@
|
|||||||
|
use std::fs;
|
||||||
|
use std::io;
|
||||||
|
use std::path::Path;
|
||||||
|
use std::time::Duration;
|
||||||
|
|
||||||
|
use indicatif::{ProgressBar, ProgressStyle};
|
||||||
|
use obikpartitionner::{KmerFilter, KmerPartition, MergeMode};
|
||||||
|
use obisys::{Reporter, Stage};
|
||||||
|
use rayon::prelude::*;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
use crate::error::{OKIError, OKIResult};
|
||||||
|
use crate::index::KmerIndex;
|
||||||
|
use crate::meta::IndexMeta;
|
||||||
|
use crate::state::{IndexState, SENTINEL_INDEXED};
|
||||||
|
|
||||||
|
impl KmerIndex {
|
||||||
|
/// Rebuild `src` into a new compact single-layer index at `output`.
|
||||||
|
///
|
||||||
|
/// Only k-mers whose per-genome row passes every filter in `filters` are
|
||||||
|
/// written. If `filters` is empty every k-mer is kept (pure compaction).
|
||||||
|
///
|
||||||
|
/// `mode` controls whether the output stores counts or presence/absence.
|
||||||
|
/// A count source may be rebuilt in presence mode; a presence source
|
||||||
|
/// cannot be rebuilt in count mode.
|
||||||
|
pub fn rebuild<P: AsRef<Path>>(
|
||||||
|
output: P,
|
||||||
|
src: &KmerIndex,
|
||||||
|
filters: &[Box<dyn KmerFilter>],
|
||||||
|
mode: MergeMode,
|
||||||
|
force: bool,
|
||||||
|
rep: &mut Reporter,
|
||||||
|
) -> OKIResult<Self> {
|
||||||
|
let output = output.as_ref();
|
||||||
|
|
||||||
|
if src.state() != IndexState::Indexed {
|
||||||
|
return Err(OKIError::NotIndexed(src.root_path.clone()));
|
||||||
|
}
|
||||||
|
|
||||||
|
if mode == MergeMode::Count && !src.meta.config.with_counts {
|
||||||
|
return Err(OKIError::InvalidInput(
|
||||||
|
"cannot rebuild in count mode from a presence-only source index".into(),
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
if output.exists() {
|
||||||
|
if force {
|
||||||
|
fs::remove_dir_all(output)?;
|
||||||
|
} else {
|
||||||
|
return Err(OKIError::Io(io::Error::new(
|
||||||
|
io::ErrorKind::AlreadyExists,
|
||||||
|
format!("{}: output directory already exists", output.display()),
|
||||||
|
)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Create output directory + metadata ────────────────────────────────
|
||||||
|
fs::create_dir_all(output)?;
|
||||||
|
let mut meta = IndexMeta::new(src.meta.config.clone());
|
||||||
|
meta.config.with_counts = mode == MergeMode::Count;
|
||||||
|
meta.genomes = src.meta.genomes.clone();
|
||||||
|
meta.write(output)?;
|
||||||
|
|
||||||
|
let n_genomes = src.meta.genomes.len();
|
||||||
|
let n_partitions = src.partition.n_partitions();
|
||||||
|
|
||||||
|
// ── Create an empty destination KmerPartition ─────────────────────────
|
||||||
|
// Create the partitions/ subdirectory so KmerPartition::open_with_config works.
|
||||||
|
fs::create_dir_all(output.join(obikpartitionner::PARTITIONS_SUBDIR))?;
|
||||||
|
let dst_partition = KmerPartition::open_with_config(
|
||||||
|
output,
|
||||||
|
meta.config.kmer_size,
|
||||||
|
meta.config.minimizer_size,
|
||||||
|
meta.config.n_bits,
|
||||||
|
)?;
|
||||||
|
|
||||||
|
info!(
|
||||||
|
"rebuild: {} partition(s), {} genome(s), mode={:?}",
|
||||||
|
n_partitions, n_genomes, mode,
|
||||||
|
);
|
||||||
|
|
||||||
|
let t = Stage::start("rebuild");
|
||||||
|
let pb = ProgressBar::new(n_partitions as u64).with_style(
|
||||||
|
ProgressStyle::with_template("rebuild — [{bar:20}] {pos}/{len} | {msg}")
|
||||||
|
.unwrap()
|
||||||
|
.progress_chars("=> "),
|
||||||
|
);
|
||||||
|
pb.enable_steady_tick(Duration::from_millis(100));
|
||||||
|
|
||||||
|
let src_partition = &src.partition;
|
||||||
|
|
||||||
|
let errors: Vec<obiskio::SKError> = (0..n_partitions)
|
||||||
|
.into_par_iter()
|
||||||
|
.filter_map(|i| {
|
||||||
|
let result = dst_partition
|
||||||
|
.rebuild_partition(src_partition, i, filters, mode, n_genomes)
|
||||||
|
.err();
|
||||||
|
pb.inc(1);
|
||||||
|
result
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
pb.finish_and_clear();
|
||||||
|
|
||||||
|
if let Some(e) = errors.into_iter().next() {
|
||||||
|
return Err(OKIError::Partition(e));
|
||||||
|
}
|
||||||
|
|
||||||
|
rep.push(t.stop());
|
||||||
|
|
||||||
|
// Write SENTINEL_INDEXED — output is ready to use.
|
||||||
|
fs::File::create(output.join(SENTINEL_INDEXED))?;
|
||||||
|
|
||||||
|
KmerIndex::open(output)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -3,7 +3,7 @@ use std::path::Path;
|
|||||||
use crate::meta::META_FILENAME;
|
use crate::meta::META_FILENAME;
|
||||||
|
|
||||||
pub const SENTINEL_SCATTERED: &str = "scatter.done";
|
pub const SENTINEL_SCATTERED: &str = "scatter.done";
|
||||||
pub const SENTINEL_COUNTED: &str = "kmer_spectrum_raw.json";
|
pub const SENTINEL_COUNTED: &str = "count.done";
|
||||||
pub const SENTINEL_INDEXED: &str = "index.done";
|
pub const SENTINEL_INDEXED: &str = "index.done";
|
||||||
|
|
||||||
/// Progression state of a `KmerIndex`.
|
/// Progression state of a `KmerIndex`.
|
||||||
@@ -17,7 +17,7 @@ pub enum IndexState {
|
|||||||
Empty,
|
Empty,
|
||||||
/// `scatter.done` sentinel present — all super-kmers have been routed.
|
/// `scatter.done` sentinel present — all super-kmers have been routed.
|
||||||
Scattered,
|
Scattered,
|
||||||
/// `kmer_spectrum_raw.json` present — dereplicate + count complete.
|
/// `count.done` sentinel present — dereplicate + count complete.
|
||||||
Counted,
|
Counted,
|
||||||
/// `index.done` sentinel present — layered MPHF index fully built.
|
/// `index.done` sentinel present — layered MPHF index fully built.
|
||||||
Indexed,
|
Indexed,
|
||||||
|
|||||||
@@ -19,6 +19,9 @@ obisys = { path = "../obisys" }
|
|||||||
obiskio = { path = "../obiskio" }
|
obiskio = { path = "../obiskio" }
|
||||||
obikindex = { path = "../obikindex" }
|
obikindex = { path = "../obikindex" }
|
||||||
clap = { version = "4", features = ["derive"] }
|
clap = { version = "4", features = ["derive"] }
|
||||||
|
serde_json = "1"
|
||||||
|
kodama = "0.2"
|
||||||
|
speedytree = "0.1"
|
||||||
rayon = "1"
|
rayon = "1"
|
||||||
indicatif = "0.17"
|
indicatif = "0.17"
|
||||||
tracing = "0.1.44"
|
tracing = "0.1.44"
|
||||||
|
|||||||
@@ -29,9 +29,9 @@ pub struct CommonArgs {
|
|||||||
#[arg(long, default_value_t = 6)]
|
#[arg(long, default_value_t = 6)]
|
||||||
pub level_max: usize,
|
pub level_max: usize,
|
||||||
|
|
||||||
/// Number of bits to encode partitions (allows up to 2^partition_bits partitions)
|
/// Number of partitions (rounded up to the next power of 2)
|
||||||
#[arg(short, long, default_value_t = 8)]
|
#[arg(short, long, default_value_t = 256)]
|
||||||
pub partition_bits: usize,
|
pub partitions: usize,
|
||||||
|
|
||||||
/// Number of worker threads
|
/// Number of worker threads
|
||||||
#[arg(
|
#[arg(
|
||||||
@@ -44,6 +44,12 @@ pub struct CommonArgs {
|
|||||||
pub threads: usize,
|
pub threads: usize,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Smallest `b` such that `2^b >= n` (i.e. `n.next_power_of_two().ilog2()`).
|
||||||
|
/// Minimum 1 (degenerate n=0 or n=1 → 1 partition).
|
||||||
|
pub fn partitions_to_bits(n: usize) -> usize {
|
||||||
|
n.max(1).next_power_of_two().trailing_zeros() as usize
|
||||||
|
}
|
||||||
|
|
||||||
impl CommonArgs {
|
impl CommonArgs {
|
||||||
pub fn seqfile_paths(&self) -> obiread::PathIter {
|
pub fn seqfile_paths(&self) -> obiread::PathIter {
|
||||||
let paths = self.inputs.iter().map(PathBuf::from).collect();
|
let paths = self.inputs.iter().map(PathBuf::from).collect();
|
||||||
|
|||||||
@@ -0,0 +1,216 @@
|
|||||||
|
use std::io::{self, BufWriter, Write};
|
||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use kodama::{Method, linkage};
|
||||||
|
use obikindex::{DistanceMetric, KmerIndex};
|
||||||
|
use obikseq::{set_k, set_m};
|
||||||
|
use speedytree::{DistanceMatrix, Hybrid, NeighborJoiningSolver, to_newick};
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
#[derive(clap::ValueEnum, Clone, Copy, Debug)]
|
||||||
|
pub enum MetricArg {
|
||||||
|
Jaccard,
|
||||||
|
Hamming,
|
||||||
|
BrayCurtis,
|
||||||
|
#[value(name = "relfreq-bray-curtis")]
|
||||||
|
RelfreqBrayCurtis,
|
||||||
|
Euclidean,
|
||||||
|
#[value(name = "relfreq-euclidean")]
|
||||||
|
RelfreqEuclidean,
|
||||||
|
Hellinger,
|
||||||
|
#[value(name = "hellinger-euclidean")]
|
||||||
|
HellingerEuclidean,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl From<MetricArg> for DistanceMetric {
|
||||||
|
fn from(m: MetricArg) -> Self {
|
||||||
|
match m {
|
||||||
|
MetricArg::Jaccard => DistanceMetric::Jaccard,
|
||||||
|
MetricArg::Hamming => DistanceMetric::Hamming,
|
||||||
|
MetricArg::BrayCurtis => DistanceMetric::BrayCurtis,
|
||||||
|
MetricArg::RelfreqBrayCurtis => DistanceMetric::RelfreqBrayCurtis,
|
||||||
|
MetricArg::Euclidean => DistanceMetric::Euclidean,
|
||||||
|
MetricArg::RelfreqEuclidean => DistanceMetric::RelfreqEuclidean,
|
||||||
|
MetricArg::Hellinger => DistanceMetric::Hellinger,
|
||||||
|
MetricArg::HellingerEuclidean => DistanceMetric::HellingerEuclidean,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct DistanceArgs {
|
||||||
|
/// Index directory
|
||||||
|
pub index: PathBuf,
|
||||||
|
|
||||||
|
/// Distance metric to compute
|
||||||
|
#[arg(long, value_enum, default_value = "jaccard")]
|
||||||
|
pub metric: MetricArg,
|
||||||
|
|
||||||
|
/// Also output the shared-kmer count matrix (CSV)
|
||||||
|
#[arg(long)]
|
||||||
|
pub shared_kmers: bool,
|
||||||
|
|
||||||
|
/// Compute and write a Neighbor-Joining tree (Newick)
|
||||||
|
#[arg(long)]
|
||||||
|
pub nj: bool,
|
||||||
|
|
||||||
|
/// Compute and write a UPGMA tree (Newick)
|
||||||
|
#[arg(long)]
|
||||||
|
pub upgma: bool,
|
||||||
|
|
||||||
|
/// Output prefix: <prefix>_dist.csv, <prefix>_shared.csv,
|
||||||
|
/// <prefix>_nj.nwk, <prefix>_upgma.nwk.
|
||||||
|
/// If omitted, the distance matrix is written to stdout.
|
||||||
|
#[arg(short, long)]
|
||||||
|
pub output: Option<PathBuf>,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: DistanceArgs) {
|
||||||
|
let idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
set_k(idx.kmer_size());
|
||||||
|
set_m(idx.minimizer_size());
|
||||||
|
|
||||||
|
let genomes = idx.meta().genomes.clone();
|
||||||
|
let n = genomes.len();
|
||||||
|
info!(
|
||||||
|
"computing {:?} distances for {} genome(s)",
|
||||||
|
args.metric, n
|
||||||
|
);
|
||||||
|
|
||||||
|
let need_shared = args.shared_kmers || args.nj || args.upgma;
|
||||||
|
let result = idx
|
||||||
|
.distance(args.metric.into(), need_shared)
|
||||||
|
.unwrap_or_else(|e| {
|
||||||
|
eprintln!("error computing distances: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
// ── Distance matrix → CSV ─────────────────────────────────────────────────
|
||||||
|
let write_dist_csv = |w: &mut dyn Write| {
|
||||||
|
write!(w, "genome").unwrap();
|
||||||
|
for g in &genomes { write!(w, ",{g}").unwrap(); }
|
||||||
|
writeln!(w).unwrap();
|
||||||
|
for (i, g) in genomes.iter().enumerate() {
|
||||||
|
write!(w, "{g}").unwrap();
|
||||||
|
for j in 0..n {
|
||||||
|
write!(w, ",{:.6}", result.matrix[[i, j]]).unwrap();
|
||||||
|
}
|
||||||
|
writeln!(w).unwrap();
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
match &args.output {
|
||||||
|
Some(prefix) => {
|
||||||
|
let path = format!("{}_dist.csv", prefix.display());
|
||||||
|
let mut f = BufWriter::new(std::fs::File::create(&path).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error creating {path}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
}));
|
||||||
|
write_dist_csv(&mut f);
|
||||||
|
info!("distance matrix → {path}");
|
||||||
|
}
|
||||||
|
None => {
|
||||||
|
let stdout = io::stdout();
|
||||||
|
let mut out = BufWriter::new(stdout.lock());
|
||||||
|
write_dist_csv(&mut out);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Shared-kmer matrix → CSV ──────────────────────────────────────────────
|
||||||
|
if args.shared_kmers {
|
||||||
|
if let Some(shared) = &result.shared_kmers {
|
||||||
|
let path = args.output.as_ref()
|
||||||
|
.map(|p| format!("{}_shared.csv", p.display()))
|
||||||
|
.unwrap_or_else(|| "shared.csv".into());
|
||||||
|
let mut f = BufWriter::new(std::fs::File::create(&path).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error creating {path}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
}));
|
||||||
|
write!(f, "genome").unwrap();
|
||||||
|
for g in &genomes { write!(f, ",{g}").unwrap(); }
|
||||||
|
writeln!(f).unwrap();
|
||||||
|
for (i, g) in genomes.iter().enumerate() {
|
||||||
|
write!(f, "{g}").unwrap();
|
||||||
|
for j in 0..n { write!(f, ",{}", shared[[i, j]]).unwrap(); }
|
||||||
|
writeln!(f).unwrap();
|
||||||
|
}
|
||||||
|
info!("shared-kmer matrix → {path}");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── NJ tree via speedytree ────────────────────────────────────────────────
|
||||||
|
if args.nj {
|
||||||
|
let rows: Vec<Vec<f64>> = (0..n)
|
||||||
|
.map(|i| (0..n).map(|j| result.matrix[[i, j]]).collect())
|
||||||
|
.collect();
|
||||||
|
let dm = DistanceMatrix::build(rows, genomes.clone()).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error building distance matrix for NJ: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
let tree = NeighborJoiningSolver::<Hybrid>::default(dm).solve().unwrap_or_else(|e| {
|
||||||
|
eprintln!("error computing NJ tree: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
let newick = to_newick(&tree);
|
||||||
|
let path = args.output.as_ref()
|
||||||
|
.map(|p| format!("{}_nj.nwk", p.display()))
|
||||||
|
.unwrap_or_else(|| "nj.nwk".into());
|
||||||
|
std::fs::write(&path, &newick).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error writing {path}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
info!("NJ tree → {path}");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── UPGMA tree via kodama ─────────────────────────────────────────────────
|
||||||
|
if args.upgma {
|
||||||
|
let mut condensed: Vec<f64> = Vec::with_capacity(n * (n - 1) / 2);
|
||||||
|
for i in 0..n {
|
||||||
|
for j in (i + 1)..n {
|
||||||
|
condensed.push(result.matrix[[i, j]]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
let dendro = linkage(&mut condensed, n, Method::Average);
|
||||||
|
let newick = upgma_to_newick(&dendro, &genomes);
|
||||||
|
let path = args.output.as_ref()
|
||||||
|
.map(|p| format!("{}_upgma.nwk", p.display()))
|
||||||
|
.unwrap_or_else(|| "upgma.nwk".into());
|
||||||
|
std::fs::write(&path, &newick).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error writing {path}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
info!("UPGMA tree → {path}");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── UPGMA Newick from kodama dendrogram ───────────────────────────────────────
|
||||||
|
|
||||||
|
fn upgma_to_newick(dendro: &kodama::Dendrogram<f64>, names: &[String]) -> String {
|
||||||
|
let n = names.len();
|
||||||
|
// node_labels[i]: Newick subtree string for node i (leaves 0..n, internals n..)
|
||||||
|
let mut labels: Vec<String> = names.to_vec();
|
||||||
|
// height of each node: leaves = 0, internal = dissimilarity/2
|
||||||
|
let mut heights: Vec<f64> = vec![0.0; 2 * n - 1];
|
||||||
|
|
||||||
|
for (k, step) in dendro.steps().iter().enumerate() {
|
||||||
|
let new_node = n + k;
|
||||||
|
let h = step.dissimilarity / 2.0;
|
||||||
|
heights[new_node] = h;
|
||||||
|
let c1 = step.cluster1;
|
||||||
|
let c2 = step.cluster2;
|
||||||
|
let bl1 = (h - heights[c1]).max(0.0);
|
||||||
|
let bl2 = (h - heights[c2]).max(0.0);
|
||||||
|
labels.push(format!(
|
||||||
|
"({label1}:{bl1:.6},{label2}:{bl2:.6})",
|
||||||
|
label1 = labels[c1],
|
||||||
|
label2 = labels[c2],
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
format!("{};", labels.last().unwrap())
|
||||||
|
}
|
||||||
@@ -0,0 +1,43 @@
|
|||||||
|
use std::io::{self, BufWriter};
|
||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use obikindex::KmerIndex;
|
||||||
|
use obikseq::set_k;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct DumpArgs {
|
||||||
|
/// Index directory to dump
|
||||||
|
pub index: PathBuf,
|
||||||
|
|
||||||
|
/// Output presence/absence (0/1) even if the index stores counts
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub force_presence: bool,
|
||||||
|
|
||||||
|
/// Prepend partition and layer columns to each row
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub debug: bool,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: DumpArgs) {
|
||||||
|
let idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
set_k(idx.kmer_size());
|
||||||
|
info!(
|
||||||
|
"dumping {} partitions, {} genome(s)",
|
||||||
|
idx.n_partitions(),
|
||||||
|
idx.meta().genomes.len()
|
||||||
|
);
|
||||||
|
|
||||||
|
let stdout = io::stdout();
|
||||||
|
let mut out = BufWriter::new(stdout.lock());
|
||||||
|
|
||||||
|
idx.dump(&mut out, args.force_presence, args.debug).unwrap_or_else(|e| {
|
||||||
|
eprintln!("dump error: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
}
|
||||||
@@ -6,7 +6,7 @@ use obikseq::{set_k, set_m};
|
|||||||
use obisys::Reporter;
|
use obisys::Reporter;
|
||||||
use tracing::info;
|
use tracing::info;
|
||||||
|
|
||||||
use crate::cli::CommonArgs;
|
use crate::cli::{CommonArgs, partitions_to_bits};
|
||||||
use crate::steps::scatter;
|
use crate::steps::scatter;
|
||||||
|
|
||||||
#[derive(Args)]
|
#[derive(Args)]
|
||||||
@@ -48,20 +48,32 @@ pub fn run(args: IndexArgs) {
|
|||||||
let mut rep = Reporter::new();
|
let mut rep = Reporter::new();
|
||||||
|
|
||||||
// ── Open or create the index ─────────────────────────────────────────────
|
// ── Open or create the index ─────────────────────────────────────────────
|
||||||
let mut idx = if KmerIndex::exists(&output) {
|
let mut idx = if KmerIndex::exists(&output) && !args.force {
|
||||||
info!("resuming from existing index at {}", output.display());
|
info!("resuming from existing index at {}", output.display());
|
||||||
KmerIndex::open(&output).unwrap_or_else(|e| {
|
KmerIndex::open(&output).unwrap_or_else(|e| {
|
||||||
eprintln!("error opening index: {e}");
|
eprintln!("error opening index: {e}");
|
||||||
std::process::exit(1);
|
std::process::exit(1);
|
||||||
})
|
})
|
||||||
} else {
|
} else {
|
||||||
|
if args.force && KmerIndex::exists(&output) {
|
||||||
|
info!("--force: removing existing index at {}", output.display());
|
||||||
|
std::fs::remove_dir_all(&output).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error removing existing index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
let n_bits = partitions_to_bits(args.common.partitions);
|
||||||
|
let effective = 1usize << n_bits;
|
||||||
|
if effective != args.common.partitions {
|
||||||
|
info!("partitions: {} → {} (next power of 2)", args.common.partitions, effective);
|
||||||
|
}
|
||||||
let config = IndexConfig {
|
let config = IndexConfig {
|
||||||
kmer_size: args.common.kmer_size,
|
kmer_size: args.common.kmer_size,
|
||||||
minimizer_size: args.common.minimizer_size,
|
minimizer_size: args.common.minimizer_size,
|
||||||
n_bits: args.common.partition_bits,
|
n_bits,
|
||||||
with_counts: args.with_counts,
|
with_counts: args.with_counts,
|
||||||
};
|
};
|
||||||
KmerIndex::create(&output, config, args.label.clone(), args.force).unwrap_or_else(|e| {
|
KmerIndex::create(&output, config, args.label.clone(), false).unwrap_or_else(|e| {
|
||||||
eprintln!("error creating index: {e}");
|
eprintln!("error creating index: {e}");
|
||||||
std::process::exit(1);
|
std::process::exit(1);
|
||||||
})
|
})
|
||||||
@@ -72,7 +84,10 @@ pub fn run(args: IndexArgs) {
|
|||||||
|
|
||||||
// ── Stage 1: scatter ─────────────────────────────────────────────────────
|
// ── Stage 1: scatter ─────────────────────────────────────────────────────
|
||||||
if idx.state() < IndexState::Scattered {
|
if idx.state() < IndexState::Scattered {
|
||||||
let first_path = args.common.inputs.first().map(PathBuf::from);
|
let paths: Vec<_> = args.common.seqfile_paths().collect();
|
||||||
|
for path in &paths {
|
||||||
|
info!("indexing: {}", path.display());
|
||||||
|
}
|
||||||
let k = idx.kmer_size();
|
let k = idx.kmer_size();
|
||||||
let level_max = args.common.level_max;
|
let level_max = args.common.level_max;
|
||||||
let theta = args.common.theta;
|
let theta = args.common.theta;
|
||||||
@@ -80,7 +95,7 @@ pub fn run(args: IndexArgs) {
|
|||||||
|
|
||||||
scatter(idx.partition_mut(), args.common.seqfile_paths(), k, level_max, theta, n_workers, &mut rep);
|
scatter(idx.partition_mut(), args.common.seqfile_paths(), k, level_max, theta, n_workers, &mut rep);
|
||||||
|
|
||||||
idx.mark_scattered(first_path.as_deref()).unwrap_or_else(|e| {
|
idx.mark_scattered().unwrap_or_else(|e| {
|
||||||
eprintln!("error marking scatter done: {e}");
|
eprintln!("error marking scatter done: {e}");
|
||||||
std::process::exit(1);
|
std::process::exit(1);
|
||||||
});
|
});
|
||||||
@@ -90,7 +105,7 @@ pub fn run(args: IndexArgs) {
|
|||||||
|
|
||||||
// ── Stage 2: dereplicate + count ─────────────────────────────────────────
|
// ── Stage 2: dereplicate + count ─────────────────────────────────────────
|
||||||
if idx.state() < IndexState::Counted {
|
if idx.state() < IndexState::Counted {
|
||||||
idx.dereplicate_and_count(&mut rep).unwrap_or_else(|e| {
|
idx.dereplicate_and_count(args.keep_intermediate, &mut rep).unwrap_or_else(|e| {
|
||||||
eprintln!("error: {e}");
|
eprintln!("error: {e}");
|
||||||
std::process::exit(1);
|
std::process::exit(1);
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -0,0 +1,71 @@
|
|||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use obikindex::{KmerIndex, MergeMode};
|
||||||
|
use obikseq::{set_k, set_m};
|
||||||
|
use obisys::Reporter;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct MergeArgs {
|
||||||
|
/// Source index directories to merge
|
||||||
|
#[arg(required = true)]
|
||||||
|
pub sources: Vec<PathBuf>,
|
||||||
|
|
||||||
|
/// Output index directory
|
||||||
|
#[arg(short, long)]
|
||||||
|
pub output: PathBuf,
|
||||||
|
|
||||||
|
/// Overwrite output directory if it already exists
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub force: bool,
|
||||||
|
|
||||||
|
/// Force presence/absence mode even if all sources have count data
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub force_presence: bool,
|
||||||
|
|
||||||
|
/// Disambiguate duplicate genome labels by appending .1, .2, … instead of erroring
|
||||||
|
#[arg(long, default_value_t = false)]
|
||||||
|
pub rename_duplicates: bool,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: MergeArgs) {
|
||||||
|
let sources: Vec<KmerIndex> = args.sources.iter().map(|p| {
|
||||||
|
info!("opening source index: {}", p.display());
|
||||||
|
KmerIndex::open(p).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening source index {}: {e}", p.display());
|
||||||
|
std::process::exit(1);
|
||||||
|
})
|
||||||
|
}).collect();
|
||||||
|
|
||||||
|
// Auto-detect mode: count if all sources have count data, presence otherwise.
|
||||||
|
// --force-presence overrides to presence regardless.
|
||||||
|
let all_have_counts = sources.iter().all(|s| s.meta().config.with_counts);
|
||||||
|
let mode = if !args.force_presence && all_have_counts {
|
||||||
|
MergeMode::Count
|
||||||
|
} else {
|
||||||
|
MergeMode::Presence
|
||||||
|
};
|
||||||
|
info!(
|
||||||
|
"merge mode: {}",
|
||||||
|
if mode == MergeMode::Count { "count" } else { "presence/absence" }
|
||||||
|
);
|
||||||
|
|
||||||
|
let source_refs: Vec<&KmerIndex> = sources.iter().collect();
|
||||||
|
|
||||||
|
set_k(sources[0].kmer_size());
|
||||||
|
set_m(sources[0].minimizer_size());
|
||||||
|
|
||||||
|
let n_genomes: usize = sources.iter().map(|s| s.meta().genomes.len()).sum();
|
||||||
|
info!(
|
||||||
|
"merging {} index(es), {} genome(s) total → {}",
|
||||||
|
sources.len(), n_genomes, args.output.display()
|
||||||
|
);
|
||||||
|
|
||||||
|
let mut rep = Reporter::new();
|
||||||
|
KmerIndex::merge(&args.output, &source_refs, mode, args.force, args.rename_duplicates, &mut rep).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error merging: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
rep.print();
|
||||||
|
}
|
||||||
@@ -1,3 +1,8 @@
|
|||||||
|
pub mod distance;
|
||||||
|
pub mod dump;
|
||||||
pub mod index;
|
pub mod index;
|
||||||
|
pub mod merge;
|
||||||
|
pub mod query;
|
||||||
|
pub mod rebuild;
|
||||||
pub mod superkmer;
|
pub mod superkmer;
|
||||||
pub mod unitig;
|
pub mod unitig;
|
||||||
|
|||||||
@@ -0,0 +1,281 @@
|
|||||||
|
use std::collections::HashMap;
|
||||||
|
use std::io::{self, BufWriter, Write};
|
||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use obikindex::KmerIndex;
|
||||||
|
use obiread::record::{SeqRecord, parse_chunk};
|
||||||
|
use obiread::chunk::read_sequence_chunks;
|
||||||
|
use obikseq::{RoutableSuperKmer, set_k, set_m};
|
||||||
|
use obiskbuilder::SuperKmerIter;
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
// ── CLI ───────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct QueryArgs {
|
||||||
|
/// Index directory
|
||||||
|
pub index: PathBuf,
|
||||||
|
|
||||||
|
/// Input sequences (FASTA/FASTQ, optionally gzip-compressed)
|
||||||
|
#[arg(num_args = 1..)]
|
||||||
|
pub inputs: Vec<String>,
|
||||||
|
|
||||||
|
/// Also report per-position coverage vectors (cov_<i> per genome)
|
||||||
|
#[arg(long)]
|
||||||
|
pub detail: bool,
|
||||||
|
|
||||||
|
/// Enable 1-mismatch approximate matching
|
||||||
|
#[arg(long)]
|
||||||
|
pub mismatch: bool,
|
||||||
|
|
||||||
|
/// Count k-mers absent from the index (adds kmer_missing annotation)
|
||||||
|
#[arg(long)]
|
||||||
|
pub count_missing: bool,
|
||||||
|
|
||||||
|
/// Report per-genome presence (0/1) instead of raw counts
|
||||||
|
#[arg(long)]
|
||||||
|
pub force_presence: bool,
|
||||||
|
|
||||||
|
/// Minimum accumulated match count to declare a genome present (implies --force-presence)
|
||||||
|
#[arg(long, default_value_t = 1)]
|
||||||
|
pub presence_threshold: u32,
|
||||||
|
|
||||||
|
/// Number of worker threads
|
||||||
|
#[arg(
|
||||||
|
short = 'T',
|
||||||
|
long,
|
||||||
|
default_value_t = std::thread::available_parallelism()
|
||||||
|
.map(|n| n.get())
|
||||||
|
.unwrap_or(1)
|
||||||
|
)]
|
||||||
|
pub threads: usize,
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── SKDesc — one occurrence of a superkmer in the batch ───────────────────────
|
||||||
|
|
||||||
|
/// Describes one occurrence of a superkmer in the query batch.
|
||||||
|
pub struct SKDesc {
|
||||||
|
/// Index of the source sequence within the batch.
|
||||||
|
pub seq_idx: u32,
|
||||||
|
/// Kmer offset of the first kmer of this superkmer within its sequence.
|
||||||
|
/// Computed as the cumulative number of kmers emitted before this superkmer
|
||||||
|
/// in the same sequence. Used for `--detail` coverage vectors.
|
||||||
|
pub kmer_offset: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── QueryBatch ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// A batch of query sequences with their superkmers deduplicated.
|
||||||
|
///
|
||||||
|
/// Each unique `RoutableSuperKmer` maps to all the (seq_idx, kmer_offset)
|
||||||
|
/// positions it occupies across the batch. The superkmer is queried once
|
||||||
|
/// per partition; the result is broadcast to every SKDesc entry.
|
||||||
|
pub struct QueryBatch {
|
||||||
|
/// Sequence ids in batch order.
|
||||||
|
pub ids: Vec<String>,
|
||||||
|
/// Raw sequence bytes (for output), in batch order.
|
||||||
|
pub seqs: Vec<Vec<u8>>,
|
||||||
|
/// Per-sequence total kmer count (kmer_count + kmer_missing).
|
||||||
|
pub n_kmers: Vec<u32>,
|
||||||
|
/// Deduplicated superkmer map.
|
||||||
|
pub map: HashMap<RoutableSuperKmer, Vec<SKDesc>>,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl QueryBatch {
|
||||||
|
/// Build a batch from a vec of parsed sequence records.
|
||||||
|
pub fn from_records(
|
||||||
|
records: Vec<SeqRecord>,
|
||||||
|
k: usize,
|
||||||
|
level_max: usize,
|
||||||
|
theta: f64,
|
||||||
|
) -> Self {
|
||||||
|
let mut ids = Vec::with_capacity(records.len());
|
||||||
|
let mut seqs = Vec::with_capacity(records.len());
|
||||||
|
let mut n_kmers = Vec::with_capacity(records.len());
|
||||||
|
let mut map: HashMap<RoutableSuperKmer, Vec<SKDesc>> = HashMap::new();
|
||||||
|
|
||||||
|
for (seq_idx, record) in records.into_iter().enumerate() {
|
||||||
|
let mut kmer_offset = 0u32;
|
||||||
|
|
||||||
|
for rsk in SuperKmerIter::new(&record.normalized, k, level_max, theta) {
|
||||||
|
let n = (rsk.seql() - k + 1) as u32;
|
||||||
|
map.entry(rsk)
|
||||||
|
.or_default()
|
||||||
|
.push(SKDesc { seq_idx: seq_idx as u32, kmer_offset });
|
||||||
|
kmer_offset += n;
|
||||||
|
}
|
||||||
|
|
||||||
|
ids.push(record.id);
|
||||||
|
seqs.push(record.sequence);
|
||||||
|
n_kmers.push(kmer_offset);
|
||||||
|
}
|
||||||
|
|
||||||
|
Self { ids, seqs, n_kmers, map }
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Split the superkmer map by partition index.
|
||||||
|
///
|
||||||
|
/// Returns a vec of length `n_partitions`; each slot holds the RSK refs
|
||||||
|
/// whose minimizer routes to that partition.
|
||||||
|
pub fn split_by_partition(&self, n_partitions: usize) -> Vec<Vec<&RoutableSuperKmer>> {
|
||||||
|
let mask = (n_partitions as u64) - 1;
|
||||||
|
let mut by_part: Vec<Vec<&RoutableSuperKmer>> = vec![Vec::new(); n_partitions];
|
||||||
|
for rsk in self.map.keys() {
|
||||||
|
let part = (rsk.minimizer().seq_hash() & mask) as usize;
|
||||||
|
by_part[part].push(rsk);
|
||||||
|
}
|
||||||
|
by_part
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Per-sequence accumulator ──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
struct SeqAcc {
|
||||||
|
kmer_count: u32,
|
||||||
|
kmer_missing: u32,
|
||||||
|
/// Per-genome accumulated count (count mode) or presence sum (presence mode).
|
||||||
|
genome_totals: Vec<u32>,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl SeqAcc {
|
||||||
|
fn new(n_genomes: usize) -> Self {
|
||||||
|
Self {
|
||||||
|
kmer_count: 0,
|
||||||
|
kmer_missing: 0,
|
||||||
|
genome_totals: vec![0u32; n_genomes],
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Entry point ───────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
pub fn run(args: QueryArgs) {
|
||||||
|
let idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
set_k(idx.kmer_size());
|
||||||
|
set_m(idx.minimizer_size());
|
||||||
|
|
||||||
|
let k = idx.kmer_size();
|
||||||
|
let n_genomes = idx.meta().genomes.len();
|
||||||
|
let n_partitions = idx.n_partitions();
|
||||||
|
let with_counts = idx.meta().config.with_counts;
|
||||||
|
|
||||||
|
info!(
|
||||||
|
"query: k={k}, {} genome(s), with_counts={with_counts}, mismatch={}, detail={}",
|
||||||
|
n_genomes, args.mismatch, args.detail
|
||||||
|
);
|
||||||
|
|
||||||
|
if args.mismatch {
|
||||||
|
eprintln!("warning: --mismatch not yet implemented, ignored");
|
||||||
|
}
|
||||||
|
if args.detail {
|
||||||
|
eprintln!("warning: --detail not yet implemented, ignored");
|
||||||
|
}
|
||||||
|
|
||||||
|
let paths: Vec<PathBuf> = args.inputs.iter().map(PathBuf::from).collect();
|
||||||
|
let mut out = BufWriter::new(io::stdout());
|
||||||
|
|
||||||
|
for path in &paths {
|
||||||
|
let chunks = read_sequence_chunks(path.to_str().unwrap_or(""))
|
||||||
|
.unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening {}: {e}", path.display());
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
for chunk_result in chunks {
|
||||||
|
let chunk = chunk_result.unwrap_or_else(|e| {
|
||||||
|
eprintln!("read error: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
let records = parse_chunk(&chunk, k);
|
||||||
|
if records.is_empty() {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let batch = QueryBatch::from_records(records, k, 6, 0.7);
|
||||||
|
let n_seqs = batch.ids.len();
|
||||||
|
|
||||||
|
let mut accs: Vec<SeqAcc> = (0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
|
||||||
|
|
||||||
|
let by_part = batch.split_by_partition(n_partitions);
|
||||||
|
|
||||||
|
for (part_idx, part_sks) in by_part.iter().enumerate() {
|
||||||
|
if part_sks.is_empty() {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let kmer_results = idx
|
||||||
|
.partition()
|
||||||
|
.query_partition(part_idx, part_sks, k, n_genomes, with_counts)
|
||||||
|
.unwrap_or_else(|e| {
|
||||||
|
eprintln!("query error on partition {part_idx}: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
let presence = args.force_presence || !with_counts;
|
||||||
|
let threshold = args.presence_threshold;
|
||||||
|
|
||||||
|
for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
|
||||||
|
let descs = batch.map.get(*rsk).expect("rsk must be in map");
|
||||||
|
for desc in descs {
|
||||||
|
let acc = &mut accs[desc.seq_idx as usize];
|
||||||
|
for hit in sk_kmer_results.iter() {
|
||||||
|
match hit {
|
||||||
|
None => acc.kmer_missing += 1,
|
||||||
|
Some(row) => {
|
||||||
|
acc.kmer_count += 1;
|
||||||
|
for (g, &v) in row.iter().enumerate() {
|
||||||
|
acc.genome_totals[g] += if presence {
|
||||||
|
u32::from(v >= threshold)
|
||||||
|
} else {
|
||||||
|
v
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
emit_batch(&batch, &accs, idx.meta(), args.count_missing, &mut out);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Output ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn emit_batch(
|
||||||
|
batch: &QueryBatch,
|
||||||
|
accs: &[SeqAcc],
|
||||||
|
meta: &obikindex::meta::IndexMeta,
|
||||||
|
count_missing: bool,
|
||||||
|
out: &mut impl Write,
|
||||||
|
) {
|
||||||
|
for (seq_idx, (id, seq)) in batch.ids.iter().zip(batch.seqs.iter()).enumerate() {
|
||||||
|
let acc = &accs[seq_idx];
|
||||||
|
let mut ann = serde_json::Map::new();
|
||||||
|
|
||||||
|
ann.insert("kmer_count".into(), acc.kmer_count.into());
|
||||||
|
if count_missing {
|
||||||
|
ann.insert("kmer_missing".into(), acc.kmer_missing.into());
|
||||||
|
}
|
||||||
|
let mut match_map = serde_json::Map::new();
|
||||||
|
for (g, label) in meta.genomes.iter().enumerate() {
|
||||||
|
match_map.insert(label.clone(), acc.genome_totals[g].into());
|
||||||
|
}
|
||||||
|
ann.insert("kmer_strict_matches".into(), match_map.into());
|
||||||
|
|
||||||
|
let ann_str = serde_json::to_string(&ann).unwrap_or_else(|_| "{}".to_string());
|
||||||
|
|
||||||
|
// OBITools4 FASTA format: >id {"key":value,...}
|
||||||
|
let _ = writeln!(out, ">{id} {ann_str}");
|
||||||
|
let _ = out.write_all(seq);
|
||||||
|
let _ = out.write_all(b"\n");
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,111 @@
|
|||||||
|
use std::path::PathBuf;
|
||||||
|
|
||||||
|
use clap::Args;
|
||||||
|
use obikindex::{KmerIndex, MergeMode};
|
||||||
|
use obikpartitionner::filter::{
|
||||||
|
KmerFilter, MaxGenomeCount, MaxGenomeFraction, MaxTotalCount,
|
||||||
|
MinGenomeCount, MinGenomeFraction, MinTotalCount,
|
||||||
|
};
|
||||||
|
use obisys::Reporter;
|
||||||
|
use obikseq::{set_k, set_m};
|
||||||
|
use tracing::info;
|
||||||
|
|
||||||
|
#[derive(Args)]
|
||||||
|
pub struct RebuildArgs {
|
||||||
|
/// Source index directory
|
||||||
|
pub source: PathBuf,
|
||||||
|
|
||||||
|
/// Output index directory
|
||||||
|
#[arg(short, long)]
|
||||||
|
pub output: PathBuf,
|
||||||
|
|
||||||
|
/// Minimum fraction of genomes containing the k-mer [0.0–1.0]
|
||||||
|
#[arg(long)]
|
||||||
|
pub min_quorum_frac: Option<f64>,
|
||||||
|
|
||||||
|
/// Maximum fraction of genomes containing the k-mer [0.0–1.0]
|
||||||
|
#[arg(long)]
|
||||||
|
pub max_quorum_frac: Option<f64>,
|
||||||
|
|
||||||
|
/// Minimum number of genomes containing the k-mer
|
||||||
|
#[arg(long)]
|
||||||
|
pub min_quorum_count: Option<usize>,
|
||||||
|
|
||||||
|
/// Maximum number of genomes containing the k-mer
|
||||||
|
#[arg(long)]
|
||||||
|
pub max_quorum_count: Option<usize>,
|
||||||
|
|
||||||
|
/// Minimum total count across all genomes (count index only)
|
||||||
|
#[arg(long)]
|
||||||
|
pub min_total_count: Option<u32>,
|
||||||
|
|
||||||
|
/// Maximum total count across all genomes (count index only)
|
||||||
|
#[arg(long)]
|
||||||
|
pub max_total_count: Option<u32>,
|
||||||
|
|
||||||
|
/// Per-genome count threshold for presence in quorum filters (default 0)
|
||||||
|
#[arg(long, default_value = "0")]
|
||||||
|
pub presence_threshold: u32,
|
||||||
|
|
||||||
|
/// Output as presence/absence instead of counts
|
||||||
|
#[arg(long)]
|
||||||
|
pub presence: bool,
|
||||||
|
|
||||||
|
/// Overwrite existing output directory
|
||||||
|
#[arg(short, long)]
|
||||||
|
pub force: bool,
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn run(args: RebuildArgs) {
|
||||||
|
let src = KmerIndex::open(&args.source).unwrap_or_else(|e| {
|
||||||
|
eprintln!("error opening source index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
set_k(src.kmer_size());
|
||||||
|
set_m(src.minimizer_size());
|
||||||
|
|
||||||
|
let n_genomes = src.meta().genomes.len();
|
||||||
|
let mode = if args.presence || !src.meta().config.with_counts {
|
||||||
|
MergeMode::Presence
|
||||||
|
} else {
|
||||||
|
MergeMode::Count
|
||||||
|
};
|
||||||
|
|
||||||
|
info!(
|
||||||
|
"rebuild: {} genome(s), mode={:?}, source={}",
|
||||||
|
n_genomes, mode, args.source.display()
|
||||||
|
);
|
||||||
|
|
||||||
|
let pt = args.presence_threshold;
|
||||||
|
let mut filters: Vec<Box<dyn KmerFilter>> = Vec::new();
|
||||||
|
|
||||||
|
if let Some(v) = args.min_quorum_frac {
|
||||||
|
filters.push(Box::new(MinGenomeFraction { frac: v, threshold: pt }));
|
||||||
|
}
|
||||||
|
if let Some(v) = args.max_quorum_frac {
|
||||||
|
filters.push(Box::new(MaxGenomeFraction { frac: v, threshold: pt }));
|
||||||
|
}
|
||||||
|
if let Some(v) = args.min_quorum_count {
|
||||||
|
filters.push(Box::new(MinGenomeCount { count: v, threshold: pt }));
|
||||||
|
}
|
||||||
|
if let Some(v) = args.max_quorum_count {
|
||||||
|
filters.push(Box::new(MaxGenomeCount { count: v, threshold: pt }));
|
||||||
|
}
|
||||||
|
if let Some(v) = args.min_total_count {
|
||||||
|
filters.push(Box::new(MinTotalCount { total: v }));
|
||||||
|
}
|
||||||
|
if let Some(v) = args.max_total_count {
|
||||||
|
filters.push(Box::new(MaxTotalCount { total: v }));
|
||||||
|
}
|
||||||
|
|
||||||
|
let mut rep = Reporter::new();
|
||||||
|
KmerIndex::rebuild(&args.output, &src, &filters, mode, args.force, &mut rep)
|
||||||
|
.unwrap_or_else(|e| {
|
||||||
|
eprintln!("error rebuilding index: {e}");
|
||||||
|
std::process::exit(1);
|
||||||
|
});
|
||||||
|
|
||||||
|
rep.print();
|
||||||
|
info!("rebuilt index → {}", args.output.display());
|
||||||
|
}
|
||||||
@@ -5,7 +5,7 @@ use clap::Args;
|
|||||||
use obifastwrite::write_scatter;
|
use obifastwrite::write_scatter;
|
||||||
use obikseq::{RoutableSuperKmer, set_k, set_m};
|
use obikseq::{RoutableSuperKmer, set_k, set_m};
|
||||||
|
|
||||||
use crate::cli::{CommonArgs, PipelineData, open_chunks};
|
use crate::cli::{CommonArgs, PipelineData, open_chunks, partitions_to_bits};
|
||||||
|
|
||||||
#[derive(Args)]
|
#[derive(Args)]
|
||||||
pub struct SuperkmerArgs {
|
pub struct SuperkmerArgs {
|
||||||
@@ -38,7 +38,7 @@ pub fn run(args: SuperkmerArgs) {
|
|||||||
let m = args.common.minimizer_size;
|
let m = args.common.minimizer_size;
|
||||||
let theta = args.common.theta;
|
let theta = args.common.theta;
|
||||||
let level_max = args.common.level_max;
|
let level_max = args.common.level_max;
|
||||||
let partition_bits = args.common.partition_bits;
|
let partition_bits = partitions_to_bits(args.common.partitions);
|
||||||
let n_workers = args.common.threads.max(1);
|
let n_workers = args.common.threads.max(1);
|
||||||
|
|
||||||
set_k(k);
|
set_k(k);
|
||||||
|
|||||||
@@ -18,6 +18,16 @@ enum Commands {
|
|||||||
Superkmer(cmd::superkmer::SuperkmerArgs),
|
Superkmer(cmd::superkmer::SuperkmerArgs),
|
||||||
/// Build the complete genome index (scatter → dereplicate → count → layered MPHF)
|
/// Build the complete genome index (scatter → dereplicate → count → layered MPHF)
|
||||||
Index(cmd::index::IndexArgs),
|
Index(cmd::index::IndexArgs),
|
||||||
|
/// Merge multiple built indexes into one
|
||||||
|
Merge(cmd::merge::MergeArgs),
|
||||||
|
/// Filter and compact an existing index into a new single-layer index
|
||||||
|
Rebuild(cmd::rebuild::RebuildArgs),
|
||||||
|
/// Query an index with sequences and annotate matches
|
||||||
|
Query(cmd::query::QueryArgs),
|
||||||
|
/// Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
|
||||||
|
Dump(cmd::dump::DumpArgs),
|
||||||
|
/// Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees
|
||||||
|
Distance(cmd::distance::DistanceArgs),
|
||||||
/// Dump unitigs from a built index to stdout (debug)
|
/// Dump unitigs from a built index to stdout (debug)
|
||||||
Unitig(cmd::unitig::UnitigArgs),
|
Unitig(cmd::unitig::UnitigArgs),
|
||||||
}
|
}
|
||||||
@@ -43,6 +53,11 @@ fn main() {
|
|||||||
match cli.command {
|
match cli.command {
|
||||||
Commands::Superkmer(args) => cmd::superkmer::run(args),
|
Commands::Superkmer(args) => cmd::superkmer::run(args),
|
||||||
Commands::Index(args) => cmd::index::run(args),
|
Commands::Index(args) => cmd::index::run(args),
|
||||||
|
Commands::Merge(args) => cmd::merge::run(args),
|
||||||
|
Commands::Dump(args) => cmd::dump::run(args),
|
||||||
|
Commands::Rebuild(args) => cmd::rebuild::run(args),
|
||||||
|
Commands::Query(args) => cmd::query::run(args),
|
||||||
|
Commands::Distance(args) => cmd::distance::run(args),
|
||||||
Commands::Unitig(args) => cmd::unitig::run(args),
|
Commands::Unitig(args) => cmd::unitig::run(args),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -13,8 +13,10 @@ obikrope = { path = "../obikrope" }
|
|||||||
[dependencies]
|
[dependencies]
|
||||||
niffler = "3.0.0"
|
niffler = "3.0.0"
|
||||||
remove_dir_all = "0.8"
|
remove_dir_all = "0.8"
|
||||||
obikseq = { path = "../obikseq" }
|
obikseq = { path = "../obikseq" }
|
||||||
obiskio = { path = "../obiskio" }
|
obiskio = { path = "../obiskio" }
|
||||||
|
obidebruinj = { path = "../obidebruinj" }
|
||||||
|
obilayeredmap = { path = "../obilayeredmap" }
|
||||||
rayon = "1"
|
rayon = "1"
|
||||||
sysinfo = "0.33"
|
sysinfo = "0.33"
|
||||||
serde = { version = "1", features = ["derive"] }
|
serde = { version = "1", features = ["derive"] }
|
||||||
|
|||||||
@@ -0,0 +1,47 @@
|
|||||||
|
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
||||||
|
use obilayeredmap::LayeredStore;
|
||||||
|
use obiskio::{SKError, SKResult};
|
||||||
|
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
const INDEX_SUBDIR: &str = "index";
|
||||||
|
|
||||||
|
fn probe_n_layers(index_dir: &std::path::Path) -> usize {
|
||||||
|
let mut n = 0;
|
||||||
|
while index_dir.join(format!("layer_{n}")).exists() { n += 1; }
|
||||||
|
n
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Open all count matrices for partition `part`, one per layer.
|
||||||
|
/// Layers without a `counts/` directory are skipped.
|
||||||
|
pub fn count_store(&self, part: usize) -> SKResult<LayeredStore<PersistentCompactIntMatrix>> {
|
||||||
|
let index_dir = self.part_dir(part).join(INDEX_SUBDIR);
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(LayeredStore::new(vec![]));
|
||||||
|
}
|
||||||
|
let matrices = (0..probe_n_layers(&index_dir))
|
||||||
|
.filter_map(|l| {
|
||||||
|
let dir = index_dir.join(format!("layer_{l}")).join("counts");
|
||||||
|
dir.exists().then(|| PersistentCompactIntMatrix::open(&dir).map_err(SKError::Io))
|
||||||
|
})
|
||||||
|
.collect::<SKResult<Vec<_>>>()?;
|
||||||
|
Ok(LayeredStore::new(matrices))
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Open all presence matrices for partition `part`, one per layer.
|
||||||
|
/// Layers without a `presence/` directory are skipped.
|
||||||
|
pub fn presence_store(&self, part: usize) -> SKResult<LayeredStore<PersistentBitMatrix>> {
|
||||||
|
let index_dir = self.part_dir(part).join(INDEX_SUBDIR);
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(LayeredStore::new(vec![]));
|
||||||
|
}
|
||||||
|
let matrices = (0..probe_n_layers(&index_dir))
|
||||||
|
.filter_map(|l| {
|
||||||
|
let dir = index_dir.join(format!("layer_{l}")).join("presence");
|
||||||
|
dir.exists().then(|| PersistentBitMatrix::open(&dir).map_err(SKError::Io))
|
||||||
|
})
|
||||||
|
.collect::<SKResult<Vec<_>>>()?;
|
||||||
|
Ok(LayeredStore::new(matrices))
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
||||||
|
use obikseq::CanonicalKmer;
|
||||||
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
|
use obilayeredmap::OLMError;
|
||||||
|
use obilayeredmap::MphfLayer;
|
||||||
|
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
const INDEX_SUBDIR: &str = "index";
|
||||||
|
|
||||||
|
fn olm_to_sk(e: OLMError) -> SKError {
|
||||||
|
match e {
|
||||||
|
OLMError::Io(e) => SKError::Io(e),
|
||||||
|
other => SKError::InvalidData { context: "dump", detail: other.to_string() },
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Iterate all indexed kmers in partition `part`, calling `cb(kmer, row)` for each.
|
||||||
|
///
|
||||||
|
/// `use_counts = true` → reads count columns (u32 values per genome).
|
||||||
|
/// `use_counts = false` → reads presence columns, converted to 0/1 u32.
|
||||||
|
///
|
||||||
|
/// If no data matrix exists for a layer (pure set-membership, single genome),
|
||||||
|
/// a row of `n_genomes` ones is emitted for every kmer in that layer.
|
||||||
|
pub fn iter_partition_kmers(
|
||||||
|
&self,
|
||||||
|
part: usize,
|
||||||
|
use_counts: bool,
|
||||||
|
n_genomes: usize,
|
||||||
|
mut cb: impl FnMut(CanonicalKmer, Box<[u32]>),
|
||||||
|
) -> SKResult<()> {
|
||||||
|
let index_dir = self.part_dir(part).join(INDEX_SUBDIR);
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
// Discover layers by probing layer_0, layer_1, … until one is absent.
|
||||||
|
// PartitionMeta (meta.json) is only created by the merge path, not by
|
||||||
|
// the initial single-genome build, so we cannot rely on it here.
|
||||||
|
let mut l = 0;
|
||||||
|
loop {
|
||||||
|
let layer_dir = index_dir.join(format!("layer_{l}"));
|
||||||
|
if !layer_dir.exists() { break; }
|
||||||
|
l += 1;
|
||||||
|
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let reader = UnitigFileReader::open(&layer_dir.join("unitigs.bin"))?;
|
||||||
|
|
||||||
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
let presence_dir = layer_dir.join("presence");
|
||||||
|
|
||||||
|
if use_counts && counts_dir.exists() {
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
cb(kmer, mat.row(slot));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if !use_counts && presence_dir.exists() {
|
||||||
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
let row: Box<[u32]> = mat.row(slot).iter().map(|&b| b as u32).collect();
|
||||||
|
cb(kmer, row);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
// No data matrix: pure set-membership layer, all kmers belong to every genome.
|
||||||
|
let all_present: Box<[u32]> = vec![1u32; n_genomes].into();
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if mphf.find(kmer).is_some() {
|
||||||
|
cb(kmer, all_present.clone());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Like [`iter_partition_kmers`] but the callback also receives the layer index,
|
||||||
|
/// enabling debug output that identifies where each kmer was stored.
|
||||||
|
pub fn iter_partition_kmers_located(
|
||||||
|
&self,
|
||||||
|
part: usize,
|
||||||
|
use_counts: bool,
|
||||||
|
n_genomes: usize,
|
||||||
|
mut cb: impl FnMut(usize, usize, CanonicalKmer, Box<[u32]>),
|
||||||
|
) -> SKResult<()> {
|
||||||
|
let index_dir = self.part_dir(part).join(INDEX_SUBDIR);
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
let mut layer = 0;
|
||||||
|
loop {
|
||||||
|
let layer_dir = index_dir.join(format!("layer_{layer}"));
|
||||||
|
if !layer_dir.exists() { break; }
|
||||||
|
let mphf = MphfLayer::open(&layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let reader = UnitigFileReader::open(&layer_dir.join("unitigs.bin"))?;
|
||||||
|
|
||||||
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
let presence_dir = layer_dir.join("presence");
|
||||||
|
|
||||||
|
if use_counts && counts_dir.exists() {
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
cb(part, layer, kmer, mat.row(slot));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if !use_counts && presence_dir.exists() {
|
||||||
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
let row: Box<[u32]> = mat.row(slot).iter().map(|&b| b as u32).collect();
|
||||||
|
cb(part, layer, kmer, row);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
let all_present: Box<[u32]> = vec![1u32; n_genomes].into();
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if mphf.find(kmer).is_some() {
|
||||||
|
cb(part, layer, kmer, all_present.clone());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
layer += 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,87 @@
|
|||||||
|
/// Trait for kmer row filters used in the rebuild command.
|
||||||
|
///
|
||||||
|
/// `row` contains raw per-genome counts (or 0/1 for presence/absence data).
|
||||||
|
/// `n_genomes` equals `row.len()`.
|
||||||
|
pub trait KmerFilter: Send + Sync {
|
||||||
|
fn passes(&self, row: &[u32], n_genomes: usize) -> bool;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Quorum filters ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn present_count(row: &[u32], threshold: u32) -> usize {
|
||||||
|
row.iter().filter(|&&v| v > threshold).count()
|
||||||
|
}
|
||||||
|
|
||||||
|
/// At least `frac` fraction of genomes contain this kmer (count > `threshold`).
|
||||||
|
pub struct MinGenomeFraction {
|
||||||
|
pub frac: f64,
|
||||||
|
pub threshold: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MinGenomeFraction {
|
||||||
|
fn passes(&self, row: &[u32], n_genomes: usize) -> bool {
|
||||||
|
let p = present_count(row, self.threshold);
|
||||||
|
p as f64 / n_genomes as f64 >= self.frac
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// At most `frac` fraction of genomes contain this kmer (count > `threshold`).
|
||||||
|
pub struct MaxGenomeFraction {
|
||||||
|
pub frac: f64,
|
||||||
|
pub threshold: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MaxGenomeFraction {
|
||||||
|
fn passes(&self, row: &[u32], n_genomes: usize) -> bool {
|
||||||
|
let p = present_count(row, self.threshold);
|
||||||
|
p as f64 / n_genomes as f64 <= self.frac
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// At least `count` genomes contain this kmer (count > `threshold`).
|
||||||
|
pub struct MinGenomeCount {
|
||||||
|
pub count: usize,
|
||||||
|
pub threshold: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MinGenomeCount {
|
||||||
|
fn passes(&self, row: &[u32], _n_genomes: usize) -> bool {
|
||||||
|
present_count(row, self.threshold) >= self.count
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// At most `count` genomes contain this kmer (count > `threshold`).
|
||||||
|
pub struct MaxGenomeCount {
|
||||||
|
pub count: usize,
|
||||||
|
pub threshold: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MaxGenomeCount {
|
||||||
|
fn passes(&self, row: &[u32], _n_genomes: usize) -> bool {
|
||||||
|
present_count(row, self.threshold) <= self.count
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Total-count filters (count indexes only) ───────────────────────────────────
|
||||||
|
|
||||||
|
/// Sum of counts across all genomes >= `total`.
|
||||||
|
pub struct MinTotalCount {
|
||||||
|
pub total: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MinTotalCount {
|
||||||
|
fn passes(&self, row: &[u32], _n_genomes: usize) -> bool {
|
||||||
|
row.iter().sum::<u32>() >= self.total
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Sum of counts across all genomes <= `total`.
|
||||||
|
pub struct MaxTotalCount {
|
||||||
|
pub total: u32,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerFilter for MaxTotalCount {
|
||||||
|
fn passes(&self, row: &[u32], _n_genomes: usize) -> bool {
|
||||||
|
row.iter().sum::<u32>() <= self.total
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,141 @@
|
|||||||
|
use std::fs;
|
||||||
|
use std::io;
|
||||||
|
|
||||||
|
use cacheline_ef::{CachelineEf, CachelineEfVec};
|
||||||
|
use epserde::prelude::*;
|
||||||
|
use obicompactvec::{PersistentCompactIntMatrix, PersistentCompactIntVec};
|
||||||
|
use obidebruinj::GraphDeBruijn;
|
||||||
|
use obilayeredmap::{OLMError, layer::Layer};
|
||||||
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
use obiskio::{SKError, SKFileMeta, SKFileReader};
|
||||||
|
use ptr_hash::{PtrHash, bucket_fn::CubicEps, hash::Xx64};
|
||||||
|
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
||||||
|
|
||||||
|
fn olm_to_sk(e: OLMError) -> SKError {
|
||||||
|
match e {
|
||||||
|
OLMError::Io(io_err) => SKError::Io(io_err),
|
||||||
|
other => SKError::InvalidData { context: "layer build", detail: other.to_string() },
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn remove_if_exists(path: &std::path::Path) {
|
||||||
|
if let Err(e) = fs::remove_file(path) {
|
||||||
|
if e.kind() != io::ErrorKind::NotFound {
|
||||||
|
eprintln!("warning: could not remove {}: {e}", path.display());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Build the layered MPHF index for partition `i`.
|
||||||
|
///
|
||||||
|
/// Returns the number of canonical k-mers indexed, or 0 if the partition
|
||||||
|
/// has no data or its layer was already built (resume-safe).
|
||||||
|
///
|
||||||
|
/// Abundance filtering is applied when `min_ab > 1` or `max_ab.is_some()`,
|
||||||
|
/// using `mphf1.bin` + `counts1.bin` if they exist.
|
||||||
|
/// Count payload is stored iff `with_counts` is true.
|
||||||
|
pub fn build_index_layer(
|
||||||
|
&self,
|
||||||
|
i: usize,
|
||||||
|
min_ab: u32,
|
||||||
|
max_ab: Option<u32>,
|
||||||
|
with_counts: bool,
|
||||||
|
) -> Result<usize, SKError> {
|
||||||
|
let part_dir = self.part_dir(i);
|
||||||
|
let dedup_path = part_dir.join("dereplicated.skmer.zst");
|
||||||
|
if !dedup_path.exists() {
|
||||||
|
return Ok(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
let layer_dir = part_dir.join("index").join("layer_0");
|
||||||
|
if layer_dir.join("mphf.bin").exists() {
|
||||||
|
return Ok(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
let filter_active = min_ab > 1 || max_ab.is_some();
|
||||||
|
let need_counts = filter_active || with_counts;
|
||||||
|
|
||||||
|
let mphf1_opt: Option<Mphf> = if need_counts {
|
||||||
|
let p = part_dir.join("mphf1.bin");
|
||||||
|
p.exists().then(|| Mphf::load_full(&p).ok()).flatten()
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
};
|
||||||
|
|
||||||
|
let counts1_opt: Option<PersistentCompactIntVec> = if need_counts {
|
||||||
|
let p = part_dir.join("counts1.bin");
|
||||||
|
p.exists()
|
||||||
|
.then(|| PersistentCompactIntVec::open(&p).ok())
|
||||||
|
.flatten()
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
};
|
||||||
|
|
||||||
|
let mut g = GraphDeBruijn::new();
|
||||||
|
let mut reader = SKFileReader::open(&dedup_path)?;
|
||||||
|
for sk in reader.iter() {
|
||||||
|
for kmer in sk.iter_canonical_kmers() {
|
||||||
|
let accept = if filter_active {
|
||||||
|
match (&mphf1_opt, &counts1_opt) {
|
||||||
|
(Some(mphf), Some(counts)) => {
|
||||||
|
let ab = counts.get(mphf.index(&kmer.raw()));
|
||||||
|
ab >= min_ab && max_ab.map_or(true, |max| ab <= max)
|
||||||
|
}
|
||||||
|
_ => true,
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
true
|
||||||
|
};
|
||||||
|
if accept {
|
||||||
|
g.push(kmer);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
let n_kmers = g.len();
|
||||||
|
g.compute_degrees();
|
||||||
|
|
||||||
|
fs::create_dir_all(&layer_dir)?;
|
||||||
|
|
||||||
|
let mut uw = Layer::<()>::unitig_writer(&layer_dir).map_err(olm_to_sk)?;
|
||||||
|
for unitig in g.iter_unitig() {
|
||||||
|
uw.write(&unitig)?;
|
||||||
|
}
|
||||||
|
uw.close()?;
|
||||||
|
|
||||||
|
if with_counts {
|
||||||
|
Layer::<PersistentCompactIntMatrix>::build(&layer_dir, |kmer| {
|
||||||
|
match (&mphf1_opt, &counts1_opt) {
|
||||||
|
(Some(mphf), Some(counts)) => counts.get(mphf.index(&kmer.raw())),
|
||||||
|
_ => 1,
|
||||||
|
}
|
||||||
|
})
|
||||||
|
.map_err(olm_to_sk)?;
|
||||||
|
} else {
|
||||||
|
Layer::<()>::build(&layer_dir).map_err(olm_to_sk)?;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Write meta.json in the index/ directory so LayeredMap::open works
|
||||||
|
// (e.g. for subsequent merge operations).
|
||||||
|
let index_dir = layer_dir.parent().expect("layer_dir has a parent");
|
||||||
|
PartitionMeta { n_layers: 1 }.save(index_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
|
Ok(n_kmers)
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Remove intermediate build artifacts for partition `i`.
|
||||||
|
///
|
||||||
|
/// Deletes `dereplicated.skmer.zst` (+ sidecar), `mphf1.bin`, `counts1.bin`.
|
||||||
|
pub fn remove_build_artifacts(&self, i: usize) {
|
||||||
|
let part_dir = self.part_dir(i);
|
||||||
|
let dedup = part_dir.join("dereplicated.skmer.zst");
|
||||||
|
remove_if_exists(&SKFileMeta::sidecar_path(&dedup));
|
||||||
|
remove_if_exists(&dedup);
|
||||||
|
remove_if_exists(&part_dir.join("mphf1.bin"));
|
||||||
|
remove_if_exists(&part_dir.join("counts1.bin"));
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -1,4 +1,13 @@
|
|||||||
|
pub mod filter;
|
||||||
|
mod distance;
|
||||||
|
mod dump_layer;
|
||||||
|
mod index_layer;
|
||||||
mod kmer_sort;
|
mod kmer_sort;
|
||||||
|
mod merge_layer;
|
||||||
mod partition;
|
mod partition;
|
||||||
|
mod query_layer;
|
||||||
|
mod rebuild_layer;
|
||||||
|
|
||||||
pub use partition::{KmerPartition, PARTITIONS_SUBDIR};
|
pub use filter::KmerFilter;
|
||||||
|
pub use merge_layer::MergeMode;
|
||||||
|
pub use partition::{KmerPartition, KmerSpectrum, PARTITIONS_SUBDIR};
|
||||||
|
|||||||
@@ -0,0 +1,352 @@
|
|||||||
|
use std::fs;
|
||||||
|
use std::io;
|
||||||
|
use std::path::{Path, PathBuf};
|
||||||
|
|
||||||
|
use obidebruinj::GraphDeBruijn;
|
||||||
|
use obicompactvec::{PersistentBitMatrix, PersistentBitMatrixBuilder,
|
||||||
|
PersistentBitVecBuilder,
|
||||||
|
PersistentCompactIntMatrix, PersistentCompactIntMatrixBuilder,
|
||||||
|
PersistentCompactIntVecBuilder};
|
||||||
|
use obikseq::CanonicalKmer;
|
||||||
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
|
use obilayeredmap::{Layer, LayeredMap, MphfLayer, OLMError};
|
||||||
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
// ── MergeMode ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||||
|
pub enum MergeMode { Presence, Count }
|
||||||
|
|
||||||
|
// ── ColBuilder — enum dispatch to avoid trait-object boxing issues ─────────────
|
||||||
|
|
||||||
|
enum ColBuilder {
|
||||||
|
Bit(PersistentBitVecBuilder),
|
||||||
|
Int(PersistentCompactIntVecBuilder),
|
||||||
|
}
|
||||||
|
|
||||||
|
impl ColBuilder {
|
||||||
|
fn set_val(&mut self, slot: usize, value: u32) {
|
||||||
|
match self {
|
||||||
|
ColBuilder::Bit(b) => b.set(slot, value > 0),
|
||||||
|
ColBuilder::Int(b) => b.set(slot, value),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn close(self) -> SKResult<()> {
|
||||||
|
match self {
|
||||||
|
ColBuilder::Bit(b) => b.close().map_err(SKError::Io),
|
||||||
|
ColBuilder::Int(b) => b.close().map_err(SKError::Io),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── SrcLayerData — opened source matrix for pass-2 lookup ─────────────────────
|
||||||
|
|
||||||
|
pub(crate) enum SrcLayerData {
|
||||||
|
/// Pure set-membership layer (no data matrix): every kmer is present in all genomes.
|
||||||
|
SetMembership,
|
||||||
|
Presence(MphfLayer, PersistentBitMatrix),
|
||||||
|
Count(MphfLayer, PersistentCompactIntMatrix),
|
||||||
|
}
|
||||||
|
|
||||||
|
impl SrcLayerData {
|
||||||
|
pub(crate) fn open(layer_dir: &Path, mode: MergeMode) -> SKResult<Self> {
|
||||||
|
let presence_dir = layer_dir.join("presence");
|
||||||
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
match mode {
|
||||||
|
MergeMode::Presence => {
|
||||||
|
if presence_dir.exists() {
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(SrcLayerData::Presence(mphf, mat))
|
||||||
|
} else if counts_dir.exists() {
|
||||||
|
// Source is a count index; treat count > 0 as present via ColBuilder::Bit.
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(SrcLayerData::Count(mphf, mat))
|
||||||
|
} else {
|
||||||
|
Ok(SrcLayerData::SetMembership)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
MergeMode::Count => {
|
||||||
|
if counts_dir.exists() {
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(SrcLayerData::Count(mphf, mat))
|
||||||
|
} else {
|
||||||
|
Ok(SrcLayerData::SetMembership)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Return one value per source genome for `kmer`.
|
||||||
|
pub(crate) fn lookup(&self, kmer: CanonicalKmer, n_genomes: usize) -> Vec<u32> {
|
||||||
|
match self {
|
||||||
|
SrcLayerData::SetMembership => vec![1u32; n_genomes],
|
||||||
|
SrcLayerData::Presence(mphf, mat) => {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
mat.row(slot).iter().map(|&b| b as u32).collect()
|
||||||
|
} else {
|
||||||
|
vec![0u32; n_genomes]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
SrcLayerData::Count(mphf, mat) => {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
mat.row(slot).iter().copied().collect()
|
||||||
|
} else {
|
||||||
|
vec![0u32; n_genomes]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const INDEX_SUBDIR: &str = "index";
|
||||||
|
|
||||||
|
/// Load PartitionMeta, or recover it by probing layer directories.
|
||||||
|
/// Indexes built before meta.json was introduced lack the file.
|
||||||
|
fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
|
||||||
|
match PartitionMeta::load(dir) {
|
||||||
|
Ok(m) => Ok(m),
|
||||||
|
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
||||||
|
let mut n = 0usize;
|
||||||
|
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
||||||
|
let m = PartitionMeta { n_layers: n };
|
||||||
|
m.save(dir).map_err(olm_to_sk)?;
|
||||||
|
Ok(m)
|
||||||
|
}
|
||||||
|
Err(e) => Err(olm_to_sk(e)),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn olm_to_sk(e: OLMError) -> SKError {
|
||||||
|
match e {
|
||||||
|
OLMError::Io(e) => SKError::Io(e),
|
||||||
|
other => SKError::InvalidData { context: "merge", detail: other.to_string() },
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn col_path_bit(dir: &Path, col: usize) -> PathBuf {
|
||||||
|
dir.join(format!("col_{col:06}.pbiv"))
|
||||||
|
}
|
||||||
|
|
||||||
|
fn col_path_int(dir: &Path, col: usize) -> PathBuf {
|
||||||
|
dir.join(format!("col_{col:06}.pciv"))
|
||||||
|
}
|
||||||
|
|
||||||
|
fn write_matrix_meta(dir: &Path, n: usize, n_cols: usize) -> io::Result<()> {
|
||||||
|
fs::write(dir.join("meta.json"), format!("{{\"n\":{n},\"n_cols\":{n_cols}}}\n"))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── KmerPartition::merge_partition ────────────────────────────────────────────
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Merge `sources` into destination partition `i`.
|
||||||
|
///
|
||||||
|
/// Each entry in `sources` is `(partition, n_genomes)` where `n_genomes` is
|
||||||
|
/// the number of genome columns that source contributes. A merged index
|
||||||
|
/// contributes more than one. The total new columns added to the destination
|
||||||
|
/// is `sum(n_genomes)`.
|
||||||
|
///
|
||||||
|
/// `n_dst_genomes` is the number of genome columns already in the destination
|
||||||
|
/// matrices (copied from source_0 before this call).
|
||||||
|
pub fn merge_partition(
|
||||||
|
&self,
|
||||||
|
i: usize,
|
||||||
|
sources: &[(&KmerPartition, usize)],
|
||||||
|
mode: MergeMode,
|
||||||
|
n_dst_genomes: usize,
|
||||||
|
) -> SKResult<()> {
|
||||||
|
let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
|
||||||
|
if !dst_index_dir.exists() {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
load_meta(&dst_index_dir)?; // ensure meta.json exists before LayeredMap::open
|
||||||
|
let dst_map = LayeredMap::<()>::open(&dst_index_dir).map_err(olm_to_sk)?;
|
||||||
|
let n_dst_layers = dst_map.n_layers();
|
||||||
|
let n_src_total: usize = sources.iter().map(|(_, n)| *n).sum();
|
||||||
|
|
||||||
|
// First merge in presence mode: init presence matrices on existing layers
|
||||||
|
// (all slots true — every kmer in those layers belongs to genome_0).
|
||||||
|
if n_dst_genomes == 1 && mode == MergeMode::Presence {
|
||||||
|
for l in 0..n_dst_layers {
|
||||||
|
let layer_dir = dst_index_dir.join(format!("layer_{l}"));
|
||||||
|
Layer::<()>::init_presence_matrix(&layer_dir, dst_map.layer(l).n())
|
||||||
|
.map_err(olm_to_sk)?;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Pass 1: classify kmers, build new-kmer de Bruijn graph ───────────
|
||||||
|
let mut g = GraphDeBruijn::new();
|
||||||
|
let mut any_new = false;
|
||||||
|
|
||||||
|
for (src, _) in sources.iter() {
|
||||||
|
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
|
||||||
|
if !src_index_dir.exists() { continue; }
|
||||||
|
let src_meta = load_meta(&src_index_dir)?;
|
||||||
|
|
||||||
|
for l in 0..src_meta.n_layers {
|
||||||
|
let unitigs_path = src_index_dir
|
||||||
|
.join(format!("layer_{l}")).join("unitigs.bin");
|
||||||
|
let reader = UnitigFileReader::open(&unitigs_path)?;
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
if dst_map.query(kmer).is_none() {
|
||||||
|
g.push(kmer);
|
||||||
|
any_new = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Build new layer from de Bruijn graph if there are new kmers.
|
||||||
|
let new_layer_idx = n_dst_layers;
|
||||||
|
let new_layer_dir = dst_index_dir.join(format!("layer_{new_layer_idx}"));
|
||||||
|
|
||||||
|
if any_new {
|
||||||
|
g.compute_degrees();
|
||||||
|
fs::create_dir_all(&new_layer_dir)?;
|
||||||
|
let mut uw = Layer::<()>::unitig_writer(&new_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
for unitig in g.iter_unitig() {
|
||||||
|
uw.write(&unitig)?;
|
||||||
|
}
|
||||||
|
uw.close()?;
|
||||||
|
Layer::<()>::build(&new_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
}
|
||||||
|
drop(g);
|
||||||
|
|
||||||
|
let new_mphf = if any_new {
|
||||||
|
Some(MphfLayer::open(&new_layer_dir).map_err(olm_to_sk)?)
|
||||||
|
} else {
|
||||||
|
None
|
||||||
|
};
|
||||||
|
let n_new = new_mphf.as_ref().map_or(0, |m| m.n());
|
||||||
|
|
||||||
|
// ── Prepare matrix directories for the new layer ──────────────────────
|
||||||
|
// Absent columns (dst genomes) are written via append_column (all-zero/false).
|
||||||
|
// Source-genome columns are created as mutable builders for pass 2.
|
||||||
|
let mut new_src_builders: Vec<ColBuilder> = if any_new {
|
||||||
|
let data_dir = match mode {
|
||||||
|
MergeMode::Presence => new_layer_dir.join("presence"),
|
||||||
|
MergeMode::Count => new_layer_dir.join("counts"),
|
||||||
|
};
|
||||||
|
fs::create_dir_all(&data_dir)?;
|
||||||
|
match mode {
|
||||||
|
MergeMode::Presence => {
|
||||||
|
PersistentBitMatrixBuilder::new(n_new, &data_dir)
|
||||||
|
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
|
||||||
|
for _ in 0..n_dst_genomes {
|
||||||
|
PersistentBitMatrix::append_column(&data_dir, |_| false)
|
||||||
|
.map_err(SKError::Io)?;
|
||||||
|
}
|
||||||
|
(0..n_src_total).map(|g| -> SKResult<ColBuilder> {
|
||||||
|
let b = PersistentBitVecBuilder::new(
|
||||||
|
n_new, &col_path_bit(&data_dir, n_dst_genomes + g))?;
|
||||||
|
Ok(ColBuilder::Bit(b))
|
||||||
|
}).collect::<SKResult<_>>()?
|
||||||
|
}
|
||||||
|
MergeMode::Count => {
|
||||||
|
PersistentCompactIntMatrixBuilder::new(n_new, &data_dir)
|
||||||
|
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
|
||||||
|
for _ in 0..n_dst_genomes {
|
||||||
|
PersistentCompactIntMatrix::append_column(&data_dir, |_| 0)
|
||||||
|
.map_err(SKError::Io)?;
|
||||||
|
}
|
||||||
|
(0..n_src_total).map(|g| -> SKResult<ColBuilder> {
|
||||||
|
let b = PersistentCompactIntVecBuilder::new(
|
||||||
|
n_new, &col_path_int(&data_dir, n_dst_genomes + g))?;
|
||||||
|
Ok(ColBuilder::Int(b))
|
||||||
|
}).collect::<SKResult<_>>()?
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
vec![]
|
||||||
|
};
|
||||||
|
|
||||||
|
// Builders for existing layers: n_src_total per layer.
|
||||||
|
// Columns n_dst_genomes .. n_dst_genomes + n_src_total - 1.
|
||||||
|
let mut exist_builders: Vec<Vec<ColBuilder>> = (0..n_dst_layers)
|
||||||
|
.map(|l| {
|
||||||
|
let layer_dir = dst_index_dir.join(format!("layer_{l}"));
|
||||||
|
let n = dst_map.layer(l).n();
|
||||||
|
(0..n_src_total).map(|src_g| -> SKResult<ColBuilder> {
|
||||||
|
match mode {
|
||||||
|
MergeMode::Presence => {
|
||||||
|
let data_dir = layer_dir.join("presence");
|
||||||
|
let b = PersistentBitVecBuilder::new(
|
||||||
|
n, &col_path_bit(&data_dir, n_dst_genomes + src_g))?;
|
||||||
|
Ok(ColBuilder::Bit(b))
|
||||||
|
}
|
||||||
|
MergeMode::Count => {
|
||||||
|
let data_dir = layer_dir.join("counts");
|
||||||
|
let b = PersistentCompactIntVecBuilder::new(
|
||||||
|
n, &col_path_int(&data_dir, n_dst_genomes + src_g))?;
|
||||||
|
Ok(ColBuilder::Int(b))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}).collect::<SKResult<_>>()
|
||||||
|
})
|
||||||
|
.collect::<SKResult<_>>()?;
|
||||||
|
|
||||||
|
// ── Pass 2: fill builders ─────────────────────────────────────────────
|
||||||
|
let mut col_offset = 0usize;
|
||||||
|
for (src, src_n) in sources.iter() {
|
||||||
|
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
|
||||||
|
if !src_index_dir.exists() { col_offset += src_n; continue; }
|
||||||
|
let src_meta = load_meta(&src_index_dir)?;
|
||||||
|
|
||||||
|
for l in 0..src_meta.n_layers {
|
||||||
|
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
|
||||||
|
let reader = UnitigFileReader::open(&src_layer_dir.join("unitigs.bin"))?;
|
||||||
|
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
||||||
|
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
let values = src_data.lookup(kmer, *src_n);
|
||||||
|
for (g, &value) in values.iter().enumerate() {
|
||||||
|
let builder_idx = col_offset + g;
|
||||||
|
if let Some((dst_layer, hit)) = dst_map.query(kmer) {
|
||||||
|
exist_builders[dst_layer][builder_idx].set_val(hit.slot, value);
|
||||||
|
} else if let Some(ref mphf) = new_mphf {
|
||||||
|
if let Some(slot) = mphf.find(kmer) {
|
||||||
|
new_src_builders[builder_idx].set_val(slot, value);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
col_offset += src_n;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Close builders and update metadata ────────────────────────────────
|
||||||
|
for (l, builders) in exist_builders.into_iter().enumerate() {
|
||||||
|
let layer_dir = dst_index_dir.join(format!("layer_{l}"));
|
||||||
|
for b in builders { b.close()?; }
|
||||||
|
let n = dst_map.layer(l).n();
|
||||||
|
let data_dir = match mode {
|
||||||
|
MergeMode::Presence => layer_dir.join("presence"),
|
||||||
|
MergeMode::Count => layer_dir.join("counts"),
|
||||||
|
};
|
||||||
|
write_matrix_meta(&data_dir, n, n_dst_genomes + n_src_total).map_err(SKError::Io)?;
|
||||||
|
}
|
||||||
|
|
||||||
|
for b in new_src_builders { b.close()?; }
|
||||||
|
if any_new {
|
||||||
|
let data_dir = match mode {
|
||||||
|
MergeMode::Presence => new_layer_dir.join("presence"),
|
||||||
|
MergeMode::Count => new_layer_dir.join("counts"),
|
||||||
|
};
|
||||||
|
write_matrix_meta(&data_dir, n_new, n_dst_genomes + n_src_total).map_err(SKError::Io)?;
|
||||||
|
|
||||||
|
let mut part_meta = PartitionMeta::load(&dst_index_dir).map_err(olm_to_sk)?;
|
||||||
|
part_meta.n_layers = new_layer_idx + 1;
|
||||||
|
part_meta.save(&dst_index_dir).map_err(olm_to_sk)?;
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -18,7 +18,6 @@ use obiskio::{SKFileMeta, SKFileReader, SKFileWriter, SKResult};
|
|||||||
use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
use ptr_hash::{PtrHash, PtrHashParams, bucket_fn::CubicEps, hash::Xx64};
|
||||||
use rayon::prelude::*;
|
use rayon::prelude::*;
|
||||||
use remove_dir_all::remove_dir_all;
|
use remove_dir_all::remove_dir_all;
|
||||||
use serde::{Deserialize, Serialize};
|
|
||||||
use sysinfo::System;
|
use sysinfo::System;
|
||||||
|
|
||||||
use niffler::Level;
|
use niffler::Level;
|
||||||
@@ -28,18 +27,15 @@ use crate::kmer_sort::{chunk_size_from_ram, sort_unique_kmers};
|
|||||||
|
|
||||||
type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
type Mphf = PtrHash<u64, CubicEps, CachelineEfVec<Vec<CachelineEf>>, Xx64, Vec<u8>>;
|
||||||
|
|
||||||
const META_FILENAME: &str = "partition.meta";
|
pub struct KmerSpectrum {
|
||||||
|
pub f0: u64,
|
||||||
|
pub f1: u64,
|
||||||
|
pub counts: BTreeMap<u32, u64>,
|
||||||
|
}
|
||||||
|
|
||||||
const SK_EXT: &str = "skmer.zst";
|
const SK_EXT: &str = "skmer.zst";
|
||||||
pub const PARTITIONS_SUBDIR: &str = "partitions";
|
pub const PARTITIONS_SUBDIR: &str = "partitions";
|
||||||
|
|
||||||
#[derive(Serialize, Deserialize)]
|
|
||||||
struct PartitionMeta {
|
|
||||||
n_bits: usize,
|
|
||||||
kmer_size: usize,
|
|
||||||
minimizer_size: usize,
|
|
||||||
level: u32,
|
|
||||||
}
|
|
||||||
|
|
||||||
pub struct KmerPartition {
|
pub struct KmerPartition {
|
||||||
root_path: PathBuf,
|
root_path: PathBuf,
|
||||||
n_partitions: usize,
|
n_partitions: usize,
|
||||||
@@ -98,11 +94,15 @@ impl KmerPartition {
|
|||||||
level,
|
level,
|
||||||
closed: false,
|
closed: false,
|
||||||
};
|
};
|
||||||
partition.write_meta(n_bits)?;
|
|
||||||
Ok(partition)
|
Ok(partition)
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn open<P: AsRef<Path>>(path: P) -> SKResult<Self> {
|
pub fn open_with_config<P: AsRef<Path>>(
|
||||||
|
path: P,
|
||||||
|
kmer_size: usize,
|
||||||
|
minimizer_size: usize,
|
||||||
|
n_bits: usize,
|
||||||
|
) -> SKResult<Self> {
|
||||||
let root_path = path.as_ref().to_owned();
|
let root_path = path.as_ref().to_owned();
|
||||||
if !root_path.exists() {
|
if !root_path.exists() {
|
||||||
return Err(io::Error::new(
|
return Err(io::Error::new(
|
||||||
@@ -111,22 +111,17 @@ impl KmerPartition {
|
|||||||
)
|
)
|
||||||
.into());
|
.into());
|
||||||
}
|
}
|
||||||
let meta_path = root_path.join(META_FILENAME);
|
let n_partitions = 1usize << n_bits;
|
||||||
let meta: PartitionMeta =
|
|
||||||
serde_json::from_reader(fs::File::open(&meta_path)?).map_err(io::Error::other)?;
|
|
||||||
|
|
||||||
let level = level_from_u32(meta.level);
|
|
||||||
let n_partitions = 1usize << meta.n_bits;
|
|
||||||
let writers = (0..n_partitions).map(|_| None).collect();
|
let writers = (0..n_partitions).map(|_| None).collect();
|
||||||
Ok(Self {
|
Ok(Self {
|
||||||
root_path,
|
root_path,
|
||||||
n_partitions,
|
n_partitions,
|
||||||
partitions_mask: (1u64 << meta.n_bits) - 1,
|
partitions_mask: (1u64 << n_bits) - 1,
|
||||||
kmer_size: meta.kmer_size,
|
kmer_size,
|
||||||
minimizer_size: meta.minimizer_size,
|
minimizer_size,
|
||||||
writers,
|
writers,
|
||||||
level,
|
level: Level::One,
|
||||||
closed: true, // read-only: writing is not allowed on an opened partition
|
closed: true,
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -224,19 +219,34 @@ impl KmerPartition {
|
|||||||
let n_threads = rayon::current_num_threads().max(1) as u64;
|
let n_threads = rayon::current_num_threads().max(1) as u64;
|
||||||
let available_per_thread = available / n_threads;
|
let available_per_thread = available / n_threads;
|
||||||
|
|
||||||
|
let pb = ProgressBar::new(self.n_partitions as u64);
|
||||||
|
pb.set_style(
|
||||||
|
ProgressStyle::with_template(
|
||||||
|
"dereplicating [{bar:40}] {pos}/{len} ({percent}%) {per_sec} eta {eta} {msg}",
|
||||||
|
)
|
||||||
|
.unwrap()
|
||||||
|
.progress_chars("█▌░"),
|
||||||
|
);
|
||||||
|
|
||||||
let results: Vec<SKResult<()>> = (0..self.n_partitions)
|
let results: Vec<SKResult<()>> = (0..self.n_partitions)
|
||||||
.into_par_iter()
|
.into_par_iter()
|
||||||
.map(|i| {
|
.map(|i| {
|
||||||
let dir = self.part_dir(i);
|
let dir = self.part_dir(i);
|
||||||
if !dir.exists() {
|
if !dir.exists() {
|
||||||
|
pb.inc(1);
|
||||||
return Ok(());
|
return Ok(());
|
||||||
}
|
}
|
||||||
let raw_path = dir.join(format!("raw.{SK_EXT}"));
|
let raw_path = dir.join(format!("raw.{SK_EXT}"));
|
||||||
|
let t = Instant::now();
|
||||||
let n_buckets = optimal_buckets(&raw_path, available_per_thread);
|
let n_buckets = optimal_buckets(&raw_path, available_per_thread);
|
||||||
dereplicate_partition(&dir, level, n_buckets)
|
let result = dereplicate_partition(&dir, level, n_buckets);
|
||||||
|
pb.set_message(format!("last {:.0}ms", t.elapsed().as_millis()));
|
||||||
|
pb.inc(1);
|
||||||
|
result
|
||||||
})
|
})
|
||||||
.collect();
|
.collect();
|
||||||
|
|
||||||
|
pb.finish_and_clear();
|
||||||
for r in results {
|
for r in results {
|
||||||
r?;
|
r?;
|
||||||
}
|
}
|
||||||
@@ -249,11 +259,13 @@ impl KmerPartition {
|
|||||||
/// 3. Writes a flat binary count file (`counts1.bin`, one `u32` per slot,
|
/// 3. Writes a flat binary count file (`counts1.bin`, one `u32` per slot,
|
||||||
/// memory-mapped) accumulating kmer abundances from the superkmer counts.
|
/// memory-mapped) accumulating kmer abundances from the superkmer counts.
|
||||||
/// 4. Persists the MPHF to `mphf1.bin` for downstream use.
|
/// 4. Persists the MPHF to `mphf1.bin` for downstream use.
|
||||||
/// 5. Writes a global `kmer_spectrum_raw.json` at the partition root.
|
///
|
||||||
|
/// Returns the aggregated `KmerSpectrum`. Per-partition spectrum files are
|
||||||
|
/// deleted after aggregation unless `keep_partial` is true.
|
||||||
///
|
///
|
||||||
/// Partitions are processed in parallel via Rayon (one task per thread).
|
/// Partitions are processed in parallel via Rayon (one task per thread).
|
||||||
/// Peak memory per partition is ~80 MB, so n_threads partitions run simultaneously.
|
/// Peak memory per partition is ~80 MB, so n_threads partitions run simultaneously.
|
||||||
pub fn count_kmer(&self) -> SKResult<()> {
|
pub fn count_kmer(&self, keep_partial: bool) -> SKResult<KmerSpectrum> {
|
||||||
let sys = System::new_all();
|
let sys = System::new_all();
|
||||||
let available = match sys.available_memory() {
|
let available = match sys.available_memory() {
|
||||||
0 => sys.total_memory() / 2,
|
0 => sys.total_memory() / 2,
|
||||||
@@ -293,10 +305,10 @@ impl KmerPartition {
|
|||||||
r?;
|
r?;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Aggregate per-partition spectra into a global one at the root.
|
// Aggregate per-partition spectra.
|
||||||
let mut global_spectrum: BTreeMap<u32, u64> = BTreeMap::new();
|
let mut counts: BTreeMap<u32, u64> = BTreeMap::new();
|
||||||
let mut global_f0: u64 = 0;
|
let mut f0: u64 = 0;
|
||||||
let mut global_f1: u64 = 0;
|
let mut f1: u64 = 0;
|
||||||
|
|
||||||
for i in 0..self.n_partitions {
|
for i in 0..self.n_partitions {
|
||||||
let path = self.part_dir(i).join("kmer_spectrum_raw.json");
|
let path = self.part_dir(i).join("kmer_spectrum_raw.json");
|
||||||
@@ -305,28 +317,21 @@ impl KmerPartition {
|
|||||||
}
|
}
|
||||||
let v: serde_json::Value =
|
let v: serde_json::Value =
|
||||||
serde_json::from_str(&fs::read_to_string(&path)?).map_err(io::Error::other)?;
|
serde_json::from_str(&fs::read_to_string(&path)?).map_err(io::Error::other)?;
|
||||||
global_f0 += v["f0"].as_u64().unwrap_or(0);
|
f0 += v["f0"].as_u64().unwrap_or(0);
|
||||||
global_f1 += v["f1"].as_u64().unwrap_or(0);
|
f1 += v["f1"].as_u64().unwrap_or(0);
|
||||||
if let Some(obj) = v["spectrum"].as_object() {
|
if let Some(obj) = v["spectrum"].as_object() {
|
||||||
for (c_str, freq) in obj {
|
for (c_str, freq) in obj {
|
||||||
if let (Ok(c), Some(f)) = (c_str.parse::<u32>(), freq.as_u64()) {
|
if let (Ok(c), Some(f)) = (c_str.parse::<u32>(), freq.as_u64()) {
|
||||||
*global_spectrum.entry(c).or_insert(0) += f;
|
*counts.entry(c).or_insert(0) += f;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
if !keep_partial {
|
||||||
|
let _ = fs::remove_file(&path);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
let global_spectrum_map: BTreeMap<String, u64> = global_spectrum
|
Ok(KmerSpectrum { f0, f1, counts })
|
||||||
.iter()
|
|
||||||
.map(|(&c, &f)| (format!("{c:010}"), f))
|
|
||||||
.collect();
|
|
||||||
serde_json::to_writer_pretty(
|
|
||||||
fs::File::create(self.root_path.join("kmer_spectrum_raw.json"))?,
|
|
||||||
&serde_json::json!({ "f0": global_f0, "f1": global_f1, "spectrum": &global_spectrum_map }),
|
|
||||||
)
|
|
||||||
.map_err(io::Error::other)?;
|
|
||||||
|
|
||||||
Ok(())
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── private ───────────────────────────────────────────────────────────────
|
// ── private ───────────────────────────────────────────────────────────────
|
||||||
@@ -339,18 +344,6 @@ impl KmerPartition {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
fn write_meta(&self, n_bits: usize) -> SKResult<()> {
|
|
||||||
let meta = PartitionMeta {
|
|
||||||
n_bits,
|
|
||||||
kmer_size: self.kmer_size,
|
|
||||||
minimizer_size: self.minimizer_size,
|
|
||||||
level: u32::from(self.level),
|
|
||||||
};
|
|
||||||
let f = fs::File::create(self.root_path.join(META_FILENAME))?;
|
|
||||||
serde_json::to_writer_pretty(f, &meta).map_err(|e| io::Error::other(e))?;
|
|
||||||
Ok(())
|
|
||||||
}
|
|
||||||
|
|
||||||
fn ensure_writer(&mut self, partition: usize) -> SKResult<&mut SKFileWriter> {
|
fn ensure_writer(&mut self, partition: usize) -> SKResult<&mut SKFileWriter> {
|
||||||
if self.writers[partition].is_none() {
|
if self.writers[partition].is_none() {
|
||||||
let dir = self.root_path.join(PARTITIONS_SUBDIR).join(format!("part_{:05}", partition));
|
let dir = self.root_path.join(PARTITIONS_SUBDIR).join(format!("part_{:05}", partition));
|
||||||
@@ -411,32 +404,6 @@ fn optimal_buckets(raw_path: &Path, available_bytes: u64) -> usize {
|
|||||||
n.next_power_of_two() as usize
|
n.next_power_of_two() as usize
|
||||||
}
|
}
|
||||||
|
|
||||||
fn level_from_u32(n: u32) -> Level {
|
|
||||||
match n {
|
|
||||||
0 => Level::Zero,
|
|
||||||
1 => Level::One,
|
|
||||||
2 => Level::Two,
|
|
||||||
3 => Level::Three,
|
|
||||||
4 => Level::Four,
|
|
||||||
5 => Level::Five,
|
|
||||||
6 => Level::Six,
|
|
||||||
7 => Level::Seven,
|
|
||||||
8 => Level::Eight,
|
|
||||||
9 => Level::Nine,
|
|
||||||
10 => Level::Ten,
|
|
||||||
11 => Level::Eleven,
|
|
||||||
12 => Level::Twelve,
|
|
||||||
13 => Level::Thirteen,
|
|
||||||
14 => Level::Fourteen,
|
|
||||||
15 => Level::Fifteen,
|
|
||||||
16 => Level::Sixteen,
|
|
||||||
17 => Level::Seventeen,
|
|
||||||
18 => Level::Eighteen,
|
|
||||||
19 => Level::Nineteen,
|
|
||||||
20 => Level::Twenty,
|
|
||||||
_ => Level::TwentyOne,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Maximum value that fits in the 24-bit COUNT field of a SuperKmer header.
|
/// Maximum value that fits in the 24-bit COUNT field of a SuperKmer header.
|
||||||
const MAX_SK_COUNT: u64 = (1 << 24) - 1;
|
const MAX_SK_COUNT: u64 = (1 << 24) - 1;
|
||||||
|
|||||||
@@ -0,0 +1,120 @@
|
|||||||
|
use std::path::Path;
|
||||||
|
|
||||||
|
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
|
||||||
|
use obikseq::{CanonicalKmer, RoutableSuperKmer};
|
||||||
|
use obiskio::{SKError, SKResult};
|
||||||
|
use obilayeredmap::{MphfLayer, OLMError};
|
||||||
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
const INDEX_SUBDIR: &str = "index";
|
||||||
|
|
||||||
|
fn olm_to_sk(e: OLMError) -> SKError {
|
||||||
|
match e {
|
||||||
|
OLMError::Io(io_err) => SKError::Io(io_err),
|
||||||
|
other => SKError::InvalidData { context: "query", detail: other.to_string() },
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── per-layer query handle ────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
enum QueryLayer {
|
||||||
|
/// Layer<()> — MPHF-only, no data matrix; all indexed kmers map to 1 per genome.
|
||||||
|
SetOnly(MphfLayer),
|
||||||
|
Presence(MphfLayer, PersistentBitMatrix),
|
||||||
|
Count(MphfLayer, PersistentCompactIntMatrix),
|
||||||
|
}
|
||||||
|
|
||||||
|
impl QueryLayer {
|
||||||
|
fn open(layer_dir: &Path, with_counts: bool) -> SKResult<Self> {
|
||||||
|
let presence_dir = layer_dir.join("presence");
|
||||||
|
let counts_dir = layer_dir.join("counts");
|
||||||
|
|
||||||
|
if with_counts && counts_dir.exists() {
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(QueryLayer::Count(mphf, mat))
|
||||||
|
} else if presence_dir.exists() {
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(QueryLayer::Presence(mphf, mat))
|
||||||
|
} else if counts_dir.exists() {
|
||||||
|
// presence query on a count index — return counts as-is
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
|
||||||
|
Ok(QueryLayer::Count(mphf, mat))
|
||||||
|
} else {
|
||||||
|
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
|
||||||
|
Ok(QueryLayer::SetOnly(mphf))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Return `Some(per-genome row)` if `kmer` is indexed in this layer, else `None`.
|
||||||
|
fn find(&self, kmer: CanonicalKmer, n_genomes: usize) -> Option<Box<[u32]>> {
|
||||||
|
match self {
|
||||||
|
QueryLayer::SetOnly(mphf) => {
|
||||||
|
mphf.find(kmer)
|
||||||
|
.map(|_| vec![1u32; n_genomes].into_boxed_slice())
|
||||||
|
}
|
||||||
|
QueryLayer::Presence(mphf, mat) => {
|
||||||
|
mphf.find(kmer)
|
||||||
|
.map(|slot| mat.row(slot).iter().map(|&b| b as u32).collect())
|
||||||
|
}
|
||||||
|
QueryLayer::Count(mphf, mat) => {
|
||||||
|
mphf.find(kmer).map(|slot| mat.row(slot))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── KmerPartition::query_partition ───────────────────────────────────────────
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Query a single partition for a slice of (already-routed) super-kmers.
|
||||||
|
///
|
||||||
|
/// Returns one entry per input super-kmer; each entry is a `Vec` with one
|
||||||
|
/// `Option<Box<[u32]>>` per k-mer inside that super-kmer:
|
||||||
|
/// - `None` — k-mer absent from the index
|
||||||
|
/// - `Some(row)` — per-genome count (count index) or 0/1 (presence index)
|
||||||
|
///
|
||||||
|
/// All `superkmers` must belong to this partition (same minimizer bucket).
|
||||||
|
pub fn query_partition(
|
||||||
|
&self,
|
||||||
|
part_idx: usize,
|
||||||
|
superkmers: &[&RoutableSuperKmer],
|
||||||
|
k: usize,
|
||||||
|
n_genomes: usize,
|
||||||
|
with_counts: bool,
|
||||||
|
) -> SKResult<Vec<Vec<Option<Box<[u32]>>>>> {
|
||||||
|
if superkmers.is_empty() {
|
||||||
|
return Ok(Vec::new());
|
||||||
|
}
|
||||||
|
|
||||||
|
let index_dir = self.part_dir(part_idx).join(INDEX_SUBDIR);
|
||||||
|
|
||||||
|
if !index_dir.exists() {
|
||||||
|
return Ok(superkmers
|
||||||
|
.iter()
|
||||||
|
.map(|rsk| vec![None; rsk.seql() - k + 1])
|
||||||
|
.collect());
|
||||||
|
}
|
||||||
|
|
||||||
|
let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
|
||||||
|
let layers: Vec<QueryLayer> = (0..meta.n_layers)
|
||||||
|
.map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts))
|
||||||
|
.collect::<SKResult<_>>()?;
|
||||||
|
|
||||||
|
Ok(superkmers
|
||||||
|
.iter()
|
||||||
|
.map(|rsk| {
|
||||||
|
rsk.superkmer()
|
||||||
|
.iter_canonical_kmers()
|
||||||
|
.map(|kmer| {
|
||||||
|
layers.iter().find_map(|layer| layer.find(kmer, n_genomes))
|
||||||
|
})
|
||||||
|
.collect()
|
||||||
|
})
|
||||||
|
.collect())
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,205 @@
|
|||||||
|
use std::fs;
|
||||||
|
use std::io;
|
||||||
|
use std::path::{Path, PathBuf};
|
||||||
|
|
||||||
|
use obicompactvec::{PersistentBitMatrixBuilder,
|
||||||
|
PersistentBitVecBuilder,
|
||||||
|
PersistentCompactIntMatrixBuilder,
|
||||||
|
PersistentCompactIntVecBuilder};
|
||||||
|
use obidebruinj::GraphDeBruijn;
|
||||||
|
use obiskio::{SKError, SKResult, UnitigFileReader};
|
||||||
|
use obilayeredmap::{Layer, MphfLayer, OLMError};
|
||||||
|
use obilayeredmap::meta::PartitionMeta;
|
||||||
|
|
||||||
|
use crate::filter::KmerFilter;
|
||||||
|
use crate::merge_layer::{MergeMode, SrcLayerData};
|
||||||
|
use crate::partition::KmerPartition;
|
||||||
|
|
||||||
|
const INDEX_SUBDIR: &str = "index";
|
||||||
|
|
||||||
|
fn olm_to_sk(e: OLMError) -> SKError {
|
||||||
|
match e {
|
||||||
|
OLMError::Io(e) => SKError::Io(e),
|
||||||
|
other => SKError::InvalidData { context: "rebuild", detail: other.to_string() },
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn col_path_bit(dir: &Path, col: usize) -> PathBuf {
|
||||||
|
dir.join(format!("col_{col:06}.pbiv"))
|
||||||
|
}
|
||||||
|
|
||||||
|
fn col_path_int(dir: &Path, col: usize) -> PathBuf {
|
||||||
|
dir.join(format!("col_{col:06}.pciv"))
|
||||||
|
}
|
||||||
|
|
||||||
|
fn write_matrix_meta(dir: &Path, n: usize, n_cols: usize) -> io::Result<()> {
|
||||||
|
fs::write(dir.join("meta.json"), format!("{{\"n\":{n},\"n_cols\":{n_cols}}}\n"))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── ColBuilder ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
enum ColBuilder {
|
||||||
|
Bit(PersistentBitVecBuilder),
|
||||||
|
Int(PersistentCompactIntVecBuilder),
|
||||||
|
}
|
||||||
|
|
||||||
|
impl ColBuilder {
|
||||||
|
fn set_val(&mut self, slot: usize, value: u32) {
|
||||||
|
match self {
|
||||||
|
ColBuilder::Bit(b) => b.set(slot, value > 0),
|
||||||
|
ColBuilder::Int(b) => b.set(slot, value),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn close(self) -> SKResult<()> {
|
||||||
|
match self {
|
||||||
|
ColBuilder::Bit(b) => b.close().map_err(SKError::Io),
|
||||||
|
ColBuilder::Int(b) => b.close().map_err(SKError::Io),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn load_meta(dir: &Path) -> SKResult<PartitionMeta> {
|
||||||
|
match PartitionMeta::load(dir) {
|
||||||
|
Ok(m) => Ok(m),
|
||||||
|
Err(e) if matches!(e, OLMError::Io(ref io_e) if io_e.kind() == std::io::ErrorKind::NotFound) => {
|
||||||
|
let mut n = 0usize;
|
||||||
|
while dir.join(format!("layer_{n}")).exists() { n += 1; }
|
||||||
|
let m = PartitionMeta { n_layers: n };
|
||||||
|
m.save(dir).map_err(olm_to_sk)?;
|
||||||
|
Ok(m)
|
||||||
|
}
|
||||||
|
Err(e) => Err(olm_to_sk(e)),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fn passes_all(filters: &[Box<dyn KmerFilter>], row: &[u32], n_genomes: usize) -> bool {
|
||||||
|
filters.iter().all(|f| f.passes(row, n_genomes))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── KmerPartition::rebuild_partition ─────────────────────────────────────────
|
||||||
|
|
||||||
|
impl KmerPartition {
|
||||||
|
/// Rebuild partition `i` from `src` into `self` (an empty destination partition).
|
||||||
|
///
|
||||||
|
/// Only k-mers whose per-genome row passes all `filters` are written.
|
||||||
|
/// The output is a single-layer index — regardless of how many layers the
|
||||||
|
/// source has.
|
||||||
|
///
|
||||||
|
/// `n_genomes` is the number of genome columns in the source (and destination).
|
||||||
|
pub fn rebuild_partition(
|
||||||
|
&self,
|
||||||
|
src: &KmerPartition,
|
||||||
|
i: usize,
|
||||||
|
filters: &[Box<dyn KmerFilter>],
|
||||||
|
mode: MergeMode,
|
||||||
|
n_genomes: usize,
|
||||||
|
) -> SKResult<()> {
|
||||||
|
let src_index_dir = src.part_dir(i).join(INDEX_SUBDIR);
|
||||||
|
if !src_index_dir.exists() {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
let src_meta = load_meta(&src_index_dir)?;
|
||||||
|
if src_meta.n_layers == 0 {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Pass 1: collect filtered kmers into de Bruijn graph ───────────────
|
||||||
|
let mut g = GraphDeBruijn::new();
|
||||||
|
|
||||||
|
for l in 0..src_meta.n_layers {
|
||||||
|
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
|
||||||
|
let unitigs_path = src_layer_dir.join("unitigs.bin");
|
||||||
|
if !unitigs_path.exists() { continue; }
|
||||||
|
|
||||||
|
let reader = UnitigFileReader::open(&unitigs_path)?;
|
||||||
|
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
||||||
|
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
let row = src_data.lookup(kmer, n_genomes);
|
||||||
|
if passes_all(filters, &row, n_genomes) {
|
||||||
|
g.push(kmer);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if g.len() == 0 {
|
||||||
|
return Ok(());
|
||||||
|
}
|
||||||
|
|
||||||
|
let n_new = g.len();
|
||||||
|
g.compute_degrees();
|
||||||
|
|
||||||
|
// ── Build MPHF in dst layer_0 ─────────────────────────────────────────
|
||||||
|
let dst_index_dir = self.part_dir(i).join(INDEX_SUBDIR);
|
||||||
|
let dst_layer_dir = dst_index_dir.join("layer_0");
|
||||||
|
fs::create_dir_all(&dst_layer_dir)?;
|
||||||
|
|
||||||
|
let mut uw = Layer::<()>::unitig_writer(&dst_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
for unitig in g.iter_unitig() {
|
||||||
|
uw.write(&unitig)?;
|
||||||
|
}
|
||||||
|
uw.close()?;
|
||||||
|
drop(g);
|
||||||
|
|
||||||
|
Layer::<()>::build(&dst_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
let dst_mphf = MphfLayer::open(&dst_layer_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
|
// ── Prepare matrix builders (one column per genome) ───────────────────
|
||||||
|
let data_dir = match mode {
|
||||||
|
MergeMode::Presence => dst_layer_dir.join("presence"),
|
||||||
|
MergeMode::Count => dst_layer_dir.join("counts"),
|
||||||
|
};
|
||||||
|
fs::create_dir_all(&data_dir)?;
|
||||||
|
|
||||||
|
let mut builders: Vec<ColBuilder> = match mode {
|
||||||
|
MergeMode::Presence => {
|
||||||
|
PersistentBitMatrixBuilder::new(n_new, &data_dir)
|
||||||
|
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
|
||||||
|
(0..n_genomes).map(|g| -> SKResult<ColBuilder> {
|
||||||
|
let b = PersistentBitVecBuilder::new(n_new, &col_path_bit(&data_dir, g))?;
|
||||||
|
Ok(ColBuilder::Bit(b))
|
||||||
|
}).collect::<SKResult<_>>()?
|
||||||
|
}
|
||||||
|
MergeMode::Count => {
|
||||||
|
PersistentCompactIntMatrixBuilder::new(n_new, &data_dir)
|
||||||
|
.map_err(SKError::Io)?.close().map_err(SKError::Io)?;
|
||||||
|
(0..n_genomes).map(|g| -> SKResult<ColBuilder> {
|
||||||
|
let b = PersistentCompactIntVecBuilder::new(n_new, &col_path_int(&data_dir, g))?;
|
||||||
|
Ok(ColBuilder::Int(b))
|
||||||
|
}).collect::<SKResult<_>>()?
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Pass 2: fill builders ─────────────────────────────────────────────
|
||||||
|
for l in 0..src_meta.n_layers {
|
||||||
|
let src_layer_dir = src_index_dir.join(format!("layer_{l}"));
|
||||||
|
let unitigs_path = src_layer_dir.join("unitigs.bin");
|
||||||
|
if !unitigs_path.exists() { continue; }
|
||||||
|
|
||||||
|
let reader = UnitigFileReader::open(&unitigs_path)?;
|
||||||
|
let src_data = SrcLayerData::open(&src_layer_dir, mode)?;
|
||||||
|
|
||||||
|
for (kmer, _, _) in reader.iter_indexed_canonical_kmers() {
|
||||||
|
let row = src_data.lookup(kmer, n_genomes);
|
||||||
|
if !passes_all(filters, &row, n_genomes) { continue; }
|
||||||
|
if let Some(slot) = dst_mphf.find(kmer) {
|
||||||
|
for (col, &value) in row.iter().enumerate() {
|
||||||
|
builders[col].set_val(slot, value);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Close builders, write metadata ────────────────────────────────────
|
||||||
|
for b in builders { b.close()?; }
|
||||||
|
write_matrix_meta(&data_dir, n_new, n_genomes).map_err(SKError::Io)?;
|
||||||
|
|
||||||
|
PartitionMeta { n_layers: 1 }.save(&dst_index_dir).map_err(olm_to_sk)?;
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -71,6 +71,20 @@ impl RoutableSuperKmer {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
impl PartialEq for RoutableSuperKmer {
|
||||||
|
fn eq(&self, other: &Self) -> bool {
|
||||||
|
self.superkmer == other.superkmer
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
impl Eq for RoutableSuperKmer {}
|
||||||
|
|
||||||
|
impl std::hash::Hash for RoutableSuperKmer {
|
||||||
|
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
|
||||||
|
self.superkmer.hash(state);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
impl Sequence for RoutableSuperKmer {
|
impl Sequence for RoutableSuperKmer {
|
||||||
type Canonical = RoutableSuperKmer;
|
type Canonical = RoutableSuperKmer;
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
use std::collections::HashMap;
|
use std::collections::HashMap;
|
||||||
|
use std::fs;
|
||||||
use std::path::Path;
|
use std::path::Path;
|
||||||
|
|
||||||
use obicompactvec::{
|
use obicompactvec::{
|
||||||
@@ -83,6 +84,22 @@ impl Layer<()> {
|
|||||||
pub fn build(out_dir: &Path) -> OLMResult<usize> {
|
pub fn build(out_dir: &Path) -> OLMResult<usize> {
|
||||||
MphfLayer::build(out_dir, &mut |_, _| Ok(()))
|
MphfLayer::build(out_dir, &mut |_, _| Ok(()))
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Create a presence matrix for a set-membership layer (first merge).
|
||||||
|
///
|
||||||
|
/// All `n_kmers` slots are set to `true`: every kmer in this layer belongs
|
||||||
|
/// to genome_0, so genome_0 is present at every slot.
|
||||||
|
pub fn init_presence_matrix(layer_dir: &Path, n_kmers: usize) -> OLMResult<()> {
|
||||||
|
let presence_dir = layer_dir.join(PRESENCE_DIR);
|
||||||
|
fs::create_dir_all(&presence_dir).map_err(OLMError::Io)?;
|
||||||
|
let mut mb = PersistentBitMatrixBuilder::new(n_kmers, &presence_dir).map_err(OLMError::Io)?;
|
||||||
|
let mut col = mb.add_col().map_err(OLMError::Io)?;
|
||||||
|
for slot in 0..n_kmers {
|
||||||
|
col.set(slot, true);
|
||||||
|
}
|
||||||
|
col.close().map_err(OLMError::Io)?;
|
||||||
|
mb.close().map_err(OLMError::Io)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Mode 2 — count matrix (1 column per layer) ────────────────────────────────
|
// ── Mode 2 — count matrix (1 column per layer) ────────────────────────────────
|
||||||
@@ -111,9 +128,31 @@ impl Layer<PersistentCompactIntMatrix> {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Mode 2 — count matrix column append ──────────────────────────────────────
|
||||||
|
|
||||||
|
impl Layer<PersistentCompactIntMatrix> {
|
||||||
|
/// Append a genome column to an existing count matrix.
|
||||||
|
pub fn append_genome_column(
|
||||||
|
layer_dir: &Path,
|
||||||
|
value_of: impl Fn(usize) -> u32,
|
||||||
|
) -> OLMResult<()> {
|
||||||
|
PersistentCompactIntMatrix::append_column(&layer_dir.join(COUNTS_DIR), value_of)
|
||||||
|
.map_err(OLMError::Io)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// ── Mode 3 — presence/absence matrix (1 column per genome) ───────────────────
|
// ── Mode 3 — presence/absence matrix (1 column per genome) ───────────────────
|
||||||
|
|
||||||
impl Layer<PersistentBitMatrix> {
|
impl Layer<PersistentBitMatrix> {
|
||||||
|
/// Append a genome column to an existing presence matrix.
|
||||||
|
pub fn append_genome_column(
|
||||||
|
layer_dir: &Path,
|
||||||
|
value_of: impl Fn(usize) -> bool,
|
||||||
|
) -> OLMResult<()> {
|
||||||
|
PersistentBitMatrix::append_column(&layer_dir.join(PRESENCE_DIR), value_of)
|
||||||
|
.map_err(OLMError::Io)
|
||||||
|
}
|
||||||
|
|
||||||
pub fn build_presence(
|
pub fn build_presence(
|
||||||
out_dir: &Path,
|
out_dir: &Path,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
|
|||||||
@@ -41,6 +41,8 @@ impl MphfLayer {
|
|||||||
#[inline]
|
#[inline]
|
||||||
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
pub fn find(&self, kmer: CanonicalKmer) -> Option<usize> {
|
||||||
let slot = self.mphf.index(&kmer.raw());
|
let slot = self.mphf.index(&kmer.raw());
|
||||||
|
// PtrHash guarantees slot < n only for its key set; arbitrary queries may exceed bounds.
|
||||||
|
if slot >= self.n { return None; }
|
||||||
let (chunk_id, rank) = self.evidence.decode(slot);
|
let (chunk_id, rank) = self.evidence.decode(slot);
|
||||||
if self.unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
if self.unitigs.verify_canonical_kmer(chunk_id as usize, rank as usize, kmer) {
|
||||||
Some(slot)
|
Some(slot)
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ mod mimetype;
|
|||||||
pub mod normalize;
|
pub mod normalize;
|
||||||
mod path_iterator;
|
mod path_iterator;
|
||||||
pub mod peakreader;
|
pub mod peakreader;
|
||||||
|
pub mod record;
|
||||||
pub mod xopen;
|
pub mod xopen;
|
||||||
|
|
||||||
pub use chunk::{SeqChunkIter, fasta_chunks, fastq_chunks,
|
pub use chunk::{SeqChunkIter, fasta_chunks, fastq_chunks,
|
||||||
|
|||||||
@@ -0,0 +1,222 @@
|
|||||||
|
//! Per-sequence record parser for FASTA and FASTQ chunks.
|
||||||
|
//!
|
||||||
|
//! Same automaton structure as `normalize.rs` — only the actions differ:
|
||||||
|
//! instead of writing into a single flat rope, we accumulate per-sequence
|
||||||
|
//! data (id, raw ASCII, normalised ACGT\x00 rope).
|
||||||
|
|
||||||
|
use obikrope::{ForwardCursor, Rope, RopeCursor};
|
||||||
|
|
||||||
|
/// One sequence record extracted from a FASTA or FASTQ chunk.
|
||||||
|
pub struct SeqRecord {
|
||||||
|
/// Sequence identifier (everything before the first space in the header).
|
||||||
|
pub id: String,
|
||||||
|
/// Raw sequence bytes, newlines stripped, non-ACGT characters preserved.
|
||||||
|
/// Reproduced verbatim in query output.
|
||||||
|
pub sequence: Vec<u8>,
|
||||||
|
/// Per-sequence normalised rope: uppercase ACGT segments of length ≥ k
|
||||||
|
/// separated by `\x00`. Ready for `SuperKmerIter`.
|
||||||
|
pub normalized: Rope,
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Parse all records from a FASTA or FASTQ chunk rope.
|
||||||
|
/// Returns an empty vec if the rope carries no recognised mime type.
|
||||||
|
pub fn parse_chunk(rope: &Rope, k: usize) -> Vec<SeqRecord> {
|
||||||
|
let cursor = rope.fw_cursor();
|
||||||
|
match rope.mime_type() {
|
||||||
|
Some("text/fasta") => parse_fasta(cursor, k),
|
||||||
|
Some("text/fastq") => parse_fastq(cursor, k),
|
||||||
|
_ => vec![],
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Shared state accumulated while scanning one sequence ──────────────────────
|
||||||
|
|
||||||
|
struct RecordBuilder {
|
||||||
|
id: String,
|
||||||
|
sequence: Vec<u8>, // raw ASCII, no newlines
|
||||||
|
norm: Vec<u8>, // ACGT\x00 segments being built
|
||||||
|
seg_start: usize, // index in norm where current segment started
|
||||||
|
k: usize,
|
||||||
|
}
|
||||||
|
|
||||||
|
impl RecordBuilder {
|
||||||
|
fn new(k: usize) -> Self {
|
||||||
|
Self { id: String::new(), sequence: Vec::new(), norm: Vec::new(), seg_start: 0, k }
|
||||||
|
}
|
||||||
|
|
||||||
|
fn reset(&mut self, id: String) {
|
||||||
|
self.id = id;
|
||||||
|
self.sequence.clear();
|
||||||
|
self.norm.clear();
|
||||||
|
self.seg_start = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Push one accepted ACGT byte.
|
||||||
|
fn push_acgt(&mut self, b: u8) {
|
||||||
|
self.sequence.push(b);
|
||||||
|
self.norm.push(b);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Push one non-ACGT byte to the raw sequence only (not to the norm buffer).
|
||||||
|
fn push_raw(&mut self, b: u8) {
|
||||||
|
self.sequence.push(b);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Close the current ACGT segment (same logic as `end_segment` in normalize.rs).
|
||||||
|
fn end_segment(&mut self) {
|
||||||
|
if self.norm.len() - self.seg_start >= self.k {
|
||||||
|
self.norm.push(0x00);
|
||||||
|
self.seg_start = self.norm.len();
|
||||||
|
} else {
|
||||||
|
self.norm.truncate(self.seg_start);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Consume into a SeqRecord. Closes any open segment first.
|
||||||
|
fn finish(mut self) -> Option<SeqRecord> {
|
||||||
|
self.end_segment();
|
||||||
|
if self.id.is_empty() {
|
||||||
|
return None;
|
||||||
|
}
|
||||||
|
let mut rope = Rope::new(None);
|
||||||
|
if !self.norm.is_empty() {
|
||||||
|
rope.push(self.norm);
|
||||||
|
}
|
||||||
|
Some(SeqRecord { id: self.id, sequence: self.sequence, normalized: rope })
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── FASTA automaton ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn parse_fasta(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
|
||||||
|
let mut records: Vec<SeqRecord> = Vec::new();
|
||||||
|
let mut builder = RecordBuilder::new(k);
|
||||||
|
|
||||||
|
// skip up to (and including) the first '>'
|
||||||
|
loop {
|
||||||
|
match cursor.read_next().ok() {
|
||||||
|
None => return records,
|
||||||
|
Some(b'>') => break,
|
||||||
|
Some(_) => {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// read first id — read_id already consumes the full header line
|
||||||
|
builder.id = read_id(&cursor);
|
||||||
|
|
||||||
|
loop {
|
||||||
|
match cursor.read_next().ok() {
|
||||||
|
None => {
|
||||||
|
// EOF — close final segment and emit
|
||||||
|
if let Some(rec) = builder.finish() {
|
||||||
|
records.push(rec);
|
||||||
|
}
|
||||||
|
return records;
|
||||||
|
}
|
||||||
|
Some(b'\n') | Some(b'\r') => {
|
||||||
|
// peek: next non-empty char determines if new record starts
|
||||||
|
match cursor.read_ahead(1).ok() {
|
||||||
|
Some(b'>') => {
|
||||||
|
// end of current record
|
||||||
|
builder.end_segment();
|
||||||
|
if let Some(rec) = builder.finish() {
|
||||||
|
records.push(rec);
|
||||||
|
}
|
||||||
|
cursor.read_next().ok(); // consume '>'
|
||||||
|
let id = read_id(&cursor); // already consumes header line
|
||||||
|
builder = RecordBuilder::new(k);
|
||||||
|
builder.reset(id);
|
||||||
|
}
|
||||||
|
None => {
|
||||||
|
builder.end_segment();
|
||||||
|
if let Some(rec) = builder.finish() {
|
||||||
|
records.push(rec);
|
||||||
|
}
|
||||||
|
return records;
|
||||||
|
}
|
||||||
|
Some(_) => {} // continuation line — do nothing
|
||||||
|
}
|
||||||
|
}
|
||||||
|
Some(b) => {
|
||||||
|
let upper = b & !0x20u8;
|
||||||
|
if matches!(upper, b'A' | b'C' | b'G' | b'T') {
|
||||||
|
builder.push_acgt(upper);
|
||||||
|
} else {
|
||||||
|
builder.push_raw(b);
|
||||||
|
builder.end_segment();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── FASTQ automaton ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn parse_fastq(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
|
||||||
|
let mut records: Vec<SeqRecord> = Vec::new();
|
||||||
|
|
||||||
|
loop {
|
||||||
|
// find '@'
|
||||||
|
loop {
|
||||||
|
match cursor.read_next().ok() {
|
||||||
|
None => return records,
|
||||||
|
Some(b'@') => break,
|
||||||
|
Some(_) => {}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
let mut builder = RecordBuilder::new(k);
|
||||||
|
builder.id = read_id(&cursor); // already consumes the full header line
|
||||||
|
|
||||||
|
// sequence line — stop at newline, non-ACGT breaks segment
|
||||||
|
loop {
|
||||||
|
match cursor.read_next().ok() {
|
||||||
|
None | Some(b'\n') | Some(b'\r') => {
|
||||||
|
builder.end_segment();
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
Some(b) => {
|
||||||
|
let upper = b & !0x20u8;
|
||||||
|
if matches!(upper, b'A' | b'C' | b'G' | b'T') {
|
||||||
|
builder.push_acgt(upper);
|
||||||
|
} else {
|
||||||
|
builder.push_raw(b);
|
||||||
|
builder.end_segment();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
skip_line(&cursor); // '+' line
|
||||||
|
skip_line(&cursor); // quality line
|
||||||
|
|
||||||
|
if let Some(rec) = builder.finish() {
|
||||||
|
records.push(rec);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
fn read_id(cursor: &ForwardCursor<'_>) -> String {
|
||||||
|
let mut id = Vec::new();
|
||||||
|
loop {
|
||||||
|
match cursor.read_next().ok() {
|
||||||
|
None | Some(b'\n') | Some(b'\r') => break,
|
||||||
|
Some(b' ') | Some(b'\t') => {
|
||||||
|
skip_line(cursor);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
Some(b) => id.push(b),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
String::from_utf8_lossy(&id).into_owned()
|
||||||
|
}
|
||||||
|
|
||||||
|
fn skip_line(cursor: &ForwardCursor<'_>) {
|
||||||
|
while let Some(c) = cursor.read_next().ok() {
|
||||||
|
if c == b'\n' {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user