Merge pull request 'Push rxrxptuxltlp' (#12) from push-rxrxptuxltlp into main
Reviewed-on: #12
This commit was merged in pull request #12.
This commit is contained in:
@@ -1 +1,107 @@
|
|||||||
toto
|
# obikmer
|
||||||
|
|
||||||
|
`obikmer` is a Rust toolkit for indexing, querying, and comparing DNA sequences
|
||||||
|
represented as sets of k-mers. It targets individual genome datasets (tens of
|
||||||
|
Gbases) with maximum efficiency in computation, memory, and disk usage.
|
||||||
|
|
||||||
|
## Key principles
|
||||||
|
|
||||||
|
**Compact k-mer encoding.** Each k-mer is stored in a `u64` at 2 bits/base.
|
||||||
|
k is odd, k ∈ [11, 31], fixed at runtime. The canonical form `min(kmer, revcomp(kmer))`
|
||||||
|
halves the effective space by collapsing both strands.
|
||||||
|
|
||||||
|
**Superkmer-based partitioning.** Sequences are decomposed into superkmers —
|
||||||
|
maximal runs of k-mers sharing the same minimizer. Superkmers route naturally to
|
||||||
|
partitions via the minimizer hash, enabling partition-parallel indexing and querying
|
||||||
|
with no cross-partition communication.
|
||||||
|
|
||||||
|
**Layered MPHF index.** Each partition holds a stack of layers. Each layer is a
|
||||||
|
minimal perfect hash function (MPHF) over the k-mers of one input genome, paired
|
||||||
|
with a per-genome presence/count matrix. Queries scatter k-mers to their partition,
|
||||||
|
probe each layer in order, and aggregate results.
|
||||||
|
|
||||||
|
**Approximate indexing (Findere).** With `-z Z`, the index stores k-mers of size
|
||||||
|
`s = k − z + 1` instead of k. At query time, results are produced at size s, then
|
||||||
|
a per-genome sliding window of size z aggregates z consecutive s-mer hits into one
|
||||||
|
confirmed k-mer answer. This reduces the false-positive rate from `1/2^b` per s-mer
|
||||||
|
to `1/2^(b·z)` per k-mer, at the cost of z−1 unconfirmed positions at each sequence
|
||||||
|
break. The aggregation window spans the full query sequence, not individual superkmers,
|
||||||
|
to avoid false negatives at superkmer boundaries.
|
||||||
|
|
||||||
|
**Multi-genome.** A single index can hold any number of genomes. Each k-mer slot
|
||||||
|
carries a per-genome count or presence vector. Distance matrices, NJ/UPGMA trees,
|
||||||
|
and classification are derived from these vectors without rebuilding the index.
|
||||||
|
|
||||||
|
## Input formats
|
||||||
|
|
||||||
|
Command Formats accepted
|
||||||
|
─────────────────── ──────────────────────────────────────────────────────────────
|
||||||
|
index, superkmer FASTA (.fa .fasta), FASTQ (.fq .fastq), GenBank flat file
|
||||||
|
(.gb .gbk .gbff), all optionally gzip-compressed.
|
||||||
|
Directories expanded recursively. Streaming stdin via -.
|
||||||
|
query FASTA, FASTQ, optionally gzip-compressed. Stdin via -.
|
||||||
|
|
||||||
|
Non-ACGT characters act as hard breaks between k-mer segments in all formats.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
Command Role
|
||||||
|
───────── ────────────────────────────────────────────────────────────────────
|
||||||
|
index Build a genome index from sequence files.
|
||||||
|
Runs scatter → dereplicate → count → layered MPHF.
|
||||||
|
Resumes automatically if interrupted.
|
||||||
|
merge Merge multiple independently built indexes into one.
|
||||||
|
rebuild Filter and compact an existing index: apply count thresholds,
|
||||||
|
drop layers, rewrite as a single-layer index.
|
||||||
|
reindex Convert evidence in-place across all layers:
|
||||||
|
exact (evidence.bin) ↔ approximate (fingerprint.bin).
|
||||||
|
Does not touch the MPHF or unitigs.
|
||||||
|
query Query an index with FASTA/FASTQ sequences.
|
||||||
|
Annotates each sequence with per-genome k-mer match counts
|
||||||
|
and optional per-position coverage vectors (--detail).
|
||||||
|
Parallel over sequence chunks.
|
||||||
|
distance Compute a pairwise Bray-Curtis or Jaccard distance matrix
|
||||||
|
between all indexed genomes.
|
||||||
|
Optionally outputs a Newick NJ or UPGMA tree.
|
||||||
|
annotate Add or update genome metadata (taxonomy, etc.) from a CSV
|
||||||
|
file; or dump the current metadata as CSV.
|
||||||
|
estimate Dry-run: resolve and print approximate-index parameters
|
||||||
|
(z, evidence bits b, FP rates) given any two of (b, z, fp).
|
||||||
|
Does not touch any index.
|
||||||
|
dump Dump all indexed k-mers as CSV with per-genome counts or
|
||||||
|
presence flags.
|
||||||
|
superkmer Extract superkmers from a sequence file and write to stdout.
|
||||||
|
Diagnostic / pipeline use.
|
||||||
|
unitig Dump the unitig sequences stored in a built index. Debug use.
|
||||||
|
utils Miscellaneous utilities.
|
||||||
|
--new-label NEW=OLD renames a genome label in-place.
|
||||||
|
|
||||||
|
## Quick start
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# Build an exact index for two genomes
|
||||||
|
obikmer index --kmer-size 31 --label genome_a genome_a.fa --output index/
|
||||||
|
obikmer index --kmer-size 31 --label genome_b genome_b.fa --output index/
|
||||||
|
|
||||||
|
# Convert to approximate index (z=5, 8-bit fingerprints)
|
||||||
|
obikmer reindex --approx -z 5 --evidence-bits 8 index/
|
||||||
|
|
||||||
|
# Query reads
|
||||||
|
obikmer query index/ reads.fq.gz > annotated.fa
|
||||||
|
|
||||||
|
# Pairwise distances
|
||||||
|
obikmer distance index/ > distances.tsv
|
||||||
|
```
|
||||||
|
|
||||||
|
## Parameter constraints
|
||||||
|
|
||||||
|
Parameter Constraint
|
||||||
|
───────────────────── ──────────────
|
||||||
|
k (--kmer-size) odd, 11 ≤ k ≤ 31
|
||||||
|
m (--minimizer-size) odd, 3 ≤ m ≤ k−1
|
||||||
|
z (-z, --approx only) 1 ≤ z ≤ k−1
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
Extended architecture and implementation notes are in `docmd/`. Build with
|
||||||
|
`make doc` (requires Python + MkDocs Material).
|
||||||
|
|||||||
@@ -24,3 +24,37 @@ Sauf qu'avec un index approximatif, les résultats seront approximatifs.
|
|||||||
--detail et --mismatch à implementer
|
--detail et --mismatch à implementer
|
||||||
|
|
||||||
- status : affiche le statut de l'index
|
- status : affiche le statut de l'index
|
||||||
|
|
||||||
|
## Problème biologique sur l'identification des contaminants
|
||||||
|
|
||||||
|
Exemple de reads problématiques:
|
||||||
|
```
|
||||||
|
>LH00534:161:22WMGWLT4:4:1101:45301:1420 {"coverage":{"gbbct":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]},"kmer_count":117,"kmer_strict_matches":{"gbbct":117}}
|
||||||
|
GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGATTTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACGCTTGCATCCTCTGTATTACCGCGGCTGCTGG
|
||||||
|
```
|
||||||
|
|
||||||
|
Par blast match une quantité invréssemblable de genomes chloroplastique avec un match de 100% (6554 hits pour Streptophyta)
|
||||||
|
|
||||||
|
mais aussi une quantité de sequences importantes à des OTU bactériennes (uncutured bacteria 115 hits) aussi avec 100% de similarité.
|
||||||
|
|
||||||
|
```
|
||||||
|
Uncultured bacterium clone Otu01032 16S ribosomal RNA gene, partial sequence
|
||||||
|
Sequence ID: KX996137.1Length: 440Number of Matches: 1
|
||||||
|
Range 1: 153 to 303GenBankGraphics
|
||||||
|
Next Match
|
||||||
|
Previous Match
|
||||||
|
Alignment statistics for match #1 Score Expect Identities Gaps Strand
|
||||||
|
273 bits(302) 2e-69 151/151(100%) 0/151(0%) Plus/Minus
|
||||||
|
|
||||||
|
Query 1 GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGAT 60
|
||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||
|
Sbjct 303 GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGAT 244
|
||||||
|
|
||||||
|
Query 61 TTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACG 120
|
||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||
|
Sbjct 243 TTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACG 184
|
||||||
|
|
||||||
|
Query 121 CTTGCATCCTCTGTATTACCGCGGCTGCTGG 151
|
||||||
|
|||||||||||||||||||||||||||||||
|
||||||
|
Sbjct 183 CTTGCATCCTCTGTATTACCGCGGCTGCTGG 153
|
||||||
|
```
|
||||||
|
|||||||
+28
-24
@@ -8,7 +8,7 @@ Given a set of query sequences, determine for each sequence how many of its k-me
|
|||||||
|
|
||||||
## Input
|
## Input
|
||||||
|
|
||||||
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
|
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported). GenBank flat files are not supported at query time (only at index time).
|
||||||
- Sequences shorter than k bases are silently skipped.
|
- Sequences shorter than k bases are silently skipped.
|
||||||
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
|
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
|
||||||
|
|
||||||
@@ -19,45 +19,50 @@ Given a set of query sequences, determine for each sequence how many of its k-me
|
|||||||
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
The query follows the same superkmer-based partitioning strategy used at indexing time.
|
||||||
|
|
||||||
```
|
```
|
||||||
for each batch of sequences:
|
for each chunk of sequences (parallel workers via obipipeline):
|
||||||
build QueryBatch: decompose all sequences into superkmers, deduplicate
|
build QueryBatch: decompose all sequences into s-mers via superkmers, deduplicate
|
||||||
|
allocate seq_results[seq_idx][smer_pos] = None ← per-sequence s-mer result vectors
|
||||||
split superkmers by partition via minimiser hash
|
split superkmers by partition via minimiser hash
|
||||||
for each partition p:
|
for each partition p:
|
||||||
query_partition(p, superkmers_routed_to_p)
|
query_partition(p, superkmers_routed_to_p)
|
||||||
→ load QueryLayer(s) for p
|
→ load QueryLayer(s) for p
|
||||||
→ for each kmer in each superkmer: MphfLayer::find(kmer)
|
→ for each s-mer in each superkmer: MphfLayer::find(smer)
|
||||||
apply_findere(sk_kmer_results, effective_z) ← per superkmer
|
fill seq_results[seq_idx][kmer_offset + j] from partition results
|
||||||
broadcast confirmed results back to each (seq_idx, kmer_offset)
|
for each sequence:
|
||||||
|
apply_findere(seq_results[seq_idx], effective_z) ← per full sequence
|
||||||
|
accumulate confirmed k-mer results into acc and cov
|
||||||
emit annotated sequences
|
emit annotated sequences
|
||||||
```
|
```
|
||||||
|
|
||||||
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
|
Superkmers that appear more than once in the batch (same sequence or across sequences) are deduplicated: each unique `RoutableSuperKmer` is queried once per partition, and the result is broadcast to every `SKDesc` entry that references it.
|
||||||
|
|
||||||
Parallelism is **not yet active** in the current implementation: batches are processed sequentially on a single thread despite the `--threads` flag being parsed. The `QueryBatch` / `split_by_partition` design is structured to support per-partition parallelism in a future iteration.
|
**Findere requires full-sequence aggregation.** `apply_findere` is applied once per sequence on the complete s-mer result vector, after all partitions have contributed. Applying it per superkmer would produce false negatives at superkmer boundaries, where the z-window spans two superkmers.
|
||||||
|
|
||||||
|
Batches are processed in parallel via `obipipeline` workers; the `--threads` flag controls the number of worker threads.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Findere z-window filter
|
## Findere z-window filter
|
||||||
|
|
||||||
For approximate and hybrid index modes, a raw k-mer hit is a positive from the fingerprint test with false-positive rate 1/2^b. The Findere scheme reduces the effective FP rate to 1/2^(b·z) by requiring **z consecutive k-mers** to all test positive before any of them is counted as a confirmed match.
|
For approximate index modes, the index physically stores s-mers of size `s = k_user − z + 1`. At query time, `set_k(s)` is in effect, so queries naturally produce s-mer results. `apply_findere` then aggregates z consecutive s-mer results into one k_user-mer answer:
|
||||||
|
|
||||||
`apply_findere` implements this as a sliding-window confirmation, independently for each genome:
|
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
fn apply_findere(
|
fn apply_findere(
|
||||||
results: &[Option<Box<[u32]>>],
|
results: &[Option<Box<[u32]>>], // N s-mer results
|
||||||
z: usize,
|
z: usize,
|
||||||
n_genomes: usize,
|
n_genomes: usize,
|
||||||
) -> Vec<Option<Box<[u32]>>>
|
) -> Vec<Option<Box<[u32]>>> // N − z + 1 k_user-mer results
|
||||||
```
|
```
|
||||||
|
|
||||||
For each genome g, a position i is confirmed if there exists at least one window of z consecutive positions `[j, j+z)` that contains i and where all z positions are present for g (i.e. `results[pos]` is `Some(row)` and `row[g] > 0`). The implementation uses a single O(n) sliding-window scan per genome.
|
Input length N (s-mers), output length N − z + 1 (k_user-mers).
|
||||||
|
|
||||||
Unconfirmed positions are zeroed in the returned row. If all genomes are zeroed for a position, it is returned as `None`.
|
For each genome g independently, a sliding window of size z scans the input. Output position i is confirmed for genome g iff all z values `results[i..i+z][g]` are nonzero (`None` counts as zero for all genomes). The scan is O(n) per genome.
|
||||||
|
|
||||||
**Short superkmers**: when a superkmer contains fewer than z k-mers, no complete z-window can be formed. Rather than discarding all results, `apply_findere` returns them unchanged (no filter applied). This avoids penalising legitimate short sequences near read ends.
|
Output values come from `results[i]` (leftmost s-mer of each window); genomes not confirmed are zeroed. If all genomes are zero, the position is returned as `None`.
|
||||||
|
|
||||||
**Exact indexes**: `z` is effectively 1 (every k-mer is its own window). `apply_findere` is a no-op.
|
**Short sequences**: when the s-mer count is less than z, no complete window can form — `apply_findere` returns an empty vector. K-mers from sequences shorter than k_user are not emitted.
|
||||||
|
|
||||||
|
**Exact indexes**: `z = 1`, `apply_findere` is a passthrough (output length = input length).
|
||||||
|
|
||||||
### Effective z at query time
|
### Effective z at query time
|
||||||
|
|
||||||
@@ -105,20 +110,20 @@ genome_totals[g] += if presence { u32::from(v >= threshold) } else { v }
|
|||||||
|
|
||||||
## Coverage vectors (`--detail`)
|
## Coverage vectors (`--detail`)
|
||||||
|
|
||||||
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated before the partition loop, with dimensions derived from `batch.n_kmers`:
|
When `--detail` is requested, a 3-D accumulator `cov[seq_idx][genome][kmer_pos]` is allocated after all partitions are queried, with dimensions derived from `n_kmers_out = n_smers − z + 1` (k_user-mer positions, not s-mer positions):
|
||||||
|
|
||||||
```
|
```
|
||||||
cov[seq_idx][g][abs_pos] += contribution
|
cov[seq_idx][g][pos] += contribution
|
||||||
where abs_pos = desc.kmer_offset + local_pos (absolute kmer position in the sequence)
|
where pos is the k_user-mer index in the filtered (post-Findere) vector
|
||||||
```
|
```
|
||||||
|
|
||||||
Coverage reflects confirmed k-mers only (post-Findere). The vectors are emitted in the JSON annotation under the key `"coverage"`.
|
Coverage reflects confirmed k_user-mers only. The vectors are emitted in the JSON annotation under the key `"coverage"`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## `kmer_missing` semantics
|
## `kmer_missing` semantics
|
||||||
|
|
||||||
`kmer_missing` counts k-mers that returned `None` from the index before Findere filtering — i.e. k-mers truly absent from every layer. K-mers that were found in the index but rejected by the z-window filter are not counted as missing.
|
`kmer_missing` counts k_user-mer positions where the first s-mer (`seq_results[seq_idx][pos]`) is `None` — i.e. absent from the index entirely. K-mers where the z-window fails because a later s-mer is absent or zero are not counted as missing (the first s-mer being present is used as proxy for index membership).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -151,7 +156,7 @@ Genome keys follow the iteration order of `meta.genomes`.
|
|||||||
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
|
| `kmer_strict_matches` | object | always | per-genome accumulated value (label → count or 0/1) |
|
||||||
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |
|
| `coverage` | object | `--detail` | per-genome array of per-position contributions (label → [u32]) |
|
||||||
|
|
||||||
`kmer_count + kmer_missing` ≤ total k-mers in the sequence. The gap (if any) corresponds to k-mers found in the index but rejected by the Findere z-window filter.
|
`kmer_count + kmer_missing` ≤ total k_user-mers in the sequence. The gap corresponds to k_user-mers whose z-window was not fully confirmed (at least one s-mer absent or zero for all genomes) but whose first s-mer was present in the index.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -171,7 +176,7 @@ obikmer query <index> [--detail] [--mismatch] [--count-missing]
|
|||||||
| `--count-missing` | off | Add `kmer_missing` field to JSON |
|
| `--count-missing` | off | Add `kmer_missing` field to JSON |
|
||||||
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
|
| `--force-presence` | off | Report 0/1 per genome regardless of index counts |
|
||||||
| `--presence-threshold` | 1 | Minimum count to declare genome present |
|
| `--presence-threshold` | 1 | Minimum count to declare genome present |
|
||||||
| `-T` / `--threads` | all CPUs | Worker threads (parallelism not yet active) |
|
| `-T` / `--threads` | all CPUs | Worker threads |
|
||||||
|
|
||||||
`--mismatch` is accepted but currently ignored with a warning on stderr.
|
`--mismatch` is accepted but currently ignored with a warning on stderr.
|
||||||
|
|
||||||
@@ -181,5 +186,4 @@ obikmer query <index> [--detail] [--mismatch] [--count-missing]
|
|||||||
|
|
||||||
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
|
- **`--mismatch`**: 1-mismatch approximate matching — generate `3·k` single-substitution variants per k-mer, look each up independently.
|
||||||
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
|
- **Read classification** (`--classify`): assign each read to the genome with the highest match score.
|
||||||
- **Parallelism**: activate per-partition or per-sequence worker threads using the already-parsed `--threads` value.
|
|
||||||
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
|
- **Whitelist / blacklist filtering**: threshold-based accept/reject on per-genome match scores.
|
||||||
|
|||||||
@@ -35,15 +35,18 @@ stored at `s` belongs to the legitimate k-mer at that slot. The FP event is:
|
|||||||
P(FP per k-mer) = 1 / 2^b
|
P(FP per k-mer) = 1 / 2^b
|
||||||
```
|
```
|
||||||
|
|
||||||
The Findere trick raises the effective window to z consecutive k-mers. A query
|
The Findere trick reduces the indexed k-mer size. When the user specifies k_user
|
||||||
succeeds only when all z fingerprint checks pass, reducing the per-window FP rate:
|
and z, the index physically stores k-mers of size `s = k_user − z + 1`. At query
|
||||||
|
time, the same s-mer size is used. After collecting per-position s-mer results
|
||||||
|
over the full query sequence, a sliding window of size z aggregates z consecutive
|
||||||
|
s-mer hits into one confirmed k_user-mer hit, reducing the per-window FP rate:
|
||||||
|
|
||||||
```
|
```
|
||||||
P(FP per z-window) = 1 / 2^(b·z)
|
P(FP per k_user-mer) = 1 / 2^(b·z)
|
||||||
```
|
```
|
||||||
|
|
||||||
The effective indexed k-mer length is `k − z + 1`: a query for a (k+z−1)-mer
|
`IndexConfig::kmer_size` stores `s = k_user − z + 1`, not k_user. Both indexing
|
||||||
decomposes into z overlapping k-mers, all of which must match.
|
and querying use this stored size via `set_k(idx.kmer_size())`.
|
||||||
|
|
||||||
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
|
Parameters b and z are stored in `layer_meta.json` (`EvidenceKind::Approx { b, z }`).
|
||||||
|
|
||||||
@@ -167,12 +170,12 @@ any index. It accepts the same `--evidence-bits`, `-z`, and `--fp` flags and
|
|||||||
additionally accepts `-k` to display the effective indexed k-mer length:
|
additionally accepts `-k` to display the effective indexed k-mer length:
|
||||||
|
|
||||||
```
|
```
|
||||||
k (query): 31
|
k (user): 31
|
||||||
k (indexed): 31
|
k (indexed, s=k-z+1): 27
|
||||||
z: 1
|
z: 5
|
||||||
evidence bits (b): 8
|
evidence bits (b): 8
|
||||||
FP per k-mer: 3.906e-3 (1/2^8)
|
FP per s-mer: 3.906e-3 (1/2^8)
|
||||||
FP per z-window: 3.906e-3 (1/2^8)
|
FP per k-mer window: 9.537e-7 (1/2^(8·5))
|
||||||
```
|
```
|
||||||
|
|
||||||
Useful for choosing parameters before committing to an index build.
|
Useful for choosing parameters before committing to an index build.
|
||||||
|
|||||||
+2
-1
@@ -25,7 +25,8 @@
|
|||||||
- Maximum efficiency in computation, memory, and disk usage
|
- Maximum efficiency in computation, memory, and disk usage
|
||||||
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
|
- k odd, k ∈ [11, 31], fixed at runtime; kmer fits in a u64 (2 bits/base)
|
||||||
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
|
- Canonical form: `min(kmer, revcomp(kmer))` reduces strand-symmetric space by half
|
||||||
- Input formats: FASTA, FASTQ, gzip, streaming stdin; `index` reads from stdin automatically when no input files are provided (`-` can also be passed explicitly among other paths)
|
- Input formats for `index`/`superkmer`: FASTA (`.fa`, `.fasta`), FASTQ (`.fq`, `.fastq`), GenBank flat file (`.gb`, `.gbk`, `.gbff`), all optionally gzip-compressed; directories expanded recursively; streaming stdin via `-`
|
||||||
|
- Input formats for `query`: FASTA, FASTQ, optionally gzip-compressed; streaming stdin via `-`
|
||||||
|
|
||||||
## Parameter constraints (enforced at CLI)
|
## Parameter constraints (enforced at CLI)
|
||||||
|
|
||||||
|
|||||||
@@ -230,16 +230,11 @@ fn process_chunk(
|
|||||||
let batch = QueryBatch::from_records(records, k, 6, 0.7);
|
let batch = QueryBatch::from_records(records, k, 6, 0.7);
|
||||||
let n_seqs = batch.ids.len();
|
let n_seqs = batch.ids.len();
|
||||||
|
|
||||||
let mut accs: Vec<SeqAcc> =
|
// Per-sequence s-mer result vectors in global sequence position order.
|
||||||
(0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
|
// All partitions fill into this structure before Findere is applied.
|
||||||
|
let mut seq_results: Vec<Vec<Option<Box<[u32]>>>> = batch.n_kmers.iter()
|
||||||
let mut cov: Vec<Vec<Vec<u32>>> = if detail {
|
.map(|&n| vec![None; n as usize])
|
||||||
batch.n_kmers.iter()
|
.collect();
|
||||||
.map(|&n| vec![vec![0u32; n as usize]; n_genomes])
|
|
||||||
.collect()
|
|
||||||
} else {
|
|
||||||
Vec::new()
|
|
||||||
};
|
|
||||||
|
|
||||||
let by_part = batch.split_by_partition(n_partitions);
|
let by_part = batch.split_by_partition(n_partitions);
|
||||||
|
|
||||||
@@ -256,20 +251,45 @@ fn process_chunk(
|
|||||||
std::process::exit(1);
|
std::process::exit(1);
|
||||||
});
|
});
|
||||||
|
|
||||||
|
for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
|
||||||
|
let descs = batch.map.get(*rsk).expect("rsk must be in map");
|
||||||
|
for desc in descs {
|
||||||
|
let offset = desc.kmer_offset as usize;
|
||||||
|
let dst = &mut seq_results[desc.seq_idx as usize];
|
||||||
|
for (j, hit) in sk_kmer_results.iter().enumerate() {
|
||||||
|
dst[offset + j] = hit.as_ref().map(|r| r.clone());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Apply Findere on each complete sequence vector, then accumulate.
|
||||||
|
let n_kmers_out: Vec<usize> = batch.n_kmers.iter()
|
||||||
|
.map(|&n| { let n = n as usize; if n >= effective_z { n - effective_z + 1 } else { 0 } })
|
||||||
|
.collect();
|
||||||
|
|
||||||
|
let mut accs: Vec<SeqAcc> =
|
||||||
|
(0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
|
||||||
|
|
||||||
|
let mut cov: Vec<Vec<Vec<u32>>> = if detail {
|
||||||
|
n_kmers_out.iter()
|
||||||
|
.map(|&n| vec![vec![0u32; n]; n_genomes])
|
||||||
|
.collect()
|
||||||
|
} else {
|
||||||
|
Vec::new()
|
||||||
|
};
|
||||||
|
|
||||||
let presence = force_presence || !with_counts;
|
let presence = force_presence || !with_counts;
|
||||||
let threshold = presence_threshold;
|
let threshold = presence_threshold;
|
||||||
|
|
||||||
for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
|
for seq_idx in 0..n_seqs {
|
||||||
let filtered = apply_findere(sk_kmer_results, effective_z, n_genomes);
|
let filtered = apply_findere(&seq_results[seq_idx], effective_z, n_genomes);
|
||||||
let descs = batch.map.get(*rsk).expect("rsk must be in map");
|
let acc = &mut accs[seq_idx];
|
||||||
|
|
||||||
for desc in descs {
|
for (pos, hit) in filtered.iter().enumerate() {
|
||||||
let acc = &mut accs[desc.seq_idx as usize];
|
|
||||||
|
|
||||||
for (local_pos, hit) in filtered.iter().enumerate() {
|
|
||||||
match hit {
|
match hit {
|
||||||
None => {
|
None => {
|
||||||
if sk_kmer_results[local_pos].is_none() {
|
if seq_results[seq_idx][pos].is_none() {
|
||||||
acc.kmer_missing += 1;
|
acc.kmer_missing += 1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -284,10 +304,7 @@ fn process_chunk(
|
|||||||
};
|
};
|
||||||
acc.genome_totals[g] += contribution;
|
acc.genome_totals[g] += contribution;
|
||||||
if detail {
|
if detail {
|
||||||
let abs_pos = desc.kmer_offset as usize + local_pos;
|
cov[seq_idx][g][pos] += contribution;
|
||||||
cov[desc.seq_idx as usize][g][abs_pos] += contribution;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user