feat: add benchmark pipeline, expose APIs, and enforce strict paths
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
This commit is contained in:
@@ -0,0 +1,132 @@
|
||||
# Benchmark pipeline
|
||||
|
||||
Requires **GNU Make ≥ 4.3** (grouped targets `&:`). On macOS use `gmake`.
|
||||
|
||||
```
|
||||
gmake all # full pipeline
|
||||
gmake simulate # simulation only
|
||||
gmake reference # reference kmer sets only
|
||||
```
|
||||
|
||||
## Pipeline overview
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
GENOMES["genomes/*.fna.gz"]
|
||||
BIN["obikmer binary"]
|
||||
|
||||
GENOMES --> simulate
|
||||
simulate --> simdata[("simulated_data/")]
|
||||
|
||||
simdata --> reference
|
||||
reference --> refnpz[("reference_index/*.npz")]
|
||||
|
||||
subgraph presence ["Presence track"]
|
||||
simdata --> index_presence
|
||||
BIN --> index_presence
|
||||
index_presence --> pres_done[("specimen_index_presence/")]
|
||||
index_presence --> pres_istats[("stats/indexing_presence/")]
|
||||
pres_istats --> aggregate_index_presence
|
||||
|
||||
pres_done --> merge_presence
|
||||
BIN --> merge_presence
|
||||
merge_presence --> gpres[("global_index_presence/")]
|
||||
|
||||
refnpz --> verify_presence
|
||||
pres_done --> verify_presence
|
||||
verify_presence --> vpres_stats[("stats/verify_presence/")]
|
||||
vpres_stats --> aggregate_verify_presence
|
||||
|
||||
gpres --> filter_presence
|
||||
BIN --> filter_presence
|
||||
filter_presence --> spec_pres[("specific_index_presence/")]
|
||||
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
|
||||
spec_pres_stats --> aggregate_filter_presence
|
||||
|
||||
refnpz --> verify_merge_presence
|
||||
gpres --> verify_merge_presence
|
||||
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
|
||||
end
|
||||
|
||||
subgraph count ["Count track"]
|
||||
simdata --> index_count
|
||||
BIN --> index_count
|
||||
index_count --> count_done[("specimen_index_count/")]
|
||||
index_count --> count_istats[("stats/indexing_count/")]
|
||||
count_istats --> aggregate_index_count
|
||||
|
||||
count_done --> merge_count
|
||||
BIN --> merge_count
|
||||
merge_count --> gcount[("global_index_count/")]
|
||||
|
||||
refnpz --> verify_count
|
||||
count_done --> verify_count
|
||||
verify_count --> vcount_stats[("stats/verify_count/")]
|
||||
vcount_stats --> aggregate_verify_count
|
||||
|
||||
gcount --> filter_count
|
||||
BIN --> filter_count
|
||||
filter_count --> spec_count[("specific_index_count/")]
|
||||
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
|
||||
spec_count_stats --> aggregate_filter_count
|
||||
|
||||
refnpz --> verify_merge_count
|
||||
gcount --> verify_merge_count
|
||||
verify_merge_count --> vmc[("stats/verify_merge_count/")]
|
||||
end
|
||||
|
||||
aggregate_verify_presence --> all
|
||||
aggregate_verify_count --> all
|
||||
vmp --> all
|
||||
vmc --> all
|
||||
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
|
||||
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
|
||||
```
|
||||
|
||||
## Steps
|
||||
|
||||
| Target | Script | Description |
|
||||
|---|---|---|
|
||||
| `simulate` | `simulate.sh` | Simulate sequencing reads from the reference genomes |
|
||||
| `reference` | `build_reference.sh` | Build reference kmer sets (`.npz`) from simulation truth |
|
||||
| `index_presence` | `index_one_presence.sh` | Index each specimen (presence mode) |
|
||||
| `index_count` | `index_one_count.sh` | Index each specimen (count mode) |
|
||||
| `aggregate_index_presence` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (presence) |
|
||||
| `aggregate_index_count` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (count) |
|
||||
| `merge_presence` | `merge_presence.sh` | Merge all specimen presence indexes into a global index |
|
||||
| `merge_count` | `merge_count.sh` | Merge all specimen count indexes into a global index |
|
||||
| `verify_presence` | `verify_one_presence.sh` | Verify each specimen presence index against reference |
|
||||
| `verify_count` | `verify_one_count.sh` | Verify each specimen count index against reference |
|
||||
| `aggregate_verify_presence` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (presence) |
|
||||
| `aggregate_verify_count` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (count) |
|
||||
| `filter_presence` | `filter_one_presence.sh` | Extract species-specific presence indexes from global index |
|
||||
| `filter_count` | `filter_one_count.sh` | Extract species-specific count indexes from global index |
|
||||
| `aggregate_filter_presence` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (presence) |
|
||||
| `aggregate_filter_count` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (count) |
|
||||
| `verify_merge_presence` | `verify_merge_presence.sh` | Verify global presence index against all reference sets |
|
||||
| `verify_merge_count` | `verify_merge_count.sh` | Verify global count index against all reference sets |
|
||||
|
||||
## Directory layout
|
||||
|
||||
```
|
||||
benchmark/
|
||||
├── genomes/ # input reference genomes (.fna.gz)
|
||||
├── simulated_data/ # generated by simulate
|
||||
│ └── <species>/<specimen>/
|
||||
├── reference_index/ # reference kmer sets (.npz)
|
||||
├── specimen_index_presence/ # per-specimen presence indexes
|
||||
├── specimen_index_count/ # per-specimen count indexes
|
||||
├── global_index_presence/ # merged global presence index
|
||||
├── global_index_count/ # merged global count index
|
||||
├── specific_index_presence/ # species-specific presence indexes
|
||||
├── specific_index_count/ # species-specific count indexes
|
||||
└── stats/ # all benchmark statistics
|
||||
├── indexing_presence/
|
||||
├── indexing_count/
|
||||
├── verify_presence/
|
||||
├── verify_count/
|
||||
├── specific_kmer_presence/
|
||||
├── specific_kmer_count/
|
||||
├── verify_merge_presence/
|
||||
└── verify_merge_count/
|
||||
```
|
||||
Reference in New Issue
Block a user