c694e1f2b0
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
133 lines
5.6 KiB
Markdown
133 lines
5.6 KiB
Markdown
# Benchmark pipeline
|
|
|
|
Requires **GNU Make ≥ 4.3** (grouped targets `&:`). On macOS use `gmake`.
|
|
|
|
```
|
|
gmake all # full pipeline
|
|
gmake simulate # simulation only
|
|
gmake reference # reference kmer sets only
|
|
```
|
|
|
|
## Pipeline overview
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
GENOMES["genomes/*.fna.gz"]
|
|
BIN["obikmer binary"]
|
|
|
|
GENOMES --> simulate
|
|
simulate --> simdata[("simulated_data/")]
|
|
|
|
simdata --> reference
|
|
reference --> refnpz[("reference_index/*.npz")]
|
|
|
|
subgraph presence ["Presence track"]
|
|
simdata --> index_presence
|
|
BIN --> index_presence
|
|
index_presence --> pres_done[("specimen_index_presence/")]
|
|
index_presence --> pres_istats[("stats/indexing_presence/")]
|
|
pres_istats --> aggregate_index_presence
|
|
|
|
pres_done --> merge_presence
|
|
BIN --> merge_presence
|
|
merge_presence --> gpres[("global_index_presence/")]
|
|
|
|
refnpz --> verify_presence
|
|
pres_done --> verify_presence
|
|
verify_presence --> vpres_stats[("stats/verify_presence/")]
|
|
vpres_stats --> aggregate_verify_presence
|
|
|
|
gpres --> filter_presence
|
|
BIN --> filter_presence
|
|
filter_presence --> spec_pres[("specific_index_presence/")]
|
|
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
|
|
spec_pres_stats --> aggregate_filter_presence
|
|
|
|
refnpz --> verify_merge_presence
|
|
gpres --> verify_merge_presence
|
|
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
|
|
end
|
|
|
|
subgraph count ["Count track"]
|
|
simdata --> index_count
|
|
BIN --> index_count
|
|
index_count --> count_done[("specimen_index_count/")]
|
|
index_count --> count_istats[("stats/indexing_count/")]
|
|
count_istats --> aggregate_index_count
|
|
|
|
count_done --> merge_count
|
|
BIN --> merge_count
|
|
merge_count --> gcount[("global_index_count/")]
|
|
|
|
refnpz --> verify_count
|
|
count_done --> verify_count
|
|
verify_count --> vcount_stats[("stats/verify_count/")]
|
|
vcount_stats --> aggregate_verify_count
|
|
|
|
gcount --> filter_count
|
|
BIN --> filter_count
|
|
filter_count --> spec_count[("specific_index_count/")]
|
|
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
|
|
spec_count_stats --> aggregate_filter_count
|
|
|
|
refnpz --> verify_merge_count
|
|
gcount --> verify_merge_count
|
|
verify_merge_count --> vmc[("stats/verify_merge_count/")]
|
|
end
|
|
|
|
aggregate_verify_presence --> all
|
|
aggregate_verify_count --> all
|
|
vmp --> all
|
|
vmc --> all
|
|
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
|
|
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
|
|
```
|
|
|
|
## Steps
|
|
|
|
| Target | Script | Description |
|
|
|---|---|---|
|
|
| `simulate` | `simulate.sh` | Simulate sequencing reads from the reference genomes |
|
|
| `reference` | `build_reference.sh` | Build reference kmer sets (`.npz`) from simulation truth |
|
|
| `index_presence` | `index_one_presence.sh` | Index each specimen (presence mode) |
|
|
| `index_count` | `index_one_count.sh` | Index each specimen (count mode) |
|
|
| `aggregate_index_presence` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (presence) |
|
|
| `aggregate_index_count` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (count) |
|
|
| `merge_presence` | `merge_presence.sh` | Merge all specimen presence indexes into a global index |
|
|
| `merge_count` | `merge_count.sh` | Merge all specimen count indexes into a global index |
|
|
| `verify_presence` | `verify_one_presence.sh` | Verify each specimen presence index against reference |
|
|
| `verify_count` | `verify_one_count.sh` | Verify each specimen count index against reference |
|
|
| `aggregate_verify_presence` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (presence) |
|
|
| `aggregate_verify_count` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (count) |
|
|
| `filter_presence` | `filter_one_presence.sh` | Extract species-specific presence indexes from global index |
|
|
| `filter_count` | `filter_one_count.sh` | Extract species-specific count indexes from global index |
|
|
| `aggregate_filter_presence` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (presence) |
|
|
| `aggregate_filter_count` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (count) |
|
|
| `verify_merge_presence` | `verify_merge_presence.sh` | Verify global presence index against all reference sets |
|
|
| `verify_merge_count` | `verify_merge_count.sh` | Verify global count index against all reference sets |
|
|
|
|
## Directory layout
|
|
|
|
```
|
|
benchmark/
|
|
├── genomes/ # input reference genomes (.fna.gz)
|
|
├── simulated_data/ # generated by simulate
|
|
│ └── <species>/<specimen>/
|
|
├── reference_index/ # reference kmer sets (.npz)
|
|
├── specimen_index_presence/ # per-specimen presence indexes
|
|
├── specimen_index_count/ # per-specimen count indexes
|
|
├── global_index_presence/ # merged global presence index
|
|
├── global_index_count/ # merged global count index
|
|
├── specific_index_presence/ # species-specific presence indexes
|
|
├── specific_index_count/ # species-specific count indexes
|
|
└── stats/ # all benchmark statistics
|
|
├── indexing_presence/
|
|
├── indexing_count/
|
|
├── verify_presence/
|
|
├── verify_count/
|
|
├── specific_kmer_presence/
|
|
├── specific_kmer_count/
|
|
├── verify_merge_presence/
|
|
└── verify_merge_count/
|
|
```
|