# Benchmark pipeline Requires **GNU Make ≥ 4.3** (grouped targets `&:`). On macOS use `gmake`. ``` gmake all # full pipeline gmake simulate # simulation only gmake reference # reference kmer sets only ``` ## Pipeline overview ```mermaid flowchart TD GENOMES["genomes/*.fna.gz"] BIN["obikmer binary"] GENOMES --> simulate simulate --> simdata[("simulated_data/")] simdata --> reference reference --> refnpz[("reference_index/*.npz")] subgraph presence ["Presence track"] simdata --> index_presence BIN --> index_presence index_presence --> pres_done[("specimen_index_presence/")] index_presence --> pres_istats[("stats/indexing_presence/")] pres_istats --> aggregate_index_presence pres_done --> merge_presence BIN --> merge_presence merge_presence --> gpres[("global_index_presence/")] refnpz --> verify_presence pres_done --> verify_presence verify_presence --> vpres_stats[("stats/verify_presence/")] vpres_stats --> aggregate_verify_presence gpres --> filter_presence BIN --> filter_presence filter_presence --> spec_pres[("specific_index_presence/")] filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")] spec_pres_stats --> aggregate_filter_presence refnpz --> verify_merge_presence gpres --> verify_merge_presence verify_merge_presence --> vmp[("stats/verify_merge_presence/")] end subgraph count ["Count track"] simdata --> index_count BIN --> index_count index_count --> count_done[("specimen_index_count/")] index_count --> count_istats[("stats/indexing_count/")] count_istats --> aggregate_index_count count_done --> merge_count BIN --> merge_count merge_count --> gcount[("global_index_count/")] refnpz --> verify_count count_done --> verify_count verify_count --> vcount_stats[("stats/verify_count/")] vcount_stats --> aggregate_verify_count gcount --> filter_count BIN --> filter_count filter_count --> spec_count[("specific_index_count/")] filter_count --> spec_count_stats[("stats/specific_kmer_count/")] spec_count_stats --> aggregate_filter_count refnpz --> verify_merge_count gcount --> verify_merge_count verify_merge_count --> vmc[("stats/verify_merge_count/")] end aggregate_verify_presence --> all aggregate_verify_count --> all vmp --> all vmc --> all all -. "$(MAKE) re-eval" .-> aggregate_filter_presence all -. "$(MAKE) re-eval" .-> aggregate_filter_count ``` ## Steps | Target | Script | Description | |---|---|---| | `simulate` | `simulate.sh` | Simulate sequencing reads from the reference genomes | | `reference` | `build_reference.sh` | Build reference kmer sets (`.npz`) from simulation truth | | `index_presence` | `index_one_presence.sh` | Index each specimen (presence mode) | | `index_count` | `index_one_count.sh` | Index each specimen (count mode) | | `aggregate_index_presence` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (presence) | | `aggregate_index_count` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (count) | | `merge_presence` | `merge_presence.sh` | Merge all specimen presence indexes into a global index | | `merge_count` | `merge_count.sh` | Merge all specimen count indexes into a global index | | `verify_presence` | `verify_one_presence.sh` | Verify each specimen presence index against reference | | `verify_count` | `verify_one_count.sh` | Verify each specimen count index against reference | | `aggregate_verify_presence` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (presence) | | `aggregate_verify_count` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (count) | | `filter_presence` | `filter_one_presence.sh` | Extract species-specific presence indexes from global index | | `filter_count` | `filter_one_count.sh` | Extract species-specific count indexes from global index | | `aggregate_filter_presence` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (presence) | | `aggregate_filter_count` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (count) | | `verify_merge_presence` | `verify_merge_presence.sh` | Verify global presence index against all reference sets | | `verify_merge_count` | `verify_merge_count.sh` | Verify global count index against all reference sets | ## Directory layout ``` benchmark/ ├── genomes/ # input reference genomes (.fna.gz) ├── simulated_data/ # generated by simulate │ └── // ├── reference_index/ # reference kmer sets (.npz) ├── specimen_index_presence/ # per-specimen presence indexes ├── specimen_index_count/ # per-specimen count indexes ├── global_index_presence/ # merged global presence index ├── global_index_count/ # merged global count index ├── specific_index_presence/ # species-specific presence indexes ├── specific_index_count/ # species-specific count indexes └── stats/ # all benchmark statistics ├── indexing_presence/ ├── indexing_count/ ├── verify_presence/ ├── verify_count/ ├── specific_kmer_presence/ ├── specific_kmer_count/ ├── verify_merge_presence/ └── verify_merge_count/ ```