Files

133 lines
5.6 KiB
Markdown
Raw Permalink Normal View History

# Benchmark pipeline
Requires **GNU Make ≥ 4.3** (grouped targets `&:`). On macOS use `gmake`.
```
gmake all # full pipeline
gmake simulate # simulation only
gmake reference # reference kmer sets only
```
## Pipeline overview
```mermaid
flowchart TD
GENOMES["genomes/*.fna.gz"]
BIN["obikmer binary"]
GENOMES --> simulate
simulate --> simdata[("simulated_data/")]
simdata --> reference
reference --> refnpz[("reference_index/*.npz")]
subgraph presence ["Presence track"]
simdata --> index_presence
BIN --> index_presence
index_presence --> pres_done[("specimen_index_presence/")]
index_presence --> pres_istats[("stats/indexing_presence/")]
pres_istats --> aggregate_index_presence
pres_done --> merge_presence
BIN --> merge_presence
merge_presence --> gpres[("global_index_presence/")]
refnpz --> verify_presence
pres_done --> verify_presence
verify_presence --> vpres_stats[("stats/verify_presence/")]
vpres_stats --> aggregate_verify_presence
gpres --> filter_presence
BIN --> filter_presence
filter_presence --> spec_pres[("specific_index_presence/")]
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
spec_pres_stats --> aggregate_filter_presence
refnpz --> verify_merge_presence
gpres --> verify_merge_presence
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
end
subgraph count ["Count track"]
simdata --> index_count
BIN --> index_count
index_count --> count_done[("specimen_index_count/")]
index_count --> count_istats[("stats/indexing_count/")]
count_istats --> aggregate_index_count
count_done --> merge_count
BIN --> merge_count
merge_count --> gcount[("global_index_count/")]
refnpz --> verify_count
count_done --> verify_count
verify_count --> vcount_stats[("stats/verify_count/")]
vcount_stats --> aggregate_verify_count
gcount --> filter_count
BIN --> filter_count
filter_count --> spec_count[("specific_index_count/")]
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
spec_count_stats --> aggregate_filter_count
refnpz --> verify_merge_count
gcount --> verify_merge_count
verify_merge_count --> vmc[("stats/verify_merge_count/")]
end
aggregate_verify_presence --> all
aggregate_verify_count --> all
vmp --> all
vmc --> all
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
```
## Steps
| Target | Script | Description |
|---|---|---|
| `simulate` | `simulate.sh` | Simulate sequencing reads from the reference genomes |
| `reference` | `build_reference.sh` | Build reference kmer sets (`.npz`) from simulation truth |
| `index_presence` | `index_one_presence.sh` | Index each specimen (presence mode) |
| `index_count` | `index_one_count.sh` | Index each specimen (count mode) |
| `aggregate_index_presence` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (presence) |
| `aggregate_index_count` | `aggregate_stats.sh` | Aggregate per-specimen indexing stats (count) |
| `merge_presence` | `merge_presence.sh` | Merge all specimen presence indexes into a global index |
| `merge_count` | `merge_count.sh` | Merge all specimen count indexes into a global index |
| `verify_presence` | `verify_one_presence.sh` | Verify each specimen presence index against reference |
| `verify_count` | `verify_one_count.sh` | Verify each specimen count index against reference |
| `aggregate_verify_presence` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (presence) |
| `aggregate_verify_count` | `aggregate_stats.sh` | Aggregate per-specimen verification stats (count) |
| `filter_presence` | `filter_one_presence.sh` | Extract species-specific presence indexes from global index |
| `filter_count` | `filter_one_count.sh` | Extract species-specific count indexes from global index |
| `aggregate_filter_presence` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (presence) |
| `aggregate_filter_count` | `aggregate_stats.sh` | Aggregate species-specific kmer stats (count) |
| `verify_merge_presence` | `verify_merge_presence.sh` | Verify global presence index against all reference sets |
| `verify_merge_count` | `verify_merge_count.sh` | Verify global count index against all reference sets |
## Directory layout
```
benchmark/
├── genomes/ # input reference genomes (.fna.gz)
├── simulated_data/ # generated by simulate
│ └── <species>/<specimen>/
├── reference_index/ # reference kmer sets (.npz)
├── specimen_index_presence/ # per-specimen presence indexes
├── specimen_index_count/ # per-specimen count indexes
├── global_index_presence/ # merged global presence index
├── global_index_count/ # merged global count index
├── specific_index_presence/ # species-specific presence indexes
├── specific_index_count/ # species-specific count indexes
└── stats/ # all benchmark statistics
├── indexing_presence/
├── indexing_count/
├── verify_presence/
├── verify_count/
├── specific_kmer_presence/
├── specific_kmer_count/
├── verify_merge_presence/
└── verify_merge_count/
```