c694e1f2b0
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
Benchmark pipeline
Requires GNU Make ≥ 4.3 (grouped targets &:). On macOS use gmake.
gmake all # full pipeline
gmake simulate # simulation only
gmake reference # reference kmer sets only
Pipeline overview
flowchart TD
GENOMES["genomes/*.fna.gz"]
BIN["obikmer binary"]
GENOMES --> simulate
simulate --> simdata[("simulated_data/")]
simdata --> reference
reference --> refnpz[("reference_index/*.npz")]
subgraph presence ["Presence track"]
simdata --> index_presence
BIN --> index_presence
index_presence --> pres_done[("specimen_index_presence/")]
index_presence --> pres_istats[("stats/indexing_presence/")]
pres_istats --> aggregate_index_presence
pres_done --> merge_presence
BIN --> merge_presence
merge_presence --> gpres[("global_index_presence/")]
refnpz --> verify_presence
pres_done --> verify_presence
verify_presence --> vpres_stats[("stats/verify_presence/")]
vpres_stats --> aggregate_verify_presence
gpres --> filter_presence
BIN --> filter_presence
filter_presence --> spec_pres[("specific_index_presence/")]
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
spec_pres_stats --> aggregate_filter_presence
refnpz --> verify_merge_presence
gpres --> verify_merge_presence
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
end
subgraph count ["Count track"]
simdata --> index_count
BIN --> index_count
index_count --> count_done[("specimen_index_count/")]
index_count --> count_istats[("stats/indexing_count/")]
count_istats --> aggregate_index_count
count_done --> merge_count
BIN --> merge_count
merge_count --> gcount[("global_index_count/")]
refnpz --> verify_count
count_done --> verify_count
verify_count --> vcount_stats[("stats/verify_count/")]
vcount_stats --> aggregate_verify_count
gcount --> filter_count
BIN --> filter_count
filter_count --> spec_count[("specific_index_count/")]
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
spec_count_stats --> aggregate_filter_count
refnpz --> verify_merge_count
gcount --> verify_merge_count
verify_merge_count --> vmc[("stats/verify_merge_count/")]
end
aggregate_verify_presence --> all
aggregate_verify_count --> all
vmp --> all
vmc --> all
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
Steps
| Target | Script | Description |
|---|---|---|
simulate |
simulate.sh |
Simulate sequencing reads from the reference genomes |
reference |
build_reference.sh |
Build reference kmer sets (.npz) from simulation truth |
index_presence |
index_one_presence.sh |
Index each specimen (presence mode) |
index_count |
index_one_count.sh |
Index each specimen (count mode) |
aggregate_index_presence |
aggregate_stats.sh |
Aggregate per-specimen indexing stats (presence) |
aggregate_index_count |
aggregate_stats.sh |
Aggregate per-specimen indexing stats (count) |
merge_presence |
merge_presence.sh |
Merge all specimen presence indexes into a global index |
merge_count |
merge_count.sh |
Merge all specimen count indexes into a global index |
verify_presence |
verify_one_presence.sh |
Verify each specimen presence index against reference |
verify_count |
verify_one_count.sh |
Verify each specimen count index against reference |
aggregate_verify_presence |
aggregate_stats.sh |
Aggregate per-specimen verification stats (presence) |
aggregate_verify_count |
aggregate_stats.sh |
Aggregate per-specimen verification stats (count) |
filter_presence |
filter_one_presence.sh |
Extract species-specific presence indexes from global index |
filter_count |
filter_one_count.sh |
Extract species-specific count indexes from global index |
aggregate_filter_presence |
aggregate_stats.sh |
Aggregate species-specific kmer stats (presence) |
aggregate_filter_count |
aggregate_stats.sh |
Aggregate species-specific kmer stats (count) |
verify_merge_presence |
verify_merge_presence.sh |
Verify global presence index against all reference sets |
verify_merge_count |
verify_merge_count.sh |
Verify global count index against all reference sets |
Directory layout
benchmark/
├── genomes/ # input reference genomes (.fna.gz)
├── simulated_data/ # generated by simulate
│ └── <species>/<specimen>/
├── reference_index/ # reference kmer sets (.npz)
├── specimen_index_presence/ # per-specimen presence indexes
├── specimen_index_count/ # per-specimen count indexes
├── global_index_presence/ # merged global presence index
├── global_index_count/ # merged global count index
├── specific_index_presence/ # species-specific presence indexes
├── specific_index_count/ # species-specific count indexes
└── stats/ # all benchmark statistics
├── indexing_presence/
├── indexing_count/
├── verify_presence/
├── verify_count/
├── specific_kmer_presence/
├── specific_kmer_count/
├── verify_merge_presence/
└── verify_merge_count/