469e53b6f5
Introduces scripts to compute and validate pairwise genomic distance matrices across multiple metrics. Updates the Makefile with build and comparison targets, adds .gitignore rules for generated outputs, and includes test CSV matrices and a Newick phylogenetic tree for validating the distance computation pipeline.
Benchmark pipeline
Requires GNU Make ≥ 4.3 (grouped targets &:). On macOS use gmake.
gmake all # full pipeline
gmake simulate # simulation only
gmake reference # reference kmer sets only
Pipeline overview
flowchart TD
GENOMES["genomes/*.fna.gz"]
BIN["obikmer binary"]
GENOMES --> simulate
simulate --> simdata[("simulated_data/")]
simdata --> reference
reference --> refnpz[("reference_index/*.npz")]
subgraph presence ["Presence track"]
simdata --> index_presence
BIN --> index_presence
index_presence --> pres_done[("specimen_index_presence/")]
index_presence --> pres_istats[("stats/indexing_presence/")]
pres_istats --> aggregate_index_presence
pres_done --> merge_presence
BIN --> merge_presence
merge_presence --> gpres[("global_index_presence/")]
refnpz --> verify_presence
pres_done --> verify_presence
verify_presence --> vpres_stats[("stats/verify_presence/")]
vpres_stats --> aggregate_verify_presence
gpres --> filter_presence
BIN --> filter_presence
filter_presence --> spec_pres[("specific_index_presence/")]
filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
spec_pres_stats --> aggregate_filter_presence
refnpz --> verify_merge_presence
gpres --> verify_merge_presence
verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
end
subgraph count ["Count track"]
simdata --> index_count
BIN --> index_count
index_count --> count_done[("specimen_index_count/")]
index_count --> count_istats[("stats/indexing_count/")]
count_istats --> aggregate_index_count
count_done --> merge_count
BIN --> merge_count
merge_count --> gcount[("global_index_count/")]
refnpz --> verify_count
count_done --> verify_count
verify_count --> vcount_stats[("stats/verify_count/")]
vcount_stats --> aggregate_verify_count
gcount --> filter_count
BIN --> filter_count
filter_count --> spec_count[("specific_index_count/")]
filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
spec_count_stats --> aggregate_filter_count
refnpz --> verify_merge_count
gcount --> verify_merge_count
verify_merge_count --> vmc[("stats/verify_merge_count/")]
end
aggregate_verify_presence --> all
aggregate_verify_count --> all
vmp --> all
vmc --> all
all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
all -. "$(MAKE) re-eval" .-> aggregate_filter_count
Steps
| Target | Script | Description |
|---|---|---|
simulate |
simulate.sh |
Simulate sequencing reads from the reference genomes |
reference |
build_reference.sh |
Build reference kmer sets (.npz) from simulation truth |
index_presence |
index_one_presence.sh |
Index each specimen (presence mode) |
index_count |
index_one_count.sh |
Index each specimen (count mode) |
aggregate_index_presence |
aggregate_stats.sh |
Aggregate per-specimen indexing stats (presence) |
aggregate_index_count |
aggregate_stats.sh |
Aggregate per-specimen indexing stats (count) |
merge_presence |
merge_presence.sh |
Merge all specimen presence indexes into a global index |
merge_count |
merge_count.sh |
Merge all specimen count indexes into a global index |
verify_presence |
verify_one_presence.sh |
Verify each specimen presence index against reference |
verify_count |
verify_one_count.sh |
Verify each specimen count index against reference |
aggregate_verify_presence |
aggregate_stats.sh |
Aggregate per-specimen verification stats (presence) |
aggregate_verify_count |
aggregate_stats.sh |
Aggregate per-specimen verification stats (count) |
filter_presence |
filter_one_presence.sh |
Extract species-specific presence indexes from global index |
filter_count |
filter_one_count.sh |
Extract species-specific count indexes from global index |
aggregate_filter_presence |
aggregate_stats.sh |
Aggregate species-specific kmer stats (presence) |
aggregate_filter_count |
aggregate_stats.sh |
Aggregate species-specific kmer stats (count) |
verify_merge_presence |
verify_merge_presence.sh |
Verify global presence index against all reference sets |
verify_merge_count |
verify_merge_count.sh |
Verify global count index against all reference sets |
Directory layout
benchmark/
├── genomes/ # input reference genomes (.fna.gz)
├── simulated_data/ # generated by simulate
│ └── <species>/<specimen>/
├── reference_index/ # reference kmer sets (.npz)
├── specimen_index_presence/ # per-specimen presence indexes
├── specimen_index_count/ # per-specimen count indexes
├── global_index_presence/ # merged global presence index
├── global_index_count/ # merged global count index
├── specific_index_presence/ # species-specific presence indexes
├── specific_index_count/ # species-specific count indexes
└── stats/ # all benchmark statistics
├── indexing_presence/
├── indexing_count/
├── verify_presence/
├── verify_count/
├── specific_kmer_presence/
├── specific_kmer_count/
├── verify_merge_presence/
└── verify_merge_count/