Files
obikmer/benchmark
Eric Coissac 469e53b6f5 Add genomic distance benchmarking suite and test data
Introduces scripts to compute and validate pairwise genomic distance matrices across multiple metrics. Updates the Makefile with build and comparison targets, adds .gitignore rules for generated outputs, and includes test CSV matrices and a Newick phylogenetic tree for validating the distance computation pipeline.
2026-06-22 18:24:30 +02:00
..

Benchmark pipeline

Requires GNU Make ≥ 4.3 (grouped targets &:). On macOS use gmake.

gmake all          # full pipeline
gmake simulate     # simulation only
gmake reference    # reference kmer sets only

Pipeline overview

flowchart TD
    GENOMES["genomes/*.fna.gz"]
    BIN["obikmer binary"]

    GENOMES --> simulate
    simulate --> simdata[("simulated_data/")]

    simdata --> reference
    reference --> refnpz[("reference_index/*.npz")]

    subgraph presence ["Presence track"]
        simdata  --> index_presence
        BIN      --> index_presence
        index_presence --> pres_done[("specimen_index_presence/")]
        index_presence --> pres_istats[("stats/indexing_presence/")]
        pres_istats --> aggregate_index_presence

        pres_done --> merge_presence
        BIN       --> merge_presence
        merge_presence --> gpres[("global_index_presence/")]

        refnpz    --> verify_presence
        pres_done --> verify_presence
        verify_presence --> vpres_stats[("stats/verify_presence/")]
        vpres_stats --> aggregate_verify_presence

        gpres --> filter_presence
        BIN   --> filter_presence
        filter_presence --> spec_pres[("specific_index_presence/")]
        filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
        spec_pres_stats --> aggregate_filter_presence

        refnpz --> verify_merge_presence
        gpres  --> verify_merge_presence
        verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
    end

    subgraph count ["Count track"]
        simdata --> index_count
        BIN     --> index_count
        index_count --> count_done[("specimen_index_count/")]
        index_count --> count_istats[("stats/indexing_count/")]
        count_istats --> aggregate_index_count

        count_done --> merge_count
        BIN        --> merge_count
        merge_count --> gcount[("global_index_count/")]

        refnpz     --> verify_count
        count_done --> verify_count
        verify_count --> vcount_stats[("stats/verify_count/")]
        vcount_stats --> aggregate_verify_count

        gcount --> filter_count
        BIN    --> filter_count
        filter_count --> spec_count[("specific_index_count/")]
        filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
        spec_count_stats --> aggregate_filter_count

        refnpz --> verify_merge_count
        gcount --> verify_merge_count
        verify_merge_count --> vmc[("stats/verify_merge_count/")]
    end

    aggregate_verify_presence  --> all
    aggregate_verify_count     --> all
    vmp                        --> all
    vmc                        --> all
    all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
    all -. "$(MAKE) re-eval" .-> aggregate_filter_count

Steps

Target Script Description
simulate simulate.sh Simulate sequencing reads from the reference genomes
reference build_reference.sh Build reference kmer sets (.npz) from simulation truth
index_presence index_one_presence.sh Index each specimen (presence mode)
index_count index_one_count.sh Index each specimen (count mode)
aggregate_index_presence aggregate_stats.sh Aggregate per-specimen indexing stats (presence)
aggregate_index_count aggregate_stats.sh Aggregate per-specimen indexing stats (count)
merge_presence merge_presence.sh Merge all specimen presence indexes into a global index
merge_count merge_count.sh Merge all specimen count indexes into a global index
verify_presence verify_one_presence.sh Verify each specimen presence index against reference
verify_count verify_one_count.sh Verify each specimen count index against reference
aggregate_verify_presence aggregate_stats.sh Aggregate per-specimen verification stats (presence)
aggregate_verify_count aggregate_stats.sh Aggregate per-specimen verification stats (count)
filter_presence filter_one_presence.sh Extract species-specific presence indexes from global index
filter_count filter_one_count.sh Extract species-specific count indexes from global index
aggregate_filter_presence aggregate_stats.sh Aggregate species-specific kmer stats (presence)
aggregate_filter_count aggregate_stats.sh Aggregate species-specific kmer stats (count)
verify_merge_presence verify_merge_presence.sh Verify global presence index against all reference sets
verify_merge_count verify_merge_count.sh Verify global count index against all reference sets

Directory layout

benchmark/
├── genomes/                        # input reference genomes (.fna.gz)
├── simulated_data/                 # generated by simulate
│   └── <species>/<specimen>/
├── reference_index/                # reference kmer sets (.npz)
├── specimen_index_presence/        # per-specimen presence indexes
├── specimen_index_count/           # per-specimen count indexes
├── global_index_presence/          # merged global presence index
├── global_index_count/             # merged global count index
├── specific_index_presence/        # species-specific presence indexes
├── specific_index_count/           # species-specific count indexes
└── stats/                          # all benchmark statistics
    ├── indexing_presence/
    ├── indexing_count/
    ├── verify_presence/
    ├── verify_count/
    ├── specific_kmer_presence/
    ├── specific_kmer_count/
    ├── verify_merge_presence/
    └── verify_merge_count/