Files
obikmer/benchmark/README.md
Eric Coissac c694e1f2b0 feat: add benchmark pipeline, expose APIs, and enforce strict paths
Introduces a Make-based orchestration for simulating, indexing, merging, filtering, and verifying k-mer counts and presence. Exposes internal builder and iterator APIs publicly, enforces mandatory leading slashes for predicate patterns, registers the `obitaxonomy` crate, and updates tooling configurations alongside documentation.
2026-06-22 10:18:33 +02:00

5.6 KiB

Benchmark pipeline

Requires GNU Make ≥ 4.3 (grouped targets &:). On macOS use gmake.

gmake all          # full pipeline
gmake simulate     # simulation only
gmake reference    # reference kmer sets only

Pipeline overview

flowchart TD
    GENOMES["genomes/*.fna.gz"]
    BIN["obikmer binary"]

    GENOMES --> simulate
    simulate --> simdata[("simulated_data/")]

    simdata --> reference
    reference --> refnpz[("reference_index/*.npz")]

    subgraph presence ["Presence track"]
        simdata  --> index_presence
        BIN      --> index_presence
        index_presence --> pres_done[("specimen_index_presence/")]
        index_presence --> pres_istats[("stats/indexing_presence/")]
        pres_istats --> aggregate_index_presence

        pres_done --> merge_presence
        BIN       --> merge_presence
        merge_presence --> gpres[("global_index_presence/")]

        refnpz    --> verify_presence
        pres_done --> verify_presence
        verify_presence --> vpres_stats[("stats/verify_presence/")]
        vpres_stats --> aggregate_verify_presence

        gpres --> filter_presence
        BIN   --> filter_presence
        filter_presence --> spec_pres[("specific_index_presence/")]
        filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
        spec_pres_stats --> aggregate_filter_presence

        refnpz --> verify_merge_presence
        gpres  --> verify_merge_presence
        verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
    end

    subgraph count ["Count track"]
        simdata --> index_count
        BIN     --> index_count
        index_count --> count_done[("specimen_index_count/")]
        index_count --> count_istats[("stats/indexing_count/")]
        count_istats --> aggregate_index_count

        count_done --> merge_count
        BIN        --> merge_count
        merge_count --> gcount[("global_index_count/")]

        refnpz     --> verify_count
        count_done --> verify_count
        verify_count --> vcount_stats[("stats/verify_count/")]
        vcount_stats --> aggregate_verify_count

        gcount --> filter_count
        BIN    --> filter_count
        filter_count --> spec_count[("specific_index_count/")]
        filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
        spec_count_stats --> aggregate_filter_count

        refnpz --> verify_merge_count
        gcount --> verify_merge_count
        verify_merge_count --> vmc[("stats/verify_merge_count/")]
    end

    aggregate_verify_presence  --> all
    aggregate_verify_count     --> all
    vmp                        --> all
    vmc                        --> all
    all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
    all -. "$(MAKE) re-eval" .-> aggregate_filter_count

Steps

Target Script Description
simulate simulate.sh Simulate sequencing reads from the reference genomes
reference build_reference.sh Build reference kmer sets (.npz) from simulation truth
index_presence index_one_presence.sh Index each specimen (presence mode)
index_count index_one_count.sh Index each specimen (count mode)
aggregate_index_presence aggregate_stats.sh Aggregate per-specimen indexing stats (presence)
aggregate_index_count aggregate_stats.sh Aggregate per-specimen indexing stats (count)
merge_presence merge_presence.sh Merge all specimen presence indexes into a global index
merge_count merge_count.sh Merge all specimen count indexes into a global index
verify_presence verify_one_presence.sh Verify each specimen presence index against reference
verify_count verify_one_count.sh Verify each specimen count index against reference
aggregate_verify_presence aggregate_stats.sh Aggregate per-specimen verification stats (presence)
aggregate_verify_count aggregate_stats.sh Aggregate per-specimen verification stats (count)
filter_presence filter_one_presence.sh Extract species-specific presence indexes from global index
filter_count filter_one_count.sh Extract species-specific count indexes from global index
aggregate_filter_presence aggregate_stats.sh Aggregate species-specific kmer stats (presence)
aggregate_filter_count aggregate_stats.sh Aggregate species-specific kmer stats (count)
verify_merge_presence verify_merge_presence.sh Verify global presence index against all reference sets
verify_merge_count verify_merge_count.sh Verify global count index against all reference sets

Directory layout

benchmark/
├── genomes/                        # input reference genomes (.fna.gz)
├── simulated_data/                 # generated by simulate
│   └── <species>/<specimen>/
├── reference_index/                # reference kmer sets (.npz)
├── specimen_index_presence/        # per-specimen presence indexes
├── specimen_index_count/           # per-specimen count indexes
├── global_index_presence/          # merged global presence index
├── global_index_count/             # merged global count index
├── specific_index_presence/        # species-specific presence indexes
├── specific_index_count/           # species-specific count indexes
└── stats/                          # all benchmark statistics
    ├── indexing_presence/
    ├── indexing_count/
    ├── verify_presence/
    ├── verify_count/
    ├── specific_kmer_presence/
    ├── specific_kmer_count/
    ├── verify_merge_presence/
    └── verify_merge_count/