Files

T

Eric Coissac 469e53b6f5 Add genomic distance benchmarking suite and test data

Introduces scripts to compute and validate pairwise genomic distance matrices across multiple metrics. Updates the Makefile with build and comparison targets, adds .gitignore rules for generated outputs, and includes test CSV matrices and a Newick phylogenetic tree for validating the distance computation pipeline.

2026-06-22 18:24:30 +02:00

aggregate_stats.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

build_reference_dist.py

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

build_reference.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

build_reference.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

compare_all_dist.py

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

deps.mk

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

downloads.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

filter_one_count.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

filter_one_presence.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

index_one_count.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

index_one_presence.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

make_deps.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

Makefile

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

merge_count.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

merge_presence.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

README.md

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

simulate_one.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

simulate.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

test_dist.csv

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

test_nj.nwk

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

test_shared.csv

Add genomic distance benchmarking suite and test data

2026-06-22 18:24:30 +02:00

verify_count.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_merge_count.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_merge_count.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_merge_presence.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_merge_presence.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_one_count.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_one_presence.sh

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

verify_presence.py

feat: add benchmark pipeline, expose APIs, and enforce strict paths

2026-06-22 10:18:33 +02:00

README.md

Benchmark pipeline

Requires GNU Make ≥ 4.3 (grouped targets &:). On macOS use gmake.

gmake all          # full pipeline
gmake simulate     # simulation only
gmake reference    # reference kmer sets only

Pipeline overview

flowchart TD
    GENOMES["genomes/*.fna.gz"]
    BIN["obikmer binary"]

    GENOMES --> simulate
    simulate --> simdata[("simulated_data/")]

    simdata --> reference
    reference --> refnpz[("reference_index/*.npz")]

    subgraph presence ["Presence track"]
        simdata  --> index_presence
        BIN      --> index_presence
        index_presence --> pres_done[("specimen_index_presence/")]
        index_presence --> pres_istats[("stats/indexing_presence/")]
        pres_istats --> aggregate_index_presence

        pres_done --> merge_presence
        BIN       --> merge_presence
        merge_presence --> gpres[("global_index_presence/")]

        refnpz    --> verify_presence
        pres_done --> verify_presence
        verify_presence --> vpres_stats[("stats/verify_presence/")]
        vpres_stats --> aggregate_verify_presence

        gpres --> filter_presence
        BIN   --> filter_presence
        filter_presence --> spec_pres[("specific_index_presence/")]
        filter_presence --> spec_pres_stats[("stats/specific_kmer_presence/")]
        spec_pres_stats --> aggregate_filter_presence

        refnpz --> verify_merge_presence
        gpres  --> verify_merge_presence
        verify_merge_presence --> vmp[("stats/verify_merge_presence/")]
    end

    subgraph count ["Count track"]
        simdata --> index_count
        BIN     --> index_count
        index_count --> count_done[("specimen_index_count/")]
        index_count --> count_istats[("stats/indexing_count/")]
        count_istats --> aggregate_index_count

        count_done --> merge_count
        BIN        --> merge_count
        merge_count --> gcount[("global_index_count/")]

        refnpz     --> verify_count
        count_done --> verify_count
        verify_count --> vcount_stats[("stats/verify_count/")]
        vcount_stats --> aggregate_verify_count

        gcount --> filter_count
        BIN    --> filter_count
        filter_count --> spec_count[("specific_index_count/")]
        filter_count --> spec_count_stats[("stats/specific_kmer_count/")]
        spec_count_stats --> aggregate_filter_count

        refnpz --> verify_merge_count
        gcount --> verify_merge_count
        verify_merge_count --> vmc[("stats/verify_merge_count/")]
    end

    aggregate_verify_presence  --> all
    aggregate_verify_count     --> all
    vmp                        --> all
    vmc                        --> all
    all -. "$(MAKE) re-eval" .-> aggregate_filter_presence
    all -. "$(MAKE) re-eval" .-> aggregate_filter_count

Steps

Target	Script	Description
`simulate`	`simulate.sh`	Simulate sequencing reads from the reference genomes
`reference`	`build_reference.sh`	Build reference kmer sets (`.npz`) from simulation truth
`index_presence`	`index_one_presence.sh`	Index each specimen (presence mode)
`index_count`	`index_one_count.sh`	Index each specimen (count mode)
`aggregate_index_presence`	`aggregate_stats.sh`	Aggregate per-specimen indexing stats (presence)
`aggregate_index_count`	`aggregate_stats.sh`	Aggregate per-specimen indexing stats (count)
`merge_presence`	`merge_presence.sh`	Merge all specimen presence indexes into a global index
`merge_count`	`merge_count.sh`	Merge all specimen count indexes into a global index
`verify_presence`	`verify_one_presence.sh`	Verify each specimen presence index against reference
`verify_count`	`verify_one_count.sh`	Verify each specimen count index against reference
`aggregate_verify_presence`	`aggregate_stats.sh`	Aggregate per-specimen verification stats (presence)
`aggregate_verify_count`	`aggregate_stats.sh`	Aggregate per-specimen verification stats (count)
`filter_presence`	`filter_one_presence.sh`	Extract species-specific presence indexes from global index
`filter_count`	`filter_one_count.sh`	Extract species-specific count indexes from global index
`aggregate_filter_presence`	`aggregate_stats.sh`	Aggregate species-specific kmer stats (presence)
`aggregate_filter_count`	`aggregate_stats.sh`	Aggregate species-specific kmer stats (count)
`verify_merge_presence`	`verify_merge_presence.sh`	Verify global presence index against all reference sets
`verify_merge_count`	`verify_merge_count.sh`	Verify global count index against all reference sets

Directory layout

benchmark/
├── genomes/                        # input reference genomes (.fna.gz)
├── simulated_data/                 # generated by simulate
│   └── <species>/<specimen>/
├── reference_index/                # reference kmer sets (.npz)
├── specimen_index_presence/        # per-specimen presence indexes
├── specimen_index_count/           # per-specimen count indexes
├── global_index_presence/          # merged global presence index
├── global_index_count/             # merged global count index
├── specific_index_presence/        # species-specific presence indexes
├── specific_index_count/           # species-specific count indexes
└── stats/                          # all benchmark statistics
    ├── indexing_presence/
    ├── indexing_count/
    ├── verify_presence/
    ├── verify_count/
    ├── specific_kmer_presence/
    ├── specific_kmer_count/
    ├── verify_merge_presence/
    └── verify_merge_count/