2026-05-21 16:57:19 +00:00
13 changed files with 762 additions and 19 deletions
@@ -1,26 +1,8 @@
-## Chose à vérifier suite à la commande index
-
- il faudrait lister les fichier qui vont être indexés
- partition.meta ne devrait plus exister
- les spectrums globaux devrait etre identifier par génome
-  - regrouper dans un sous-dossier spectrums à la racine de l'index avec un nom basé sur le génome
- les spectrum patiels ont-ils vocation à être conserver ?
- l'étape de déreplication dure quasiment autant de temps que le comptage mais ne laisse aucune trace de progression à l'utilisateur
-
 ## commandes à ajouter

- filter : produit un nouvel index filtré à partir d'un index existant en verifiant que les kmer présents dans le nouvel index respectent les critères de filtrage spécifiés
-  - quorum de presence en fraction-(min/max) du nombre de génomes, en nombre-(min/max) de génomes, si mode count la présence peut être défini par un seuil personnalisé minimum et maximum
-
 - aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.

 - query : scan un fichier de sequences et retourne pour chaque sequence quels kmer sont présents dans l'index et dans quel genomes
-
- distance : calcule la matrice de distance entre les genomes
-  - proposer une option pour chaque distance à calculer
-  - un possibité de récuperer la matrice des kmer communs
-  - un possibité de calculer l'arbre nj 
-  - les matrices sont sauvegardées en CSV
-  - les arbres NJ sont sauvegardés en Newick avec les longeurs de branche
+  --detail et --mismatch à implementer

 - status : affiche le statut de l'index
@@ -0,0 +1,111 @@
+# Query system
+
+## Goal
+
+Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
+
+---
+
+## Input
+
+- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
+- Sequences shorter than k bases are silently skipped.
+- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
+
+---
+
+## Algorithm
+
+The query follows the same superkmer-based partitioning strategy used at indexing time.
+
+```
+for each query sequence:
+    decompose into superkmers  (non-ACGT breaks, same minimiser scheme as indexing)
+    for each superkmer:
+        route to partition p via minimiser hash
+        for each kmer in the superkmer:
+            lookup kmer in partition p  (MPHF → evidence check → matrix row)
+            accumulate result into per-sequence accumulators
+    emit annotated sequence
+```
+
+Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
+
+---
+
+## Exact vs. approximate matching
+
+### Exact (default)
+
+Standard MPHF lookup followed by evidence check. O(1) per k-mer.
+
+### 1-mismatch (`--mismatch` flag)
+
+For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
+
+- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
+- Exact and approximate matches are tracked separately in the output (see annotation schema below).
+- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
+- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
+
+---
+
+## Output format
+
+Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
+
+```
+>read_id {"kmer_total":150,"kmer_found":59,...}
+ATCGATCG...
+```
+
+Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
+
+---
+
+## Annotation schema
+
+### Summary mode (default)
+
+| Key | Type | Condition | Semantics |
+|---|---|---|---|
+| `kmer_total` | int | always | total k-mers in the (masked) sequence |
+| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
+| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
+| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
+| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
+| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
+| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
+
+`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
+
+`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
+
+### Detail mode (`--detail`)
+
+All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) − k + 1`:
+
+| Key | Type | Condition | Semantics |
+|---|---|---|---|
+| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
+| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
+| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
+
+Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
+
+---
+
+## CLI
+
+```
+obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
+```
+
+`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
+
+---
+
+## Future work
+
+- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
+- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.
@@ -1514,6 +1514,7 @@ dependencies = [
 "obisys",
 "pprof",
 "rayon",
+ "serde_json",
 "speedytree",
 "tracing",
 "tracing-subscriber",
@@ -185,6 +185,11 @@ impl KmerIndex {
        Ok(())
    }

+    /// Borrow the inner partition for direct superkmer-level queries.
+    pub fn partition(&self) -> &KmerPartition {
+        &self.partition
+    }
+
    /// Path to the unitigs file for partition `part`, layer `layer`.
    pub fn layer_unitigs_path(&self, part: usize, layer: usize) -> PathBuf {
        self.partition.part_dir(part)
@@ -19,6 +19,7 @@ obisys        = { path = "../obisys" }
 obiskio       = { path = "../obiskio" }
 obikindex     = { path = "../obikindex" }
 clap          = { version = "4", features = ["derive"] }
+serde_json    = "1"
 kodama        = "0.2"
 speedytree    = "0.1"
 rayon         = "1"
@@ -2,6 +2,7 @@ pub mod distance;
 pub mod dump;
 pub mod index;
 pub mod merge;
+pub mod query;
 pub mod rebuild;
 pub mod superkmer;
 pub mod unitig;
@@ -0,0 +1,281 @@
+use std::collections::HashMap;
+use std::io::{self, BufWriter, Write};
+use std::path::PathBuf;
+
+use clap::Args;
+use obikindex::KmerIndex;
+use obiread::record::{SeqRecord, parse_chunk};
+use obiread::chunk::read_sequence_chunks;
+use obikseq::{RoutableSuperKmer, set_k, set_m};
+use obiskbuilder::SuperKmerIter;
+use tracing::info;
+
+// ── CLI ───────────────────────────────────────────────────────────────────────
+
+#[derive(Args)]
+pub struct QueryArgs {
+    /// Index directory
+    pub index: PathBuf,
+
+    /// Input sequences (FASTA/FASTQ, optionally gzip-compressed)
+    #[arg(num_args = 1..)]
+    pub inputs: Vec<String>,
+
+    /// Also report per-position coverage vectors (cov_<i> per genome)
+    #[arg(long)]
+    pub detail: bool,
+
+    /// Enable 1-mismatch approximate matching
+    #[arg(long)]
+    pub mismatch: bool,
+
+    /// Count k-mers absent from the index (adds kmer_missing annotation)
+    #[arg(long)]
+    pub count_missing: bool,
+
+    /// Report per-genome presence (0/1) instead of raw counts
+    #[arg(long)]
+    pub force_presence: bool,
+
+    /// Minimum accumulated match count to declare a genome present (implies --force-presence)
+    #[arg(long, default_value_t = 1)]
+    pub presence_threshold: u32,
+
+    /// Number of worker threads
+    #[arg(
+        short = 'T',
+        long,
+        default_value_t = std::thread::available_parallelism()
+            .map(|n| n.get())
+            .unwrap_or(1)
+    )]
+    pub threads: usize,
+}
+
+// ── SKDesc — one occurrence of a superkmer in the batch ───────────────────────
+
+/// Describes one occurrence of a superkmer in the query batch.
+pub struct SKDesc {
+    /// Index of the source sequence within the batch.
+    pub seq_idx: u32,
+    /// Kmer offset of the first kmer of this superkmer within its sequence.
+    /// Computed as the cumulative number of kmers emitted before this superkmer
+    /// in the same sequence. Used for `--detail` coverage vectors.
+    pub kmer_offset: u32,
+}
+
+// ── QueryBatch ────────────────────────────────────────────────────────────────
+
+/// A batch of query sequences with their superkmers deduplicated.
+///
+/// Each unique `RoutableSuperKmer` maps to all the (seq_idx, kmer_offset)
+/// positions it occupies across the batch. The superkmer is queried once
+/// per partition; the result is broadcast to every SKDesc entry.
+pub struct QueryBatch {
+    /// Sequence ids in batch order.
+    pub ids: Vec<String>,
+    /// Raw sequence bytes (for output), in batch order.
+    pub seqs: Vec<Vec<u8>>,
+    /// Per-sequence total kmer count (kmer_count + kmer_missing).
+    pub n_kmers: Vec<u32>,
+    /// Deduplicated superkmer map.
+    pub map: HashMap<RoutableSuperKmer, Vec<SKDesc>>,
+}
+
+impl QueryBatch {
+    /// Build a batch from a vec of parsed sequence records.
+    pub fn from_records(
+        records: Vec<SeqRecord>,
+        k: usize,
+        level_max: usize,
+        theta: f64,
+    ) -> Self {
+        let mut ids   = Vec::with_capacity(records.len());
+        let mut seqs  = Vec::with_capacity(records.len());
+        let mut n_kmers = Vec::with_capacity(records.len());
+        let mut map: HashMap<RoutableSuperKmer, Vec<SKDesc>> = HashMap::new();
+
+        for (seq_idx, record) in records.into_iter().enumerate() {
+            let mut kmer_offset = 0u32;
+
+            for rsk in SuperKmerIter::new(&record.normalized, k, level_max, theta) {
+                let n = (rsk.seql() - k + 1) as u32;
+                map.entry(rsk)
+                    .or_default()
+                    .push(SKDesc { seq_idx: seq_idx as u32, kmer_offset });
+                kmer_offset += n;
+            }
+
+            ids.push(record.id);
+            seqs.push(record.sequence);
+            n_kmers.push(kmer_offset);
+        }
+
+        Self { ids, seqs, n_kmers, map }
+    }
+
+    /// Split the superkmer map by partition index.
+    ///
+    /// Returns a vec of length `n_partitions`; each slot holds the RSK refs
+    /// whose minimizer routes to that partition.
+    pub fn split_by_partition(&self, n_partitions: usize) -> Vec<Vec<&RoutableSuperKmer>> {
+        let mask = (n_partitions as u64) - 1;
+        let mut by_part: Vec<Vec<&RoutableSuperKmer>> = vec![Vec::new(); n_partitions];
+        for rsk in self.map.keys() {
+            let part = (rsk.minimizer().seq_hash() & mask) as usize;
+            by_part[part].push(rsk);
+        }
+        by_part
+    }
+}
+
+// ── Per-sequence accumulator ──────────────────────────────────────────────────
+
+struct SeqAcc {
+    kmer_count:   u32,
+    kmer_missing: u32,
+    /// Per-genome accumulated count (count mode) or presence sum (presence mode).
+    genome_totals: Vec<u32>,
+}
+
+impl SeqAcc {
+    fn new(n_genomes: usize) -> Self {
+        Self {
+            kmer_count:   0,
+            kmer_missing: 0,
+            genome_totals: vec![0u32; n_genomes],
+        }
+    }
+}
+
+// ── Entry point ───────────────────────────────────────────────────────────────
+
+pub fn run(args: QueryArgs) {
+    let idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
+        eprintln!("error opening index: {e}");
+        std::process::exit(1);
+    });
+
+    set_k(idx.kmer_size());
+    set_m(idx.minimizer_size());
+
+    let k            = idx.kmer_size();
+    let n_genomes    = idx.meta().genomes.len();
+    let n_partitions = idx.n_partitions();
+    let with_counts  = idx.meta().config.with_counts;
+
+    info!(
+        "query: k={k}, {} genome(s), with_counts={with_counts}, mismatch={}, detail={}",
+        n_genomes, args.mismatch, args.detail
+    );
+
+    if args.mismatch {
+        eprintln!("warning: --mismatch not yet implemented, ignored");
+    }
+    if args.detail {
+        eprintln!("warning: --detail not yet implemented, ignored");
+    }
+
+    let paths: Vec<PathBuf> = args.inputs.iter().map(PathBuf::from).collect();
+    let mut out = BufWriter::new(io::stdout());
+
+    for path in &paths {
+        let chunks = read_sequence_chunks(path.to_str().unwrap_or(""))
+            .unwrap_or_else(|e| {
+                eprintln!("error opening {}: {e}", path.display());
+                std::process::exit(1);
+            });
+
+        for chunk_result in chunks {
+            let chunk = chunk_result.unwrap_or_else(|e| {
+                eprintln!("read error: {e}");
+                std::process::exit(1);
+            });
+
+            let records = parse_chunk(&chunk, k);
+            if records.is_empty() {
+                continue;
+            }
+
+            let batch = QueryBatch::from_records(records, k, 6, 0.7);
+            let n_seqs = batch.ids.len();
+
+            let mut accs: Vec<SeqAcc> = (0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
+
+            let by_part = batch.split_by_partition(n_partitions);
+
+            for (part_idx, part_sks) in by_part.iter().enumerate() {
+                if part_sks.is_empty() {
+                    continue;
+                }
+
+                let kmer_results = idx
+                    .partition()
+                    .query_partition(part_idx, part_sks, k, n_genomes, with_counts)
+                    .unwrap_or_else(|e| {
+                        eprintln!("query error on partition {part_idx}: {e}");
+                        std::process::exit(1);
+                    });
+
+                let presence   = args.force_presence || !with_counts;
+                let threshold  = args.presence_threshold;
+
+                for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
+                    let descs = batch.map.get(*rsk).expect("rsk must be in map");
+                    for desc in descs {
+                        let acc = &mut accs[desc.seq_idx as usize];
+                        for hit in sk_kmer_results.iter() {
+                            match hit {
+                                None => acc.kmer_missing += 1,
+                                Some(row) => {
+                                    acc.kmer_count += 1;
+                                    for (g, &v) in row.iter().enumerate() {
+                                        acc.genome_totals[g] += if presence {
+                                            u32::from(v >= threshold)
+                                        } else {
+                                            v
+                                        };
+                                    }
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+
+            emit_batch(&batch, &accs, idx.meta(), args.count_missing, &mut out);
+        }
+    }
+}
+
+// ── Output ────────────────────────────────────────────────────────────────────
+
+fn emit_batch(
+    batch:        &QueryBatch,
+    accs:         &[SeqAcc],
+    meta:         &obikindex::meta::IndexMeta,
+    count_missing: bool,
+    out:          &mut impl Write,
+) {
+    for (seq_idx, (id, seq)) in batch.ids.iter().zip(batch.seqs.iter()).enumerate() {
+        let acc = &accs[seq_idx];
+        let mut ann = serde_json::Map::new();
+
+        ann.insert("kmer_count".into(), acc.kmer_count.into());
+        if count_missing {
+            ann.insert("kmer_missing".into(), acc.kmer_missing.into());
+        }
+        let mut match_map = serde_json::Map::new();
+        for (g, label) in meta.genomes.iter().enumerate() {
+            match_map.insert(label.clone(), acc.genome_totals[g].into());
+        }
+        ann.insert("kmer_strict_matches".into(), match_map.into());
+
+        let ann_str = serde_json::to_string(&ann).unwrap_or_else(|_| "{}".to_string());
+
+        // OBITools4 FASTA format: >id {"key":value,...}
+        let _ = writeln!(out, ">{id} {ann_str}");
+        let _ = out.write_all(seq);
+        let _ = out.write_all(b"\n");
+    }
+}
@@ -22,6 +22,8 @@ enum Commands {
    Merge(cmd::merge::MergeArgs),
    /// Filter and compact an existing index into a new single-layer index
    Rebuild(cmd::rebuild::RebuildArgs),
+    /// Query an index with sequences and annotate matches
+    Query(cmd::query::QueryArgs),
    /// Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
    Dump(cmd::dump::DumpArgs),
    /// Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees
@@ -54,6 +56,7 @@ fn main() {
        Commands::Merge(args)     => cmd::merge::run(args),
        Commands::Dump(args)      => cmd::dump::run(args),
        Commands::Rebuild(args)   => cmd::rebuild::run(args),
+        Commands::Query(args)     => cmd::query::run(args),
        Commands::Distance(args)  => cmd::distance::run(args),
        Commands::Unitig(args)    => cmd::unitig::run(args),
    }
@@ -5,6 +5,7 @@ mod index_layer;
 mod kmer_sort;
 mod merge_layer;
 mod partition;
+mod query_layer;
 mod rebuild_layer;

 pub use filter::KmerFilter;
@@ -0,0 +1,120 @@
+use std::path::Path;
+
+use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
+use obikseq::{CanonicalKmer, RoutableSuperKmer};
+use obiskio::{SKError, SKResult};
+use obilayeredmap::{MphfLayer, OLMError};
+use obilayeredmap::meta::PartitionMeta;
+
+use crate::partition::KmerPartition;
+
+const INDEX_SUBDIR: &str = "index";
+
+fn olm_to_sk(e: OLMError) -> SKError {
+    match e {
+        OLMError::Io(io_err) => SKError::Io(io_err),
+        other => SKError::InvalidData { context: "query", detail: other.to_string() },
+    }
+}
+
+// ── per-layer query handle ────────────────────────────────────────────────────
+
+enum QueryLayer {
+    /// Layer<()> — MPHF-only, no data matrix; all indexed kmers map to 1 per genome.
+    SetOnly(MphfLayer),
+    Presence(MphfLayer, PersistentBitMatrix),
+    Count(MphfLayer, PersistentCompactIntMatrix),
+}
+
+impl QueryLayer {
+    fn open(layer_dir: &Path, with_counts: bool) -> SKResult<Self> {
+        let presence_dir = layer_dir.join("presence");
+        let counts_dir   = layer_dir.join("counts");
+
+        if with_counts && counts_dir.exists() {
+            let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
+            let mat  = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
+            Ok(QueryLayer::Count(mphf, mat))
+        } else if presence_dir.exists() {
+            let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
+            let mat  = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
+            Ok(QueryLayer::Presence(mphf, mat))
+        } else if counts_dir.exists() {
+            // presence query on a count index — return counts as-is
+            let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
+            let mat  = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
+            Ok(QueryLayer::Count(mphf, mat))
+        } else {
+            let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
+            Ok(QueryLayer::SetOnly(mphf))
+        }
+    }
+
+    /// Return `Some(per-genome row)` if `kmer` is indexed in this layer, else `None`.
+    fn find(&self, kmer: CanonicalKmer, n_genomes: usize) -> Option<Box<[u32]>> {
+        match self {
+            QueryLayer::SetOnly(mphf) => {
+                mphf.find(kmer)
+                    .map(|_| vec![1u32; n_genomes].into_boxed_slice())
+            }
+            QueryLayer::Presence(mphf, mat) => {
+                mphf.find(kmer)
+                    .map(|slot| mat.row(slot).iter().map(|&b| b as u32).collect())
+            }
+            QueryLayer::Count(mphf, mat) => {
+                mphf.find(kmer).map(|slot| mat.row(slot))
+            }
+        }
+    }
+}
+
+// ── KmerPartition::query_partition ───────────────────────────────────────────
+
+impl KmerPartition {
+    /// Query a single partition for a slice of (already-routed) super-kmers.
+    ///
+    /// Returns one entry per input super-kmer; each entry is a `Vec` with one
+    /// `Option<Box<[u32]>>` per k-mer inside that super-kmer:
+    /// - `None`        — k-mer absent from the index
+    /// - `Some(row)`   — per-genome count (count index) or 0/1 (presence index)
+    ///
+    /// All `superkmers` must belong to this partition (same minimizer bucket).
+    pub fn query_partition(
+        &self,
+        part_idx: usize,
+        superkmers: &[&RoutableSuperKmer],
+        k: usize,
+        n_genomes: usize,
+        with_counts: bool,
+    ) -> SKResult<Vec<Vec<Option<Box<[u32]>>>>> {
+        if superkmers.is_empty() {
+            return Ok(Vec::new());
+        }
+
+        let index_dir = self.part_dir(part_idx).join(INDEX_SUBDIR);
+
+        if !index_dir.exists() {
+            return Ok(superkmers
+                .iter()
+                .map(|rsk| vec![None; rsk.seql() - k + 1])
+                .collect());
+        }
+
+        let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
+        let layers: Vec<QueryLayer> = (0..meta.n_layers)
+            .map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts))
+            .collect::<SKResult<_>>()?;
+
+        Ok(superkmers
+            .iter()
+            .map(|rsk| {
+                rsk.superkmer()
+                    .iter_canonical_kmers()
+                    .map(|kmer| {
+                        layers.iter().find_map(|layer| layer.find(kmer, n_genomes))
+                    })
+                    .collect()
+            })
+            .collect())
+    }
+}
@@ -71,6 +71,20 @@ impl RoutableSuperKmer {
    }
 }

+impl PartialEq for RoutableSuperKmer {
+    fn eq(&self, other: &Self) -> bool {
+        self.superkmer == other.superkmer
+    }
+}
+
+impl Eq for RoutableSuperKmer {}
+
+impl std::hash::Hash for RoutableSuperKmer {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        self.superkmer.hash(state);
+    }
+}
+
 impl Sequence for RoutableSuperKmer {
    type Canonical = RoutableSuperKmer;

@@ -12,6 +12,7 @@ mod mimetype;
 pub mod normalize;
 mod path_iterator;
 pub mod peakreader;
+pub mod record;
 pub mod xopen;

 pub use chunk::{SeqChunkIter, fasta_chunks, fastq_chunks,
@@ -0,0 +1,222 @@
+//! Per-sequence record parser for FASTA and FASTQ chunks.
+//!
+//! Same automaton structure as `normalize.rs` — only the actions differ:
+//! instead of writing into a single flat rope, we accumulate per-sequence
+//! data (id, raw ASCII, normalised ACGT\x00 rope).
+
+use obikrope::{ForwardCursor, Rope, RopeCursor};
+
+/// One sequence record extracted from a FASTA or FASTQ chunk.
+pub struct SeqRecord {
+    /// Sequence identifier (everything before the first space in the header).
+    pub id: String,
+    /// Raw sequence bytes, newlines stripped, non-ACGT characters preserved.
+    /// Reproduced verbatim in query output.
+    pub sequence: Vec<u8>,
+    /// Per-sequence normalised rope: uppercase ACGT segments of length ≥ k
+    /// separated by `\x00`. Ready for `SuperKmerIter`.
+    pub normalized: Rope,
+}
+
+/// Parse all records from a FASTA or FASTQ chunk rope.
+/// Returns an empty vec if the rope carries no recognised mime type.
+pub fn parse_chunk(rope: &Rope, k: usize) -> Vec<SeqRecord> {
+    let cursor = rope.fw_cursor();
+    match rope.mime_type() {
+        Some("text/fasta") => parse_fasta(cursor, k),
+        Some("text/fastq") => parse_fastq(cursor, k),
+        _ => vec![],
+    }
+}
+
+// ── Shared state accumulated while scanning one sequence ──────────────────────
+
+struct RecordBuilder {
+    id:       String,
+    sequence: Vec<u8>,   // raw ASCII, no newlines
+    norm:     Vec<u8>,   // ACGT\x00 segments being built
+    seg_start: usize,    // index in norm where current segment started
+    k:        usize,
+}
+
+impl RecordBuilder {
+    fn new(k: usize) -> Self {
+        Self { id: String::new(), sequence: Vec::new(), norm: Vec::new(), seg_start: 0, k }
+    }
+
+    fn reset(&mut self, id: String) {
+        self.id = id;
+        self.sequence.clear();
+        self.norm.clear();
+        self.seg_start = 0;
+    }
+
+    /// Push one accepted ACGT byte.
+    fn push_acgt(&mut self, b: u8) {
+        self.sequence.push(b);
+        self.norm.push(b);
+    }
+
+    /// Push one non-ACGT byte to the raw sequence only (not to the norm buffer).
+    fn push_raw(&mut self, b: u8) {
+        self.sequence.push(b);
+    }
+
+    /// Close the current ACGT segment (same logic as `end_segment` in normalize.rs).
+    fn end_segment(&mut self) {
+        if self.norm.len() - self.seg_start >= self.k {
+            self.norm.push(0x00);
+            self.seg_start = self.norm.len();
+        } else {
+            self.norm.truncate(self.seg_start);
+        }
+    }
+
+    /// Consume into a SeqRecord. Closes any open segment first.
+    fn finish(mut self) -> Option<SeqRecord> {
+        self.end_segment();
+        if self.id.is_empty() {
+            return None;
+        }
+        let mut rope = Rope::new(None);
+        if !self.norm.is_empty() {
+            rope.push(self.norm);
+        }
+        Some(SeqRecord { id: self.id, sequence: self.sequence, normalized: rope })
+    }
+}
+
+// ── FASTA automaton ───────────────────────────────────────────────────────────
+
+fn parse_fasta(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
+    let mut records: Vec<SeqRecord> = Vec::new();
+    let mut builder = RecordBuilder::new(k);
+
+    // skip up to (and including) the first '>'
+    loop {
+        match cursor.read_next().ok() {
+            None => return records,
+            Some(b'>') => break,
+            Some(_) => {}
+        }
+    }
+
+    // read first id — read_id already consumes the full header line
+    builder.id = read_id(&cursor);
+
+    loop {
+        match cursor.read_next().ok() {
+            None => {
+                // EOF — close final segment and emit
+                if let Some(rec) = builder.finish() {
+                    records.push(rec);
+                }
+                return records;
+            }
+            Some(b'\n') | Some(b'\r') => {
+                // peek: next non-empty char determines if new record starts
+                match cursor.read_ahead(1).ok() {
+                    Some(b'>') => {
+                        // end of current record
+                        builder.end_segment();
+                        if let Some(rec) = builder.finish() {
+                            records.push(rec);
+                        }
+                        cursor.read_next().ok(); // consume '>'
+                        let id = read_id(&cursor); // already consumes header line
+                        builder = RecordBuilder::new(k);
+                        builder.reset(id);
+                    }
+                    None => {
+                        builder.end_segment();
+                        if let Some(rec) = builder.finish() {
+                            records.push(rec);
+                        }
+                        return records;
+                    }
+                    Some(_) => {} // continuation line — do nothing
+                }
+            }
+            Some(b) => {
+                let upper = b & !0x20u8;
+                if matches!(upper, b'A' | b'C' | b'G' | b'T') {
+                    builder.push_acgt(upper);
+                } else {
+                    builder.push_raw(b);
+                    builder.end_segment();
+                }
+            }
+        }
+    }
+}
+
+// ── FASTQ automaton ───────────────────────────────────────────────────────────
+
+fn parse_fastq(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
+    let mut records: Vec<SeqRecord> = Vec::new();
+
+    loop {
+        // find '@'
+        loop {
+            match cursor.read_next().ok() {
+                None => return records,
+                Some(b'@') => break,
+                Some(_) => {}
+            }
+        }
+
+        let mut builder = RecordBuilder::new(k);
+        builder.id = read_id(&cursor); // already consumes the full header line
+
+        // sequence line — stop at newline, non-ACGT breaks segment
+        loop {
+            match cursor.read_next().ok() {
+                None | Some(b'\n') | Some(b'\r') => {
+                    builder.end_segment();
+                    break;
+                }
+                Some(b) => {
+                    let upper = b & !0x20u8;
+                    if matches!(upper, b'A' | b'C' | b'G' | b'T') {
+                        builder.push_acgt(upper);
+                    } else {
+                        builder.push_raw(b);
+                        builder.end_segment();
+                    }
+                }
+            }
+        }
+
+        skip_line(&cursor); // '+' line
+        skip_line(&cursor); // quality line
+
+        if let Some(rec) = builder.finish() {
+            records.push(rec);
+        }
+    }
+}
+
+// ── Helpers ───────────────────────────────────────────────────────────────────
+
+fn read_id(cursor: &ForwardCursor<'_>) -> String {
+    let mut id = Vec::new();
+    loop {
+        match cursor.read_next().ok() {
+            None | Some(b'\n') | Some(b'\r') => break,
+            Some(b' ') | Some(b'\t') => {
+                skip_line(cursor);
+                break;
+            }
+            Some(b) => id.push(b),
+        }
+    }
+    String::from_utf8_lossy(&id).into_owned()
+}
+
+fn skip_line(cursor: &ForwardCursor<'_>) {
+    while let Some(c) = cursor.read_next().ok() {
+        if c == b'\n' {
+            return;
+        }
+    }
+}