Push qkpyqurltlpk #1

Merged
coissac merged 11 commits from push-qkpyqurltlpk into main 2026-05-21 16:57:19 +00:00
13 changed files with 762 additions and 19 deletions
Showing only changes of commit 13599dd444 - Show all commits
+1 -19
View File
@@ -1,26 +1,8 @@
## Chose à vérifier suite à la commande index
- il faudrait lister les fichier qui vont être indexés
- partition.meta ne devrait plus exister
- les spectrums globaux devrait etre identifier par génome
- regrouper dans un sous-dossier spectrums à la racine de l'index avec un nom basé sur le génome
- les spectrum patiels ont-ils vocation à être conserver ?
- l'étape de déreplication dure quasiment autant de temps que le comptage mais ne laisse aucune trace de progression à l'utilisateur
## commandes à ajouter ## commandes à ajouter
- filter : produit un nouvel index filtré à partir d'un index existant en verifiant que les kmer présents dans le nouvel index respectent les critères de filtrage spécifiés
- quorum de presence en fraction-(min/max) du nombre de génomes, en nombre-(min/max) de génomes, si mode count la présence peut être défini par un seuil personnalisé minimum et maximum
- aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne. - aggregate : aggrege toutes les colonnes d'une matrice d'index en une seule colonne.
- query : scan un fichier de sequences et retourne pour chaque sequence quels kmer sont présents dans l'index et dans quel genomes - query : scan un fichier de sequences et retourne pour chaque sequence quels kmer sont présents dans l'index et dans quel genomes
--detail et --mismatch à implementer
- distance : calcule la matrice de distance entre les genomes
- proposer une option pour chaque distance à calculer
- un possibité de récuperer la matrice des kmer communs
- un possibité de calculer l'arbre nj
- les matrices sont sauvegardées en CSV
- les arbres NJ sont sauvegardés en Newick avec les longeurs de branche
- status : affiche le statut de l'index - status : affiche le statut de l'index
+111
View File
@@ -0,0 +1,111 @@
# Query system
## Goal
Given a set of query sequences, determine for each sequence how many of its k-mers are found in the index and, for each indexed genome, how many k-mers match. The query system is the foundation for read classification and sequence-to-genome mapping.
---
## Input
- Query sequences in FASTA or FASTQ format (gzip supported, streaming stdin supported).
- Sequences shorter than k bases are silently skipped.
- Non-ACGT characters are handled by the superkmer decomposition layer: they act as hard breaks, producing shorter superkmers (identical to the behaviour at indexing time).
---
## Algorithm
The query follows the same superkmer-based partitioning strategy used at indexing time.
```
for each query sequence:
decompose into superkmers (non-ACGT breaks, same minimiser scheme as indexing)
for each superkmer:
route to partition p via minimiser hash
for each kmer in the superkmer:
lookup kmer in partition p (MPHF → evidence check → matrix row)
accumulate result into per-sequence accumulators
emit annotated sequence
```
Parallelism is **per sequence**: each worker thread handles all partitions of one sequence independently. This avoids cross-thread coordination when merging partial results and keeps memory usage proportional to the number of concurrent sequences rather than to the number of partitions.
---
## Exact vs. approximate matching
### Exact (default)
Standard MPHF lookup followed by evidence check. O(1) per k-mer.
### 1-mismatch (`--mismatch` flag)
For each k-mer of the query, generate all `3·k` single-substitution variants. Each variant is canonicalised and looked up independently in the index. If one or more variants are found, their per-genome rows are **summed** into the result for that k-mer position.
- If a k-mer matches exactly AND one of its variants also matches (distinct k-mers in the index), both contributions are accumulated.
- Exact and approximate matches are tracked separately in the output (see annotation schema below).
- The superkmer routing optimisation is **not** used in 1-mismatch mode: each variant is looked up directly via its own minimiser.
- Cost: up to `3·k` MPHF probes per k-mer position vs. 1 in exact mode.
---
## Output format
Output sequences are written in **OBITools4 format**: the original sequence with a JSON annotation map in the title line.
```
>read_id {"kmer_total":150,"kmer_found":59,...}
ATCGATCG...
```
Genome order in all list-valued annotations follows the genome order recorded in `index.meta`.
---
## Annotation schema
### Summary mode (default)
| Key | Type | Condition | Semantics |
|---|---|---|---|
| `kmer_total` | int | always | total k-mers in the (masked) sequence |
| `kmer_found` | int | always | k-mers with at least one match (exact or approx) |
| `kmer_missing` | int | `--count-missing` | k-mers absent from the index |
| `kmer_match` | list[int] | always | per-genome matched k-mer count (exact + approx) |
| `kmer_match_exact` | list[int] | `--mismatch` | per-genome exact match count |
| `kmer_match_approx` | list[int] | `--mismatch` | per-genome approx match count |
| `count_match` | list[int] | count index | per-genome sum of index counts for matched k-mers |
`kmer_match[i]` is the number of k-mer positions in the query that contribute at least one match to genome i. In 1-mismatch mode, a single k-mer position can contribute to multiple genomes if several of its variants are present in the index.
`count_match[i]` sums raw index counts across all matched k-mer positions for genome i. Only meaningful for count indexes.
### Detail mode (`--detail`)
All summary keys, plus per-position coverage vectors — one list per genome, length `len(sequence) k + 1`:
| Key | Type | Condition | Semantics |
|---|---|---|---|
| `cov_<i>` | list[int] | `--detail` | coverage at each k-mer position for genome i; raw count (count index) or 0/1 (presence index); 0 if absent |
| `cov_exact_<i>` | list[int] | `--detail` + `--mismatch` | exact-match contribution per position |
| `cov_approx_<i>` | list[int] | `--detail` + `--mismatch` | approx-match contribution per position |
Genome indices in key names are 0-based integers matching the `index.meta` genome order. Genome labels are not used as key names to avoid issues with special characters in long or complex genome identifiers.
---
## CLI
```
obikmer query -i <index> [--summary | --detail] [--mismatch] [--count-missing] <query.fa>
```
`--summary` is the default; `--detail` implies `--summary` (all summary keys are always present).
---
## Future work
- **Read classification** (`--classify`): assign each read to the genome with the highest `kmer_match` score; emit as a single annotation key.
- **Whitelist / blacklist filtering**: accept or reject sequences based on whether their k-mer match score for a designated set of genomes exceeds a threshold.
+1
View File
@@ -1514,6 +1514,7 @@ dependencies = [
"obisys", "obisys",
"pprof", "pprof",
"rayon", "rayon",
"serde_json",
"speedytree", "speedytree",
"tracing", "tracing",
"tracing-subscriber", "tracing-subscriber",
+5
View File
@@ -185,6 +185,11 @@ impl KmerIndex {
Ok(()) Ok(())
} }
/// Borrow the inner partition for direct superkmer-level queries.
pub fn partition(&self) -> &KmerPartition {
&self.partition
}
/// Path to the unitigs file for partition `part`, layer `layer`. /// Path to the unitigs file for partition `part`, layer `layer`.
pub fn layer_unitigs_path(&self, part: usize, layer: usize) -> PathBuf { pub fn layer_unitigs_path(&self, part: usize, layer: usize) -> PathBuf {
self.partition.part_dir(part) self.partition.part_dir(part)
+1
View File
@@ -19,6 +19,7 @@ obisys = { path = "../obisys" }
obiskio = { path = "../obiskio" } obiskio = { path = "../obiskio" }
obikindex = { path = "../obikindex" } obikindex = { path = "../obikindex" }
clap = { version = "4", features = ["derive"] } clap = { version = "4", features = ["derive"] }
serde_json = "1"
kodama = "0.2" kodama = "0.2"
speedytree = "0.1" speedytree = "0.1"
rayon = "1" rayon = "1"
+1
View File
@@ -2,6 +2,7 @@ pub mod distance;
pub mod dump; pub mod dump;
pub mod index; pub mod index;
pub mod merge; pub mod merge;
pub mod query;
pub mod rebuild; pub mod rebuild;
pub mod superkmer; pub mod superkmer;
pub mod unitig; pub mod unitig;
+281
View File
@@ -0,0 +1,281 @@
use std::collections::HashMap;
use std::io::{self, BufWriter, Write};
use std::path::PathBuf;
use clap::Args;
use obikindex::KmerIndex;
use obiread::record::{SeqRecord, parse_chunk};
use obiread::chunk::read_sequence_chunks;
use obikseq::{RoutableSuperKmer, set_k, set_m};
use obiskbuilder::SuperKmerIter;
use tracing::info;
// ── CLI ───────────────────────────────────────────────────────────────────────
#[derive(Args)]
pub struct QueryArgs {
/// Index directory
pub index: PathBuf,
/// Input sequences (FASTA/FASTQ, optionally gzip-compressed)
#[arg(num_args = 1..)]
pub inputs: Vec<String>,
/// Also report per-position coverage vectors (cov_<i> per genome)
#[arg(long)]
pub detail: bool,
/// Enable 1-mismatch approximate matching
#[arg(long)]
pub mismatch: bool,
/// Count k-mers absent from the index (adds kmer_missing annotation)
#[arg(long)]
pub count_missing: bool,
/// Report per-genome presence (0/1) instead of raw counts
#[arg(long)]
pub force_presence: bool,
/// Minimum accumulated match count to declare a genome present (implies --force-presence)
#[arg(long, default_value_t = 1)]
pub presence_threshold: u32,
/// Number of worker threads
#[arg(
short = 'T',
long,
default_value_t = std::thread::available_parallelism()
.map(|n| n.get())
.unwrap_or(1)
)]
pub threads: usize,
}
// ── SKDesc — one occurrence of a superkmer in the batch ───────────────────────
/// Describes one occurrence of a superkmer in the query batch.
pub struct SKDesc {
/// Index of the source sequence within the batch.
pub seq_idx: u32,
/// Kmer offset of the first kmer of this superkmer within its sequence.
/// Computed as the cumulative number of kmers emitted before this superkmer
/// in the same sequence. Used for `--detail` coverage vectors.
pub kmer_offset: u32,
}
// ── QueryBatch ────────────────────────────────────────────────────────────────
/// A batch of query sequences with their superkmers deduplicated.
///
/// Each unique `RoutableSuperKmer` maps to all the (seq_idx, kmer_offset)
/// positions it occupies across the batch. The superkmer is queried once
/// per partition; the result is broadcast to every SKDesc entry.
pub struct QueryBatch {
/// Sequence ids in batch order.
pub ids: Vec<String>,
/// Raw sequence bytes (for output), in batch order.
pub seqs: Vec<Vec<u8>>,
/// Per-sequence total kmer count (kmer_count + kmer_missing).
pub n_kmers: Vec<u32>,
/// Deduplicated superkmer map.
pub map: HashMap<RoutableSuperKmer, Vec<SKDesc>>,
}
impl QueryBatch {
/// Build a batch from a vec of parsed sequence records.
pub fn from_records(
records: Vec<SeqRecord>,
k: usize,
level_max: usize,
theta: f64,
) -> Self {
let mut ids = Vec::with_capacity(records.len());
let mut seqs = Vec::with_capacity(records.len());
let mut n_kmers = Vec::with_capacity(records.len());
let mut map: HashMap<RoutableSuperKmer, Vec<SKDesc>> = HashMap::new();
for (seq_idx, record) in records.into_iter().enumerate() {
let mut kmer_offset = 0u32;
for rsk in SuperKmerIter::new(&record.normalized, k, level_max, theta) {
let n = (rsk.seql() - k + 1) as u32;
map.entry(rsk)
.or_default()
.push(SKDesc { seq_idx: seq_idx as u32, kmer_offset });
kmer_offset += n;
}
ids.push(record.id);
seqs.push(record.sequence);
n_kmers.push(kmer_offset);
}
Self { ids, seqs, n_kmers, map }
}
/// Split the superkmer map by partition index.
///
/// Returns a vec of length `n_partitions`; each slot holds the RSK refs
/// whose minimizer routes to that partition.
pub fn split_by_partition(&self, n_partitions: usize) -> Vec<Vec<&RoutableSuperKmer>> {
let mask = (n_partitions as u64) - 1;
let mut by_part: Vec<Vec<&RoutableSuperKmer>> = vec![Vec::new(); n_partitions];
for rsk in self.map.keys() {
let part = (rsk.minimizer().seq_hash() & mask) as usize;
by_part[part].push(rsk);
}
by_part
}
}
// ── Per-sequence accumulator ──────────────────────────────────────────────────
struct SeqAcc {
kmer_count: u32,
kmer_missing: u32,
/// Per-genome accumulated count (count mode) or presence sum (presence mode).
genome_totals: Vec<u32>,
}
impl SeqAcc {
fn new(n_genomes: usize) -> Self {
Self {
kmer_count: 0,
kmer_missing: 0,
genome_totals: vec![0u32; n_genomes],
}
}
}
// ── Entry point ───────────────────────────────────────────────────────────────
pub fn run(args: QueryArgs) {
let idx = KmerIndex::open(&args.index).unwrap_or_else(|e| {
eprintln!("error opening index: {e}");
std::process::exit(1);
});
set_k(idx.kmer_size());
set_m(idx.minimizer_size());
let k = idx.kmer_size();
let n_genomes = idx.meta().genomes.len();
let n_partitions = idx.n_partitions();
let with_counts = idx.meta().config.with_counts;
info!(
"query: k={k}, {} genome(s), with_counts={with_counts}, mismatch={}, detail={}",
n_genomes, args.mismatch, args.detail
);
if args.mismatch {
eprintln!("warning: --mismatch not yet implemented, ignored");
}
if args.detail {
eprintln!("warning: --detail not yet implemented, ignored");
}
let paths: Vec<PathBuf> = args.inputs.iter().map(PathBuf::from).collect();
let mut out = BufWriter::new(io::stdout());
for path in &paths {
let chunks = read_sequence_chunks(path.to_str().unwrap_or(""))
.unwrap_or_else(|e| {
eprintln!("error opening {}: {e}", path.display());
std::process::exit(1);
});
for chunk_result in chunks {
let chunk = chunk_result.unwrap_or_else(|e| {
eprintln!("read error: {e}");
std::process::exit(1);
});
let records = parse_chunk(&chunk, k);
if records.is_empty() {
continue;
}
let batch = QueryBatch::from_records(records, k, 6, 0.7);
let n_seqs = batch.ids.len();
let mut accs: Vec<SeqAcc> = (0..n_seqs).map(|_| SeqAcc::new(n_genomes)).collect();
let by_part = batch.split_by_partition(n_partitions);
for (part_idx, part_sks) in by_part.iter().enumerate() {
if part_sks.is_empty() {
continue;
}
let kmer_results = idx
.partition()
.query_partition(part_idx, part_sks, k, n_genomes, with_counts)
.unwrap_or_else(|e| {
eprintln!("query error on partition {part_idx}: {e}");
std::process::exit(1);
});
let presence = args.force_presence || !with_counts;
let threshold = args.presence_threshold;
for (rsk, sk_kmer_results) in part_sks.iter().zip(kmer_results.iter()) {
let descs = batch.map.get(*rsk).expect("rsk must be in map");
for desc in descs {
let acc = &mut accs[desc.seq_idx as usize];
for hit in sk_kmer_results.iter() {
match hit {
None => acc.kmer_missing += 1,
Some(row) => {
acc.kmer_count += 1;
for (g, &v) in row.iter().enumerate() {
acc.genome_totals[g] += if presence {
u32::from(v >= threshold)
} else {
v
};
}
}
}
}
}
}
}
emit_batch(&batch, &accs, idx.meta(), args.count_missing, &mut out);
}
}
}
// ── Output ────────────────────────────────────────────────────────────────────
fn emit_batch(
batch: &QueryBatch,
accs: &[SeqAcc],
meta: &obikindex::meta::IndexMeta,
count_missing: bool,
out: &mut impl Write,
) {
for (seq_idx, (id, seq)) in batch.ids.iter().zip(batch.seqs.iter()).enumerate() {
let acc = &accs[seq_idx];
let mut ann = serde_json::Map::new();
ann.insert("kmer_count".into(), acc.kmer_count.into());
if count_missing {
ann.insert("kmer_missing".into(), acc.kmer_missing.into());
}
let mut match_map = serde_json::Map::new();
for (g, label) in meta.genomes.iter().enumerate() {
match_map.insert(label.clone(), acc.genome_totals[g].into());
}
ann.insert("kmer_strict_matches".into(), match_map.into());
let ann_str = serde_json::to_string(&ann).unwrap_or_else(|_| "{}".to_string());
// OBITools4 FASTA format: >id {"key":value,...}
let _ = writeln!(out, ">{id} {ann_str}");
let _ = out.write_all(seq);
let _ = out.write_all(b"\n");
}
}
+3
View File
@@ -22,6 +22,8 @@ enum Commands {
Merge(cmd::merge::MergeArgs), Merge(cmd::merge::MergeArgs),
/// Filter and compact an existing index into a new single-layer index /// Filter and compact an existing index into a new single-layer index
Rebuild(cmd::rebuild::RebuildArgs), Rebuild(cmd::rebuild::RebuildArgs),
/// Query an index with sequences and annotate matches
Query(cmd::query::QueryArgs),
/// Dump all indexed kmers as CSV (kmer + per-genome counts or presence) /// Dump all indexed kmers as CSV (kmer + per-genome counts or presence)
Dump(cmd::dump::DumpArgs), Dump(cmd::dump::DumpArgs),
/// Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees /// Compute pairwise distance matrix between genomes; optionally build NJ/UPGMA trees
@@ -54,6 +56,7 @@ fn main() {
Commands::Merge(args) => cmd::merge::run(args), Commands::Merge(args) => cmd::merge::run(args),
Commands::Dump(args) => cmd::dump::run(args), Commands::Dump(args) => cmd::dump::run(args),
Commands::Rebuild(args) => cmd::rebuild::run(args), Commands::Rebuild(args) => cmd::rebuild::run(args),
Commands::Query(args) => cmd::query::run(args),
Commands::Distance(args) => cmd::distance::run(args), Commands::Distance(args) => cmd::distance::run(args),
Commands::Unitig(args) => cmd::unitig::run(args), Commands::Unitig(args) => cmd::unitig::run(args),
} }
+1
View File
@@ -5,6 +5,7 @@ mod index_layer;
mod kmer_sort; mod kmer_sort;
mod merge_layer; mod merge_layer;
mod partition; mod partition;
mod query_layer;
mod rebuild_layer; mod rebuild_layer;
pub use filter::KmerFilter; pub use filter::KmerFilter;
+120
View File
@@ -0,0 +1,120 @@
use std::path::Path;
use obicompactvec::{PersistentBitMatrix, PersistentCompactIntMatrix};
use obikseq::{CanonicalKmer, RoutableSuperKmer};
use obiskio::{SKError, SKResult};
use obilayeredmap::{MphfLayer, OLMError};
use obilayeredmap::meta::PartitionMeta;
use crate::partition::KmerPartition;
const INDEX_SUBDIR: &str = "index";
fn olm_to_sk(e: OLMError) -> SKError {
match e {
OLMError::Io(io_err) => SKError::Io(io_err),
other => SKError::InvalidData { context: "query", detail: other.to_string() },
}
}
// ── per-layer query handle ────────────────────────────────────────────────────
enum QueryLayer {
/// Layer<()> — MPHF-only, no data matrix; all indexed kmers map to 1 per genome.
SetOnly(MphfLayer),
Presence(MphfLayer, PersistentBitMatrix),
Count(MphfLayer, PersistentCompactIntMatrix),
}
impl QueryLayer {
fn open(layer_dir: &Path, with_counts: bool) -> SKResult<Self> {
let presence_dir = layer_dir.join("presence");
let counts_dir = layer_dir.join("counts");
if with_counts && counts_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Count(mphf, mat))
} else if presence_dir.exists() {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
let mat = PersistentBitMatrix::open(&presence_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Presence(mphf, mat))
} else if counts_dir.exists() {
// presence query on a count index — return counts as-is
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
let mat = PersistentCompactIntMatrix::open(&counts_dir).map_err(SKError::Io)?;
Ok(QueryLayer::Count(mphf, mat))
} else {
let mphf = MphfLayer::open(layer_dir).map_err(olm_to_sk)?;
Ok(QueryLayer::SetOnly(mphf))
}
}
/// Return `Some(per-genome row)` if `kmer` is indexed in this layer, else `None`.
fn find(&self, kmer: CanonicalKmer, n_genomes: usize) -> Option<Box<[u32]>> {
match self {
QueryLayer::SetOnly(mphf) => {
mphf.find(kmer)
.map(|_| vec![1u32; n_genomes].into_boxed_slice())
}
QueryLayer::Presence(mphf, mat) => {
mphf.find(kmer)
.map(|slot| mat.row(slot).iter().map(|&b| b as u32).collect())
}
QueryLayer::Count(mphf, mat) => {
mphf.find(kmer).map(|slot| mat.row(slot))
}
}
}
}
// ── KmerPartition::query_partition ───────────────────────────────────────────
impl KmerPartition {
/// Query a single partition for a slice of (already-routed) super-kmers.
///
/// Returns one entry per input super-kmer; each entry is a `Vec` with one
/// `Option<Box<[u32]>>` per k-mer inside that super-kmer:
/// - `None` — k-mer absent from the index
/// - `Some(row)` — per-genome count (count index) or 0/1 (presence index)
///
/// All `superkmers` must belong to this partition (same minimizer bucket).
pub fn query_partition(
&self,
part_idx: usize,
superkmers: &[&RoutableSuperKmer],
k: usize,
n_genomes: usize,
with_counts: bool,
) -> SKResult<Vec<Vec<Option<Box<[u32]>>>>> {
if superkmers.is_empty() {
return Ok(Vec::new());
}
let index_dir = self.part_dir(part_idx).join(INDEX_SUBDIR);
if !index_dir.exists() {
return Ok(superkmers
.iter()
.map(|rsk| vec![None; rsk.seql() - k + 1])
.collect());
}
let meta = PartitionMeta::load(&index_dir).map_err(olm_to_sk)?;
let layers: Vec<QueryLayer> = (0..meta.n_layers)
.map(|i| QueryLayer::open(&index_dir.join(format!("layer_{i}")), with_counts))
.collect::<SKResult<_>>()?;
Ok(superkmers
.iter()
.map(|rsk| {
rsk.superkmer()
.iter_canonical_kmers()
.map(|kmer| {
layers.iter().find_map(|layer| layer.find(kmer, n_genomes))
})
.collect()
})
.collect())
}
}
+14
View File
@@ -71,6 +71,20 @@ impl RoutableSuperKmer {
} }
} }
impl PartialEq for RoutableSuperKmer {
fn eq(&self, other: &Self) -> bool {
self.superkmer == other.superkmer
}
}
impl Eq for RoutableSuperKmer {}
impl std::hash::Hash for RoutableSuperKmer {
fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
self.superkmer.hash(state);
}
}
impl Sequence for RoutableSuperKmer { impl Sequence for RoutableSuperKmer {
type Canonical = RoutableSuperKmer; type Canonical = RoutableSuperKmer;
+1
View File
@@ -12,6 +12,7 @@ mod mimetype;
pub mod normalize; pub mod normalize;
mod path_iterator; mod path_iterator;
pub mod peakreader; pub mod peakreader;
pub mod record;
pub mod xopen; pub mod xopen;
pub use chunk::{SeqChunkIter, fasta_chunks, fastq_chunks, pub use chunk::{SeqChunkIter, fasta_chunks, fastq_chunks,
+222
View File
@@ -0,0 +1,222 @@
//! Per-sequence record parser for FASTA and FASTQ chunks.
//!
//! Same automaton structure as `normalize.rs` — only the actions differ:
//! instead of writing into a single flat rope, we accumulate per-sequence
//! data (id, raw ASCII, normalised ACGT\x00 rope).
use obikrope::{ForwardCursor, Rope, RopeCursor};
/// One sequence record extracted from a FASTA or FASTQ chunk.
pub struct SeqRecord {
/// Sequence identifier (everything before the first space in the header).
pub id: String,
/// Raw sequence bytes, newlines stripped, non-ACGT characters preserved.
/// Reproduced verbatim in query output.
pub sequence: Vec<u8>,
/// Per-sequence normalised rope: uppercase ACGT segments of length ≥ k
/// separated by `\x00`. Ready for `SuperKmerIter`.
pub normalized: Rope,
}
/// Parse all records from a FASTA or FASTQ chunk rope.
/// Returns an empty vec if the rope carries no recognised mime type.
pub fn parse_chunk(rope: &Rope, k: usize) -> Vec<SeqRecord> {
let cursor = rope.fw_cursor();
match rope.mime_type() {
Some("text/fasta") => parse_fasta(cursor, k),
Some("text/fastq") => parse_fastq(cursor, k),
_ => vec![],
}
}
// ── Shared state accumulated while scanning one sequence ──────────────────────
struct RecordBuilder {
id: String,
sequence: Vec<u8>, // raw ASCII, no newlines
norm: Vec<u8>, // ACGT\x00 segments being built
seg_start: usize, // index in norm where current segment started
k: usize,
}
impl RecordBuilder {
fn new(k: usize) -> Self {
Self { id: String::new(), sequence: Vec::new(), norm: Vec::new(), seg_start: 0, k }
}
fn reset(&mut self, id: String) {
self.id = id;
self.sequence.clear();
self.norm.clear();
self.seg_start = 0;
}
/// Push one accepted ACGT byte.
fn push_acgt(&mut self, b: u8) {
self.sequence.push(b);
self.norm.push(b);
}
/// Push one non-ACGT byte to the raw sequence only (not to the norm buffer).
fn push_raw(&mut self, b: u8) {
self.sequence.push(b);
}
/// Close the current ACGT segment (same logic as `end_segment` in normalize.rs).
fn end_segment(&mut self) {
if self.norm.len() - self.seg_start >= self.k {
self.norm.push(0x00);
self.seg_start = self.norm.len();
} else {
self.norm.truncate(self.seg_start);
}
}
/// Consume into a SeqRecord. Closes any open segment first.
fn finish(mut self) -> Option<SeqRecord> {
self.end_segment();
if self.id.is_empty() {
return None;
}
let mut rope = Rope::new(None);
if !self.norm.is_empty() {
rope.push(self.norm);
}
Some(SeqRecord { id: self.id, sequence: self.sequence, normalized: rope })
}
}
// ── FASTA automaton ───────────────────────────────────────────────────────────
fn parse_fasta(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
let mut records: Vec<SeqRecord> = Vec::new();
let mut builder = RecordBuilder::new(k);
// skip up to (and including) the first '>'
loop {
match cursor.read_next().ok() {
None => return records,
Some(b'>') => break,
Some(_) => {}
}
}
// read first id — read_id already consumes the full header line
builder.id = read_id(&cursor);
loop {
match cursor.read_next().ok() {
None => {
// EOF — close final segment and emit
if let Some(rec) = builder.finish() {
records.push(rec);
}
return records;
}
Some(b'\n') | Some(b'\r') => {
// peek: next non-empty char determines if new record starts
match cursor.read_ahead(1).ok() {
Some(b'>') => {
// end of current record
builder.end_segment();
if let Some(rec) = builder.finish() {
records.push(rec);
}
cursor.read_next().ok(); // consume '>'
let id = read_id(&cursor); // already consumes header line
builder = RecordBuilder::new(k);
builder.reset(id);
}
None => {
builder.end_segment();
if let Some(rec) = builder.finish() {
records.push(rec);
}
return records;
}
Some(_) => {} // continuation line — do nothing
}
}
Some(b) => {
let upper = b & !0x20u8;
if matches!(upper, b'A' | b'C' | b'G' | b'T') {
builder.push_acgt(upper);
} else {
builder.push_raw(b);
builder.end_segment();
}
}
}
}
}
// ── FASTQ automaton ───────────────────────────────────────────────────────────
fn parse_fastq(cursor: ForwardCursor<'_>, k: usize) -> Vec<SeqRecord> {
let mut records: Vec<SeqRecord> = Vec::new();
loop {
// find '@'
loop {
match cursor.read_next().ok() {
None => return records,
Some(b'@') => break,
Some(_) => {}
}
}
let mut builder = RecordBuilder::new(k);
builder.id = read_id(&cursor); // already consumes the full header line
// sequence line — stop at newline, non-ACGT breaks segment
loop {
match cursor.read_next().ok() {
None | Some(b'\n') | Some(b'\r') => {
builder.end_segment();
break;
}
Some(b) => {
let upper = b & !0x20u8;
if matches!(upper, b'A' | b'C' | b'G' | b'T') {
builder.push_acgt(upper);
} else {
builder.push_raw(b);
builder.end_segment();
}
}
}
}
skip_line(&cursor); // '+' line
skip_line(&cursor); // quality line
if let Some(rec) = builder.finish() {
records.push(rec);
}
}
}
// ── Helpers ───────────────────────────────────────────────────────────────────
fn read_id(cursor: &ForwardCursor<'_>) -> String {
let mut id = Vec::new();
loop {
match cursor.read_next().ok() {
None | Some(b'\n') | Some(b'\r') => break,
Some(b' ') | Some(b'\t') => {
skip_line(cursor);
break;
}
Some(b) => id.push(b),
}
}
String::from_utf8_lossy(&id).into_owned()
}
fn skip_line(cursor: &ForwardCursor<'_>) {
while let Some(c) = cursor.read_next().ok() {
if c == b'\n' {
return;
}
}
}