refactor: aggregate query results at sequence level

Refactor the query pipeline to buffer partition outputs into a per-sequence `seq_results` vector, deferring final accumulation until all partitions complete. This ensures global position ordering before computing k-mer presence, counts, and coverage statistics. Additionally, removes a resolved TODO and documents a known BLAST false-positive issue where chloroplast and bacterial contaminants yield unrealistic high-confidence matches.
This commit is contained in:
Eric Coissac
2026-05-30 07:16:23 +02:00
parent 3138f6382c
commit 708b0abf9b
2 changed files with 89 additions and 38 deletions
+34
View File
@@ -24,3 +24,37 @@ Sauf qu'avec un index approximatif, les résultats seront approximatifs.
--detail et --mismatch à implementer
- status : affiche le statut de l'index
## Problème biologique sur l'identification des contaminants
Exemple de reads problématiques:
```
>LH00534:161:22WMGWLT4:4:1101:45301:1420 {"coverage":{"gbbct":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]},"kmer_count":117,"kmer_strict_matches":{"gbbct":117}}
GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGATTTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACGCTTGCATCCTCTGTATTACCGCGGCTGCTGG
```
Par blast match une quantité invréssemblable de genomes chloroplastique avec un match de 100% (6554 hits pour Streptophyta)
mais aussi une quantité de sequences importantes à des OTU bactériennes (uncutured bacteria 115 hits) aussi avec 100% de similarité.
```
Uncultured bacterium clone Otu01032 16S ribosomal RNA gene, partial sequence
Sequence ID: KX996137.1Length: 440Number of Matches: 1
Range 1: 153 to 303GenBankGraphics
Next Match
Previous Match
Alignment statistics for match #1 Score Expect Identities Gaps Strand
273 bits(302) 2e-69 151/151(100%) 0/151(0%) Plus/Minus
Query 1 GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGAT 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 303 GCCCCTACCGTACTCCAGCTTGGTAGTTTCCACCGCCTGTCCAGGGTTGAGCCCTGGGAT 244
Query 61 TTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACG 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 243 TTGACGGCGGACTTAAAAAGCCACCTACAGACGCTTTACGCCCAATCATTCCGGATAACG 184
Query 121 CTTGCATCCTCTGTATTACCGCGGCTGCTGG 151
|||||||||||||||||||||||||||||||
Sbjct 183 CTTGCATCCTCTGTATTACCGCGGCTGCTGG 153
```