Commit Graph

7 Commits

Author SHA1 Message Date
Eric Coissac
761e0dbed3 Implémentation d'un parseur GenBank utilisant rope pour réduire l'usage de mémoire
Ajout d'un parseur GenBank basé sur rope pour réduire l'usage de mémoire (RSS) et les allocations heap.

- Ajout de `gbRopeScanner` pour lire les lignes sans allocation heap
- Implémentation de `GenbankChunkParserRope` qui utilise rope au lieu de `Pack()`
- Modification de `_ParseGenbankFile` et `ReadGenbank` pour utiliser le nouveau parseur
- Réduction du RSS attendue de 57 GB à ~128 MB × workers
- Conservation de l'ancien parseur pour compatibilité et tests

Réduction significative des allocations (~50M) et temps sys, avec un temps user comparable ou meilleur.
2026-03-10 15:35:36 +01:00
Eric Coissac
a7ea47624b Optimisation du parsing des grandes séquences
Implémente une optimisation du parsing des grandes séquences en évitant l'allocation de mémoire inutile lors de la fusion des chunks. Ajoute un support pour le parsing direct de la structure rope, ce qui permet de réduire les allocations et d'améliorer les performances lors du traitement de fichiers GenBank/EMBL et FASTA/FASTQ de plusieurs Gbp. Les parseurs sont mis à jour pour utiliser la rope non-packée et le nouveau mécanisme d'écriture in-place pour les séquences GenBank.
2026-03-10 14:20:21 +01:00
Eric Coissac
a016ad5b8a Refactor kmer index to disk-based partitioning with minimizer
Refactor kmer index package to use disk-based partitioning with minimizer

- Replace roaring64 bitmaps with disk-based kmer index
- Implement partitioned kmer sets with delta-varint encoding
- Add support for frequency filtering during construction
- Introduce new builder pattern for index construction
- Add streaming operations for set operations (union, intersect, etc.)
- Add support for super-kmer encoding during construction
- Update command line tool to use new index format
- Remove dependency on roaring bitmap library

This change introduces a new architecture for kmer indexing that is more memory efficient and scalable for large datasets.
2026-02-09 17:52:37 +01:00
Eric Coissac
db98ddb241 Fix super k-mer minimizer bijection and add validation test
This commit addresses a bug in the super k-mer implementation where the minimizer bijection property was not properly enforced. The fix ensures that:

1. All k-mers within a super k-mer share the same minimizer
2. Identical super k-mer sequences have the same minimizer

The changes include:

- Fixing the super k-mer iteration logic to properly validate the minimizer bijection property
- Adding a comprehensive test suite (TestSuperKmerMinimizerBijection) that validates the intrinsic property of super k-mers
- Updating the .gitignore file to properly track relevant files

This resolves issues where the same sequence could be associated with different minimizers, violating the super k-mer definition.
2026-02-08 13:47:33 +01:00
Eric Coissac
7a979ba77f Add obisuperkmer command implementation and tests
This commit adds the implementation of the obisuperkmer command, including:

- The main command in cmd/obitools/obisuperkmer/
- The package implementation in pkg/obitools/obisuperkmer/
- Automated tests in obitests/obitools/obisuperkmer/
- Documentation for the implementation and tests

The obisuperkmer command extracts super k-mers from DNA sequences, following the standard OBITools architecture. It includes proper CLI option handling, validation of parameters, and integration with the OBITools pipeline system.

Tests cover basic functionality, parameter validation, output format, metadata preservation, and file I/O operations.
2026-02-07 13:54:02 +01:00
Eric Coissac
00c8be6b48 docs: add architecture documentation for OBITools commands
Ajout d'une documentation détaillée sur l'architecture des commandes OBITools, incluant la structure modulaire, les patterns architecturaux et les bonnes pratiques pour la création de nouvelles commandes.
2026-02-07 12:26:35 +01:00
Eric Coissac
500144051a Add jj Makefile targets and k-mer encoding utilities
Add new Makefile targets for jj operations (jjnew, jjpush, jjfetch) to streamline commit workflow.

Introduce k-mer encoding utilities in pkg/obikmer:
- EncodeKmers: converts DNA sequences to encoded k-mers
- ReverseComplement: computes reverse complement of k-mers
- NormalizeKmer: returns canonical form of k-mers
- EncodeNormalizedKmers: encodes sequences with normalized k-mers

Add comprehensive tests for k-mer encoding functions including edge cases, buffer reuse, and performance benchmarks.

Document k-mer index design for large genomes, covering:
- Use cases and objectives
- Volume estimations
- Distance metrics (Jaccard, Sørensen-Dice, Bray-Curtis)
- Indexing options (Bloom filters, sorted sets, MPHF)
- Optimization techniques (k-2-mer indexing)
- MinHash for distance acceleration
- Recommended architecture for presence/absence and counting queries
2026-02-04 14:27:10 +01:00