This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
Implémentation complète du filtre de fréquence utilisant v niveaux de Roaring Bitmaps pour éliminer efficacement les erreurs de séquençage.
- Ajout de la logique de filtrage par fréquence avec v niveaux
- Intégration des bibliothèques RoaringBitmap et bitset
- Ajout d'exemples d'utilisation et de documentation
- Implémentation de l'itérateur de k-mers pour une utilisation mémoire efficace
- Optimisation pour les distributions skewed typiques du séquençage
Ce changement permet de filtrer les k-mers par fréquence minimale avec une utilisation mémoire optimale et une seule passe sur les données.
This commit introduces error marker functionality for k-mers with odd lengths up to 31. The top 2 bits of each k-mer are now reserved for error coding (0-3), allowing for error detection and correction capabilities. Key changes include:
- Added constants KmerErrorMask and KmerSequenceMask for bit manipulation
- Implemented SetKmerError, GetKmerError, and ClearKmerError functions
- Updated EncodeKmers, ExtractSuperKmers, EncodeNormalizedKmers functions to enforce k ≤ 31
- Enhanced ReverseComplement to preserve error bits during reverse complement operations
- Added comprehensive tests for error marker functionality including edge cases and integration tests
The maximum k-mer size is now capped at 31 to accommodate the error bits, ensuring that k-mers with odd lengths ≤ 31 utilize only 62 bits of the 64-bit uint64, leaving the top 2 bits available for error coding.
This commit introduces the ExtractSuperKmers function which identifies maximal subsequences where all consecutive k-mers share the same minimizer. It includes:
- SuperKmer struct to represent the maximal subsequences
- dequeItem struct for tracking minimizers in a sliding window
- Efficient algorithm using monotone deque for O(1) amortized minimizer tracking
- Comprehensive parameter validation
- Support for buffer reuse for performance optimization
- Extensive test cases covering basic functionality, edge cases, and performance benchmarks
The implementation uses simultaneous forward/reverse m-mer encoding for O(1) canonical m-mer computation and maintains a monotone deque to track minimizers efficiently.