Commit Graph

628 Commits

Author SHA1 Message Date
Eric Coissac
00dcd78e84 Refactor k-mer encoding and frequency filtering with KmerSet
This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
2026-02-05 14:41:59 +01:00
Eric Coissac
60f27c1dc8 Add error handling for ambiguous bases in k-mer encoding
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
2026-02-04 21:45:08 +01:00
Eric Coissac
28162ac36f Ajout du filtre de fréquence avec v niveaux Roaring Bitmaps
Implémentation complète du filtre de fréquence utilisant v niveaux de Roaring Bitmaps pour éliminer efficacement les erreurs de séquençage.

- Ajout de la logique de filtrage par fréquence avec v niveaux
- Intégration des bibliothèques RoaringBitmap et bitset
- Ajout d'exemples d'utilisation et de documentation
- Implémentation de l'itérateur de k-mers pour une utilisation mémoire efficace
- Optimisation pour les distributions skewed typiques du séquençage

Ce changement permet de filtrer les k-mers par fréquence minimale avec une utilisation mémoire optimale et une seule passe sur les données.
2026-02-04 21:21:10 +01:00
Eric Coissac
1a1adb83ac Add error marker support for k-mers with enhanced documentation
This commit introduces error marker functionality for k-mers with odd lengths up to 31. The top 2 bits of each k-mer are now reserved for error coding (0-3), allowing for error detection and correction capabilities. Key changes include:

- Added constants KmerErrorMask and KmerSequenceMask for bit manipulation
- Implemented SetKmerError, GetKmerError, and ClearKmerError functions
- Updated EncodeKmers, ExtractSuperKmers, EncodeNormalizedKmers functions to enforce k ≤ 31
- Enhanced ReverseComplement to preserve error bits during reverse complement operations
- Added comprehensive tests for error marker functionality including edge cases and integration tests

The maximum k-mer size is now capped at 31 to accommodate the error bits, ensuring that k-mers with odd lengths ≤ 31 utilize only 62 bits of the 64-bit uint64, leaving the top 2 bits available for error coding.
2026-02-04 16:21:47 +01:00
Eric Coissac
05de9ca58e Add SuperKmer extraction functionality
This commit introduces the ExtractSuperKmers function which identifies maximal subsequences where all consecutive k-mers share the same minimizer. It includes:

- SuperKmer struct to represent the maximal subsequences
- dequeItem struct for tracking minimizers in a sliding window
- Efficient algorithm using monotone deque for O(1) amortized minimizer tracking
- Comprehensive parameter validation
- Support for buffer reuse for performance optimization
- Extensive test cases covering basic functionality, edge cases, and performance benchmarks

The implementation uses simultaneous forward/reverse m-mer encoding for O(1) canonical m-mer computation and maintains a monotone deque to track minimizers efficiently.
2026-02-04 16:04:06 +01:00
Eric Coissac
500144051a Add jj Makefile targets and k-mer encoding utilities
Add new Makefile targets for jj operations (jjnew, jjpush, jjfetch) to streamline commit workflow.

Introduce k-mer encoding utilities in pkg/obikmer:
- EncodeKmers: converts DNA sequences to encoded k-mers
- ReverseComplement: computes reverse complement of k-mers
- NormalizeKmer: returns canonical form of k-mers
- EncodeNormalizedKmers: encodes sequences with normalized k-mers

Add comprehensive tests for k-mer encoding functions including edge cases, buffer reuse, and performance benchmarks.

Document k-mer index design for large genomes, covering:
- Use cases and objectives
- Volume estimations
- Distance metrics (Jaccard, Sørensen-Dice, Bray-Curtis)
- Indexing options (Bloom filters, sorted sets, MPHF)
- Optimization techniques (k-2-mer indexing)
- MinHash for distance acceleration
- Recommended architecture for presence/absence and counting queries
2026-02-04 14:27:10 +01:00
coissac
740f66b4c7 Merge pull request #71 from metabarcoding/push-onwzsyuooozn
Implémentation du filtrage unique basé sur séquence et catégories
2026-01-14 19:19:27 +01:00
Eric Coissac
b49aba9c09 Implémentation du filtrage unique basé sur séquence et catégories
Ajout d'une fonctionnalité pour le filtrage unique qui prend en compte à la fois la séquence et les catégories.

- Modification de la fonction ISequenceChunk pour accepter un classifieur unique optionnel
- Implémentation du traitement unique sur disque en utilisant un classifieur composite
- Mise à jour du classifieur utilisé pour le tri sur disque
- Correction de la gestion des clés de unicité en utilisant le code et la valeur du classifieur
- Mise à jour du numéro de commit
2026-01-14 19:18:17 +01:00
coissac
52244cdb64 Merge pull request #70 from metabarcoding/push-kuwnszsxmxpn
Refactor chunk processing and update version commit
2026-01-14 18:47:17 +01:00
Eric Coissac
0678181023 Refactor chunk processing and update version commit
Optimize chunk processing by moving variable declarations inside the loop and update the commit hash in version.go to reflect the latest changes.
2026-01-14 18:46:04 +01:00
coissac
f55dd553c7 Merge pull request #68 from metabarcoding/push-rrulynolpprl
Push rrulynolpprl
2026-01-14 17:44:36 +01:00
coissac
4a383ac6c9 Merge branch 'master' into push-rrulynolpprl 2025-12-18 14:12:56 +01:00
Eric Coissac
371e702423 obiannotate --cut bug 2025-12-18 14:11:11 +01:00
Eric Coissac
ac0d3f3fe4 Update obiuniq for very large dataset 2025-12-18 14:11:11 +01:00
Eric Coissac
547135c747 End of obilowmask 2025-12-03 11:49:07 +01:00
coissac
f4a919732e Merge pull request #65 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-26 12:13:08 +01:00
Eric Coissac
e681666aaa End of obilowmask 2025-11-26 11:14:56 +01:00
coissac
adf2486295 Merge pull request #64 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-24 15:36:20 +01:00
Eric Coissac
272f5c9c35 End of obilowmask 2025-11-24 15:27:38 +01:00
coissac
c1b9503ca6 Merge pull request #63 from metabarcoding/push-vypwrurrsxuk
obicsv bug with stat on value map fields
2025-11-21 14:04:34 +01:00
Eric Coissac
86e60aedd0 obicsv bug with stat on value map fields 2025-11-21 14:03:31 +01:00
coissac
961abcea7b Merge pull request #61 from metabarcoding/push-mvxssvnysyxn
Push mvxssvnysyxn
2025-11-21 13:25:19 +01:00
Eric Coissac
57c65f9d50 obimatrix bug 2025-11-21 13:24:24 +01:00
Eric Coissac
e65b2a5efe obimatrix bugs 2025-11-21 13:24:06 +01:00
coissac
3e5f3f76b0 Merge pull request #60 from metabarcoding/push-qpnzxskwpoxo
Push qpnzxskwpoxo
2025-11-18 15:35:41 +01:00
Eric Coissac
ccc827afd3 finalise obilowmask 2025-11-18 15:33:08 +01:00
Eric Coissac
cef29005a5 debug url reading 2025-11-18 15:30:20 +01:00
Eric Coissac
4603d7973e implementation de obilowmask 2025-11-18 15:30:20 +01:00
coissac
8bc47c13d3 Merge pull request #58 from metabarcoding/push-vxkqkkrokwuz
debug obimultiplex
2025-11-06 15:44:31 +01:00
Eric Coissac
07cdd6f758 debug obimultiplex
bug option obimultiplex
2025-11-06 15:43:13 +01:00
coissac
432da366e2 Merge pull request #57 from metabarcoding/push-ywktmvpvtvmv
debug taxonomy core dump
2025-11-05 19:07:41 +01:00
Eric Coissac
2d7dc7d09d debug taxonomy core dump 2025-11-05 19:01:15 +01:00
coissac
5e12ed5400 Merge pull request #56 from metabarcoding/push-tnrvpwvqtzyo
update install script
2025-11-04 18:11:21 +01:00
Eric Coissac
7500ee1d15 update install script 2025-11-04 18:09:15 +01:00
coissac
5a1d66bf06 Merge pull request #53 from metabarcoding/push-skmxzrzulvtq
Push skmxzrzulvtq
2025-10-28 14:27:19 +01:00
Eric Coissac
0844dcc607 bug obimatrix 2025-10-28 13:57:31 +01:00
Eric Coissac
7f4ebe757e Bug obiuniq - don't clean the chunks 2025-10-28 13:50:22 +01:00
coissac
5150947e23 Merge pull request #51 from metabarcoding/push-urtwmwktsrru
Push urtwmwktsrru
2025-10-20 17:41:33 +02:00
Eric Coissac
d17a9520b9 work on obiclean chimera detection 2025-10-20 17:29:47 +02:00
Eric Coissac
29bf4ce871 add a feature to obimatrix adding obicsv option to obimatrix 2025-10-20 16:34:58 +02:00
coissac
d7ed9d343e Update install_obitools.sh for missing directory 2025-10-15 08:32:06 +02:00
Eric Coissac
82b6bb1ab6 correct a bug in func (worker SeqWorker) ChainWorkers(next SeqWorker) SeqWorker 2025-08-11 15:09:49 +02:00
Eric Coissac
6d204f6281 Patch the fastq detector 2025-08-08 10:23:03 -04:00
Eric Coissac
7a6d552450 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-08-07 17:01:48 -04:00
Eric Coissac
412b54822c Patch a bug in obliclean for d>1 leading to some instability in the result 2025-08-07 17:01:38 -04:00
Eric Coissac
730d448fc3 Allows for only one cpu and it should work 2025-08-06 16:09:25 -04:00
Eric Coissac
04f3af3e60 some renaming of functions 2025-08-06 15:54:50 -04:00
Eric Coissac
997b6e8c01 correct the fastq detector for distinguish with a csv ngsfilter 2025-08-06 15:52:54 -04:00
Eric Coissac
f239e8da92 Rename ISequenceChunk 2025-08-05 08:49:45 -04:00
Eric Coissac
ed28d3fb5b Adds a --u-to-t option 2025-07-07 15:35:26 +02:00