obitools4

mirror of https://github.com/metabarcoding/obitools4.git synced 2026-03-26 05:50:52 +00:00

Author	SHA1	Message	Date
Eric Coissac	ac41dd8a22	Refactor k-mer matching pipeline with improved concurrency and memory management Refactor k-mer matching to use a pipeline architecture with improved concurrency and memory management: - Replace sort.Slice with slices.SortFunc and cmp.Compare for better performance - Introduce PreparedQueries struct to encapsulate query buckets with metadata - Implement MergeQueries function to merge query buckets from multiple batches - Rewrite MatchBatch to use pre-allocated results and mutexes instead of map-based accumulation - Add seek optimization in matchPartition to reduce linear scanning - Refactor match command to use a multi-stage pipeline with proper batching and merging - Add index directory option for match command - Improve parallel processing of sequence batches This refactoring improves performance by reducing memory allocations, optimizing k-mer lookup, and implementing a more efficient pipeline for large-scale k-mer matching operations.	2026-02-10 22:10:36 +01:00
Eric Coissac	bebbbbfe7d	Add entropy-based filtering for k-mers This commit introduces entropy-based filtering for k-mers to remove low-complexity sequences. It adds: - New KmerEntropy and KmerEntropyFilter functions in pkg/obikmer/entropy.go for computing and filtering k-mer entropy - Integration of entropy filtering in the k-mer set builder (pkg/obikmer/kmer_set_builder.go) - A new 'filter' command in obik tool (pkg/obitools/obik/filter.go) to apply entropy filtering on existing indices - CLI options for configuring entropy filtering during index building and filtering The entropy filter helps improve the quality of k-mer sets by removing repetitive sequences that may interfere with downstream analyses.	2026-02-10 18:20:35 +01:00
Eric Coissac	c6e04265f1	Add sparse index support for KDI files with fast seeking This commit introduces sparse index support for KDI files to enable fast random access during k-mer matching. It adds a new .kdx index file format and updates the KDI reader and writer to handle index creation and seeking. The changes include: - New KdxIndex struct and related functions for loading, searching, and writing .kdx files - Modified KdiReader to support seeking with the new index - Updated KdiWriter to create .kdx index files during writing - Enhanced KmerSetGroup.Contains to use the new index for faster lookups - Added a new 'match' command to annotate sequences with k-mer match positions The index is created automatically during KDI file creation and allows for O(log N / stride) binary search followed by at most stride linear scan steps, significantly improving performance for large datasets.	2026-02-10 13:24:24 +01:00
Eric Coissac	9babcc0fae	Refactor lowmask options and shared kmer options Refactor lowmask options to use shared kmer options and CLI getters This commit refactors the lowmask subcommand to use shared kmer options and CLI getters instead of local variables. It also moves the kmer size and minimizer size options to a shared location and adds new CLI getters for the lowmask options. - Move kmer size and minimizer size options to shared location - Add CLI getters for lowmask options - Refactor lowmask to use CLI getters - Remove unused strings import - Add MaskingMode type and related functions	2026-02-10 09:52:38 +01:00
Eric Coissac	e775f7e256	Add option to keep shorter fragments in lowmask Add a new boolean option 'keep-shorter' to preserve fragments shorter than kmer-size during split/extract mode. This change introduces a new flag _lowmaskKeepShorter that controls whether fragments shorter than the kmer size should be kept during split/extract operations. The implementation: 1. Adds the new boolean variable _lowmaskKeepShorter 2. Registers the command-line option "keep-shorter" 3. Updates the lowMaskWorker function signature to accept the keepShorter parameter 4. Modifies the fragment selection logic to check the keepShorter flag 5. Updates the worker creation to pass the global flag value This allows users to control the behavior when dealing with short sequences in split/extract modes, providing more flexibility in low-complexity masking.	2026-02-10 09:36:42 +01:00
Eric Coissac	f2937af1ad	Add max frequency filtering and top-kmer saving capabilities This commit introduces max frequency filtering to limit k-mer occurrences and adds functionality to save the N most frequent k-mers per set to CSV files. It also includes the ability to output k-mer frequency spectra as CSV and updates the CLI options accordingly.	2026-02-10 09:27:04 +01:00
Eric Coissac	56c1f4180c	Refactor k-mer index management with subcommands and enhanced metadata support This commit refactors the k-mer index management tools to use a unified subcommand structure with obik, adds support for per-set metadata and ID management, enhances the k-mer set group builder to support appending to existing groups, and improves command-line option handling with a new global options registration system. Key changes: - Introduce obik command with subcommands (index, ls, summary, cp, mv, rm, super, lowmask) - Add support for per-set metadata and ID management in kmer set groups - Implement ability to append to existing kmer index groups - Refactor option parsing to use a global options registration system - Add new commands for listing, copying, moving, and removing sets - Enhance low-complexity masking with new options and output formats - Improve kmer index summary with Jaccard distance matrix support - Remove deprecated obikindex and obisuperkmer commands - Update build process to use the new subcommand structure	2026-02-10 06:49:31 +01:00

7 Commits