Commit Graph

282 Commits

Author SHA1 Message Date
Eric Coissac 42910c7db9 🔧 Rename mismatch fields to error in pcrtag.go
- Renamed `obimultiplex_forward_mismatches` to "error" for consistency
- Similarly renamed 
  `obimultiplex_reverse_mismatches` to "error"
- Applied changes in both annotation dictionaries (aanot, banot)
2026-04-29 15:29:25 +02:00
Eric Coissac 8b4cf677c6 [obitools] Add validation for paired files and config template support
- Enforce requirement of both forward (-F) and reverse files in obipairing/main.go
- Add config template support to obtagpcr via CLIAskConfigTemplate()
- Remove redundant Required() constraints in options.go
- Introduce new helper CLIHasPairedFiles()
2026-04-29 15:01:37 +02:00
Eric Coissac 434d2e5930 +feat: add support for map_summaries aggregation in obisummary
- Implement merging logic of `map summaries` across datasets
  - Ensure proper initialization and population in multi-threaded context
- Add `map_summaries` to final output dictionary when non-empty
2026-04-16 14:58:18 +02:00
Eric Coissac 7cb02ded69 Refactor: Extract utility function for string reversal
- Introduce `inverser_chaine()` helper to centralize logic
 - Replace inline reverse implementations across modules
2026-04-16 13:42:51 +02:00
Eric Coissac a2b26712b2 refactor: replace fixed batch size with dynamic flushing based on count and memory
Replace the old fixed batch-size mechanism in Distribute with a dynamic strategy that flushes batches when either BatchSizeMax() sequences or BatchMem() bytes are reached per key. This aligns with the RebatchBySize strategy and removes the optional sizes parameter. Also update related code: simplify Lua wrapper to accept optional capacity, and fix buffer growth logic in worker.go using slices.Grow correctly. Remove unused BatchSize() usage from obidistribute.
2026-03-16 22:06:44 +01:00
Eric Coissac c188580aac Replace Rebatch with RebatchBySize using default batch parameters
Replace calls to Rebatch(size) with RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax()) in batchiterator.go, fragment.go, and obirefidx.go to ensure consistent use of default memory and size limits for batch rebatching.
2026-03-13 15:16:33 +01:00
Eric Coissac 1e1f575d1c refactor: replace single batch size with min/max bounds and memory limits
Introduce separate _BatchSize (min) and _BatchSizeMax (max) constants to replace the single _BatchSize variable. Update RebatchBySize to accept both maxBytes and maxCount parameters, flushing when either limit is exceeded. Set default batch size min to 1, max to 2000, and memory limit to 128 MB. Update CLI options and sequence_reader.go accordingly.
2026-03-13 15:07:35 +01:00
Eric Coissac 40769bf827 Add memory-based batching support
Implement memory-aware batch sizing with --batch-mem CLI option, enabling adaptive batching based on estimated sequence memory footprint. Key changes:
- Added _BatchMem and related getters/setters in pkg/obidefault
- Implemented RebatchBySize() in pkg/obiter for memory-constrained batching
- Added BioSequence.MemorySize() for conservative memory estimation
- Integrated batch-mem option in pkg/obioptions with human-readable size parsing (e.g., 128K, 64M, 1G)
- Added obiutils.ParseMemSize/FormatMemSize for unit conversion
- Enhanced pool GC in pkg/obiseq/pool.go to trigger explicit GC for large slice discards
- Updated sequence_reader.go to apply memory-based rebatching when enabled
2026-03-13 14:54:21 +01:00
Eric Coissac 8dd32dc1bf Fix CompressStream call to use compressed variable
Replace hardcoded boolean with the `compressed` variable in CompressStream call to ensure correct compression behavior.
2026-03-12 18:48:22 +01:00
Eric Coissac 1ce5da9bee Support new sequence file formats and improve error handling
Add support for .gbff and .gbff.gz file extensions in sequence reader.

Update the logic to return an error instead of using NilIBioSequence when no sequence files are found, improving the error handling and user feedback.
2026-02-11 06:31:10 +01:00
Eric Coissac ac41dd8a22 Refactor k-mer matching pipeline with improved concurrency and memory management
Refactor k-mer matching to use a pipeline architecture with improved concurrency and memory management:

- Replace sort.Slice with slices.SortFunc and cmp.Compare for better performance
- Introduce PreparedQueries struct to encapsulate query buckets with metadata
- Implement MergeQueries function to merge query buckets from multiple batches
- Rewrite MatchBatch to use pre-allocated results and mutexes instead of map-based accumulation
- Add seek optimization in matchPartition to reduce linear scanning
- Refactor match command to use a multi-stage pipeline with proper batching and merging
- Add index directory option for match command
- Improve parallel processing of sequence batches

This refactoring improves performance by reducing memory allocations, optimizing k-mer lookup, and implementing a more efficient pipeline for large-scale k-mer matching operations.
2026-02-10 22:10:36 +01:00
Eric Coissac bebbbbfe7d Add entropy-based filtering for k-mers
This commit introduces entropy-based filtering for k-mers to remove low-complexity sequences. It adds:

- New KmerEntropy and KmerEntropyFilter functions in pkg/obikmer/entropy.go for computing and filtering k-mer entropy
- Integration of entropy filtering in the k-mer set builder (pkg/obikmer/kmer_set_builder.go)
- A new 'filter' command in obik tool (pkg/obitools/obik/filter.go) to apply entropy filtering on existing indices
- CLI options for configuring entropy filtering during index building and filtering

The entropy filter helps improve the quality of k-mer sets by removing repetitive sequences that may interfere with downstream analyses.
2026-02-10 18:20:35 +01:00
Eric Coissac c6e04265f1 Add sparse index support for KDI files with fast seeking
This commit introduces sparse index support for KDI files to enable fast random access during k-mer matching. It adds a new .kdx index file format and updates the KDI reader and writer to handle index creation and seeking. The changes include:

- New KdxIndex struct and related functions for loading, searching, and writing .kdx files
- Modified KdiReader to support seeking with the new index
- Updated KdiWriter to create .kdx index files during writing
- Enhanced KmerSetGroup.Contains to use the new index for faster lookups
- Added a new 'match' command to annotate sequences with k-mer match positions

The index is created automatically during KDI file creation and allows for O(log N / stride) binary search followed by at most stride linear scan steps, significantly improving performance for large datasets.
2026-02-10 13:24:24 +01:00
Eric Coissac 9babcc0fae Refactor lowmask options and shared kmer options
Refactor lowmask options to use shared kmer options and CLI getters

This commit refactors the lowmask subcommand to use shared kmer options and CLI getters instead of local variables. It also moves the kmer size and minimizer size options to a shared location and adds new CLI getters for the lowmask options.

- Move kmer size and minimizer size options to shared location
- Add CLI getters for lowmask options
- Refactor lowmask to use CLI getters
- Remove unused strings import
- Add MaskingMode type and related functions
2026-02-10 09:52:38 +01:00
Eric Coissac e775f7e256 Add option to keep shorter fragments in lowmask
Add a new boolean option 'keep-shorter' to preserve fragments shorter than kmer-size during split/extract mode.

This change introduces a new flag _lowmaskKeepShorter that controls whether fragments
shorter than the kmer size should be kept during split/extract operations.

The implementation:
1. Adds the new boolean variable _lowmaskKeepShorter
2. Registers the command-line option "keep-shorter"
3. Updates the lowMaskWorker function signature to accept the keepShorter parameter
4. Modifies the fragment selection logic to check the keepShorter flag
5. Updates the worker creation to pass the global flag value

This allows users to control the behavior when dealing with short sequences in
split/extract modes, providing more flexibility in low-complexity masking.
2026-02-10 09:36:42 +01:00
Eric Coissac f2937af1ad Add max frequency filtering and top-kmer saving capabilities
This commit introduces max frequency filtering to limit k-mer occurrences and adds functionality to save the N most frequent k-mers per set to CSV files. It also includes the ability to output k-mer frequency spectra as CSV and updates the CLI options accordingly.
2026-02-10 09:27:04 +01:00
Eric Coissac 56c1f4180c Refactor k-mer index management with subcommands and enhanced metadata support
This commit refactors the k-mer index management tools to use a unified subcommand structure with obik, adds support for per-set metadata and ID management, enhances the k-mer set group builder to support appending to existing groups, and improves command-line option handling with a new global options registration system.

Key changes:
- Introduce obik command with subcommands (index, ls, summary, cp, mv, rm, super, lowmask)
- Add support for per-set metadata and ID management in kmer set groups
- Implement ability to append to existing kmer index groups
- Refactor option parsing to use a global options registration system
- Add new commands for listing, copying, moving, and removing sets
- Enhance low-complexity masking with new options and output formats
- Improve kmer index summary with Jaccard distance matrix support
- Remove deprecated obikindex and obisuperkmer commands
- Update build process to use the new subcommand structure
2026-02-10 06:49:31 +01:00
Eric Coissac f78543ee75 Refactor k-mer index building to use disk-based KmerSetGroupBuilder
Refactor k-mer index building to use the new disk-based KmerSetGroupBuilder instead of the old KmerSet and FrequencyFilter approaches. This change introduces a more efficient and scalable approach to building k-mer indices by using partitioned disk storage with streaming operations.

- Replace BuildKmerIndex and BuildFrequencyFilterIndex with KmerSetGroupBuilder
- Add support for frequency filtering via WithMinFrequency option
- Remove deprecated k-mer set persistence methods
- Update CLI to use new builder approach
- Add new disk-based k-mer operations (union, intersect, difference, quorum)
- Introduce KDI (K-mer Delta Index) file format for efficient storage
- Add K-way merge operations for combining sorted k-mer streams
- Update documentation and examples to reflect new API

This refactoring provides better memory usage, faster operations on large datasets, and more flexible k-mer set operations.
2026-02-10 06:49:31 +01:00
Eric Coissac a016ad5b8a Refactor kmer index to disk-based partitioning with minimizer
Refactor kmer index package to use disk-based partitioning with minimizer

- Replace roaring64 bitmaps with disk-based kmer index
- Implement partitioned kmer sets with delta-varint encoding
- Add support for frequency filtering during construction
- Introduce new builder pattern for index construction
- Add streaming operations for set operations (union, intersect, etc.)
- Add support for super-kmer encoding during construction
- Update command line tool to use new index format
- Remove dependency on roaring bitmap library

This change introduces a new architecture for kmer indexing that is more memory efficient and scalable for large datasets.
2026-02-09 17:52:37 +01:00
Eric Coissac 99a8e69d10 Optimize low-complexity masking algorithm
This commit optimizes the low-complexity masking algorithm by:

1. Precomputing logarithm values and normalization tables to avoid repeated calculations
2. Replacing the MinMultiset-based sliding minimum with a more efficient deque-based implementation
3. Improving entropy calculation by using precomputed n*log(n) values
4. Simplifying the circular normalization process with precomputed tables
5. Removing unused imports and log statements

The changes significantly improve performance while maintaining the same masking behavior.
2026-02-09 09:05:46 +01:00
Eric Coissac 1a28d5ed64 Add progress bar configuration and conditional display
This commit introduces a new configuration module `obidefault` to manage progress bar settings, allowing users to disable progress bars via a `--no-progressbar` option. It updates various packages to conditionally display progress bars based on this new configuration, improving user experience by providing control over progress bar output. The changes also include improvements to progress bar handling in several packages, ensuring they are only displayed when appropriate (e.g., when stderr is a terminal and stdout is not piped).
2026-02-08 16:14:02 +01:00
Eric Coissac 7c12b1ee83 Disable progress bar when output is piped
Modify CLIProgressBar function to check if stdout is a named pipe and disable the progress bar accordingly. This prevents the progress bar from being displayed when the output is redirected or piped to another command.
2026-02-08 14:48:13 +01:00
Eric Coissac 7a979ba77f Add obisuperkmer command implementation and tests
This commit adds the implementation of the obisuperkmer command, including:

- The main command in cmd/obitools/obisuperkmer/
- The package implementation in pkg/obitools/obisuperkmer/
- Automated tests in obitests/obitools/obisuperkmer/
- Documentation for the implementation and tests

The obisuperkmer command extracts super k-mers from DNA sequences, following the standard OBITools architecture. It includes proper CLI option handling, validation of parameters, and integration with the OBITools pipeline system.

Tests cover basic functionality, parameter validation, output format, metadata preservation, and file I/O operations.
2026-02-07 13:54:02 +01:00
Eric Coissac 16f72e6305 refactoring of obikmer 2026-02-05 16:05:48 +01:00
Eric Coissac 371e702423 obiannotate --cut bug 2025-12-18 14:11:11 +01:00
Eric Coissac 547135c747 End of obilowmask 2025-12-03 11:49:07 +01:00
Eric Coissac e65b2a5efe obimatrix bugs 2025-11-21 13:24:06 +01:00
Eric Coissac ccc827afd3 finalise obilowmask 2025-11-18 15:33:08 +01:00
Eric Coissac cef29005a5 debug url reading 2025-11-18 15:30:20 +01:00
Eric Coissac 4603d7973e implementation de obilowmask 2025-11-18 15:30:20 +01:00
Eric Coissac 07cdd6f758 debug obimultiplex
bug option obimultiplex
2025-11-06 15:43:13 +01:00
Eric Coissac 0844dcc607 bug obimatrix 2025-10-28 13:57:31 +01:00
Eric Coissac d17a9520b9 work on obiclean chimera detection 2025-10-20 17:29:47 +02:00
Eric Coissac 29bf4ce871 add a feature to obimatrix adding obicsv option to obimatrix 2025-10-20 16:34:58 +02:00
Eric Coissac 82b6bb1ab6 correct a bug in func (worker SeqWorker) ChainWorkers(next SeqWorker) SeqWorker 2025-08-11 15:09:49 +02:00
Eric Coissac 412b54822c Patch a bug in obliclean for d>1 leading to some instability in the result 2025-08-07 17:01:38 -04:00
Eric Coissac 04f3af3e60 some renaming of functions 2025-08-06 15:54:50 -04:00
Eric Coissac f239e8da92 Rename ISequenceChunk 2025-08-05 08:49:45 -04:00
Eric Coissac ed28d3fb5b Adds a --u-to-t option 2025-07-07 15:35:26 +02:00
Eric Coissac 235a7e202a Update obisummary to account new obiseq.StatsOnValues type 2025-06-19 17:21:30 +02:00
Eric Coissac 27fa984a63 Patch obimatrix accoring to the new type obiseq.StatsOnValues 2025-06-19 16:51:53 +02:00
Eric Coissac 9965370d85 Manage a lock on StatsOnValues 2025-06-17 16:46:11 +02:00
Eric Coissac d31e677304 Patch a bug in obitag 2025-06-04 14:47:28 +02:00
Eric Coissac 6cb7a5a352 Changes to be committed:
modified:   cmd/obitools/obitag/main.go
	modified:   cmd/obitools/obitaxonomy/main.go
	modified:   pkg/obiformats/csvtaxdump_read.go
	modified:   pkg/obiformats/ecopcr_read.go
	modified:   pkg/obiformats/ncbitaxdump_read.go
	modified:   pkg/obiformats/ncbitaxdump_readtar.go
	modified:   pkg/obiformats/newick_write.go
	modified:   pkg/obiformats/options.go
	modified:   pkg/obiformats/taxonomy_read.go
	modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obiiter/extract_taxonomy.go
	modified:   pkg/obioptions/options.go
	modified:   pkg/obioptions/version.go
	new file:   pkg/obiphylo/tree.go
	modified:   pkg/obiseq/biosequenceslice.go
	modified:   pkg/obiseq/taxonomy_methods.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonset.go
	modified:   pkg/obitools/obiconvert/sequence_reader.go
	modified:   pkg/obitools/obitag/obitag.go
	modified:   pkg/obitools/obitaxonomy/obitaxonomy.go
	modified:   pkg/obitools/obitaxonomy/options.go
	deleted:    sample/.DS_Store
2025-06-04 09:48:10 +02:00
Eric Coissac c0ecaf90ab Add the --number option to obiannotate 2025-04-22 18:35:51 +02:00
Eric Coissac 7542e33010 Several bugs dicoverd during the doc writing 2025-04-04 16:59:27 +02:00
Eric Coissac 03b5ce9397 Patch a bug in obitag when some reference sequences have taxid absent from the taxonomy 2025-03-27 16:45:02 +01:00
Eric Coissac fd80249b85 Patch a bug in obitag when a taxon from the reference library is unknown in the taxonomy 2025-03-27 14:28:15 +01:00
Eric Coissac 5a3705b6bb Adds the --silent-warning options to the obitools commands and removes the --pared-with option from some of the obitols commands. 2025-03-25 16:44:46 +01:00
Eric Coissac 2ab6f67d58 Add a progress bar to chimera detection 2025-03-25 08:37:27 +01:00