obitools4

mirror of https://github.com/metabarcoding/obitools4.git synced 2026-06-24 09:41:00 +00:00

Author	SHA1	Message	Date
coissac	ff6e515b2a	Merge pull request #91 from metabarcoding/push-uotrstkymowq 4.4.20: Rope-based parsing, improved release tooling, and bug fixes	2026-03-12 20:15:33 +01:00
Eric Coissac	cd0c525f50	4.4.20: Rope-based parsing, improved release tooling, and bug fixes ### Enhancements - Rope-based parsing: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkParserRope`. Sequence extraction now supports U→T conversion and improved line ending detection. - Rope scanner refactoring: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency. - Sequence handling: Added `TakeQualities()` method to BioSequence for more efficient quality data handling. ### Bug Fixes - Compression behavior: Fixed `CompressStream` to correctly use the `compressed` variable instead of a hardcoded boolean. - String splitting: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility. ### Tooling & Workflow Improvements - Makefile enhancements: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation. - Release process: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation. - Versioning: `bump-version` now supports the VERSION environment variable for manual version setting. - Submission: Switched from raw `jj git push` to `stakk` for PR submission. ### Internal Notes - Installation instructions are now included in release tags. - Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.	2026-03-12 20:14:11 +01:00
Eric Coissac	abe935aa18	Add help target, colorize output, and improve release workflow - Add colored terminal output support (GREEN, YELLOW, BLUE, NC) - Introduce `help` target to document all Makefile targets - Enhance `bump-version` to accept VERSION env var for manual version setting - Refactor jjpush: split into modular targets (jjpush-notes, jjpush-push, jjpush-tag) - Replace orla with aichat for AI-powered release notes generation - Add robust JSON parsing using Python for release notes extraction - Use stakk for PR submission (replacing raw `jj git push`) - Generate and store release notes in temp files for tag creation - Add installation instructions to release tags - Update .PHONY with new targets 4.4.20: Rope-based parsing, improved release tooling, and bug fixes ### Enhancements - Rope-based parsing: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkRope` functions, eliminating unnecessary memory allocation via Pack(). Sequence extraction now supports U→T conversion and improved line ending detection. - Rope scanner refactoring: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency across parsers. - Sequence handling: Added `TakeQualities()` method to BioSequence for more efficient quality data handling. ### Bug Fixes - Compression behavior: Fixed CompressStream to correctly use the `compressed` variable instead of a hardcoded boolean. - String splitting: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility. ### Tooling & Workflow Improvements - Makefile enhancements: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation. - Release process: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation. - Versioning: `bump-version` now supports the VERSION environment variable for manual version setting. - Submission: Switched from raw `jj git push` to `stakk` for PR submission. ### Internal Notes - Installation instructions are now included in release tags. - Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations. Release_4.4.20	2026-03-12 20:14:11 +01:00
Eric Coissac	8dd32dc1bf	Fix CompressStream call to use compressed variable Replace hardcoded boolean with the `compressed` variable in CompressStream call to ensure correct compression behavior.	2026-03-12 18:48:22 +01:00
Eric Coissac	6ee8750635	Replace SplitInTwo with LeftSplitInTwo/RightSplitInTwo for precise splitting Replace SplitInTwo calls with LeftSplitInTwo or RightSplitInTwo depending on the intended split direction. In fastseq_json_header.go, extract rank from suffix without splitting; in biosequenceslice.go and taxid.go, use LeftSplitInTwo to split from the left; add RightSplitInTwo utility function for splitting from the right.	2026-03-12 18:41:28 +01:00
Eric Coissac	8c318c480e	replace fixed-size carry buffer with dynamic slice Replace the fixed [256]byte carry buffer with a dynamic []byte slice to support arbitrarily long lines without heap allocation during accumulation. Update all carry buffer handling logic to use len(s.carry) and append instead of fixed-size copy operations.	2026-03-11 20:44:45 +01:00
Eric Coissac	09fbc217d3	Add EMBL rope parsing support and improve sequence extraction Introduce EmblChunkParserRope function to parse EMBL chunks directly from a rope without using Pack(). Add extractEmblSeq helper to scan sequence sections and handle U to T conversion. Update parser logic to use rope-based parsing when available, and fix feature table handling for WGS entries.	2026-03-10 17:02:14 +01:00
Eric Coissac	3d2e205722	Refactor rope scanner and add FASTQ rope parser This commit refactors the rope scanner implementation by renaming gbRopeScanner to ropeScanner and extracting the common functionality into a new file. It also introduces a new FastqChunkParserRope function that parses FASTQ chunks directly from a rope without Pack(), enabling more efficient memory usage. The existing parsers are updated to use the new rope-based parser when available. The BioSequence type is enhanced with a TakeQualities method for more efficient quality data handling.	2026-03-10 16:47:03 +01:00
Eric Coissac	623116ab13	Add rope-based FASTA parsing and improve sequence handling Introduce FastaChunkParserRope for direct rope-based FASTA parsing, enhance sequence extraction with whitespace skipping and U->T conversion, and update parser logic to support both rope and raw data sources. - Added extractFastaSeq function to scan sequence bytes directly from rope - Implemented FastaChunkParserRope for rope-based parsing - Modified _ParseFastaFile to use rope when available - Updated sequence handling to support U->T conversion - Fixed line ending detection for FASTA parsing	2026-03-10 16:34:33 +01:00
coissac	1e4509cb63	Merge pull request #90 from metabarcoding/push-uzpqqoqvpnxw Push uzpqqoqvpnxw	2026-03-10 15:53:08 +01:00
Eric Coissac	b33d7705a8	Bump version to 4.4.19 Update version from 4.4.18 to 4.4.19 in both version.txt and pkg/obioptions/version.go	2026-03-10 15:51:36 +01:00
Eric Coissac	1342c83db6	Use NewBioSequenceOwning to avoid unnecessary sequence copying Replace NewBioSequence with NewBioSequenceOwning in genbank_read.go to take ownership of sequence slices without copying, improving performance. Update biosequence.go to add the new TakeSequence method and NewBioSequenceOwning constructor. Release_4.4.19	2026-03-10 15:51:35 +01:00
Eric Coissac	b246025907	Optimize Fasta batch formatting Optimize FormatFastaBatch to pre-allocate buffer and write sequences directly without intermediate strings, improving performance and memory usage.	2026-03-10 15:43:59 +01:00
Eric Coissac	761e0dbed3	Implémentation d'un parseur GenBank utilisant rope pour réduire l'usage de mémoire Ajout d'un parseur GenBank basé sur rope pour réduire l'usage de mémoire (RSS) et les allocations heap. - Ajout de `gbRopeScanner` pour lire les lignes sans allocation heap - Implémentation de `GenbankChunkParserRope` qui utilise rope au lieu de `Pack()` - Modification de `_ParseGenbankFile` et `ReadGenbank` pour utiliser le nouveau parseur - Réduction du RSS attendue de 57 GB à ~128 MB × workers - Conservation de l'ancien parseur pour compatibilité et tests Réduction significative des allocations (~50M) et temps sys, avec un temps user comparable ou meilleur.	2026-03-10 15:35:36 +01:00
Eric Coissac	a7ea47624b	Optimisation du parsing des grandes séquences Implémente une optimisation du parsing des grandes séquences en évitant l'allocation de mémoire inutile lors de la fusion des chunks. Ajoute un support pour le parsing direct de la structure rope, ce qui permet de réduire les allocations et d'améliorer les performances lors du traitement de fichiers GenBank/EMBL et FASTA/FASTQ de plusieurs Gbp. Les parseurs sont mis à jour pour utiliser la rope non-packée et le nouveau mécanisme d'écriture in-place pour les séquences GenBank.	2026-03-10 14:20:21 +01:00
Eric Coissac	61e346658e	Refactor jjpush workflow and enhance release notes generation Split the jjpush target into multiple sub-targets (jjpush-describe, jjpush-bump, jjpush-push, jjpush-tag) for better modularity and control. Enhance release notes generation by: - Using git log with full commit messages instead of GitHub API for pre-release mode - Adding robust JSON parsing with fallbacks for release notes - Including detailed installation instructions in release notes - Supporting both pre-release and published release modes Update release_notes.sh to handle pre-release mode, improve commit message fetching, and add installation section to release notes. Add .PHONY declarations for new sub-targets.	2026-03-10 11:09:19 +01:00
coissac	1ba1294b11	Merge pull request #89 from metabarcoding/push-uoqxkozlonwx Push uoqxkozlonwx	2026-02-20 11:42:40 +01:00
Eric Coissac	b2476fffcb	Bump version to 4.4.18 Update version from 4.4.17 to 4.4.18 in version.txt and corresponding Go variable _Version.	2026-02-20 11:40:43 +01:00
Eric Coissac	b05404721e	Bump version to 4.4.16 Update version from 4.4.15 to 4.4.16 in version.go and version.txt files. Release_4.4.18	2026-02-20 11:40:40 +01:00
Eric Coissac	c57e788459	Fix GenBank parsing and add release notes script This commit fixes an issue in the GenBank parser where empty parts were being included in the parsed data. It also introduces a new script `release_notes.sh` to automate the generation of GitHub-compatible release notes for OBITools4 versions, including support for LLM summarization and various output modes.	2026-02-20 11:37:51 +01:00
coissac	1cecf23978	Merge pull request #86 from metabarcoding/push-oulwykrpwxuz Push oulwykrpwxuz	2026-02-11 06:34:05 +01:00
Eric Coissac	4c824ef9b7	Bump version to 4.4.15 Update version from 4.4.14 to 4.4.15 in version.txt and pkg/obioptions/version.go	2026-02-11 06:31:11 +01:00
Eric Coissac	1ce5da9bee	Support new sequence file formats and improve error handling Add support for .gbff and .gbff.gz file extensions in sequence reader. Update the logic to return an error instead of using NilIBioSequence when no sequence files are found, improving the error handling and user feedback. Release_4.4.15	2026-02-11 06:31:10 +01:00
coissac	dc23d9de9a	Merge pull request #85 from metabarcoding/push-smturnsrozkp Push smturnsrozkp	2026-02-10 22:19:22 +01:00
Eric Coissac	aa9d7bbf72	Bump version to 4.4.14 Update version number from 4.4.13 to 4.4.14 in both version.go and version.txt files.	2026-02-10 22:17:23 +01:00
Eric Coissac	db22d20d0a	Rename obisuperkmer test script to obik-super and update command references Update test script name from obisuperkmer to obik-super and adjust all command references accordingly. - Changed TEST_NAME from 'obisuperkmer' to 'obik-super' - Changed CMD from 'obisuperkmer' to 'obik' - Updated MCMD to 'OBIk-super' - Modified command calls to use '$CMD super' instead of direct command names - Updated help test to use '$CMD super -h' - Updated all test cases to use the new command format Release_4.4.14	2026-02-10 22:17:22 +01:00
coissac	7c05bdb01c	Merge pull request #84 from metabarcoding/push-uxvowwlxkrlq Push uxvowwlxkrlq	2026-02-10 22:12:18 +01:00
Eric Coissac	b6542c4523	Bump version to 4.4.13 Update version from 4.4.12 to 4.4.13 in version.txt and pkg/obioptions/version.go	2026-02-10 22:10:38 +01:00
Eric Coissac	ac41dd8a22	Refactor k-mer matching pipeline with improved concurrency and memory management Refactor k-mer matching to use a pipeline architecture with improved concurrency and memory management: - Replace sort.Slice with slices.SortFunc and cmp.Compare for better performance - Introduce PreparedQueries struct to encapsulate query buckets with metadata - Implement MergeQueries function to merge query buckets from multiple batches - Rewrite MatchBatch to use pre-allocated results and mutexes instead of map-based accumulation - Add seek optimization in matchPartition to reduce linear scanning - Refactor match command to use a multi-stage pipeline with proper batching and merging - Add index directory option for match command - Improve parallel processing of sequence batches This refactoring improves performance by reducing memory allocations, optimizing k-mer lookup, and implementing a more efficient pipeline for large-scale k-mer matching operations. Release_4.4.13	2026-02-10 22:10:36 +01:00
Eric Coissac	bebbbbfe7d	Add entropy-based filtering for k-mers This commit introduces entropy-based filtering for k-mers to remove low-complexity sequences. It adds: - New KmerEntropy and KmerEntropyFilter functions in pkg/obikmer/entropy.go for computing and filtering k-mer entropy - Integration of entropy filtering in the k-mer set builder (pkg/obikmer/kmer_set_builder.go) - A new 'filter' command in obik tool (pkg/obitools/obik/filter.go) to apply entropy filtering on existing indices - CLI options for configuring entropy filtering during index building and filtering The entropy filter helps improve the quality of k-mer sets by removing repetitive sequences that may interfere with downstream analyses.	2026-02-10 18:20:35 +01:00
Eric Coissac	c6e04265f1	Add sparse index support for KDI files with fast seeking This commit introduces sparse index support for KDI files to enable fast random access during k-mer matching. It adds a new .kdx index file format and updates the KDI reader and writer to handle index creation and seeking. The changes include: - New KdxIndex struct and related functions for loading, searching, and writing .kdx files - Modified KdiReader to support seeking with the new index - Updated KdiWriter to create .kdx index files during writing - Enhanced KmerSetGroup.Contains to use the new index for faster lookups - Added a new 'match' command to annotate sequences with k-mer match positions The index is created automatically during KDI file creation and allows for O(log N / stride) binary search followed by at most stride linear scan steps, significantly improving performance for large datasets.	2026-02-10 13:24:24 +01:00
Eric Coissac	9babcc0fae	Refactor lowmask options and shared kmer options Refactor lowmask options to use shared kmer options and CLI getters This commit refactors the lowmask subcommand to use shared kmer options and CLI getters instead of local variables. It also moves the kmer size and minimizer size options to a shared location and adds new CLI getters for the lowmask options. - Move kmer size and minimizer size options to shared location - Add CLI getters for lowmask options - Refactor lowmask to use CLI getters - Remove unused strings import - Add MaskingMode type and related functions	2026-02-10 09:52:38 +01:00
Eric Coissac	e775f7e256	Add option to keep shorter fragments in lowmask Add a new boolean option 'keep-shorter' to preserve fragments shorter than kmer-size during split/extract mode. This change introduces a new flag _lowmaskKeepShorter that controls whether fragments shorter than the kmer size should be kept during split/extract operations. The implementation: 1. Adds the new boolean variable _lowmaskKeepShorter 2. Registers the command-line option "keep-shorter" 3. Updates the lowMaskWorker function signature to accept the keepShorter parameter 4. Modifies the fragment selection logic to check the keepShorter flag 5. Updates the worker creation to pass the global flag value This allows users to control the behavior when dealing with short sequences in split/extract modes, providing more flexibility in low-complexity masking.	2026-02-10 09:36:42 +01:00
Eric Coissac	f2937af1ad	Add max frequency filtering and top-kmer saving capabilities This commit introduces max frequency filtering to limit k-mer occurrences and adds functionality to save the N most frequent k-mers per set to CSV files. It also includes the ability to output k-mer frequency spectra as CSV and updates the CLI options accordingly.	2026-02-10 09:27:04 +01:00
Eric Coissac	56c1f4180c	Refactor k-mer index management with subcommands and enhanced metadata support This commit refactors the k-mer index management tools to use a unified subcommand structure with obik, adds support for per-set metadata and ID management, enhances the k-mer set group builder to support appending to existing groups, and improves command-line option handling with a new global options registration system. Key changes: - Introduce obik command with subcommands (index, ls, summary, cp, mv, rm, super, lowmask) - Add support for per-set metadata and ID management in kmer set groups - Implement ability to append to existing kmer index groups - Refactor option parsing to use a global options registration system - Add new commands for listing, copying, moving, and removing sets - Enhance low-complexity masking with new options and output formats - Improve kmer index summary with Jaccard distance matrix support - Remove deprecated obikindex and obisuperkmer commands - Update build process to use the new subcommand structure	2026-02-10 06:49:31 +01:00
Eric Coissac	f78543ee75	Refactor k-mer index building to use disk-based KmerSetGroupBuilder Refactor k-mer index building to use the new disk-based KmerSetGroupBuilder instead of the old KmerSet and FrequencyFilter approaches. This change introduces a more efficient and scalable approach to building k-mer indices by using partitioned disk storage with streaming operations. - Replace BuildKmerIndex and BuildFrequencyFilterIndex with KmerSetGroupBuilder - Add support for frequency filtering via WithMinFrequency option - Remove deprecated k-mer set persistence methods - Update CLI to use new builder approach - Add new disk-based k-mer operations (union, intersect, difference, quorum) - Introduce KDI (K-mer Delta Index) file format for efficient storage - Add K-way merge operations for combining sorted k-mer streams - Update documentation and examples to reflect new API This refactoring provides better memory usage, faster operations on large datasets, and more flexible k-mer set operations.	2026-02-10 06:49:31 +01:00
Eric Coissac	a016ad5b8a	Refactor kmer index to disk-based partitioning with minimizer Refactor kmer index package to use disk-based partitioning with minimizer - Replace roaring64 bitmaps with disk-based kmer index - Implement partitioned kmer sets with delta-varint encoding - Add support for frequency filtering during construction - Introduce new builder pattern for index construction - Add streaming operations for set operations (union, intersect, etc.) - Add support for super-kmer encoding during construction - Update command line tool to use new index format - Remove dependency on roaring bitmap library This change introduces a new architecture for kmer indexing that is more memory efficient and scalable for large datasets.	2026-02-09 17:52:37 +01:00
coissac	09d437d10f	Merge pull request #83 from metabarcoding/push-xssnppvunmlq Push xssnppvunmlq Release_4.4.12	2026-02-09 09:58:06 +01:00
Eric Coissac	d00ab6f83a	Bump version from 4.4.11 to 4.4.12 Update version number in version.txt from 4.4.11 to 4.4.12	2026-02-09 09:46:12 +01:00
Eric Coissac	8037860518	Update version and improve release note generation Update version from 4.4.11 to 4.4.12 - Bump version in version.go - Enhance release note generation in Makefile to use JSON output from orla and fallback to raw output if JSON parsing fails - Improve test script to verify minimum super k-mer length is >= k (default k=31)	2026-02-09 09:46:10 +01:00
coissac	43d6cbe56a	Merge pull request #82 from metabarcoding/push-vkprtnlyxmkl Push vkprtnlyxmkl	2026-02-09 09:16:20 +01:00
Eric Coissac	6dadee9371	Bump version to 4.4.12 Update version from 4.4.11 to 4.4.12 in version.txt and pkg/obioptions/version.go	2026-02-09 09:05:49 +01:00
Eric Coissac	99a8e69d10	Optimize low-complexity masking algorithm This commit optimizes the low-complexity masking algorithm by: 1. Precomputing logarithm values and normalization tables to avoid repeated calculations 2. Replacing the MinMultiset-based sliding minimum with a more efficient deque-based implementation 3. Improving entropy calculation by using precomputed n*log(n) values 4. Simplifying the circular normalization process with precomputed tables 5. Removing unused imports and log statements The changes significantly improve performance while maintaining the same masking behavior.	2026-02-09 09:05:46 +01:00
Eric Coissac	c0ae49ef92	Ajout d'obilowmask_ref au fichier .gitignore Ajout du fichier obilowmask_ref dans le fichier .gitignore pour éviter qu'il ne soit suivi par Git.	2026-02-08 19:31:12 +01:00
Eric Coissac	08490420a2	Fix whitespace in test script and add merge consistency tests This commit fixes minor whitespace issues in the test script and adds new tests to ensure merge attribute consistency between in-memory and on-disk paths. - Removed trailing spaces in log messages - Added tests for merge consistency between in-memory and on-disk paths - These tests catch a bug where shared classifier in on-disk dereplication path caused incorrect merged attributes	2026-02-08 18:08:29 +01:00
Eric Coissac	1a28d5ed64	Add progress bar configuration and conditional display This commit introduces a new configuration module `obidefault` to manage progress bar settings, allowing users to disable progress bars via a `--no-progressbar` option. It updates various packages to conditionally display progress bars based on this new configuration, improving user experience by providing control over progress bar output. The changes also include improvements to progress bar handling in several packages, ensuring they are only displayed when appropriate (e.g., when stderr is a terminal and stdout is not piped).	2026-02-08 16:14:02 +01:00
Eric Coissac	b2d16721f0	Fix classifier cloning and reset in chunk processing This commit fixes an issue in the chunk processing logic where the wrong classifier instance was being reset and used for code generation. A local clone of the classifier is now created and used to ensure correct behavior during dereplication.	2026-02-08 15:52:25 +01:00
Eric Coissac	7c12b1ee83	Disable progress bar when output is piped Modify CLIProgressBar function to check if stdout is a named pipe and disable the progress bar accordingly. This prevents the progress bar from being displayed when the output is redirected or piped to another command.	2026-02-08 14:48:13 +01:00
Eric Coissac	db98ddb241	Fix super k-mer minimizer bijection and add validation test This commit addresses a bug in the super k-mer implementation where the minimizer bijection property was not properly enforced. The fix ensures that: 1. All k-mers within a super k-mer share the same minimizer 2. Identical super k-mer sequences have the same minimizer The changes include: - Fixing the super k-mer iteration logic to properly validate the minimizer bijection property - Adding a comprehensive test suite (TestSuperKmerMinimizerBijection) that validates the intrinsic property of super k-mers - Updating the .gitignore file to properly track relevant files This resolves issues where the same sequence could be associated with different minimizers, violating the super k-mer definition.	2026-02-08 13:47:33 +01:00
Eric Coissac	7a979ba77f	Add obisuperkmer command implementation and tests This commit adds the implementation of the obisuperkmer command, including: - The main command in cmd/obitools/obisuperkmer/ - The package implementation in pkg/obitools/obisuperkmer/ - Automated tests in obitests/obitools/obisuperkmer/ - Documentation for the implementation and tests The obisuperkmer command extracts super k-mers from DNA sequences, following the standard OBITools architecture. It includes proper CLI option handling, validation of parameters, and integration with the OBITools pipeline system. Tests cover basic functionality, parameter validation, output format, metadata preservation, and file I/O operations.	2026-02-07 13:54:02 +01:00

1 2 3 4 5 ...

721 Commits