This release introduces dynamic batch flushing in the Distribute component, replacing the previous fixed-size batching with a memory- and count-aware strategy. Batches now flush automatically when either the maximum sequence count (BatchSizeMax()) or memory threshold (BatchMem()) per key is reached, ensuring more efficient resource usage and consistent behavior with the RebatchBySize strategy. The optional sizes parameter has been removed, and related code—including the Lua wrapper and worker buffer handling—has been updated for correctness and simplicity. Unused BatchSize() references have been eliminated from obidistribute.
Additionally, this release includes improvements to static Linux builds and overall build stability, enhancing reliability across deployment environments.
This release focuses on improving build reliability, memory efficiency for large datasets, and portability of Linux binaries.
### Static Linux Binaries
- Linux binaries are now built with static linking using musl, eliminating external runtime dependencies and ensuring portability across distributions.
### Memory-Aware Batching
- Users can now control memory usage during processing with the new `--batch-mem` option, specifying limits such as 128K, 64M, or 1G.
- Batching logic now respects both size and memory constraints: batches are flushed when either threshold is exceeded.
- Conservative memory estimation for sequences helps avoid over-allocation, and explicit garbage collection after large batch discards reduces memory spikes.
### Build System Improvements
- Upgraded to Go 1.26 for improved performance and toolchain stability.
- Fixed cross-compilation issues by replacing generic include paths with architecture-specific ones (x86_64-linux-gnu and aarch64-linux-gnu).
- Streamlined macOS builds by removing special flags, using standard `make` targets.
- Enhanced error reporting during build failures: logs are now shown before cleanup and exit.
- Updated install script to correctly configure GOROOT, GOPATH, and GOTOOLCHAIN, with visual progress feedback for downloads.
All batching behavior is non-breaking and maintains backward compatibility while offering more predictable resource usage on large datasets.
Update version to 4.4.27 in version.txt and pkg/obioptions/version.go.
Add zlib-static package to release workflow to ensure static linking of zlib, resolving potential runtime dependency issues with the external link mode.
- Upgrade Go version from 1.23 to 1.26 in release.yml
- Remove CGO_CFLAGS from cross-compilation matrix entries
- Replace Linux build tools installation with Docker-based static build using golang:1.26-alpine
- Simplify macOS build to use standard make without special flags
- Increment version to 4.4.26
Update version to 4.4.25 in version.txt and pkg/obioptions/version.go.
Fix CGO_CFLAGS in release.yml by replacing generic '-I/usr/include' with architecture-specific paths (x86_64-linux-gnu and aarch64-linux-gnu) to ensure correct header inclusion during cross-compilation on Linux.
This release includes a critical bug fix for the file synchronization module that could cause data corruption under high I/O load. Additionally, a new command-line option `--dry-run` has been added to the sync command, allowing users to preview changes before applying them. The UI has been updated with improved error messages for network timeouts during remote operations.
- Update version from 4.4.22 to 4.4.23 in version.txt and pkg/obioptions/version.go
- Add zlib1g-dev dependency to Linux release workflow for potential linking requirements
- Improve tag creation in Makefile by resolving commit hash with `jj log` for better CI/CD integration
### Memory-Aware Batching
- Replaced single batch size limits with configurable min/max bounds and memory limits for more precise control over resource usage.
- Added `--batch-mem` CLI option to enable adaptive batching based on estimated sequence memory footprint (e.g., 128K, 64M, 1G).
- Introduced `RebatchBySize()` with explicit support for both byte and count limits, flushing when either threshold is exceeded.
- Implemented conservative memory estimation via `BioSequence.MemorySize()` and enhanced garbage collection to trigger explicit cleanup after large batch discards.
- Updated internal batching logic across `batchiterator.go`, `fragment.go`, and `obirefidx.go` to consistently use default memory (128 MB) and size (min: 1, max: 2000) bounds.
### Linux Build Enhancements
- Enabled static linking for Linux binaries using musl, producing portable, self-contained executables without external dependencies.
### Notes
- This release consolidates and improves batching behavior introduced in 4.4.20, with no breaking changes to the public API.
- All user-facing batching behavior is now governed by consistent memory and count constraints, improving predictability and stability during large dataset processing.
This release introduces significant improvements to build reliability and performance, alongside key parsing enhancements for sequence data.
### Build & Installation Improvements
- Added support for parallel compilation via `-j/--jobs` option in both the Makefile and install script, enabling faster builds on multi-core systems. The default remains single-threaded for safety.
- Enhanced Makefile with `.DEFAULT_GOAL := all` for consistent behavior and a documented `help` target.
- Replaced fragile file operations with robust error handling, clear diagnostics, and automatic preservation of the build directory on copy failures to aid recovery.
### Rope-Based Parsing Enhancements (from 4.4.20)
- Introduced direct rope-based parsers for FASTA, EMBL, and FASTQ formats, improving memory efficiency for large files.
- Added U→T conversion support during sequence extraction and more reliable line ending detection.
- Unified rope scanning logic under a new `ropeScanner` for better maintainability.
- Added `TakeQualities()` method to BioSequence for more efficient handling of quality data.
### Bug Fixes (from 4.4.20)
- Fixed `CompressStream` to correctly respect the `compressed` variable.
- Replaced ambiguous string splitting utilities with precise left/right split variants (`LeftSplitInTwo`, `RightSplitInTwo`).
### Release Tooling (from 4.4.20)
- Streamlined release process with modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`) and AI-assisted note generation via `aichat`.
- Improved versioning support via the `VERSION` environment variable in `bump-version`.
- Switched PR submission from raw `jj git push` to `stakk` for consistency and reliability.
Note: This release incorporates key enhancements from 4.4.20 that impact end users, while focusing on build robustness and performance gains.
### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkParserRope`. Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.
### Bug Fixes
- **Compression behavior**: Fixed `CompressStream` to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.
### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.
### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup
This commit introduces Jaccard distance and similarity methods for KmerSet and KmerSetGroup.
For KmerSet:
- Added JaccardDistance method to compute the Jaccard distance between two KmerSets
- Added JaccardSimilarity method to compute the Jaccard similarity between two KmerSets
For KmerSetGroup:
- Added JaccardDistanceMatrix method to compute a pairwise Jaccard distance matrix
- Added JaccardSimilarityMatrix method to compute a pairwise Jaccard similarity matrix
Also includes:
- New DistMatrix implementation in pkg/obidist for storing and computing distance/similarity matrices
- Updated version handling with bump-version target in Makefile
- Added tests for all new methods
This commit refactors the k-mer normalization functions, renaming them from 'NormalizeKmer' to 'CanonicalKmer' to better reflect their purpose of returning canonical k-mers. It also introduces new quorum operations (AtLeast, AtMost, Exactly) for k-mer set groups, along with comprehensive tests and benchmarks. The version commit hash has also been updated.
Ajout de la fonctionnalité de sauvegarde et de chargement pour FrequencyFilter en utilisant le KmerSetGroup sous-jacent.
- Nouvelle méthode Save() pour enregistrer le filtre dans un répertoire avec formatage des métadonnées
- Nouvelle méthode LoadFrequencyFilter() pour charger un filtre depuis un répertoire
- Initialisation des métadonnées lors de la création du filtre
- Optimisation des méthodes Union() et Intersect() du KmerSetGroup
- Mise à jour du commit hash
This commit refactors all k-mer encoding and normalization functions to consistently use 'canonical' instead of 'normalized' terminology. This includes renaming functions like EncodeNormalizedKmer to EncodeCanonicalKmer, IterNormalizedKmers to IterCanonicalKmers, and NormalizeKmer to CanonicalKmer. The change aligns the API with biological conventions where 'canonical' refers to the lexicographically smallest representation of a k-mer and its reverse complement. All related documentation and examples have been updated accordingly. The commit also updates the version file with a new commit hash.
This commit introduces new functions for encoding and decoding k-mers, including support for normalized k-mers. It also updates the frequency filter and k-mer set implementations to use the new encoding functions, providing zero-allocation encoding for better performance. The commit hash has been updated to reflect the latest changes.
This commit refactors the KmerSet and related structures to use an immutable K parameter and introduces consistent Copy methods instead of Clone. It also adds attribute API support for KmerSet and KmerSetGroup, and updates persistence logic to handle IDs and metadata correctly.
Cette modification ajoute la capacité de stocker et de persister des métadonnées utilisateur dans les structures KmerSet et KmerSetGroup. Les changements incluent l'ajout d'un champ Metadata dans KmerSet et KmerSetGroup, ainsi que la mise à jour des méthodes de clonage et de persistance pour gérer ces métadonnées. Cela permet de conserver des informations supplémentaires liées aux ensembles de k-mers tout en maintenant la compatibilité avec les opérations existantes.
This commit adds support for saving and loading KmerSet and KmerSetGroup structures using TOML, YAML, and JSON formats for metadata. It includes:
- Added github.com/pelletier/go-toml/v2 dependency
- Implemented Save and Load methods for KmerSet and KmerSetGroup
- Added metadata persistence with support for multiple formats (TOML, YAML, JSON)
- Added helper functions for format detection and metadata handling
- Updated version commit hash
This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
Ajout d'une fonctionnalité pour le filtrage unique qui prend en compte à la fois la séquence et les catégories.
- Modification de la fonction ISequenceChunk pour accepter un classifieur unique optionnel
- Implémentation du traitement unique sur disque en utilisant un classifieur composite
- Mise à jour du classifieur utilisé pour le tri sur disque
- Correction de la gestion des clés de unicité en utilisant le code et la valeur du classifieur
- Mise à jour du numéro de commit