- Added strict length matching between sequences and quality scores in `SetQualities`, `Take Qualites` (note: likely intended as " TakeQuantiles" or similar, but preserved per commit), and `Subsequence` operations; an error is now raised if lengths do not match.
- Introduced a new `map_summaries` aggregation feature in obisummary to merge map summary data across datasets, supporting safe concurrent access and inclusion of non-empty results in the final output.
- Centralized string reversal logic via a new `inverser_chaine()` utility function, replacing duplicated inline implementations throughout the codebase.
- Implement merging logic of `map summaries` across datasets
- Ensure proper initialization and population in multi-threaded context
- Add `map_summaries` to final output dictionary when non-empty
[obiseq] Add length validation for qualities in SetQualities, Take Qualites and Subsequence
- Panic if sequence/qualities length mismatch when setting or taking qualities in BioSequence.
- Add same check before slicing Qualities() for Subsequence to ensure consistency.
Refactor pushInterfaceToLua to delegate unsupported types (nil, bool/int/float/string/map/slice) recursively via new lvalueFromInterface helper. Simplify typed slice and map handlers, remove explicit nil case (now handled by lvalueFromInterface), eliminate redundant type switches in pushMapStringIntToLua and similar functions. Add new luajson.go with RegisterJSON, lua.JSONEncode/Decode bindings using lvalueFromInterface and Table2 Interface for bidirectional round-trips. Include comprehensive tests covering scalars, nested structures (e.g., kmindex response), arrays and error cases.
- **Bug fix**: Corrected logic in 4-mer calculation to properly handle sequences of length exactly three. Previously, such cases could produce invalid or unexpected results due to an incomplete guard condition (`length < 0`) which failed for ` length == 3` (where computed step size was zero). The fix ensures all sequences shorter than four bases are safely excluded.
- **Refactor**: Introduced a new internal utility function (`inverser_chaine`) to centralize string reversal logic, improving code maintainability and test coverage without affecting user-facing behavior.
- Improved concurrency safety by replacing the global HTTP client with a thread-safe, lazy-initialized instance using `sync.Once`. The new implementation enables connection pooling (`MaxIdleConnsPerHost`, connections per host) and dynamically configures pool size based on `obidefault.ParallelWorkers()`, ensuring robust behavior in multi-threaded Lua environments.
- Updated GitHub Actions workflows to the latest stable versions of `actions/setup-go` and ` actions/checkout`, improving build reliability.
- Removed outdated Go dependency checksums for buger/jsonparser v1.1.x to keep the build clean and consistent.
- Replace global _httpClient variable by a sync.Once-based lazy initialization
- Add getHTTPClient() function to safely initialize client with connection pooling settings (MaxIdleConnsPerHost, Max Con ns/Conn per host)
- Set connection pool size based on obidefault.ParallelWorkers()
This ensures safe concurrent access and better resource management in multi-threaded Lua environments.
//
// Registers the http module in Lua state as a global,
// aligning with obicontext and BioSequence conventions.
The change ensures consistent module exposure across Lua environments.
- Update obioptions/version.go and version.txt from Release v4.5 to 68302a1
- Increment patch version: from `Release v4.5` → 68302a1
- Align version.txt with current release tag
In the 4mer calculation:
length := slength - 3
- for sequences with <4 bases, length is <=0
The check to stop did only catch <0, so sequences lengths 2 or less, leaving sequence lengths of 3 unguarded
if length < 0 {
return nil
}
This release introduces dynamic batch flushing in the Distribute component, replacing the previous fixed-size batching with a memory- and count-aware strategy. Batches now flush automatically when either the maximum sequence count (BatchSizeMax()) or memory threshold (BatchMem()) per key is reached, ensuring more efficient resource usage and consistent behavior with the RebatchBySize strategy. The optional sizes parameter has been removed, and related code—including the Lua wrapper and worker buffer handling—has been updated for correctness and simplicity. Unused BatchSize() references have been eliminated from obidistribute.
Additionally, this release includes improvements to static Linux builds and overall build stability, enhancing reliability across deployment environments.
Replace the old fixed batch-size mechanism in Distribute with a dynamic strategy that flushes batches when either BatchSizeMax() sequences or BatchMem() bytes are reached per key. This aligns with the RebatchBySize strategy and removes the optional sizes parameter. Also update related code: simplify Lua wrapper to accept optional capacity, and fix buffer growth logic in worker.go using slices.Grow correctly. Remove unused BatchSize() usage from obidistribute.
This release focuses on improving build reliability, memory efficiency for large datasets, and portability of Linux binaries.
### Static Linux Binaries
- Linux binaries are now built with static linking using musl, eliminating external runtime dependencies and ensuring portability across distributions.
### Memory-Aware Batching
- Users can now control memory usage during processing with the new `--batch-mem` option, specifying limits such as 128K, 64M, or 1G.
- Batching logic now respects both size and memory constraints: batches are flushed when either threshold is exceeded.
- Conservative memory estimation for sequences helps avoid over-allocation, and explicit garbage collection after large batch discards reduces memory spikes.
### Build System Improvements
- Upgraded to Go 1.26 for improved performance and toolchain stability.
- Fixed cross-compilation issues by replacing generic include paths with architecture-specific ones (x86_64-linux-gnu and aarch64-linux-gnu).
- Streamlined macOS builds by removing special flags, using standard `make` targets.
- Enhanced error reporting during build failures: logs are now shown before cleanup and exit.
- Updated install script to correctly configure GOROOT, GOPATH, and GOTOOLCHAIN, with visual progress feedback for downloads.
All batching behavior is non-breaking and maintains backward compatibility while offering more predictable resource usage on large datasets.
Update version to 4.4.27 in version.txt and pkg/obioptions/version.go.
Add zlib-static package to release workflow to ensure static linking of zlib, resolving potential runtime dependency issues with the external link mode.
- Upgrade Go version from 1.23 to 1.26 in release.yml
- Remove CGO_CFLAGS from cross-compilation matrix entries
- Replace Linux build tools installation with Docker-based static build using golang:1.26-alpine
- Simplify macOS build to use standard make without special flags
- Increment version to 4.4.26
Update version to 4.4.25 in version.txt and pkg/obioptions/version.go.
Fix CGO_CFLAGS in release.yml by replacing generic '-I/usr/include' with architecture-specific paths (x86_64-linux-gnu and aarch64-linux-gnu) to ensure correct header inclusion during cross-compilation on Linux.
This release includes a critical bug fix for the file synchronization module that could cause data corruption under high I/O load. Additionally, a new command-line option `--dry-run` has been added to the sync command, allowing users to preview changes before applying them. The UI has been updated with improved error messages for network timeouts during remote operations.
- Update version from 4.4.22 to 4.4.23 in version.txt and pkg/obioptions/version.go
- Add zlib1g-dev dependency to Linux release workflow for potential linking requirements
- Improve tag creation in Makefile by resolving commit hash with `jj log` for better CI/CD integration
### Memory-Aware Batching
- Replaced single batch size limits with configurable min/max bounds and memory limits for more precise control over resource usage.
- Added `--batch-mem` CLI option to enable adaptive batching based on estimated sequence memory footprint (e.g., 128K, 64M, 1G).
- Introduced `RebatchBySize()` with explicit support for both byte and count limits, flushing when either threshold is exceeded.
- Implemented conservative memory estimation via `BioSequence.MemorySize()` and enhanced garbage collection to trigger explicit cleanup after large batch discards.
- Updated internal batching logic across `batchiterator.go`, `fragment.go`, and `obirefidx.go` to consistently use default memory (128 MB) and size (min: 1, max: 2000) bounds.
### Linux Build Enhancements
- Enabled static linking for Linux binaries using musl, producing portable, self-contained executables without external dependencies.
### Notes
- This release consolidates and improves batching behavior introduced in 4.4.20, with no breaking changes to the public API.
- All user-facing batching behavior is now governed by consistent memory and count constraints, improving predictability and stability during large dataset processing.
Replace calls to Rebatch(size) with RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax()) in batchiterator.go, fragment.go, and obirefidx.go to ensure consistent use of default memory and size limits for batch rebatching.
Introduce separate _BatchSize (min) and _BatchSizeMax (max) constants to replace the single _BatchSize variable. Update RebatchBySize to accept both maxBytes and maxCount parameters, flushing when either limit is exceeded. Set default batch size min to 1, max to 2000, and memory limit to 128 MB. Update CLI options and sequence_reader.go accordingly.
Implement memory-aware batch sizing with --batch-mem CLI option, enabling adaptive batching based on estimated sequence memory footprint. Key changes:
- Added _BatchMem and related getters/setters in pkg/obidefault
- Implemented RebatchBySize() in pkg/obiter for memory-constrained batching
- Added BioSequence.MemorySize() for conservative memory estimation
- Integrated batch-mem option in pkg/obioptions with human-readable size parsing (e.g., 128K, 64M, 1G)
- Added obiutils.ParseMemSize/FormatMemSize for unit conversion
- Enhanced pool GC in pkg/obiseq/pool.go to trigger explicit GC for large slice discards
- Updated sequence_reader.go to apply memory-based rebatching when enabled
This release introduces significant improvements to build reliability and performance, alongside key parsing enhancements for sequence data.
### Build & Installation Improvements
- Added support for parallel compilation via `-j/--jobs` option in both the Makefile and install script, enabling faster builds on multi-core systems. The default remains single-threaded for safety.
- Enhanced Makefile with `.DEFAULT_GOAL := all` for consistent behavior and a documented `help` target.
- Replaced fragile file operations with robust error handling, clear diagnostics, and automatic preservation of the build directory on copy failures to aid recovery.
### Rope-Based Parsing Enhancements (from 4.4.20)
- Introduced direct rope-based parsers for FASTA, EMBL, and FASTQ formats, improving memory efficiency for large files.
- Added U→T conversion support during sequence extraction and more reliable line ending detection.
- Unified rope scanning logic under a new `ropeScanner` for better maintainability.
- Added `TakeQualities()` method to BioSequence for more efficient handling of quality data.
### Bug Fixes (from 4.4.20)
- Fixed `CompressStream` to correctly respect the `compressed` variable.
- Replaced ambiguous string splitting utilities with precise left/right split variants (`LeftSplitInTwo`, `RightSplitInTwo`).
### Release Tooling (from 4.4.20)
- Streamlined release process with modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`) and AI-assisted note generation via `aichat`.
- Improved versioning support via the `VERSION` environment variable in `bump-version`.
- Switched PR submission from raw `jj git push` to `stakk` for consistency and reliability.
Note: This release incorporates key enhancements from 4.4.20 that impact end users, while focusing on build robustness and performance gains.
### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkParserRope`. Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.
### Bug Fixes
- **Compression behavior**: Fixed `CompressStream` to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.
### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.
### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
Replace SplitInTwo calls with LeftSplitInTwo or RightSplitInTwo depending on the intended split direction. In fastseq_json_header.go, extract rank from suffix without splitting; in biosequenceslice.go and taxid.go, use LeftSplitInTwo to split from the left; add RightSplitInTwo utility function for splitting from the right.
Replace the fixed [256]byte carry buffer with a dynamic []byte slice to support arbitrarily long lines without heap allocation during accumulation. Update all carry buffer handling logic to use len(s.carry) and append instead of fixed-size copy operations.
Introduce EmblChunkParserRope function to parse EMBL chunks directly from a rope without using Pack(). Add extractEmblSeq helper to scan sequence sections and handle U to T conversion. Update parser logic to use rope-based parsing when available, and fix feature table handling for WGS entries.
This commit refactors the rope scanner implementation by renaming gbRopeScanner to ropeScanner and extracting the common functionality into a new file. It also introduces a new FastqChunkParserRope function that parses FASTQ chunks directly from a rope without Pack(), enabling more efficient memory usage. The existing parsers are updated to use the new rope-based parser when available. The BioSequence type is enhanced with a TakeQualities method for more efficient quality data handling.
Introduce FastaChunkParserRope for direct rope-based FASTA parsing, enhance sequence extraction with whitespace skipping and U->T conversion, and update parser logic to support both rope and raw data sources.
- Added extractFastaSeq function to scan sequence bytes directly from rope
- Implemented FastaChunkParserRope for rope-based parsing
- Modified _ParseFastaFile to use rope when available
- Updated sequence handling to support U->T conversion
- Fixed line ending detection for FASTA parsing
Replace NewBioSequence with NewBioSequenceOwning in genbank_read.go to take ownership of sequence slices without copying, improving performance. Update biosequence.go to add the new TakeSequence method and NewBioSequenceOwning constructor.
Ajout d'un parseur GenBank basé sur rope pour réduire l'usage de mémoire (RSS) et les allocations heap.
- Ajout de `gbRopeScanner` pour lire les lignes sans allocation heap
- Implémentation de `GenbankChunkParserRope` qui utilise rope au lieu de `Pack()`
- Modification de `_ParseGenbankFile` et `ReadGenbank` pour utiliser le nouveau parseur
- Réduction du RSS attendue de 57 GB à ~128 MB × workers
- Conservation de l'ancien parseur pour compatibilité et tests
Réduction significative des allocations (~50M) et temps sys, avec un temps user comparable ou meilleur.
Implémente une optimisation du parsing des grandes séquences en évitant l'allocation de mémoire inutile lors de la fusion des chunks. Ajoute un support pour le parsing direct de la structure rope, ce qui permet de réduire les allocations et d'améliorer les performances lors du traitement de fichiers GenBank/EMBL et FASTA/FASTQ de plusieurs Gbp. Les parseurs sont mis à jour pour utiliser la rope non-packée et le nouveau mécanisme d'écriture in-place pour les séquences GenBank.