Compare commits

...

18 Commits

Author SHA1 Message Date
Eric Coissac
94b0887069 Memory-aware Batching and Static Linux Builds
### Memory-Aware Batching
- Replaced single batch size limits with configurable min/max bounds and memory limits for more precise control over resource usage.
- Added `--batch-mem` CLI option to enable adaptive batching based on estimated sequence memory footprint (e.g., 128K, 64M, 1G).
- Introduced `RebatchBySize()` with explicit support for both byte and count limits, flushing when either threshold is exceeded.
- Implemented conservative memory estimation via `BioSequence.MemorySize()` and enhanced garbage collection to trigger explicit cleanup after large batch discards.
- Updated internal batching logic across `batchiterator.go`, `fragment.go`, and `obirefidx.go` to consistently use default memory (128 MB) and size (min: 1, max: 2000) bounds.

### Linux Build Enhancements
- Enabled static linking for Linux binaries using musl, producing portable, self-contained executables without external dependencies.

### Notes
- This release consolidates and improves batching behavior introduced in 4.4.20, with no breaking changes to the public API.
- All user-facing batching behavior is now governed by consistent memory and count constraints, improving predictability and stability during large dataset processing.
2026-03-13 15:16:41 +01:00
Eric Coissac
c188580aac Replace Rebatch with RebatchBySize using default batch parameters
Replace calls to Rebatch(size) with RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax()) in batchiterator.go, fragment.go, and obirefidx.go to ensure consistent use of default memory and size limits for batch rebatching.
2026-03-13 15:16:33 +01:00
Eric Coissac
1e1f575d1c refactor: replace single batch size with min/max bounds and memory limits
Introduce separate _BatchSize (min) and _BatchSizeMax (max) constants to replace the single _BatchSize variable. Update RebatchBySize to accept both maxBytes and maxCount parameters, flushing when either limit is exceeded. Set default batch size min to 1, max to 2000, and memory limit to 128 MB. Update CLI options and sequence_reader.go accordingly.
2026-03-13 15:07:35 +01:00
Eric Coissac
40769bf827 Add memory-based batching support
Implement memory-aware batch sizing with --batch-mem CLI option, enabling adaptive batching based on estimated sequence memory footprint. Key changes:
- Added _BatchMem and related getters/setters in pkg/obidefault
- Implemented RebatchBySize() in pkg/obiter for memory-constrained batching
- Added BioSequence.MemorySize() for conservative memory estimation
- Integrated batch-mem option in pkg/obioptions with human-readable size parsing (e.g., 128K, 64M, 1G)
- Added obiutils.ParseMemSize/FormatMemSize for unit conversion
- Enhanced pool GC in pkg/obiseq/pool.go to trigger explicit GC for large slice discards
- Updated sequence_reader.go to apply memory-based rebatching when enabled
2026-03-13 14:54:21 +01:00
Eric Coissac
74e6fcaf83 feat: add static linking for Linux builds using musl
Enable static linking for Linux binaries by installing musl-tools and passing appropriate LDFLAGS during build. This ensures portable, self-contained executables for Linux targets.
2026-03-13 14:26:31 +01:00
coissac
30ec8b1b63 Merge pull request #92 from metabarcoding/push-mvpuxnxoyypu
4.4.21: Parallel builds, robust installation, and rope-based parsing enhancements
2026-03-13 12:00:32 +01:00
Eric Coissac
cdc72c5346 4.4.21: Parallel builds, robust installation, and rope-based parsing enhancements
This release introduces significant improvements to build reliability and performance, alongside key parsing enhancements for sequence data.

### Build & Installation Improvements
- Added support for parallel compilation via `-j/--jobs` option in both the Makefile and install script, enabling faster builds on multi-core systems. The default remains single-threaded for safety.
- Enhanced Makefile with `.DEFAULT_GOAL := all` for consistent behavior and a documented `help` target.
- Replaced fragile file operations with robust error handling, clear diagnostics, and automatic preservation of the build directory on copy failures to aid recovery.

### Rope-Based Parsing Enhancements (from 4.4.20)
- Introduced direct rope-based parsers for FASTA, EMBL, and FASTQ formats, improving memory efficiency for large files.
- Added U→T conversion support during sequence extraction and more reliable line ending detection.
- Unified rope scanning logic under a new `ropeScanner` for better maintainability.
- Added `TakeQualities()` method to BioSequence for more efficient handling of quality data.

### Bug Fixes (from 4.4.20)
- Fixed `CompressStream` to correctly respect the `compressed` variable.
- Replaced ambiguous string splitting utilities with precise left/right split variants (`LeftSplitInTwo`, `RightSplitInTwo`).

### Release Tooling (from 4.4.20)
- Streamlined release process with modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`) and AI-assisted note generation via `aichat`.
- Improved versioning support via the `VERSION` environment variable in `bump-version`.
- Switched PR submission from raw `jj git push` to `stakk` for consistency and reliability.

Note: This release incorporates key enhancements from 4.4.20 that impact end users, while focusing on build robustness and performance gains.
2026-03-13 11:59:32 +01:00
Eric Coissac
82a9972be7 Add parallel compilation support and improve Makefile/install script robustness
- Add .DEFAULT_GOAL := all to Makefile for consistent default target
- Document -j/--jobs option in README.md to allow parallel compilation
- Add JOBS variable and -j/--jobs argument to install script (default: 1)
- Replace fragile mkdir/cp commands with robust error handling and clear diagnostics
- Add build directory preservation on copy failure for manual recovery
- Pass -j option to make during compilation to enable parallel builds
2026-03-13 11:59:20 +01:00
coissac
ff6e515b2a Merge pull request #91 from metabarcoding/push-uotrstkymowq
4.4.20: Rope-based parsing, improved release tooling, and bug fixes
2026-03-12 20:15:33 +01:00
Eric Coissac
cd0c525f50 4.4.20: Rope-based parsing, improved release tooling, and bug fixes
### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkParserRope`. Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.

### Bug Fixes
- **Compression behavior**: Fixed `CompressStream` to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.

### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.

### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
2026-03-12 20:14:11 +01:00
Eric Coissac
abe935aa18 Add help target, colorize output, and improve release workflow
- Add colored terminal output support (GREEN, YELLOW, BLUE, NC)
- Introduce `help` target to document all Makefile targets
- Enhance `bump-version` to accept VERSION env var for manual version setting
- Refactor jjpush: split into modular targets (jjpush-notes, jjpush-push, jjpush-tag)
- Replace orla with aichat for AI-powered release notes generation
- Add robust JSON parsing using Python for release notes extraction
- Use stakk for PR submission (replacing raw `jj git push`)
- Generate and store release notes in temp files for tag creation
- Add installation instructions to release tags
- Update .PHONY with new targets

4.4.20: Rope-based parsing, improved release tooling, and bug fixes

### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkRope` functions, eliminating unnecessary memory allocation via Pack(). Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency across parsers.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.

### Bug Fixes
- **Compression behavior**: Fixed CompressStream to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.

### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.

### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
2026-03-12 20:14:11 +01:00
Eric Coissac
8dd32dc1bf Fix CompressStream call to use compressed variable
Replace hardcoded boolean with the `compressed` variable in CompressStream call to ensure correct compression behavior.
2026-03-12 18:48:22 +01:00
Eric Coissac
6ee8750635 Replace SplitInTwo with LeftSplitInTwo/RightSplitInTwo for precise splitting
Replace SplitInTwo calls with LeftSplitInTwo or RightSplitInTwo depending on the intended split direction. In fastseq_json_header.go, extract rank from suffix without splitting; in biosequenceslice.go and taxid.go, use LeftSplitInTwo to split from the left; add RightSplitInTwo utility function for splitting from the right.
2026-03-12 18:41:28 +01:00
Eric Coissac
8c318c480e replace fixed-size carry buffer with dynamic slice
Replace the fixed [256]byte carry buffer with a dynamic []byte slice to support arbitrarily long lines without heap allocation during accumulation. Update all carry buffer handling logic to use len(s.carry) and append instead of fixed-size copy operations.
2026-03-11 20:44:45 +01:00
Eric Coissac
09fbc217d3 Add EMBL rope parsing support and improve sequence extraction
Introduce EmblChunkParserRope function to parse EMBL chunks directly from a rope without using Pack(). Add extractEmblSeq helper to scan sequence sections and handle U to T conversion. Update parser logic to use rope-based parsing when available, and fix feature table handling for WGS entries.
2026-03-10 17:02:14 +01:00
Eric Coissac
3d2e205722 Refactor rope scanner and add FASTQ rope parser
This commit refactors the rope scanner implementation by renaming gbRopeScanner to ropeScanner and extracting the common functionality into a new file. It also introduces a new FastqChunkParserRope function that parses FASTQ chunks directly from a rope without Pack(), enabling more efficient memory usage. The existing parsers are updated to use the new rope-based parser when available. The BioSequence type is enhanced with a TakeQualities method for more efficient quality data handling.
2026-03-10 16:47:03 +01:00
Eric Coissac
623116ab13 Add rope-based FASTA parsing and improve sequence handling
Introduce FastaChunkParserRope for direct rope-based FASTA parsing, enhance sequence extraction with whitespace skipping and U->T conversion, and update parser logic to support both rope and raw data sources.

- Added extractFastaSeq function to scan sequence bytes directly from rope
- Implemented FastaChunkParserRope for rope-based parsing
- Modified _ParseFastaFile to use rope when available
- Updated sequence handling to support U->T conversion
- Fixed line ending detection for FASTA parsing
2026-03-10 16:34:33 +01:00
coissac
1e4509cb63 Merge pull request #90 from metabarcoding/push-uzpqqoqvpnxw
Push uzpqqoqvpnxw
2026-03-10 15:53:08 +01:00
26 changed files with 866 additions and 142 deletions

View File

@@ -62,6 +62,12 @@ jobs:
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
- name: Install build tools (Linux)
if: runner.os == 'Linux'
run: |
sudo apt-get update -q
sudo apt-get install -y musl-tools
- name: Install build tools (macOS)
if: runner.os == 'macOS'
run: |
@@ -74,8 +80,13 @@ jobs:
GOOS: ${{ matrix.goos }}
GOARCH: ${{ matrix.goarch }}
VERSION: ${{ steps.get_version.outputs.version }}
CC: ${{ matrix.goos == 'linux' && 'musl-gcc' || '' }}
run: |
make obitools
if [ "$GOOS" = "linux" ]; then
make LDFLAGS='-linkmode=external -extldflags=-static' obitools
else
make obitools
fi
mkdir -p artifacts
# Create a single tar.gz with all binaries for this platform
tar -czf artifacts/obitools4_${VERSION}_${{ matrix.output_name }}.tar.gz -C build .

113
Makefile
View File

@@ -2,9 +2,17 @@
#export GOBIN=$(GOPATH)/bin
#export PATH=$(GOBIN):$(shell echo $${PATH})
.DEFAULT_GOAL := all
GREEN := \033[0;32m
YELLOW := \033[0;33m
BLUE := \033[0;34m
NC := \033[0m
GOFLAGS=
LDFLAGS=
GOCMD=go
GOBUILD=$(GOCMD) build $(GOFLAGS)
GOBUILD=$(GOCMD) build $(GOFLAGS) $(if $(LDFLAGS),-ldflags='$(LDFLAGS)')
GOGENERATE=$(GOCMD) generate
GOCLEAN=$(GOCMD) clean
GOTEST=$(GOCMD) test
@@ -60,6 +68,28 @@ endif
OUTPUT:=$(shell mktemp)
help:
@printf "$(GREEN)OBITools4 Makefile$(NC)\n\n"
@printf "$(BLUE)Main targets:$(NC)\n"
@printf " %-20s %s\n" "all" "Build all obitools (default)"
@printf " %-20s %s\n" "obitools" "Build all obitools binaries to build/"
@printf " %-20s %s\n" "test" "Run Go unit tests"
@printf " %-20s %s\n" "obitests" "Run integration tests (obitests/)"
@printf " %-20s %s\n" "bump-version" "Increment patch version (or set with VERSION=x.y.z)"
@printf " %-20s %s\n" "update-deps" "Update all Go dependencies"
@printf "\n$(BLUE)Jujutsu workflow:$(NC)\n"
@printf " %-20s %s\n" "jjnew" "Document current commit and start a new one"
@printf " %-20s %s\n" "jjpush" "Release: describe, bump, generate notes, push PR, tag (VERSION=x.y.z optional)"
@printf " %-20s %s\n" "jjfetch" "Fetch latest commits from origin"
@printf "\n$(BLUE)Required tools:$(NC)\n"
@printf " %-20s " "go"; command -v go >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(go version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "git"; command -v git >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(git --version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "jj"; command -v jj >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(jj --version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "gh"; command -v gh >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(gh --version | head -1)" || printf "$(YELLOW)✗ not found$(NC) (brew install gh)\n"
@printf "\n$(BLUE)Optional tools (release notes generation):$(NC)\n"
@printf " %-20s " "aichat"; command -v aichat >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(aichat --version)" || printf "$(YELLOW)✗ not found$(NC) (https://github.com/sigoden/aichat)\n"
@printf " %-20s " "jq"; command -v jq >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(jq --version)" || printf "$(YELLOW)✗ not found$(NC) (brew install jq)\n"
all: install-githook obitools
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
@@ -106,15 +136,20 @@ pkg/obioptions/version.go: version.txt .FORCE
@rm -f $(OUTPUT)
bump-version:
@echo "Incrementing version..."
@current=$$(cat version.txt); \
echo " Current version: $$current"; \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
echo " New version: $$new_version"; \
if [ -n "$(VERSION)" ]; then \
new_version="$(VERSION)"; \
echo "Setting version to $$new_version (was $$current)"; \
else \
echo "Incrementing version..."; \
echo " Current version: $$current"; \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
echo " New version: $$new_version"; \
fi; \
echo "$$new_version" > version.txt
@echo "✓ Version updated in version.txt"
@$(MAKE) pkg/obioptions/version.go
@@ -130,6 +165,7 @@ jjnew:
jjpush:
@$(MAKE) jjpush-describe
@$(MAKE) jjpush-bump
@$(MAKE) jjpush-notes
@$(MAKE) jjpush-push
@$(MAKE) jjpush-tag
@echo "$(GREEN)✓ Release complete$(NC)"
@@ -142,44 +178,61 @@ jjpush-bump:
@echo "$(BLUE)→ Creating new commit for version bump...$(NC)"
@jj new
@$(MAKE) bump-version
@echo "$(BLUE)→ Documenting version bump commit...$(NC)"
@jj auto-describe
jjpush-push:
@echo "$(BLUE)→ Pushing commits...$(NC)"
@jj git push --change @
jjpush-tag:
jjpush-notes:
@version=$$(cat version.txt); \
tag_name="Release_$$version"; \
echo "$(BLUE)→ Generating release notes for $$tag_name...$(NC)"; \
release_message="Release $$version"; \
if command -v orla >/dev/null 2>&1 && command -v jq >/dev/null 2>&1; then \
previous_tag=$$(git describe --tags --abbrev=0 --match 'Release_*' HEAD^ 2>/dev/null); \
echo "$(BLUE)→ Generating release notes for version $$version...$(NC)"; \
release_title="Release $$version"; \
release_body=""; \
if command -v aichat >/dev/null 2>&1; then \
previous_tag=$$(git describe --tags --abbrev=0 --match 'Release_*' 2>/dev/null); \
if [ -z "$$previous_tag" ]; then \
echo "$(YELLOW)⚠ No previous Release tag found, skipping release notes$(NC)"; \
else \
raw_output=$$(git log --format="%h %B" "$$previous_tag..HEAD" | \
ORLA_MAX_TOOL_CALLS=50 orla agent -m ollama:qwen3-coder-next:latest \
aichat \
"Summarize the following commits into a GitHub release note for version $$version. Ignore commits related to version bumps, .gitignore changes, or any internal housekeeping that is irrelevant to end users. Describe each user-facing change precisely without exposing code. Eliminate redundancy. Output strictly valid JSON with no surrounding text, using this exact schema: {\"title\": \"<short release title>\", \"body\": \"<detailed markdown release notes>\"}" 2>/dev/null) || true; \
if [ -n "$$raw_output" ]; then \
sanitized=$$(echo "$$raw_output" | sed -n '/^{/,/^}/p' | tr -d '\000-\011\013-\014\016-\037'); \
release_title=$$(echo "$$sanitized" | jq -r '.title // empty' 2>/dev/null) ; \
release_body=$$(echo "$$sanitized" | jq -r '.body // empty' 2>/dev/null) ; \
if [ -n "$$release_title" ] && [ -n "$$release_body" ]; then \
release_message="$$release_title"$$'\n\n'"$$release_body"; \
notes=$$(printf '%s\n' "$$raw_output" | python3 tools/json2md.py 2>/dev/null); \
if [ -n "$$notes" ]; then \
release_title=$$(echo "$$notes" | head -1); \
release_body=$$(echo "$$notes" | tail -n +3); \
else \
echo "$(YELLOW)⚠ JSON parsing failed, using default release message$(NC)"; \
fi; \
fi; \
fi; \
fi; \
printf '%s' "$$release_title" > /tmp/obitools4-release-title.txt; \
printf '%s' "$$release_body" > /tmp/obitools4-release-body.txt; \
echo "$(BLUE)→ Setting release notes as commit description...$(NC)"; \
jj desc -m "$$release_title"$$'\n\n'"$$release_body"
jjpush-push:
@echo "$(BLUE)→ Pushing commits...$(NC)"
@jj git push --change @
@echo "$(BLUE)→ Creating/updating PR...$(NC)"
@release_title=$$(cat /tmp/obitools4-release-title.txt 2>/dev/null || echo "Release $$(cat version.txt)"); \
release_body=$$(cat /tmp/obitools4-release-body.txt 2>/dev/null || echo ""); \
branch=$$(jj log -r @ --no-graph -T 'bookmarks.map(|b| b.name()).join("\n")' 2>/dev/null | head -1); \
if [ -n "$$branch" ] && command -v gh >/dev/null 2>&1; then \
gh pr create --title "$$release_title" --body "$$release_body" --base master --head "$$branch" 2>/dev/null \
|| gh pr edit "$$branch" --title "$$release_title" --body "$$release_body" 2>/dev/null \
|| echo "$(YELLOW)⚠ Could not create/update PR$(NC)"; \
fi
jjpush-tag:
@version=$$(cat version.txt); \
tag_name="Release_$$version"; \
release_title=$$(cat /tmp/obitools4-release-title.txt 2>/dev/null || echo "Release $$version"); \
release_body=$$(cat /tmp/obitools4-release-body.txt 2>/dev/null || echo ""); \
install_section=$$'\n## Installation\n\n### Pre-built binaries\n\nDownload the appropriate archive for your system from the\n[release assets](https://github.com/metabarcoding/obitools4/releases/tag/Release_'"$$version"')\nand extract it:\n\n#### Linux (AMD64)\n```bash\ntar -xzf obitools4_'"$$version"'_linux_amd64.tar.gz\n```\n\n#### Linux (ARM64)\n```bash\ntar -xzf obitools4_'"$$version"'_linux_arm64.tar.gz\n```\n\n#### macOS (Intel)\n```bash\ntar -xzf obitools4_'"$$version"'_darwin_amd64.tar.gz\n```\n\n#### macOS (Apple Silicon)\n```bash\ntar -xzf obitools4_'"$$version"'_darwin_arm64.tar.gz\n```\n\nAll OBITools4 binaries are included in each archive.\n\n### From source\n\nYou can also compile and install OBITools4 directly from source using the\ninstallation script:\n\n```bash\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --version '"$$version"'\n```\n\nBy default binaries are installed in `/usr/local/bin`. Use `--install-dir` to\nchange the destination and `--obitools-prefix` to add a prefix to command names:\n\n```bash\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | \\\n bash -s -- --version '"$$version"' --install-dir ~/local --obitools-prefix k\n```\n'; \
release_message="$$release_message$$install_section"; \
release_message="$$release_title"$$'\n\n'"$$release_body$$install_section"; \
echo "$(BLUE)→ Creating tag $$tag_name...$(NC)"; \
git tag -a "$$tag_name" -m "$$release_message" 2>/dev/null || echo "$(YELLOW)⚠ Tag $$tag_name already exists$(NC)"; \
echo "$(BLUE)→ Pushing tag $$tag_name...$(NC)"; \
git push origin "$$tag_name" 2>/dev/null || echo "$(YELLOW)⚠ Tag push failed or already pushed$(NC)"
git push origin "$$tag_name" 2>/dev/null || echo "$(YELLOW)⚠ Tag push failed or already pushed$(NC)"; \
rm -f /tmp/obitools4-release-title.txt /tmp/obitools4-release-body.txt
jjfetch:
@echo "$(YELLOW)→ Pulling latest commits...$(NC)"
@@ -187,5 +240,5 @@ jjfetch:
@jj new master@origin
@echo "$(GREEN)✓ Latest commits pulled$(NC)"
.PHONY: all obitools update-deps obitests githubtests jjnew jjpush jjpush-describe jjpush-bump jjpush-push jjpush-tag jjfetch bump-version .FORCE
.PHONY: all obitools update-deps obitests githubtests help jjnew jjpush jjpush-describe jjpush-bump jjpush-notes jjpush-push jjpush-tag jjfetch bump-version .FORCE
.FORCE:

View File

@@ -32,8 +32,12 @@ The installation script offers several options:
>
> -p, --obitools-prefix Prefix added to the obitools command names if you
> want to have several versions of obitools at the
> same time on your system (as example `-p g` will produce
> same time on your system (as example `-p g` will produce
> `gobigrep` command instead of `obigrep`).
>
> -j, --jobs Number of parallel jobs used for compilation
> (default: 1). Increase this value to speed up
> compilation on multi-core systems (e.g., `-j 4`).
### Examples

View File

@@ -7,6 +7,7 @@ INSTALL_DIR="/usr/local"
OBITOOLS_PREFIX=""
VERSION=""
LIST_VERSIONS=false
JOBS=1
# Help message
function display_help {
@@ -21,6 +22,7 @@ function display_help {
echo " gobigrep command instead of obigrep)."
echo " -v, --version Install a specific version (e.g., 4.4.8)."
echo " If not specified, installs the latest version."
echo " -j, --jobs Number of parallel jobs for compilation (default: 1)."
echo " -l, --list List all available versions and exit."
echo " -h, --help Display this help message."
echo ""
@@ -65,6 +67,10 @@ while [ "$#" -gt 0 ]; do
VERSION="$2"
shift 2
;;
-j|--jobs)
JOBS="$2"
shift 2
;;
-l|--list)
LIST_VERSIONS=true
shift
@@ -122,9 +128,15 @@ mkdir -p "${WORK_DIR}/cache" \
exit 1)
# Create installation directory
mkdir -p "${INSTALL_DIR}/bin" 2> /dev/null \
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo mkdir -p "${INSTALL_DIR}/bin")
if ! mkdir -p "${INSTALL_DIR}/bin" 2>/dev/null; then
if [ ! -w "$(dirname "${INSTALL_DIR}")" ] && [ ! -w "${INSTALL_DIR}" ]; then
echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo mkdir -p "${INSTALL_DIR}/bin"
else
echo "Error: Could not create ${INSTALL_DIR}/bin (check path or disk space)" 1>&2
exit 1
fi
fi
if [[ ! -d "${INSTALL_DIR}/bin" ]]; then
echo "Could not create ${INSTALL_DIR}/bin directory for installing obitools" 1>&2
@@ -208,16 +220,29 @@ mkdir -p vendor
# Build with or without prefix
if [[ -z "$OBITOOLS_PREFIX" ]] ; then
make GOFLAGS="-buildvcs=false"
make -j"${JOBS}" obitools GOFLAGS="-buildvcs=false"
else
make GOFLAGS="-buildvcs=false" OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
make -j"${JOBS}" obitools GOFLAGS="-buildvcs=false" OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
fi
# Install binaries
echo "Installing binaries to ${INSTALL_DIR}/bin..." 1>&2
(cp build/* "${INSTALL_DIR}/bin" 2> /dev/null) \
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo cp build/* "${INSTALL_DIR}/bin")
if ! cp build/* "${INSTALL_DIR}/bin" 2>/dev/null; then
if [ ! -w "${INSTALL_DIR}/bin" ]; then
echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo cp build/* "${INSTALL_DIR}/bin"
else
echo "Error: Could not copy binaries to ${INSTALL_DIR}/bin" 1>&2
echo " Source files: $(ls build/ 2>/dev/null || echo 'none found')" 1>&2
echo "" 1>&2
echo "The build directory has been preserved for manual recovery:" 1>&2
echo " $(pwd)/build/" 1>&2
echo "You can install manually with:" 1>&2
echo " cp $(pwd)/build/* ${INSTALL_DIR}/bin/" 1>&2
popd > /dev/null || true
exit 1
fi
fi
popd > /dev/null || exit

View File

@@ -1,6 +1,12 @@
package obidefault
var _BatchSize = 2000
// _BatchSize is the minimum number of sequences per batch (floor).
// Used as the minSeqs argument to RebatchBySize.
var _BatchSize = 1
// _BatchSizeMax is the maximum number of sequences per batch (ceiling).
// A batch is flushed when this count is reached regardless of memory usage.
var _BatchSizeMax = 2000
// SetBatchSize sets the size of the sequence batches.
//
@@ -24,3 +30,42 @@ func BatchSize() int {
func BatchSizePtr() *int {
return &_BatchSize
}
// BatchSizeMax returns the maximum number of sequences per batch.
func BatchSizeMax() int {
return _BatchSizeMax
}
func BatchSizeMaxPtr() *int {
return &_BatchSizeMax
}
// _BatchMem holds the maximum cumulative memory (in bytes) per batch when
// memory-based batching is requested. A value of 0 disables memory-based
// batching and falls back to count-based batching.
var _BatchMem = 128 * 1024 * 1024 // 128 MB default; set to 0 to disable
var _BatchMemStr = ""
// SetBatchMem sets the memory budget per batch in bytes.
func SetBatchMem(n int) {
_BatchMem = n
}
// BatchMem returns the current memory budget per batch in bytes.
// A value of 0 means memory-based batching is disabled.
func BatchMem() int {
return _BatchMem
}
func BatchMemPtr() *int {
return &_BatchMem
}
// BatchMemStr returns the raw --batch-mem string value as provided on the CLI.
func BatchMemStr() string {
return _BatchMemStr
}
func BatchMemStrPtr() *string {
return &_BatchMemStr
}

View File

@@ -161,6 +161,149 @@ func EmblChunkParser(withFeatureTable, UtoT bool) func(string, io.Reader) (obise
return parser
}
// extractEmblSeq scans the sequence section of an EMBL record directly on the
// rope. EMBL sequence lines start with 5 spaces followed by bases in groups of
// 10, separated by spaces, with a position number at the end. The section ends
// with "//".
func (s *ropeScanner) extractEmblSeq(dest []byte, UtoT bool) []byte {
// We use ReadLine and scan each line for bases (skip digits, spaces, newlines).
for {
line := s.ReadLine()
if line == nil {
break
}
if len(line) >= 2 && line[0] == '/' && line[1] == '/' {
break
}
// Lines start with 5 spaces; bases follow separated by single spaces.
// Digits at the end are the position counter — skip them.
// Simplest: take every byte that is a letter.
for _, b := range line {
if b >= 'A' && b <= 'Z' {
b += 'a' - 'A'
}
if UtoT && b == 'u' {
b = 't'
}
if b >= 'a' && b <= 'z' {
dest = append(dest, b)
}
}
}
return dest
}
// EmblChunkParserRope parses an EMBL chunk directly from a rope without Pack().
func EmblChunkParserRope(source string, rope *PieceOfChunk, withFeatureTable, UtoT bool) (obiseq.BioSequenceSlice, error) {
scanner := newRopeScanner(rope)
sequences := obiseq.MakeBioSequenceSlice(100)[:0]
var id string
var scientificName string
defBytes := make([]byte, 0, 256)
featBytes := make([]byte, 0, 1024)
var taxid int
inSeq := false
for {
line := scanner.ReadLine()
if line == nil {
break
}
if inSeq {
// Should not happen — extractEmblSeq consumed up to "//"
inSeq = false
continue
}
switch {
case bytes.HasPrefix(line, []byte("ID ")):
id = string(bytes.SplitN(line[5:], []byte(";"), 2)[0])
case bytes.HasPrefix(line, []byte("OS ")):
scientificName = string(bytes.TrimSpace(line[5:]))
case bytes.HasPrefix(line, []byte("DE ")):
if len(defBytes) > 0 {
defBytes = append(defBytes, ' ')
}
defBytes = append(defBytes, bytes.TrimSpace(line[5:])...)
case withFeatureTable && bytes.HasPrefix(line, []byte("FH ")):
featBytes = append(featBytes, line...)
case withFeatureTable && bytes.Equal(line, []byte("FH")):
featBytes = append(featBytes, '\n')
featBytes = append(featBytes, line...)
case bytes.HasPrefix(line, []byte("FT ")):
if withFeatureTable {
featBytes = append(featBytes, '\n')
featBytes = append(featBytes, line...)
}
if bytes.HasPrefix(line, []byte(`FT /db_xref="taxon:`)) {
rest := line[37:]
end := bytes.IndexByte(rest, '"')
if end > 0 {
taxid, _ = strconv.Atoi(string(rest[:end]))
}
}
case bytes.HasPrefix(line, []byte(" ")):
// First sequence line: extract all bases via extractEmblSeq,
// which also consumes this line's remaining content.
// But ReadLine already consumed this line — we need to process it
// plus subsequent lines. Process this line inline then call helper.
seqDest := make([]byte, 0, 4096)
for _, b := range line {
if b >= 'A' && b <= 'Z' {
b += 'a' - 'A'
}
if UtoT && b == 'u' {
b = 't'
}
if b >= 'a' && b <= 'z' {
seqDest = append(seqDest, b)
}
}
seqDest = scanner.extractEmblSeq(seqDest, UtoT)
seq := obiseq.NewBioSequenceOwning(id, seqDest, string(defBytes))
seq.SetSource(source)
if withFeatureTable {
seq.SetFeatures(featBytes)
}
annot := seq.Annotations()
annot["scientific_name"] = scientificName
annot["taxid"] = taxid
sequences = append(sequences, seq)
// Reset state
id = ""
scientificName = ""
defBytes = defBytes[:0]
featBytes = featBytes[:0]
taxid = 1
case bytes.Equal(line, []byte("//")):
// record ended without SQ/sequence section (e.g. WGS entries)
if id != "" {
seq := obiseq.NewBioSequenceOwning(id, []byte{}, string(defBytes))
seq.SetSource(source)
if withFeatureTable {
seq.SetFeatures(featBytes)
}
annot := seq.Annotations()
annot["scientific_name"] = scientificName
annot["taxid"] = taxid
sequences = append(sequences, seq)
}
id = ""
scientificName = ""
defBytes = defBytes[:0]
featBytes = featBytes[:0]
taxid = 1
}
}
return sequences, nil
}
func _ParseEmblFile(
input ChannelFileChunk,
out obiiter.IBioSequence,
@@ -171,7 +314,14 @@ func _ParseEmblFile(
for chunks := range input {
order := chunks.Order
sequences, err := parser(chunks.Source, chunks.Raw)
var sequences obiseq.BioSequenceSlice
var err error
if chunks.Rope != nil {
sequences, err = EmblChunkParserRope(chunks.Source, chunks.Rope, withFeatureTable, UtoT)
} else {
sequences, err = parser(chunks.Source, chunks.Raw)
}
if err != nil {
log.Fatalf("%s : Cannot parse the embl file : %v", chunks.Source, err)
@@ -196,7 +346,7 @@ func ReadEMBL(reader io.Reader, options ...WithOption) (obiiter.IBioSequence, er
1024*1024*128,
EndOfLastFlatFileEntry,
"\nID ",
true,
false,
)
newIter := obiiter.MakeIBioSequence()

View File

@@ -209,28 +209,121 @@ func FastaChunkParser(UtoT bool) func(string, io.Reader) (obiseq.BioSequenceSlic
return parser
}
// extractFastaSeq scans sequence bytes from the rope directly into dest,
// appending valid nucleotide characters and skipping whitespace.
// Stops when '>' is found at the start of a line (next record) or at EOF.
// Returns (dest with appended bases, hasMore).
// hasMore=true means scanner is now positioned at '>' of the next record.
func (s *ropeScanner) extractFastaSeq(dest []byte, UtoT bool) ([]byte, bool) {
lineStart := true
for s.current != nil {
data := s.current.data[s.pos:]
for i, b := range data {
if lineStart && b == '>' {
s.pos += i
if s.pos >= len(s.current.data) {
s.current = s.current.Next()
s.pos = 0
}
return dest, true
}
if b == '\n' || b == '\r' {
lineStart = true
continue
}
lineStart = false
if b == ' ' || b == '\t' {
continue
}
if b >= 'A' && b <= 'Z' {
b += 'a' - 'A'
}
if UtoT && b == 'u' {
b = 't'
}
dest = append(dest, b)
}
s.current = s.current.Next()
s.pos = 0
}
return dest, false
}
// FastaChunkParserRope parses a FASTA chunk directly from the rope without Pack().
func FastaChunkParserRope(source string, rope *PieceOfChunk, UtoT bool) (obiseq.BioSequenceSlice, error) {
scanner := newRopeScanner(rope)
sequences := obiseq.MakeBioSequenceSlice(100)[:0]
for {
bline := scanner.ReadLine()
if bline == nil {
break
}
if len(bline) == 0 || bline[0] != '>' {
continue
}
// Parse header: ">id definition"
header := bline[1:]
var id string
var definition string
sp := bytes.IndexByte(header, ' ')
if sp < 0 {
sp = bytes.IndexByte(header, '\t')
}
if sp < 0 {
id = string(header)
} else {
id = string(header[:sp])
definition = string(bytes.TrimSpace(header[sp+1:]))
}
seqDest := make([]byte, 0, 4096)
var hasMore bool
seqDest, hasMore = scanner.extractFastaSeq(seqDest, UtoT)
if len(seqDest) == 0 {
log.Fatalf("%s [%s]: sequence is empty", source, id)
}
seq := obiseq.NewBioSequenceOwning(id, seqDest, definition)
seq.SetSource(source)
sequences = append(sequences, seq)
if !hasMore {
break
}
}
return sequences, nil
}
func _ParseFastaFile(
input ChannelFileChunk,
out obiiter.IBioSequence,
UtoT bool,
) {
parser := FastaChunkParser(UtoT)
for chunks := range input {
sequences, err := parser(chunks.Source, chunks.Raw)
// obilog.Warnf("Chunck(%d:%d) -%d- ", chunks.Order, l, sequences.Len())
var sequences obiseq.BioSequenceSlice
var err error
if chunks.Rope != nil {
sequences, err = FastaChunkParserRope(chunks.Source, chunks.Rope, UtoT)
} else {
sequences, err = parser(chunks.Source, chunks.Raw)
}
if err != nil {
log.Fatalf("File %s : Cannot parse the fasta file : %v", chunks.Source, err)
}
out.Push(obiiter.MakeBioSequenceBatch(chunks.Source, chunks.Order, sequences))
}
out.Done()
}
func ReadFasta(reader io.Reader, options ...WithOption) (obiiter.IBioSequence, error) {
@@ -245,7 +338,7 @@ func ReadFasta(reader io.Reader, options ...WithOption) (obiiter.IBioSequence, e
1024*1024,
EndOfLastFastaEntry,
"\n>",
true,
false,
)
for i := 0; i < nworker; i++ {

View File

@@ -303,6 +303,80 @@ func FastqChunkParser(quality_shift byte, with_quality bool, UtoT bool) func(str
return parser
}
// FastqChunkParserRope parses a FASTQ chunk directly from a rope without Pack().
func FastqChunkParserRope(source string, rope *PieceOfChunk, quality_shift byte, with_quality, UtoT bool) (obiseq.BioSequenceSlice, error) {
scanner := newRopeScanner(rope)
sequences := obiseq.MakeBioSequenceSlice(100)[:0]
for {
// Line 1: @id [definition]
hline := scanner.ReadLine()
if hline == nil {
break
}
if len(hline) == 0 || hline[0] != '@' {
continue
}
header := hline[1:]
var id string
var definition string
sp := bytes.IndexByte(header, ' ')
if sp < 0 {
sp = bytes.IndexByte(header, '\t')
}
if sp < 0 {
id = string(header)
} else {
id = string(header[:sp])
definition = string(bytes.TrimSpace(header[sp+1:]))
}
// Line 2: sequence
sline := scanner.ReadLine()
if sline == nil {
log.Fatalf("@%s[%s]: unexpected EOF after header", id, source)
}
seqDest := make([]byte, len(sline))
w := 0
for _, b := range sline {
if b >= 'A' && b <= 'Z' {
b += 'a' - 'A'
}
if UtoT && b == 'u' {
b = 't'
}
seqDest[w] = b
w++
}
seqDest = seqDest[:w]
if len(seqDest) == 0 {
log.Fatalf("@%s[%s]: sequence is empty", id, source)
}
// Line 3: + (skip)
scanner.ReadLine()
// Line 4: quality
qline := scanner.ReadLine()
seq := obiseq.NewBioSequenceOwning(id, seqDest, definition)
seq.SetSource(source)
if with_quality && qline != nil {
qDest := make([]byte, len(qline))
copy(qDest, qline)
for i := range qDest {
qDest[i] -= quality_shift
}
seq.TakeQualities(qDest)
}
sequences = append(sequences, seq)
}
return sequences, nil
}
func _ParseFastqFile(
input ChannelFileChunk,
out obiiter.IBioSequence,
@@ -313,7 +387,14 @@ func _ParseFastqFile(
parser := FastqChunkParser(quality_shift, with_quality, UtoT)
for chunks := range input {
sequences, err := parser(chunks.Source, chunks.Raw)
var sequences obiseq.BioSequenceSlice
var err error
if chunks.Rope != nil {
sequences, err = FastqChunkParserRope(chunks.Source, chunks.Rope, quality_shift, with_quality, UtoT)
} else {
sequences, err = parser(chunks.Source, chunks.Raw)
}
if err != nil {
log.Fatalf("File %s : Cannot parse the fastq file : %v", chunks.Source, err)
@@ -339,7 +420,7 @@ func ReadFastq(reader io.Reader, options ...WithOption) (obiiter.IBioSequence, e
1024*1024,
EndOfLastFastqEntry,
"\n@",
true,
false,
)
for i := 0; i < nworker; i++ {

View File

@@ -296,7 +296,7 @@ func _parse_json_header_(header string, sequence *obiseq.BioSequence) string {
case strings.HasSuffix(skey, "_taxid"):
if dataType == jsonparser.Number || dataType == jsonparser.String {
rank, _ := obiutils.SplitInTwo(skey, '_')
rank := skey[:len(skey)-len("_taxid")]
taxid := string(value)
sequence.SetTaxid(taxid, rank)

View File

@@ -29,70 +29,11 @@ const (
var _seqlenght_rx = regexp.MustCompile(" +([0-9]+) bp")
// gbRopeScanner reads lines from a PieceOfChunk rope without heap allocation.
// The carry buffer (stack) handles lines that span two rope nodes.
type gbRopeScanner struct {
current *PieceOfChunk
pos int
carry [256]byte // max GenBank line = 80 chars; 256 gives ample margin
carryN int
}
func newGbRopeScanner(rope *PieceOfChunk) *gbRopeScanner {
return &gbRopeScanner{current: rope}
}
// ReadLine returns the next line without the trailing \n (or \r\n).
// Returns nil at end of rope. The returned slice aliases carry[] or the node
// data and is valid only until the next ReadLine call.
func (s *gbRopeScanner) ReadLine() []byte {
for {
if s.current == nil {
if s.carryN > 0 {
n := s.carryN
s.carryN = 0
return s.carry[:n]
}
return nil
}
data := s.current.data[s.pos:]
idx := bytes.IndexByte(data, '\n')
if idx >= 0 {
var line []byte
if s.carryN == 0 {
line = data[:idx]
} else {
n := copy(s.carry[s.carryN:], data[:idx])
s.carryN += n
line = s.carry[:s.carryN]
s.carryN = 0
}
s.pos += idx + 1
if s.pos >= len(s.current.data) {
s.current = s.current.Next()
s.pos = 0
}
if len(line) > 0 && line[len(line)-1] == '\r' {
line = line[:len(line)-1]
}
return line
}
// No \n in this node: accumulate into carry and advance
n := copy(s.carry[s.carryN:], data)
s.carryN += n
s.current = s.current.Next()
s.pos = 0
}
}
// extractSequence scans the ORIGIN section byte-by-byte directly on the rope,
// appending compacted bases to dest. Returns the extended slice.
// Stops and returns when "//" is found at the start of a line.
// The scanner is left positioned after the "//" line.
func (s *gbRopeScanner) extractSequence(dest []byte, UtoT bool) []byte {
func (s *ropeScanner) extractSequence(dest []byte, UtoT bool) []byte {
lineStart := true
skipDigits := true
@@ -139,24 +80,6 @@ func (s *gbRopeScanner) extractSequence(dest []byte, UtoT bool) []byte {
return dest
}
// skipToNewline advances the scanner past the next '\n'.
func (s *gbRopeScanner) skipToNewline() {
for s.current != nil {
data := s.current.data[s.pos:]
idx := bytes.IndexByte(data, '\n')
if idx >= 0 {
s.pos += idx + 1
if s.pos >= len(s.current.data) {
s.current = s.current.Next()
s.pos = 0
}
return
}
s.current = s.current.Next()
s.pos = 0
}
}
// parseLseqFromLocus extracts the declared sequence length from a LOCUS line.
// Format: "LOCUS <id> <length> bp ..."
// Returns -1 if not found or parse error.
@@ -205,7 +128,7 @@ func GenbankChunkParserRope(source string, rope *PieceOfChunk,
withFeatureTable, UtoT bool) (obiseq.BioSequenceSlice, error) {
state := inHeader
scanner := newGbRopeScanner(rope)
scanner := newRopeScanner(rope)
sequences := obiseq.MakeBioSequenceSlice(100)[:0]
id := ""

View File

@@ -0,0 +1,77 @@
package obiformats
import "bytes"
// ropeScanner reads lines from a PieceOfChunk rope.
// The carry buffer handles lines that span two rope nodes; it grows as needed.
type ropeScanner struct {
current *PieceOfChunk
pos int
carry []byte
}
func newRopeScanner(rope *PieceOfChunk) *ropeScanner {
return &ropeScanner{current: rope}
}
// ReadLine returns the next line without the trailing \n (or \r\n).
// Returns nil at end of rope. The returned slice aliases carry[] or the node
// data and is valid only until the next ReadLine call.
func (s *ropeScanner) ReadLine() []byte {
for {
if s.current == nil {
if len(s.carry) > 0 {
line := s.carry
s.carry = s.carry[:0]
return line
}
return nil
}
data := s.current.data[s.pos:]
idx := bytes.IndexByte(data, '\n')
if idx >= 0 {
var line []byte
if len(s.carry) == 0 {
line = data[:idx]
} else {
s.carry = append(s.carry, data[:idx]...)
line = s.carry
s.carry = s.carry[:0]
}
s.pos += idx + 1
if s.pos >= len(s.current.data) {
s.current = s.current.Next()
s.pos = 0
}
if len(line) > 0 && line[len(line)-1] == '\r' {
line = line[:len(line)-1]
}
return line
}
// No \n in this node: accumulate into carry and advance
s.carry = append(s.carry, data...)
s.current = s.current.Next()
s.pos = 0
}
}
// skipToNewline advances the scanner past the next '\n'.
func (s *ropeScanner) skipToNewline() {
for s.current != nil {
data := s.current.data[s.pos:]
idx := bytes.IndexByte(data, '\n')
if idx >= 0 {
s.pos += idx + 1
if s.pos >= len(s.current.data) {
s.current = s.current.Next()
s.pos = 0
}
return
}
s.current = s.current.Next()
s.pos = 0
}
}

View File

@@ -444,6 +444,67 @@ func (iterator IBioSequence) Rebatch(size int) IBioSequence {
return newIter
}
// RebatchBySize reorganises the stream into batches bounded by two independent
// upper limits: maxCount (max number of sequences) and maxBytes (max cumulative
// estimated memory). A batch is flushed as soon as either limit would be
// exceeded. A single sequence larger than maxBytes is always emitted alone.
// Passing 0 for a limit disables that constraint; if both are 0 it falls back
// to Rebatch(obidefault.BatchSizeMax()).
func (iterator IBioSequence) RebatchBySize(maxBytes int, maxCount int) IBioSequence {
if maxBytes <= 0 && maxCount <= 0 {
return iterator.Rebatch(obidefault.BatchSizeMax())
}
newIter := MakeIBioSequence()
newIter.Add(1)
go func() {
newIter.WaitAndClose()
}()
go func() {
order := 0
iterator = iterator.SortBatches()
buffer := obiseq.MakeBioSequenceSlice()
bufBytes := 0
source := ""
flush := func() {
if len(buffer) > 0 {
newIter.Push(MakeBioSequenceBatch(source, order, buffer))
order++
buffer = obiseq.MakeBioSequenceSlice()
bufBytes = 0
}
}
for iterator.Next() {
seqs := iterator.Get()
source = seqs.Source()
for _, s := range seqs.Slice() {
sz := s.MemorySize()
countFull := maxCount > 0 && len(buffer) >= maxCount
memFull := maxBytes > 0 && bufBytes+sz > maxBytes && len(buffer) > 0
if countFull || memFull {
flush()
}
buffer = append(buffer, s)
bufBytes += sz
}
}
flush()
newIter.Done()
}()
if iterator.IsPaired() {
newIter.MarkAsPaired()
}
return newIter
}
func (iterator IBioSequence) FilterEmpty() IBioSequence {
newIter := MakeIBioSequence()
@@ -638,7 +699,7 @@ func (iterator IBioSequence) FilterOn(predicate obiseq.SequencePredicate,
trueIter.MarkAsPaired()
}
return trueIter.Rebatch(size)
return trueIter.RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax())
}
func (iterator IBioSequence) FilterAnd(predicate obiseq.SequencePredicate,
@@ -694,7 +755,7 @@ func (iterator IBioSequence) FilterAnd(predicate obiseq.SequencePredicate,
trueIter.MarkAsPaired()
}
return trueIter.Rebatch(size)
return trueIter.RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax())
}
// Load all sequences availables from an IBioSequenceBatch iterator into

View File

@@ -3,6 +3,7 @@ package obiiter
import (
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
)
@@ -70,7 +71,7 @@ func IFragments(minsize, length, overlap, size, nworkers int) Pipeable {
}
go f(iterator)
return newiter.SortBatches().Rebatch(size)
return newiter.SortBatches().RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax())
}
return ifrg

View File

@@ -8,6 +8,7 @@ import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiformats"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
log "github.com/sirupsen/logrus"
"github.com/DavidGamba/go-getoptions"
@@ -55,7 +56,15 @@ func RegisterGlobalOptions(options *getoptions.GetOpt) {
options.IntVar(obidefault.BatchSizePtr(), "batch-size", obidefault.BatchSize(),
options.GetEnv("OBIBATCHSIZE"),
options.Description("Number of sequence per batch for paralelle processing"))
options.Description("Minimum number of sequences per batch (floor, default 1)"))
options.IntVar(obidefault.BatchSizeMaxPtr(), "batch-size-max", obidefault.BatchSizeMax(),
options.GetEnv("OBIBATCHSIZEMAX"),
options.Description("Maximum number of sequences per batch (ceiling, default 2000)"))
options.StringVar(obidefault.BatchMemStrPtr(), "batch-mem", "",
options.GetEnv("OBIBATCHMEM"),
options.Description("Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable."))
options.Bool("solexa", false,
options.GetEnv("OBISOLEXA"),
@@ -157,6 +166,15 @@ func ProcessParsedOptions(options *getoptions.GetOpt, parseErr error) {
if options.Called("solexa") {
obidefault.SetReadQualitiesShift(64)
}
if options.Called("batch-mem") {
n, err := obiutils.ParseMemSize(obidefault.BatchMemStr())
if err != nil {
log.Fatalf("Invalid --batch-mem value %q: %v", obidefault.BatchMemStr(), err)
}
obidefault.SetBatchMem(n)
log.Printf("Memory-based batching enabled: %s per batch", obidefault.BatchMemStr())
}
}
func GenerateOptionParser(program string,

View File

@@ -3,7 +3,7 @@ package obioptions
// Version is automatically updated by the Makefile from version.txt
// The patch number (third digit) is incremented on each push to the repository
var _Version = "Release 4.4.19"
var _Version = "Release 4.4.22"
// Version returns the version of the obitools package.
//

View File

@@ -273,6 +273,28 @@ func (s *BioSequence) Len() int {
return len(s.sequence)
}
// MemorySize returns an estimate of the memory footprint of the BioSequence
// in bytes. It accounts for the sequence, quality scores, feature data,
// annotations, and fixed struct overhead. The estimate is conservative
// (cap rather than len for byte slices) so it is suitable for memory-based
// batching decisions.
func (s *BioSequence) MemorySize() int {
if s == nil {
return 0
}
// fixed struct overhead (strings, pointers, mutex pointer)
const overhead = 128
n := overhead
n += cap(s.sequence)
n += cap(s.qualities)
n += cap(s.feature)
n += len(s.id)
n += len(s.source)
// rough annotation estimate: each key+value pair ~64 bytes on average
n += len(s.annotations) * 64
return n
}
// HasQualities checks if the BioSequence has sequence qualitiy scores.
//
// This function does not have any parameters.
@@ -480,6 +502,15 @@ func (s *BioSequence) SetQualities(qualities Quality) {
s.qualities = CopySlice(qualities)
}
// TakeQualities stores the slice directly without copying.
// The caller must not use the slice after this call.
func (s *BioSequence) TakeQualities(qualities Quality) {
if s.qualities != nil {
RecycleSlice(&s.qualities)
}
s.qualities = qualities
}
// A method that appends a byte slice to the qualities of the BioSequence.
func (s *BioSequence) WriteQualities(data []byte) (int, error) {
s.qualities = append(s.qualities, data...)

View File

@@ -195,7 +195,7 @@ func (s *BioSequenceSlice) ExtractTaxonomy(taxonomy *obitax.Taxonomy, seqAsTaxa
return nil, fmt.Errorf("sequence %v has no path", s.Id())
}
last := path[len(path)-1]
taxname, _ := obiutils.SplitInTwo(last, ':')
taxname, _ := obiutils.LeftSplitInTwo(last, ':')
if idx, ok := s.GetIntAttribute("seq_number"); !ok {
return nil, errors.New("sequences are not numbered")
} else {

View File

@@ -1,13 +1,20 @@
package obiseq
import (
"runtime"
"sync"
"sync/atomic"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
const _LargeSliceThreshold = 100 * 1024 // 100 kb — below: leave to GC, above: trigger explicit GC
const _GCBytesBudget = int64(256 * 1024 * 1024) // trigger GC every 256 MB of large discards
var _largeSliceDiscardedBytes = atomic.Int64{}
var _BioSequenceByteSlicePool = sync.Pool{
New: func() interface{} {
bs := make([]byte, 0, 300)
@@ -34,6 +41,13 @@ func RecycleSlice(s *[]byte) {
}
if cap(*s) <= 1024 {
_BioSequenceByteSlicePool.Put(s)
} else if cap(*s) >= _LargeSliceThreshold {
n := int64(cap(*s))
*s = nil
prev := _largeSliceDiscardedBytes.Load()
if _largeSliceDiscardedBytes.Add(n)/_GCBytesBudget > prev/_GCBytesBudget {
runtime.GC()
}
}
}
}

View File

@@ -31,7 +31,7 @@ func NewTaxidFactory(code string, alphabet obiutils.AsciiSet) *TaxidFactory {
// It extracts the relevant part of the string after the first colon (':') if present.
func (f *TaxidFactory) FromString(taxid string) (Taxid, error) {
taxid = obiutils.AsciiSpaceSet.TrimLeft(taxid)
part1, part2 := obiutils.SplitInTwo(taxid, ':')
part1, part2 := obiutils.LeftSplitInTwo(taxid, ':')
if len(part2) == 0 {
taxid = part1
} else {

View File

@@ -64,7 +64,7 @@ func EmpiricalDistCsv(filename string, data [][]Ratio, compressed bool) {
fmt.Println(err)
}
destfile, err := obiutils.CompressStream(file, true, true)
destfile, err := obiutils.CompressStream(file, compressed, true)
if err != nil {
fmt.Println(err)
}

View File

@@ -214,6 +214,8 @@ func CLIReadBioSequences(filenames ...string) (obiiter.IBioSequence, error) {
iterator = iterator.Speed("Reading sequences")
iterator = iterator.RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax())
return iterator, nil
}

View File

@@ -291,5 +291,5 @@ func IndexReferenceDB(iterator obiiter.IBioSequence) obiiter.IBioSequence {
go f()
}
return indexed.Rebatch(obidefault.BatchSize())
return indexed.RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax())
}

85
pkg/obiutils/memsize.go Normal file
View File

@@ -0,0 +1,85 @@
package obiutils
import (
"fmt"
"strconv"
"strings"
"unicode"
)
// ParseMemSize parses a human-readable memory size string and returns the
// equivalent number of bytes. The value is a number optionally followed by a
// unit suffix (case-insensitive):
//
// B or (no suffix) — bytes
// K or KB — kibibytes (1 024)
// M or MB — mebibytes (1 048 576)
// G or GB — gibibytes (1 073 741 824)
// T or TB — tebibytes (1 099 511 627 776)
//
// Examples: "512", "128K", "128k", "64M", "1G", "2GB"
func ParseMemSize(s string) (int, error) {
s = strings.TrimSpace(s)
if s == "" {
return 0, fmt.Errorf("empty memory size string")
}
// split numeric prefix from unit suffix
i := 0
for i < len(s) && (unicode.IsDigit(rune(s[i])) || s[i] == '.') {
i++
}
numStr := s[:i]
unit := strings.ToUpper(strings.TrimSpace(s[i:]))
// strip trailing 'B' from two-letter units (KB→K, MB→M …)
if len(unit) == 2 && unit[1] == 'B' {
unit = unit[:1]
}
val, err := strconv.ParseFloat(numStr, 64)
if err != nil {
return 0, fmt.Errorf("invalid memory size %q: %w", s, err)
}
var multiplier float64
switch unit {
case "", "B":
multiplier = 1
case "K":
multiplier = 1024
case "M":
multiplier = 1024 * 1024
case "G":
multiplier = 1024 * 1024 * 1024
case "T":
multiplier = 1024 * 1024 * 1024 * 1024
default:
return 0, fmt.Errorf("unknown memory unit %q in %q", unit, s)
}
return int(val * multiplier), nil
}
// FormatMemSize formats a byte count as a human-readable string with the
// largest unit that produces a value ≥ 1 (e.g. 1536 → "1.5K").
func FormatMemSize(n int) string {
units := []struct {
suffix string
size int
}{
{"T", 1024 * 1024 * 1024 * 1024},
{"G", 1024 * 1024 * 1024},
{"M", 1024 * 1024},
{"K", 1024},
}
for _, u := range units {
if n >= u.size {
v := float64(n) / float64(u.size)
if v == float64(int(v)) {
return fmt.Sprintf("%d%s", int(v), u.suffix)
}
return fmt.Sprintf("%.1f%s", v, u.suffix)
}
}
return fmt.Sprintf("%dB", n)
}

View File

@@ -144,7 +144,7 @@ func (r *AsciiSet) TrimLeft(s string) string {
return s[i:]
}
func SplitInTwo(s string, sep byte) (string, string) {
func LeftSplitInTwo(s string, sep byte) (string, string) {
i := 0
for ; i < len(s); i++ {
c := s[i]
@@ -157,3 +157,17 @@ func SplitInTwo(s string, sep byte) (string, string) {
}
return s[:i], s[i+1:]
}
func RightSplitInTwo(s string, sep byte) (string, string) {
i := len(s) - 1
for ; i >= 0; i-- {
c := s[i]
if c == sep {
break
}
}
if i == len(s) {
return s, ""
}
return s[:i], s[i+1:]
}

36
tools/json2md.py Executable file
View File

@@ -0,0 +1,36 @@
#!/usr/bin/env python3
"""
Read potentially malformed JSON from stdin (aichat output), extract title and
body, and print them as plain text: title on first line, blank line, then body.
Exits with 1 on failure (no output).
"""
import sys
import json
import re
text = sys.stdin.read()
m = re.search(r'\{.*\}', text, re.DOTALL)
if not m:
sys.exit(1)
s = m.group()
obj = None
try:
obj = json.loads(s)
except Exception:
s2 = re.sub(r'(?<!\\)\n', r'\\n', s)
try:
obj = json.loads(s2)
except Exception:
sys.exit(1)
title = obj.get('title', '').strip()
body = obj.get('body', '').strip()
if not title or not body:
sys.exit(1)
print(f"{title}\n\n{body}")

View File

@@ -1 +1 @@
4.4.19
4.4.22