Compare commits

...

280 Commits

Author SHA1 Message Date
Eric Coissac f32b29db4f Release 4.4.33 2026-04-13 14:29:18 +02:00
Eric Coissac 10f49fe64b 📝 Clarify RegisterHTTP global registration intent
//
// Registers the http module in Lua state as a global,
// aligning with obicontext and BioSequence conventions.
The change ensures consistent module exposure across Lua environments.
2026-04-13 14:29:16 +02:00
coissac d257917748 Merge pull request #106 from metabarcoding/push-qoqotlnktvls
Push qoqotlnktvls
2026-04-13 14:08:42 +02:00
Eric Coissac fec078c04c Release 4.4.32 2026-04-13 14:08:16 +02:00
Eric Coissac a92393dd51 ⬆️ update go.mod dependencies and improve error messages
- Bump github.com/buger/jsonparser from v1.1.1 to v1.2
- Add error details in log.Fatalf calls for better debugging
2026-04-13 14:08:13 +02:00
coissac 7e76698490 Merge pull request #105 from metabarcoding/push-pnqoquxmpqpq
Push pnqoquxmpqpq
2026-04-13 13:36:13 +02:00
Eric Coissac 64b0b32f61 Release 4.4.31 2026-04-13 13:35:39 +02:00
Eric Coissac c8e6a218cb [release] bump version to v4.5
- Update obioptions/version.go and version.txt from Release v4.5 to 68302a1
- Increment patch version: from `Release v4.5` → 68302a1
- Align version.txt with current release tag
2026-04-13 13:35:33 +02:00
Eric Coissac 8c7017a99d ⬆️ version bump to v4.5
- Update obioptions.Version from "Release 4.4.29" to "/v/ Release v5"
- Update version.txt from 4.29 → .30
(automated by Makefile)
2026-04-13 13:34:53 +02:00
Eric Coissac 670edc1958 docs: ajouter la documentation globale pour OBITools v4
- Ajout de prompt_documentation_globale.md décrivant les trois phases d'écriture de la documentation (fichier → package → outil)
- Présence de fichiers .DS_Store non significatifs (à ignorer)
2026-03-31 19:02:42 +02:00
coissac f92f285417 Merge pull request #101 from metabarcoding/push-klzowrsmmnyv
Dynamic Batch Flushing and Build Improvements
2026-03-16 22:29:29 +01:00
Eric Coissac a786b58ed3 Dynamic Batch Flushing and Build Improvements
This release introduces dynamic batch flushing in the Distribute component, replacing the previous fixed-size batching with a memory- and count-aware strategy. Batches now flush automatically when either the maximum sequence count (BatchSizeMax()) or memory threshold (BatchMem()) per key is reached, ensuring more efficient resource usage and consistent behavior with the RebatchBySize strategy. The optional sizes parameter has been removed, and related code—including the Lua wrapper and worker buffer handling—has been updated for correctness and simplicity. Unused BatchSize() references have been eliminated from obidistribute.

Additionally, this release includes improvements to static Linux builds and overall build stability, enhancing reliability across deployment environments.
2026-03-16 22:06:51 +01:00
Eric Coissac a2b26712b2 refactor: replace fixed batch size with dynamic flushing based on count and memory
Replace the old fixed batch-size mechanism in Distribute with a dynamic strategy that flushes batches when either BatchSizeMax() sequences or BatchMem() bytes are reached per key. This aligns with the RebatchBySize strategy and removes the optional sizes parameter. Also update related code: simplify Lua wrapper to accept optional capacity, and fix buffer growth logic in worker.go using slices.Grow correctly. Remove unused BatchSize() usage from obidistribute.
2026-03-16 22:06:44 +01:00
coissac 1599abc9ad Merge pull request #99 from metabarcoding/push-urlyqwkrqypt
4.4.28: Static Linux Builds, Memory-Aware Batching, and Build Stability
2026-03-14 12:21:34 +01:00
Eric Coissac af213ab446 4.4.28: Static Linux Builds, Memory-Aware Batching, and Build Stability
This release focuses on improving build reliability, memory efficiency for large datasets, and portability of Linux binaries.

### Static Linux Binaries
- Linux binaries are now built with static linking using musl, eliminating external runtime dependencies and ensuring portability across distributions.

### Memory-Aware Batching
- Users can now control memory usage during processing with the new `--batch-mem` option, specifying limits such as 128K, 64M, or 1G.
- Batching logic now respects both size and memory constraints: batches are flushed when either threshold is exceeded.
- Conservative memory estimation for sequences helps avoid over-allocation, and explicit garbage collection after large batch discards reduces memory spikes.

### Build System Improvements
- Upgraded to Go 1.26 for improved performance and toolchain stability.
- Fixed cross-compilation issues by replacing generic include paths with architecture-specific ones (x86_64-linux-gnu and aarch64-linux-gnu).
- Streamlined macOS builds by removing special flags, using standard `make` targets.
- Enhanced error reporting during build failures: logs are now shown before cleanup and exit.
- Updated install script to correctly configure GOROOT, GOPATH, and GOTOOLCHAIN, with visual progress feedback for downloads.

All batching behavior is non-breaking and maintains backward compatibility while offering more predictable resource usage on large datasets.
2026-03-14 11:59:15 +01:00
Eric Coissac a60184c115 chore: bump version to 4.4.27 and add zlib-static dependency
Update version to 4.4.27 in version.txt and pkg/obioptions/version.go.

Add zlib-static package to release workflow to ensure static linking of zlib, resolving potential runtime dependency issues with the external link mode.
2026-03-14 11:59:04 +01:00
Eric Coissac 585b024bf0 chore: update to Go 1.26 and refactor release workflow
- Upgrade Go version from 1.23 to 1.26 in release.yml
- Remove CGO_CFLAGS from cross-compilation matrix entries
- Replace Linux build tools installation with Docker-based static build using golang:1.26-alpine
- Simplify macOS build to use standard make without special flags
- Increment version to 4.4.26
2026-03-14 11:43:31 +01:00
Eric Coissac afc9ffda85 chore: bump version to 4.4.25 and fix CGO_CFLAGS for cross-compilation
Update version to 4.4.25 in version.txt and pkg/obioptions/version.go.

Fix CGO_CFLAGS in release.yml by replacing generic '-I/usr/include' with architecture-specific paths (x86_64-linux-gnu and aarch64-linux-gnu) to ensure correct header inclusion during cross-compilation on Linux.
2026-03-13 19:30:29 +01:00
Eric Coissac fdd972bbd2 fix: add CGO_CFLAGS for static Linux builds and update go.work.sum
- Add CGO_CFLAGS environment variable to release workflow for Linux builds
- Update go.work.sum with new golang.org/x/net v0.38.0 entry
- Remove obsolete logs archive file
2026-03-13 19:24:18 +01:00
coissac 76f595e1fe Merge pull request #95 from metabarcoding/push-kzmrqmplznrn
Version 4.4.24
2026-03-13 19:13:02 +01:00
coissac 1e1e5443e3 Merge branch 'master' into push-kzmrqmplznrn 2026-03-13 19:12:49 +01:00
Eric Coissac 15d1f1fd80 Version 4.4.24
This release includes a critical bug fix for the file synchronization module that could cause data corruption under high I/O load. Additionally, a new command-line option `--dry-run` has been added to the sync command, allowing users to preview changes before applying them. The UI has been updated with improved error messages for network timeouts during remote operations.
2026-03-13 19:11:58 +01:00
Eric Coissac 8df2cbe22f Bump version to 4.4.23 and update release workflow
- Update version from 4.4.22 to 4.4.23 in version.txt and pkg/obioptions/version.go
- Add zlib1g-dev dependency to Linux release workflow for potential linking requirements
- Improve tag creation in Makefile by resolving commit hash with `jj log` for better CI/CD integration
2026-03-13 19:11:55 +01:00
coissac 58d685926b Merge pull request #94 from metabarcoding/push-lxxxlurqmqrt
4.4.23: Memory-aware batching, static Linux builds, and build improvements
2026-03-13 19:04:15 +01:00
Eric Coissac e9f24426df 4.4.23: Memory-aware batching, static Linux builds, and build improvements
### Memory-Aware Batching
- Introduced configurable min/max batch size bounds and memory limits for precise resource control.
- Added `--batch-mem` CLI option to enable adaptive batching based on estimated sequence memory footprint (e.g., 128K, 64M, 1G).
- Implemented `RebatchBySize()` to handle both byte and count limits, flushing when either threshold is exceeded.
- Added conservative memory estimation via `BioSequence.MemorySize()` and enhanced garbage collection for explicit cleanup after large batch discards.
- Updated internal batching logic across core modules to consistently apply default memory (128 MB) and size (min: 1, max: 2000) bounds.

### Linux Build Enhancements
- Enabled static linking for Linux binaries using musl, producing portable, self-contained executables without external dependencies.

### Build System & Toolchain Improvements
- Updated Go toolchain to 1.26.1 with corresponding dependency bumps (e.g., go-getoptions, gval, regexp2, go-json, progressbar, logrus, testify).
- Fixed Makefile to safely quote LDFLAGS for paths with spaces.
- Improved build error handling: on failure, logs are displayed before cleanup and exit.
- Updated install script to correctly set GOROOT, GOPATH, and GOTOOLCHAIN, ensuring GOPATH directory creation.
- Added progress bar to curl downloads in the install script for visual feedback during Go and OBITools4 downloads.

All batching behavior remains non-breaking, with consistent constraints improving predictability during large dataset processing.
2026-03-13 19:03:50 +01:00
Eric Coissac 2f7be10b5d Build improvements and Go version update
- Update Go version from 1.25.0 to 1.26.1 in go.mod and go.work
- Fix Makefile: quote LDFLAGS to handle spaces safely in -ldflags
- Improve build error handling: on failure, cat log then cleanup and exit with error code
- Update install_obitools.sh: properly set GOROOT, GOPATH, and GOTOOLCHAIN; ensure GOPATH directory is created
2026-03-13 19:03:42 +01:00
Eric Coissac 43125f9f5e feat: add progress bar to curl downloads in install script
Replace silent curl commands with --progress-bar option to provide visual feedback during Go and OBITools4 downloads, improving user experience without changing download logic.
2026-03-13 16:40:55 +01:00
Eric Coissac c23368e929 update dependencies and Go toolchain to 1.25.0
Update go.mod and go.work to Go 1.25.0, bump several direct dependencies (e.g., go-getoptions, gval, regexp2, go-json, progressbar, logrus, testify), update indirect dependencies accordingly, and remove obsolete toolchain directive.
2026-03-13 16:09:34 +01:00
coissac 6cb5a81685 Merge pull request #93 from metabarcoding/push-snmwxkwkqxrm
Memory-aware Batching and Static Linux Builds
2026-03-13 15:18:29 +01:00
Eric Coissac 94b0887069 Memory-aware Batching and Static Linux Builds
### Memory-Aware Batching
- Replaced single batch size limits with configurable min/max bounds and memory limits for more precise control over resource usage.
- Added `--batch-mem` CLI option to enable adaptive batching based on estimated sequence memory footprint (e.g., 128K, 64M, 1G).
- Introduced `RebatchBySize()` with explicit support for both byte and count limits, flushing when either threshold is exceeded.
- Implemented conservative memory estimation via `BioSequence.MemorySize()` and enhanced garbage collection to trigger explicit cleanup after large batch discards.
- Updated internal batching logic across `batchiterator.go`, `fragment.go`, and `obirefidx.go` to consistently use default memory (128 MB) and size (min: 1, max: 2000) bounds.

### Linux Build Enhancements
- Enabled static linking for Linux binaries using musl, producing portable, self-contained executables without external dependencies.

### Notes
- This release consolidates and improves batching behavior introduced in 4.4.20, with no breaking changes to the public API.
- All user-facing batching behavior is now governed by consistent memory and count constraints, improving predictability and stability during large dataset processing.
2026-03-13 15:16:41 +01:00
Eric Coissac c188580aac Replace Rebatch with RebatchBySize using default batch parameters
Replace calls to Rebatch(size) with RebatchBySize(obidefault.BatchMem(), obidefault.BatchSizeMax()) in batchiterator.go, fragment.go, and obirefidx.go to ensure consistent use of default memory and size limits for batch rebatching.
2026-03-13 15:16:33 +01:00
Eric Coissac 1e1f575d1c refactor: replace single batch size with min/max bounds and memory limits
Introduce separate _BatchSize (min) and _BatchSizeMax (max) constants to replace the single _BatchSize variable. Update RebatchBySize to accept both maxBytes and maxCount parameters, flushing when either limit is exceeded. Set default batch size min to 1, max to 2000, and memory limit to 128 MB. Update CLI options and sequence_reader.go accordingly.
2026-03-13 15:07:35 +01:00
Eric Coissac 40769bf827 Add memory-based batching support
Implement memory-aware batch sizing with --batch-mem CLI option, enabling adaptive batching based on estimated sequence memory footprint. Key changes:
- Added _BatchMem and related getters/setters in pkg/obidefault
- Implemented RebatchBySize() in pkg/obiter for memory-constrained batching
- Added BioSequence.MemorySize() for conservative memory estimation
- Integrated batch-mem option in pkg/obioptions with human-readable size parsing (e.g., 128K, 64M, 1G)
- Added obiutils.ParseMemSize/FormatMemSize for unit conversion
- Enhanced pool GC in pkg/obiseq/pool.go to trigger explicit GC for large slice discards
- Updated sequence_reader.go to apply memory-based rebatching when enabled
2026-03-13 14:54:21 +01:00
Eric Coissac 74e6fcaf83 feat: add static linking for Linux builds using musl
Enable static linking for Linux binaries by installing musl-tools and passing appropriate LDFLAGS during build. This ensures portable, self-contained executables for Linux targets.
2026-03-13 14:26:31 +01:00
coissac 30ec8b1b63 Merge pull request #92 from metabarcoding/push-mvpuxnxoyypu
4.4.21: Parallel builds, robust installation, and rope-based parsing enhancements
2026-03-13 12:00:32 +01:00
Eric Coissac cdc72c5346 4.4.21: Parallel builds, robust installation, and rope-based parsing enhancements
This release introduces significant improvements to build reliability and performance, alongside key parsing enhancements for sequence data.

### Build & Installation Improvements
- Added support for parallel compilation via `-j/--jobs` option in both the Makefile and install script, enabling faster builds on multi-core systems. The default remains single-threaded for safety.
- Enhanced Makefile with `.DEFAULT_GOAL := all` for consistent behavior and a documented `help` target.
- Replaced fragile file operations with robust error handling, clear diagnostics, and automatic preservation of the build directory on copy failures to aid recovery.

### Rope-Based Parsing Enhancements (from 4.4.20)
- Introduced direct rope-based parsers for FASTA, EMBL, and FASTQ formats, improving memory efficiency for large files.
- Added U→T conversion support during sequence extraction and more reliable line ending detection.
- Unified rope scanning logic under a new `ropeScanner` for better maintainability.
- Added `TakeQualities()` method to BioSequence for more efficient handling of quality data.

### Bug Fixes (from 4.4.20)
- Fixed `CompressStream` to correctly respect the `compressed` variable.
- Replaced ambiguous string splitting utilities with precise left/right split variants (`LeftSplitInTwo`, `RightSplitInTwo`).

### Release Tooling (from 4.4.20)
- Streamlined release process with modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`) and AI-assisted note generation via `aichat`.
- Improved versioning support via the `VERSION` environment variable in `bump-version`.
- Switched PR submission from raw `jj git push` to `stakk` for consistency and reliability.

Note: This release incorporates key enhancements from 4.4.20 that impact end users, while focusing on build robustness and performance gains.
2026-03-13 11:59:32 +01:00
Eric Coissac 82a9972be7 Add parallel compilation support and improve Makefile/install script robustness
- Add .DEFAULT_GOAL := all to Makefile for consistent default target
- Document -j/--jobs option in README.md to allow parallel compilation
- Add JOBS variable and -j/--jobs argument to install script (default: 1)
- Replace fragile mkdir/cp commands with robust error handling and clear diagnostics
- Add build directory preservation on copy failure for manual recovery
- Pass -j option to make during compilation to enable parallel builds
2026-03-13 11:59:20 +01:00
coissac ff6e515b2a Merge pull request #91 from metabarcoding/push-uotrstkymowq
4.4.20: Rope-based parsing, improved release tooling, and bug fixes
2026-03-12 20:15:33 +01:00
Eric Coissac cd0c525f50 4.4.20: Rope-based parsing, improved release tooling, and bug fixes
### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkParserRope`. Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.

### Bug Fixes
- **Compression behavior**: Fixed `CompressStream` to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.

### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.

### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
2026-03-12 20:14:11 +01:00
Eric Coissac abe935aa18 Add help target, colorize output, and improve release workflow
- Add colored terminal output support (GREEN, YELLOW, BLUE, NC)
- Introduce `help` target to document all Makefile targets
- Enhance `bump-version` to accept VERSION env var for manual version setting
- Refactor jjpush: split into modular targets (jjpush-notes, jjpush-push, jjpush-tag)
- Replace orla with aichat for AI-powered release notes generation
- Add robust JSON parsing using Python for release notes extraction
- Use stakk for PR submission (replacing raw `jj git push`)
- Generate and store release notes in temp files for tag creation
- Add installation instructions to release tags
- Update .PHONY with new targets

4.4.20: Rope-based parsing, improved release tooling, and bug fixes

### Enhancements
- **Rope-based parsing**: Added direct rope parsing for FASTA, EMBL, and FASTQ formats via `FastaChunkParserRope`, `EmblChunkParserRope`, and `FastqChunkRope` functions, eliminating unnecessary memory allocation via Pack(). Sequence extraction now supports U→T conversion and improved line ending detection.
- **Rope scanner refactoring**: Unified rope scanning logic under a new `ropeScanner`, improving maintainability and consistency across parsers.
- **Sequence handling**: Added `TakeQualities()` method to BioSequence for more efficient quality data handling.

### Bug Fixes
- **Compression behavior**: Fixed CompressStream to correctly use the `compressed` variable instead of a hardcoded boolean.
- **String splitting**: Replaced ambiguous `SplitInTwo` calls with precise `LeftSplitInTwo` or `RightSplitInTwo`, and added dedicated right-split utility.

### Tooling & Workflow Improvements
- **Makefile enhancements**: Added colored terminal output, a `help` target for documenting all targets, and improved release workflow automation.
- **Release process**: Refactored `jjpush` into modular targets (`jjpush-notes`, `jjpush-push`, `jjpush-tag`), replaced `orla` with `aichat` for AI-assisted release notes, and introduced robust JSON parsing using Python. Release notes are now generated and stored in temp files for tag creation.
- **Versioning**: `bump-version` now supports the VERSION environment variable for manual version setting.
- **Submission**: Switched from raw `jj git push` to `stakk` for PR submission.

### Internal Notes
- Installation instructions are now included in release tags.
- Fixed-size carry buffer replaced with dynamic slice for arbitrarily long line support without extra allocations.
2026-03-12 20:14:11 +01:00
Eric Coissac 8dd32dc1bf Fix CompressStream call to use compressed variable
Replace hardcoded boolean with the `compressed` variable in CompressStream call to ensure correct compression behavior.
2026-03-12 18:48:22 +01:00
Eric Coissac 6ee8750635 Replace SplitInTwo with LeftSplitInTwo/RightSplitInTwo for precise splitting
Replace SplitInTwo calls with LeftSplitInTwo or RightSplitInTwo depending on the intended split direction. In fastseq_json_header.go, extract rank from suffix without splitting; in biosequenceslice.go and taxid.go, use LeftSplitInTwo to split from the left; add RightSplitInTwo utility function for splitting from the right.
2026-03-12 18:41:28 +01:00
Eric Coissac 8c318c480e replace fixed-size carry buffer with dynamic slice
Replace the fixed [256]byte carry buffer with a dynamic []byte slice to support arbitrarily long lines without heap allocation during accumulation. Update all carry buffer handling logic to use len(s.carry) and append instead of fixed-size copy operations.
2026-03-11 20:44:45 +01:00
Eric Coissac 09fbc217d3 Add EMBL rope parsing support and improve sequence extraction
Introduce EmblChunkParserRope function to parse EMBL chunks directly from a rope without using Pack(). Add extractEmblSeq helper to scan sequence sections and handle U to T conversion. Update parser logic to use rope-based parsing when available, and fix feature table handling for WGS entries.
2026-03-10 17:02:14 +01:00
Eric Coissac 3d2e205722 Refactor rope scanner and add FASTQ rope parser
This commit refactors the rope scanner implementation by renaming gbRopeScanner to ropeScanner and extracting the common functionality into a new file. It also introduces a new FastqChunkParserRope function that parses FASTQ chunks directly from a rope without Pack(), enabling more efficient memory usage. The existing parsers are updated to use the new rope-based parser when available. The BioSequence type is enhanced with a TakeQualities method for more efficient quality data handling.
2026-03-10 16:47:03 +01:00
Eric Coissac 623116ab13 Add rope-based FASTA parsing and improve sequence handling
Introduce FastaChunkParserRope for direct rope-based FASTA parsing, enhance sequence extraction with whitespace skipping and U->T conversion, and update parser logic to support both rope and raw data sources.

- Added extractFastaSeq function to scan sequence bytes directly from rope
- Implemented FastaChunkParserRope for rope-based parsing
- Modified _ParseFastaFile to use rope when available
- Updated sequence handling to support U->T conversion
- Fixed line ending detection for FASTA parsing
2026-03-10 16:34:33 +01:00
coissac 1e4509cb63 Merge pull request #90 from metabarcoding/push-uzpqqoqvpnxw
Push uzpqqoqvpnxw
2026-03-10 15:53:08 +01:00
Eric Coissac b33d7705a8 Bump version to 4.4.19
Update version from 4.4.18 to 4.4.19 in both version.txt and pkg/obioptions/version.go
2026-03-10 15:51:36 +01:00
Eric Coissac 1342c83db6 Use NewBioSequenceOwning to avoid unnecessary sequence copying
Replace NewBioSequence with NewBioSequenceOwning in genbank_read.go to take ownership of sequence slices without copying, improving performance. Update biosequence.go to add the new TakeSequence method and NewBioSequenceOwning constructor.
2026-03-10 15:51:35 +01:00
Eric Coissac b246025907 Optimize Fasta batch formatting
Optimize FormatFastaBatch to pre-allocate buffer and write sequences directly without intermediate strings, improving performance and memory usage.
2026-03-10 15:43:59 +01:00
Eric Coissac 761e0dbed3 Implémentation d'un parseur GenBank utilisant rope pour réduire l'usage de mémoire
Ajout d'un parseur GenBank basé sur rope pour réduire l'usage de mémoire (RSS) et les allocations heap.

- Ajout de `gbRopeScanner` pour lire les lignes sans allocation heap
- Implémentation de `GenbankChunkParserRope` qui utilise rope au lieu de `Pack()`
- Modification de `_ParseGenbankFile` et `ReadGenbank` pour utiliser le nouveau parseur
- Réduction du RSS attendue de 57 GB à ~128 MB × workers
- Conservation de l'ancien parseur pour compatibilité et tests

Réduction significative des allocations (~50M) et temps sys, avec un temps user comparable ou meilleur.
2026-03-10 15:35:36 +01:00
Eric Coissac a7ea47624b Optimisation du parsing des grandes séquences
Implémente une optimisation du parsing des grandes séquences en évitant l'allocation de mémoire inutile lors de la fusion des chunks. Ajoute un support pour le parsing direct de la structure rope, ce qui permet de réduire les allocations et d'améliorer les performances lors du traitement de fichiers GenBank/EMBL et FASTA/FASTQ de plusieurs Gbp. Les parseurs sont mis à jour pour utiliser la rope non-packée et le nouveau mécanisme d'écriture in-place pour les séquences GenBank.
2026-03-10 14:20:21 +01:00
Eric Coissac 61e346658e Refactor jjpush workflow and enhance release notes generation
Split the jjpush target into multiple sub-targets (jjpush-describe, jjpush-bump, jjpush-push, jjpush-tag) for better modularity and control.

Enhance release notes generation by:
- Using git log with full commit messages instead of GitHub API for pre-release mode
- Adding robust JSON parsing with fallbacks for release notes
- Including detailed installation instructions in release notes
- Supporting both pre-release and published release modes

Update release_notes.sh to handle pre-release mode, improve commit message fetching, and add installation section to release notes.

Add .PHONY declarations for new sub-targets.
2026-03-10 11:09:19 +01:00
coissac 1ba1294b11 Merge pull request #89 from metabarcoding/push-uoqxkozlonwx
Push uoqxkozlonwx
2026-02-20 11:42:40 +01:00
Eric Coissac b2476fffcb Bump version to 4.4.18
Update version from 4.4.17 to 4.4.18 in version.txt and corresponding Go variable _Version.
2026-02-20 11:40:43 +01:00
Eric Coissac b05404721e Bump version to 4.4.16
Update version from 4.4.15 to 4.4.16 in version.go and version.txt files.
2026-02-20 11:40:40 +01:00
Eric Coissac c57e788459 Fix GenBank parsing and add release notes script
This commit fixes an issue in the GenBank parser where empty parts were being included in the parsed data. It also introduces a new script `release_notes.sh` to automate the generation of GitHub-compatible release notes for OBITools4 versions, including support for LLM summarization and various output modes.
2026-02-20 11:37:51 +01:00
coissac 1cecf23978 Merge pull request #86 from metabarcoding/push-oulwykrpwxuz
Push oulwykrpwxuz
2026-02-11 06:34:05 +01:00
Eric Coissac 4c824ef9b7 Bump version to 4.4.15
Update version from 4.4.14 to 4.4.15 in version.txt and pkg/obioptions/version.go
2026-02-11 06:31:11 +01:00
Eric Coissac 1ce5da9bee Support new sequence file formats and improve error handling
Add support for .gbff and .gbff.gz file extensions in sequence reader.

Update the logic to return an error instead of using NilIBioSequence when no sequence files are found, improving the error handling and user feedback.
2026-02-11 06:31:10 +01:00
coissac dc23d9de9a Merge pull request #85 from metabarcoding/push-smturnsrozkp
Push smturnsrozkp
2026-02-10 22:19:22 +01:00
Eric Coissac aa9d7bbf72 Bump version to 4.4.14
Update version number from 4.4.13 to 4.4.14 in both version.go and version.txt files.
2026-02-10 22:17:23 +01:00
Eric Coissac db22d20d0a Rename obisuperkmer test script to obik-super and update command references
Update test script name from obisuperkmer to obik-super and adjust all command references accordingly.

- Changed TEST_NAME from 'obisuperkmer' to 'obik-super'
- Changed CMD from 'obisuperkmer' to 'obik'
- Updated MCMD to 'OBIk-super'
- Modified command calls to use '$CMD super' instead of direct command names
- Updated help test to use '$CMD super -h'
- Updated all test cases to use the new command format
2026-02-10 22:17:22 +01:00
coissac 7c05bdb01c Merge pull request #84 from metabarcoding/push-uxvowwlxkrlq
Push uxvowwlxkrlq
2026-02-10 22:12:18 +01:00
Eric Coissac b6542c4523 Bump version to 4.4.13
Update version from 4.4.12 to 4.4.13 in version.txt and pkg/obioptions/version.go
2026-02-10 22:10:38 +01:00
Eric Coissac ac41dd8a22 Refactor k-mer matching pipeline with improved concurrency and memory management
Refactor k-mer matching to use a pipeline architecture with improved concurrency and memory management:

- Replace sort.Slice with slices.SortFunc and cmp.Compare for better performance
- Introduce PreparedQueries struct to encapsulate query buckets with metadata
- Implement MergeQueries function to merge query buckets from multiple batches
- Rewrite MatchBatch to use pre-allocated results and mutexes instead of map-based accumulation
- Add seek optimization in matchPartition to reduce linear scanning
- Refactor match command to use a multi-stage pipeline with proper batching and merging
- Add index directory option for match command
- Improve parallel processing of sequence batches

This refactoring improves performance by reducing memory allocations, optimizing k-mer lookup, and implementing a more efficient pipeline for large-scale k-mer matching operations.
2026-02-10 22:10:36 +01:00
Eric Coissac bebbbbfe7d Add entropy-based filtering for k-mers
This commit introduces entropy-based filtering for k-mers to remove low-complexity sequences. It adds:

- New KmerEntropy and KmerEntropyFilter functions in pkg/obikmer/entropy.go for computing and filtering k-mer entropy
- Integration of entropy filtering in the k-mer set builder (pkg/obikmer/kmer_set_builder.go)
- A new 'filter' command in obik tool (pkg/obitools/obik/filter.go) to apply entropy filtering on existing indices
- CLI options for configuring entropy filtering during index building and filtering

The entropy filter helps improve the quality of k-mer sets by removing repetitive sequences that may interfere with downstream analyses.
2026-02-10 18:20:35 +01:00
Eric Coissac c6e04265f1 Add sparse index support for KDI files with fast seeking
This commit introduces sparse index support for KDI files to enable fast random access during k-mer matching. It adds a new .kdx index file format and updates the KDI reader and writer to handle index creation and seeking. The changes include:

- New KdxIndex struct and related functions for loading, searching, and writing .kdx files
- Modified KdiReader to support seeking with the new index
- Updated KdiWriter to create .kdx index files during writing
- Enhanced KmerSetGroup.Contains to use the new index for faster lookups
- Added a new 'match' command to annotate sequences with k-mer match positions

The index is created automatically during KDI file creation and allows for O(log N / stride) binary search followed by at most stride linear scan steps, significantly improving performance for large datasets.
2026-02-10 13:24:24 +01:00
Eric Coissac 9babcc0fae Refactor lowmask options and shared kmer options
Refactor lowmask options to use shared kmer options and CLI getters

This commit refactors the lowmask subcommand to use shared kmer options and CLI getters instead of local variables. It also moves the kmer size and minimizer size options to a shared location and adds new CLI getters for the lowmask options.

- Move kmer size and minimizer size options to shared location
- Add CLI getters for lowmask options
- Refactor lowmask to use CLI getters
- Remove unused strings import
- Add MaskingMode type and related functions
2026-02-10 09:52:38 +01:00
Eric Coissac e775f7e256 Add option to keep shorter fragments in lowmask
Add a new boolean option 'keep-shorter' to preserve fragments shorter than kmer-size during split/extract mode.

This change introduces a new flag _lowmaskKeepShorter that controls whether fragments
shorter than the kmer size should be kept during split/extract operations.

The implementation:
1. Adds the new boolean variable _lowmaskKeepShorter
2. Registers the command-line option "keep-shorter"
3. Updates the lowMaskWorker function signature to accept the keepShorter parameter
4. Modifies the fragment selection logic to check the keepShorter flag
5. Updates the worker creation to pass the global flag value

This allows users to control the behavior when dealing with short sequences in
split/extract modes, providing more flexibility in low-complexity masking.
2026-02-10 09:36:42 +01:00
Eric Coissac f2937af1ad Add max frequency filtering and top-kmer saving capabilities
This commit introduces max frequency filtering to limit k-mer occurrences and adds functionality to save the N most frequent k-mers per set to CSV files. It also includes the ability to output k-mer frequency spectra as CSV and updates the CLI options accordingly.
2026-02-10 09:27:04 +01:00
Eric Coissac 56c1f4180c Refactor k-mer index management with subcommands and enhanced metadata support
This commit refactors the k-mer index management tools to use a unified subcommand structure with obik, adds support for per-set metadata and ID management, enhances the k-mer set group builder to support appending to existing groups, and improves command-line option handling with a new global options registration system.

Key changes:
- Introduce obik command with subcommands (index, ls, summary, cp, mv, rm, super, lowmask)
- Add support for per-set metadata and ID management in kmer set groups
- Implement ability to append to existing kmer index groups
- Refactor option parsing to use a global options registration system
- Add new commands for listing, copying, moving, and removing sets
- Enhance low-complexity masking with new options and output formats
- Improve kmer index summary with Jaccard distance matrix support
- Remove deprecated obikindex and obisuperkmer commands
- Update build process to use the new subcommand structure
2026-02-10 06:49:31 +01:00
Eric Coissac f78543ee75 Refactor k-mer index building to use disk-based KmerSetGroupBuilder
Refactor k-mer index building to use the new disk-based KmerSetGroupBuilder instead of the old KmerSet and FrequencyFilter approaches. This change introduces a more efficient and scalable approach to building k-mer indices by using partitioned disk storage with streaming operations.

- Replace BuildKmerIndex and BuildFrequencyFilterIndex with KmerSetGroupBuilder
- Add support for frequency filtering via WithMinFrequency option
- Remove deprecated k-mer set persistence methods
- Update CLI to use new builder approach
- Add new disk-based k-mer operations (union, intersect, difference, quorum)
- Introduce KDI (K-mer Delta Index) file format for efficient storage
- Add K-way merge operations for combining sorted k-mer streams
- Update documentation and examples to reflect new API

This refactoring provides better memory usage, faster operations on large datasets, and more flexible k-mer set operations.
2026-02-10 06:49:31 +01:00
Eric Coissac a016ad5b8a Refactor kmer index to disk-based partitioning with minimizer
Refactor kmer index package to use disk-based partitioning with minimizer

- Replace roaring64 bitmaps with disk-based kmer index
- Implement partitioned kmer sets with delta-varint encoding
- Add support for frequency filtering during construction
- Introduce new builder pattern for index construction
- Add streaming operations for set operations (union, intersect, etc.)
- Add support for super-kmer encoding during construction
- Update command line tool to use new index format
- Remove dependency on roaring bitmap library

This change introduces a new architecture for kmer indexing that is more memory efficient and scalable for large datasets.
2026-02-09 17:52:37 +01:00
coissac 09d437d10f Merge pull request #83 from metabarcoding/push-xssnppvunmlq
Push xssnppvunmlq
2026-02-09 09:58:06 +01:00
Eric Coissac d00ab6f83a Bump version from 4.4.11 to 4.4.12
Update version number in version.txt from 4.4.11 to 4.4.12
2026-02-09 09:46:12 +01:00
Eric Coissac 8037860518 Update version and improve release note generation
Update version from 4.4.11 to 4.4.12

- Bump version in version.go
- Enhance release note generation in Makefile to use JSON output from orla and fallback to raw output if JSON parsing fails
- Improve test script to verify minimum super k-mer length is >= k (default k=31)
2026-02-09 09:46:10 +01:00
coissac 43d6cbe56a Merge pull request #82 from metabarcoding/push-vkprtnlyxmkl
Push vkprtnlyxmkl
2026-02-09 09:16:20 +01:00
Eric Coissac 6dadee9371 Bump version to 4.4.12
Update version from 4.4.11 to 4.4.12 in version.txt and pkg/obioptions/version.go
2026-02-09 09:05:49 +01:00
Eric Coissac 99a8e69d10 Optimize low-complexity masking algorithm
This commit optimizes the low-complexity masking algorithm by:

1. Precomputing logarithm values and normalization tables to avoid repeated calculations
2. Replacing the MinMultiset-based sliding minimum with a more efficient deque-based implementation
3. Improving entropy calculation by using precomputed n*log(n) values
4. Simplifying the circular normalization process with precomputed tables
5. Removing unused imports and log statements

The changes significantly improve performance while maintaining the same masking behavior.
2026-02-09 09:05:46 +01:00
Eric Coissac c0ae49ef92 Ajout d'obilowmask_ref au fichier .gitignore
Ajout du fichier obilowmask_ref dans le fichier .gitignore pour éviter qu'il ne soit suivi par Git.
2026-02-08 19:31:12 +01:00
Eric Coissac 08490420a2 Fix whitespace in test script and add merge consistency tests
This commit fixes minor whitespace issues in the test script and adds new tests to ensure merge attribute consistency between in-memory and on-disk paths.

- Removed trailing spaces in log messages
- Added tests for merge consistency between in-memory and on-disk paths
- These tests catch a bug where shared classifier in on-disk dereplication path caused incorrect merged attributes
2026-02-08 18:08:29 +01:00
Eric Coissac 1a28d5ed64 Add progress bar configuration and conditional display
This commit introduces a new configuration module `obidefault` to manage progress bar settings, allowing users to disable progress bars via a `--no-progressbar` option. It updates various packages to conditionally display progress bars based on this new configuration, improving user experience by providing control over progress bar output. The changes also include improvements to progress bar handling in several packages, ensuring they are only displayed when appropriate (e.g., when stderr is a terminal and stdout is not piped).
2026-02-08 16:14:02 +01:00
Eric Coissac b2d16721f0 Fix classifier cloning and reset in chunk processing
This commit fixes an issue in the chunk processing logic where the wrong classifier instance was being reset and used for code generation. A local clone of the classifier is now created and used to ensure correct behavior during dereplication.
2026-02-08 15:52:25 +01:00
Eric Coissac 7c12b1ee83 Disable progress bar when output is piped
Modify CLIProgressBar function to check if stdout is a named pipe and disable the progress bar accordingly. This prevents the progress bar from being displayed when the output is redirected or piped to another command.
2026-02-08 14:48:13 +01:00
Eric Coissac db98ddb241 Fix super k-mer minimizer bijection and add validation test
This commit addresses a bug in the super k-mer implementation where the minimizer bijection property was not properly enforced. The fix ensures that:

1. All k-mers within a super k-mer share the same minimizer
2. Identical super k-mer sequences have the same minimizer

The changes include:

- Fixing the super k-mer iteration logic to properly validate the minimizer bijection property
- Adding a comprehensive test suite (TestSuperKmerMinimizerBijection) that validates the intrinsic property of super k-mers
- Updating the .gitignore file to properly track relevant files

This resolves issues where the same sequence could be associated with different minimizers, violating the super k-mer definition.
2026-02-08 13:47:33 +01:00
Eric Coissac 7a979ba77f Add obisuperkmer command implementation and tests
This commit adds the implementation of the obisuperkmer command, including:

- The main command in cmd/obitools/obisuperkmer/
- The package implementation in pkg/obitools/obisuperkmer/
- Automated tests in obitests/obitools/obisuperkmer/
- Documentation for the implementation and tests

The obisuperkmer command extracts super k-mers from DNA sequences, following the standard OBITools architecture. It includes proper CLI option handling, validation of parameters, and integration with the OBITools pipeline system.

Tests cover basic functionality, parameter validation, output format, metadata preservation, and file I/O operations.
2026-02-07 13:54:02 +01:00
Eric Coissac 00c8be6b48 docs: add architecture documentation for OBITools commands
Ajout d'une documentation détaillée sur l'architecture des commandes OBITools, incluant la structure modulaire, les patterns architecturaux et les bonnes pratiques pour la création de nouvelles commandes.
2026-02-07 12:26:35 +01:00
Eric Coissac 4ae331db36 Refactor SuperKmer extraction to use iterator pattern
This commit refactors the SuperKmer extraction functionality to use Go's new iterator pattern. The ExtractSuperKmers function is now implemented as a wrapper around a new IterSuperKmers iterator function, which yields results one at a time instead of building a complete slice. This change provides better memory efficiency and more flexible consumption of super k-mers. The functionality remains the same, but the interface is now more idiomatic and efficient for large datasets.
2026-02-07 12:23:12 +01:00
Eric Coissac f1e2846d2d Amélioration du processus de release avec génération automatique des notes de version
Mise à jour du Makefile pour améliorer le processus de version bump et de création de tag.

- Utilisation de variables pour stocker les versions précédente et actuelle
- Ajout de la génération automatique des notes de version à partir des commits entre les tags
- Intégration d'une logique de fallback si orla n'est pas disponible
- Amélioration de la documentation des étapes du processus de release
- Mise à jour de la commande de création du tag avec le message généré
2026-02-07 11:48:26 +01:00
coissac cd5562fb30 Merge pull request #81 from metabarcoding/push-nrylumyxtxnr
Push nrylumyxtxnr
2026-02-06 10:10:22 +01:00
Eric Coissac f79b018430 Bump version to 4.4.11
Update version from 4.4.10 to 4.4.11 in version.txt and pkg/obioptions/version.go
2026-02-06 10:09:56 +01:00
Eric Coissac aa819618c2 Enhance OBITools4 installation script with version control and documentation
Update installation script to support specific version installation, list available versions, and improve documentation.

- Add support for installing specific versions with -v/--version flag
- Add -l/--list flag to list all available versions
- Improve help message with examples
- Update README.md to reflect new installation options and examples
- Add note on version compatibility between OBITools2 and OBITools4
- Remove ecoprimers directory
- Improve error handling and user feedback during installation
- Add version detection and download logic from GitHub releases
- Update installation process to use tagged releases instead of master branch
2026-02-06 10:09:54 +01:00
coissac da8d851d4d Merge pull request #80 from metabarcoding/push-vvonlpwlnwxy
Remove ecoprimers submodule
2026-02-06 09:53:29 +01:00
Eric Coissac 9823bcb41b Remove ecoprimers submodule 2026-02-06 09:52:54 +01:00
coissac 9c162459b0 Merge pull request #79 from metabarcoding/push-tpytwyyyostt
Remove ecoprimers submodule
2026-02-06 09:51:42 +01:00
Eric Coissac 25b494e562 Remove ecoprimers submodule 2026-02-06 09:50:45 +01:00
coissac 0b5cadd104 Merge pull request #78 from metabarcoding/push-pwvvkzxzmlux
Push pwvvkzxzmlux
2026-02-06 09:48:47 +01:00
Eric Coissac a2106e4e82 Bump version to 4.4.10
Update version from 4.4.9 to 4.4.10 in version.txt and pkg/obioptions/version.go
2026-02-06 09:48:27 +01:00
Eric Coissac a8a00ba0f7 Simplify artifact packaging and update release notes
This commit simplifies the artifact packaging process by creating a single tar.gz file containing all binaries for each platform, instead of individual files. It also updates the release notes to reflect the new packaging approach and corrects the documentation to use the new naming convention 'obitools4' instead of '<tool>'.
2026-02-06 09:48:25 +01:00
coissac 1595a74ada Merge pull request #77 from metabarcoding/push-lwtnswxmorrq
Push lwtnswxmorrq
2026-02-06 09:35:05 +01:00
Eric Coissac 68d723ecba Bump version to 4.4.9
Update version from 4.4.8 to 4.4.9 in version.txt and corresponding Go file.
2026-02-06 09:34:43 +01:00
Eric Coissac 250d616129 Mise à jour des workflows de release pour les nouvelles versions d'OS
Mise à jour du workflow de release pour utiliser ubuntu-24.04-arm au lieu de ubuntu-latest pour ARM64, et macos-15-intel au lieu de macos-latest pour macOS. Suppression de la compilation croisée pour ARM64 et ajustement de l'installation des outils de build pour macOS.
2026-02-06 09:34:41 +01:00
coissac fbf816d219 Merge pull request #76 from metabarcoding/push-tzpmmnnxkvxx
Push tzpmmnnxkvxx
2026-02-06 09:09:05 +01:00
Eric Coissac 7f0133a196 Bump version to 4.4.8
Update version from 4.4.7 to 4.4.8 in version.txt and _Version variable.
2026-02-06 09:08:35 +01:00
Eric Coissac f798f22434 Add cross-platform binary builds and release workflow improvements
This commit introduces a new build job that compiles binaries for multiple platforms (Linux, macOS) and architectures (amd64, arm64). It also refactors the release process to download pre-built artifacts and simplify the release directory preparation. The workflow now uses matrix strategy for building binaries and downloads all artifacts for the final release, removing the previous manual build steps for each platform.
2026-02-06 09:08:33 +01:00
coissac 248bc9f672 Merge pull request #75 from metabarcoding/push-mxxuykppzlpw
Push mxxuykppzlpw
2026-02-05 18:11:12 +01:00
Eric Coissac 7a7db703f1 Bump version to 4.4.7
Update version from 4.4.6 to 4.4.7 in version.txt and pkg/obioptions/version.go
2026-02-05 18:10:45 +01:00
Eric Coissac da195ac5cb Optimisation de la construction des binaires
Modification du fichier de workflow de release pour compiler uniquement les outils obitools lors de la construction des binaires pour chaque plateforme (Linux AMD64, Linux ARM64, macOS AMD64, macOS ARM64, Windows AMD64). Cela permet d'optimiser le processus de build en ne générant que les binaires nécessaires.
2026-02-05 18:10:43 +01:00
coissac 20a0a09f5f Merge pull request #74 from metabarcoding/push-yqrwnpmoqllk
Push yqrwnpmoqllk
2026-02-05 18:03:28 +01:00
coissac 7d8c578c57 Merge branch 'master' into push-yqrwnpmoqllk 2026-02-05 18:03:18 +01:00
Eric Coissac d7f615108f Bump version to 4.4.6
Update version from 4.4.5 to 4.4.6 in version.txt and pkg/obioptions/version.go
2026-02-05 18:02:30 +01:00
Eric Coissac 71574f240b Update version and add CI tests
Update version to 4.4.5 and add a test job in the release workflow to ensure tests pass before creating a release.
2026-02-05 18:02:28 +01:00
coissac c98501a898 Merge pull request #73 from metabarcoding/push-pklkwsssrkuv
Push pklkwsssrkuv
2026-02-05 17:54:39 +01:00
Eric Coissac 23f145a4c2 Bump version to 4.4.5
Update version number from 4.4.4 to 4.4.5 in both version.go and version.txt files.
2026-02-05 17:53:53 +01:00
Eric Coissac fe6d74efbf Add automated release workflow and update tag creation
This commit introduces a new GitHub Actions workflow to automatically create releases when tags matching the pattern 'Release_*' are pushed. It also updates the Makefile to use the new tag format 'Release_<version>' for tagging commits, ensuring consistency with the new release automation.
2026-02-05 17:53:52 +01:00
coissac cff8135468 Merge pull request #72 from metabarcoding/push-zsprzlqxurrp
Push zsprzlqxurrp
2026-02-05 17:42:48 +01:00
Eric Coissac 02ab683fa0 Bump version to 4.4.4
Update version from 4.4.3 to 4.4.4 in version.txt and pkg/obioptions/version.go
2026-02-05 17:42:01 +01:00
Eric Coissac de88e7eecd Fix typo in variable name
Corrected a typo in the variable name 'usreId' to 'userId' to ensure proper functionality.
2026-02-05 17:41:59 +01:00
Eric Coissac e3c41fc11b Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup
Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup

This commit introduces Jaccard distance and similarity methods for KmerSet and KmerSetGroup.

For KmerSet:
- Added JaccardDistance method to compute the Jaccard distance between two KmerSets
- Added JaccardSimilarity method to compute the Jaccard similarity between two KmerSets

For KmerSetGroup:
- Added JaccardDistanceMatrix method to compute a pairwise Jaccard distance matrix
- Added JaccardSimilarityMatrix method to compute a pairwise Jaccard similarity matrix

Also includes:
- New DistMatrix implementation in pkg/obidist for storing and computing distance/similarity matrices
- Updated version handling with bump-version target in Makefile
- Added tests for all new methods
2026-02-05 17:39:23 +01:00
Eric Coissac aa2e94dd6f Refactor k-mer normalization functions and add quorum operations
This commit refactors the k-mer normalization functions, renaming them from 'NormalizeKmer' to 'CanonicalKmer' to better reflect their purpose of returning canonical k-mers. It also introduces new quorum operations (AtLeast, AtMost, Exactly) for k-mer set groups, along with comprehensive tests and benchmarks. The version commit hash has also been updated.
2026-02-05 17:11:34 +01:00
Eric Coissac a43e6258be docs: translate comments to English
This commit translates all French comments in the kmer filtering and set management code to English, improving code readability and maintainability for international collaborators.
2026-02-05 16:35:55 +01:00
Eric Coissac 12ca62b06a Implémentation complète de la persistance pour FrequencyFilter
Ajout de la fonctionnalité de sauvegarde et de chargement pour FrequencyFilter en utilisant le KmerSetGroup sous-jacent.

- Nouvelle méthode Save() pour enregistrer le filtre dans un répertoire avec formatage des métadonnées
- Nouvelle méthode LoadFrequencyFilter() pour charger un filtre depuis un répertoire
- Initialisation des métadonnées lors de la création du filtre
- Optimisation des méthodes Union() et Intersect() du KmerSetGroup
- Mise à jour du commit hash
2026-02-05 16:26:10 +01:00
Eric Coissac 09ac15a76b Refactor k-mer encoding functions to use 'canonical' terminology
This commit refactors all k-mer encoding and normalization functions to consistently use 'canonical' instead of 'normalized' terminology. This includes renaming functions like EncodeNormalizedKmer to EncodeCanonicalKmer, IterNormalizedKmers to IterCanonicalKmers, and NormalizeKmer to CanonicalKmer. The change aligns the API with biological conventions where 'canonical' refers to the lexicographically smallest representation of a k-mer and its reverse complement. All related documentation and examples have been updated accordingly. The commit also updates the version file with a new commit hash.
2026-02-05 16:14:35 +01:00
Eric Coissac 16f72e6305 refactoring of obikmer 2026-02-05 16:05:48 +01:00
Eric Coissac 6c6c369ee2 Add k-mer encoding and decoding functions with normalized k-mer support
This commit introduces new functions for encoding and decoding k-mers, including support for normalized k-mers. It also updates the frequency filter and k-mer set implementations to use the new encoding functions, providing zero-allocation encoding for better performance. The commit hash has been updated to reflect the latest changes.
2026-02-05 15:51:52 +01:00
Eric Coissac c5dd477675 Refactor KmerSet and FrequencyFilter to use immutable K parameter and consistent Copy/Clone methods
This commit refactors the KmerSet and related structures to use an immutable K parameter and introduces consistent Copy methods instead of Clone. It also adds attribute API support for KmerSet and KmerSetGroup, and updates persistence logic to handle IDs and metadata correctly.
2026-02-05 15:32:36 +01:00
Eric Coissac afcb43b352 Ajout de la gestion des métadonnées utilisateur dans KmerSet et KmerSetGroup
Cette modification ajoute la capacité de stocker et de persister des métadonnées utilisateur dans les structures KmerSet et KmerSetGroup. Les changements incluent l'ajout d'un champ Metadata dans KmerSet et KmerSetGroup, ainsi que la mise à jour des méthodes de clonage et de persistance pour gérer ces métadonnées. Cela permet de conserver des informations supplémentaires liées aux ensembles de k-mers tout en maintenant la compatibilité avec les opérations existantes.
2026-02-05 15:02:36 +01:00
Eric Coissac b26b76cbf8 Add TOML persistence support for KmerSet and KmerSetGroup
This commit adds support for saving and loading KmerSet and KmerSetGroup structures using TOML, YAML, and JSON formats for metadata. It includes:

- Added github.com/pelletier/go-toml/v2 dependency
- Implemented Save and Load methods for KmerSet and KmerSetGroup
- Added metadata persistence with support for multiple formats (TOML, YAML, JSON)
- Added helper functions for format detection and metadata handling
- Updated version commit hash
2026-02-05 14:57:22 +01:00
Eric Coissac aa468ec462 Refactor FrequencyFilter to use KmerSetGroup
Refactor FrequencyFilter to inherit from KmerSetGroup for better code organization and maintainability. This change replaces the direct bitmap management with a group-based approach, simplifying the implementation and improving readability.
2026-02-05 14:46:57 +01:00
Eric Coissac 00dcd78e84 Refactor k-mer encoding and frequency filtering with KmerSet
This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
2026-02-05 14:41:59 +01:00
Eric Coissac 60f27c1dc8 Add error handling for ambiguous bases in k-mer encoding
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
2026-02-04 21:45:08 +01:00
Eric Coissac 28162ac36f Ajout du filtre de fréquence avec v niveaux Roaring Bitmaps
Implémentation complète du filtre de fréquence utilisant v niveaux de Roaring Bitmaps pour éliminer efficacement les erreurs de séquençage.

- Ajout de la logique de filtrage par fréquence avec v niveaux
- Intégration des bibliothèques RoaringBitmap et bitset
- Ajout d'exemples d'utilisation et de documentation
- Implémentation de l'itérateur de k-mers pour une utilisation mémoire efficace
- Optimisation pour les distributions skewed typiques du séquençage

Ce changement permet de filtrer les k-mers par fréquence minimale avec une utilisation mémoire optimale et une seule passe sur les données.
2026-02-04 21:21:10 +01:00
Eric Coissac 1a1adb83ac Add error marker support for k-mers with enhanced documentation
This commit introduces error marker functionality for k-mers with odd lengths up to 31. The top 2 bits of each k-mer are now reserved for error coding (0-3), allowing for error detection and correction capabilities. Key changes include:

- Added constants KmerErrorMask and KmerSequenceMask for bit manipulation
- Implemented SetKmerError, GetKmerError, and ClearKmerError functions
- Updated EncodeKmers, ExtractSuperKmers, EncodeNormalizedKmers functions to enforce k ≤ 31
- Enhanced ReverseComplement to preserve error bits during reverse complement operations
- Added comprehensive tests for error marker functionality including edge cases and integration tests

The maximum k-mer size is now capped at 31 to accommodate the error bits, ensuring that k-mers with odd lengths ≤ 31 utilize only 62 bits of the 64-bit uint64, leaving the top 2 bits available for error coding.
2026-02-04 16:21:47 +01:00
Eric Coissac 05de9ca58e Add SuperKmer extraction functionality
This commit introduces the ExtractSuperKmers function which identifies maximal subsequences where all consecutive k-mers share the same minimizer. It includes:

- SuperKmer struct to represent the maximal subsequences
- dequeItem struct for tracking minimizers in a sliding window
- Efficient algorithm using monotone deque for O(1) amortized minimizer tracking
- Comprehensive parameter validation
- Support for buffer reuse for performance optimization
- Extensive test cases covering basic functionality, edge cases, and performance benchmarks

The implementation uses simultaneous forward/reverse m-mer encoding for O(1) canonical m-mer computation and maintains a monotone deque to track minimizers efficiently.
2026-02-04 16:04:06 +01:00
Eric Coissac 500144051a Add jj Makefile targets and k-mer encoding utilities
Add new Makefile targets for jj operations (jjnew, jjpush, jjfetch) to streamline commit workflow.

Introduce k-mer encoding utilities in pkg/obikmer:
- EncodeKmers: converts DNA sequences to encoded k-mers
- ReverseComplement: computes reverse complement of k-mers
- NormalizeKmer: returns canonical form of k-mers
- EncodeNormalizedKmers: encodes sequences with normalized k-mers

Add comprehensive tests for k-mer encoding functions including edge cases, buffer reuse, and performance benchmarks.

Document k-mer index design for large genomes, covering:
- Use cases and objectives
- Volume estimations
- Distance metrics (Jaccard, Sørensen-Dice, Bray-Curtis)
- Indexing options (Bloom filters, sorted sets, MPHF)
- Optimization techniques (k-2-mer indexing)
- MinHash for distance acceleration
- Recommended architecture for presence/absence and counting queries
2026-02-04 14:27:10 +01:00
coissac 740f66b4c7 Merge pull request #71 from metabarcoding/push-onwzsyuooozn
Implémentation du filtrage unique basé sur séquence et catégories
2026-01-14 19:19:27 +01:00
Eric Coissac b49aba9c09 Implémentation du filtrage unique basé sur séquence et catégories
Ajout d'une fonctionnalité pour le filtrage unique qui prend en compte à la fois la séquence et les catégories.

- Modification de la fonction ISequenceChunk pour accepter un classifieur unique optionnel
- Implémentation du traitement unique sur disque en utilisant un classifieur composite
- Mise à jour du classifieur utilisé pour le tri sur disque
- Correction de la gestion des clés de unicité en utilisant le code et la valeur du classifieur
- Mise à jour du numéro de commit
2026-01-14 19:18:17 +01:00
coissac 52244cdb64 Merge pull request #70 from metabarcoding/push-kuwnszsxmxpn
Refactor chunk processing and update version commit
2026-01-14 18:47:17 +01:00
Eric Coissac 0678181023 Refactor chunk processing and update version commit
Optimize chunk processing by moving variable declarations inside the loop and update the commit hash in version.go to reflect the latest changes.
2026-01-14 18:46:04 +01:00
coissac f55dd553c7 Merge pull request #68 from metabarcoding/push-rrulynolpprl
Push rrulynolpprl
2026-01-14 17:44:36 +01:00
coissac 4a383ac6c9 Merge branch 'master' into push-rrulynolpprl 2025-12-18 14:12:56 +01:00
Eric Coissac 371e702423 obiannotate --cut bug 2025-12-18 14:11:11 +01:00
Eric Coissac ac0d3f3fe4 Update obiuniq for very large dataset 2025-12-18 14:11:11 +01:00
Eric Coissac 547135c747 End of obilowmask 2025-12-03 11:49:07 +01:00
coissac f4a919732e Merge pull request #65 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-26 12:13:08 +01:00
Eric Coissac e681666aaa End of obilowmask 2025-11-26 11:14:56 +01:00
coissac adf2486295 Merge pull request #64 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-24 15:36:20 +01:00
Eric Coissac 272f5c9c35 End of obilowmask 2025-11-24 15:27:38 +01:00
coissac c1b9503ca6 Merge pull request #63 from metabarcoding/push-vypwrurrsxuk
obicsv bug with stat on value map fields
2025-11-21 14:04:34 +01:00
Eric Coissac 86e60aedd0 obicsv bug with stat on value map fields 2025-11-21 14:03:31 +01:00
coissac 961abcea7b Merge pull request #61 from metabarcoding/push-mvxssvnysyxn
Push mvxssvnysyxn
2025-11-21 13:25:19 +01:00
Eric Coissac 57c65f9d50 obimatrix bug 2025-11-21 13:24:24 +01:00
Eric Coissac e65b2a5efe obimatrix bugs 2025-11-21 13:24:06 +01:00
coissac 3e5f3f76b0 Merge pull request #60 from metabarcoding/push-qpnzxskwpoxo
Push qpnzxskwpoxo
2025-11-18 15:35:41 +01:00
Eric Coissac ccc827afd3 finalise obilowmask 2025-11-18 15:33:08 +01:00
Eric Coissac cef29005a5 debug url reading 2025-11-18 15:30:20 +01:00
Eric Coissac 4603d7973e implementation de obilowmask 2025-11-18 15:30:20 +01:00
coissac 8bc47c13d3 Merge pull request #58 from metabarcoding/push-vxkqkkrokwuz
debug obimultiplex
2025-11-06 15:44:31 +01:00
Eric Coissac 07cdd6f758 debug obimultiplex
bug option obimultiplex
2025-11-06 15:43:13 +01:00
coissac 432da366e2 Merge pull request #57 from metabarcoding/push-ywktmvpvtvmv
debug taxonomy core dump
2025-11-05 19:07:41 +01:00
Eric Coissac 2d7dc7d09d debug taxonomy core dump 2025-11-05 19:01:15 +01:00
coissac 5e12ed5400 Merge pull request #56 from metabarcoding/push-tnrvpwvqtzyo
update install script
2025-11-04 18:11:21 +01:00
Eric Coissac 7500ee1d15 update install script 2025-11-04 18:09:15 +01:00
coissac 5a1d66bf06 Merge pull request #53 from metabarcoding/push-skmxzrzulvtq
Push skmxzrzulvtq
2025-10-28 14:27:19 +01:00
Eric Coissac 0844dcc607 bug obimatrix 2025-10-28 13:57:31 +01:00
Eric Coissac 7f4ebe757e Bug obiuniq - don't clean the chunks 2025-10-28 13:50:22 +01:00
coissac 5150947e23 Merge pull request #51 from metabarcoding/push-urtwmwktsrru
Push urtwmwktsrru
2025-10-20 17:41:33 +02:00
Eric Coissac d17a9520b9 work on obiclean chimera detection 2025-10-20 17:29:47 +02:00
Eric Coissac 29bf4ce871 add a feature to obimatrix adding obicsv option to obimatrix 2025-10-20 16:34:58 +02:00
coissac d7ed9d343e Update install_obitools.sh for missing directory 2025-10-15 08:32:06 +02:00
Eric Coissac 82b6bb1ab6 correct a bug in func (worker SeqWorker) ChainWorkers(next SeqWorker) SeqWorker 2025-08-11 15:09:49 +02:00
Eric Coissac 6d204f6281 Patch the fastq detector 2025-08-08 10:23:03 -04:00
Eric Coissac 7a6d552450 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-08-07 17:01:48 -04:00
Eric Coissac 412b54822c Patch a bug in obliclean for d>1 leading to some instability in the result 2025-08-07 17:01:38 -04:00
Eric Coissac 730d448fc3 Allows for only one cpu and it should work 2025-08-06 16:09:25 -04:00
Eric Coissac 04f3af3e60 some renaming of functions 2025-08-06 15:54:50 -04:00
Eric Coissac 997b6e8c01 correct the fastq detector for distinguish with a csv ngsfilter 2025-08-06 15:52:54 -04:00
Eric Coissac f239e8da92 Rename ISequenceChunk 2025-08-05 08:49:45 -04:00
Eric Coissac ed28d3fb5b Adds a --u-to-t option 2025-07-07 15:35:26 +02:00
Eric Coissac 43b285587e Debug on taxonomy extraction and CSV conversion 2025-07-07 15:29:40 +02:00
Eric Coissac 8d53d253d4 Add a reading option on readers to convet U to T 2025-07-07 15:29:07 +02:00
Eric Coissac 8c26fc9884 Add a new test on obisummary 2025-07-07 15:28:29 +02:00
Eric Coissac 235a7e202a Update obisummary to account new obiseq.StatsOnValues type 2025-06-19 17:21:30 +02:00
Eric Coissac 27fa984a63 Patch obimatrix accoring to the new type obiseq.StatsOnValues 2025-06-19 16:51:53 +02:00
Eric Coissac add9d89ccc Patch the Min and Max values of the expression language 2025-06-19 16:43:26 +02:00
Eric Coissac 9965370d85 Manage a lock on StatsOnValues 2025-06-17 16:46:11 +02:00
Eric Coissac 8a2bb1fe82 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/merge.go
2025-06-17 12:11:35 +02:00
Eric Coissac efc3f3af29 Patch a concurrent access problem 2025-06-17 12:05:42 +02:00
Eric Coissac 1c6ab1c559 Changes to be committed:
modified:   pkg/obingslibrary/multimatch.go
	modified:   pkg/obioptions/version.go
2025-06-17 09:06:42 +02:00
Eric Coissac 38dcd98d4a Patch the genbank parser automata 2025-06-17 08:52:45 +02:00
Eric Coissac 7b23985693 Add _ to allowed in taxid 2025-06-06 14:37:57 +02:00
Eric Coissac d31e677304 Patch a bug in obitag 2025-06-04 14:47:28 +02:00
Eric Coissac 6cb7a5a352 Changes to be committed:
modified:   cmd/obitools/obitag/main.go
	modified:   cmd/obitools/obitaxonomy/main.go
	modified:   pkg/obiformats/csvtaxdump_read.go
	modified:   pkg/obiformats/ecopcr_read.go
	modified:   pkg/obiformats/ncbitaxdump_read.go
	modified:   pkg/obiformats/ncbitaxdump_readtar.go
	modified:   pkg/obiformats/newick_write.go
	modified:   pkg/obiformats/options.go
	modified:   pkg/obiformats/taxonomy_read.go
	modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obiiter/extract_taxonomy.go
	modified:   pkg/obioptions/options.go
	modified:   pkg/obioptions/version.go
	new file:   pkg/obiphylo/tree.go
	modified:   pkg/obiseq/biosequenceslice.go
	modified:   pkg/obiseq/taxonomy_methods.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonset.go
	modified:   pkg/obitools/obiconvert/sequence_reader.go
	modified:   pkg/obitools/obitag/obitag.go
	modified:   pkg/obitools/obitaxonomy/obitaxonomy.go
	modified:   pkg/obitools/obitaxonomy/options.go
	deleted:    sample/.DS_Store
2025-06-04 09:48:10 +02:00
Eric Coissac 3424d3057f Changes to be committed:
modified:   pkg/obiformats/ngsfilter_read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiutils/mimetypes.go
2025-05-14 14:53:25 +02:00
Eric Coissac f9324dd8f4 add min and max to the obitools expression language 2025-05-13 16:03:03 +02:00
Eric Coissac f1b9ac4a13 Update the expression language 2025-05-07 20:45:05 +02:00
Eric Coissac e065e2963b Update the install script 2025-05-01 11:45:46 +02:00
Eric Coissac 13ff892ac9 Patch type mismatch in apat C library 2025-04-23 16:14:10 +02:00
Eric Coissac c0ecaf90ab Add the --number option to obiannotate 2025-04-22 18:35:51 +02:00
Eric Coissac a57cfda675 Make the replace function of the eval language accepting regex 2025-04-10 15:17:15 +02:00
Eric Coissac c2f38e737b Update of the packages 2025-04-10 15:16:36 +02:00
Eric Coissac 0aec5ba4df change the tests according to the corrections in obipairing 2025-04-04 17:10:17 +02:00
Eric Coissac 67e5b6ef24 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:02:45 +02:00
Eric Coissac 3b1aa2869e Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:01:20 +02:00
Eric Coissac 7542e33010 Several bugs dicoverd during the doc writing 2025-04-04 16:59:27 +02:00
Eric Coissac 03b5ce9397 Patch a bug in obitag when some reference sequences have taxid absent from the taxonomy 2025-03-27 16:45:02 +01:00
Eric Coissac 2d52322876 Patch a bug in the obi2 annotation parser on map indexed by integers 2025-03-27 14:54:13 +01:00
Eric Coissac fd80249b85 Patch a bug in obitag when a taxon from the reference library is unknown in the taxonomy 2025-03-27 14:28:15 +01:00
Eric Coissac 5a3705b6bb Adds the --silent-warning options to the obitools commands and removes the --pared-with option from some of the obitols commands. 2025-03-25 16:44:46 +01:00
Eric Coissac 2ab6f67d58 Add a progress bar to chimera detection 2025-03-25 08:37:27 +01:00
Eric Coissac 8b379d30da Adds the --newick-output option to the obitaxonomy command 2025-03-14 14:24:12 +01:00
Eric Coissac 8448783499 Make sequence files recognized as a taxonomy 2025-03-14 14:22:22 +01:00
Eric Coissac d1c31c54de add a first version of the inline documentation 2025-03-12 14:40:42 +01:00
Eric Coissac 7a9dc1ab3b update release notes 2025-03-12 14:06:20 +01:00
Eric Coissac 3a1cf4fe97 Accelerate the speed of very long fasta sequences, and more generaly of every format 2025-03-12 13:29:41 +01:00
Eric Coissac 83926c91e1 Patch the install script to desactivate the CSV check 2025-03-12 13:28:52 +01:00
Eric Coissac 937a483aa6 Changes to be committed:
modified:   Makefile
2025-03-12 12:55:41 +01:00
Eric Coissac dada70e6b1 Changes to be committed:
modified:   Makefile
2025-03-12 12:49:34 +01:00
Eric Coissac 62e5a93492 update the compress option name 2025-03-11 17:14:40 +01:00
Eric Coissac f21f51ae62 Correct the logic of --update-taxid and --fail-on-taxonomy 2025-03-11 16:56:02 +01:00
Eric Coissac 3b5d4ba455 patch a bug in obiannotate 2025-03-11 16:35:38 +01:00
Eric Coissac 50d11ce374 Add a pre-push git-hook to run tests on obitools commands before pushing on master 2025-03-08 18:56:02 +01:00
Eric Coissac 52d5f6fe11 make makefile crashing on test error 2025-03-08 16:54:24 +01:00
Eric Coissac 78caabd2fd Add basic test on -h for all the commands 2025-03-08 16:28:06 +01:00
Eric Coissac 65bd29b955 normalize the usage of obitaxonomy 2025-03-08 13:00:55 +01:00
Eric Coissac b18c9b7ac6 add the --raw-taxid option 2025-03-08 09:40:06 +01:00
Eric Coissac 78df7db18d typos 2025-03-08 07:44:41 +01:00
Eric Coissac fc08c12ab0 update release notes 2025-03-08 07:42:20 +01:00
Eric Coissac 0339e4dffa Patch size limite of the filetype guesser 2025-03-08 07:34:02 +01:00
Eric Coissac 706b44c37f Add option for csv input format 2025-03-08 07:21:24 +01:00
Eric Coissac fbe7d15dc3 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obicleandb/obicleandb.go
	modified:   pkg/obitools/obicleandb/options.go
2025-03-06 13:38:38 +01:00
Eric Coissac b5cf586f17 patch a duplicate --taxonomy option in obirefidxdb 2025-03-06 11:36:20 +01:00
Eric Coissac 286e27d6ba patch the scienctific_name tag name to "scientific_name" 2025-03-05 14:22:12 +01:00
Eric Coissac 996ec69bd9 update the release notes for version 4.4.0 2025-03-01 12:56:39 +01:00
Eric Coissac 5f9182d25b Changes to be committed:
modified:   pkg/obioptions/version.go
2025-03-01 09:20:39 +01:00
Eric Coissac 9913fa8354 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-03-01 09:14:56 +01:00
Eric Coissac 7b23314651 Some typos 2025-03-01 08:29:27 +01:00
Eric Coissac 1e541eac4c Last commit version 2025-03-01 08:24:26 +01:00
Eric Coissac 13cd4c86ac Patch the bug on --out with paired sequence files 2025-02-27 18:13:21 +01:00
Eric Coissac 75dd535201 Add a --valid-taxid option to obigrep 2025-02-27 18:12:55 +01:00
Eric Coissac 573acafafc Patch bug on ecotag with too short sequences 2025-02-27 15:09:07 +01:00
Eric Coissac 0067152c2b Patch the production of the ratio file 2025-02-27 10:19:39 +01:00
Eric Coissac 791d253edc Generate the ratio file as compressed if -Z option enabled. 2025-02-27 09:06:07 +01:00
Eric Coissac 6245d7f684 Changes to be committed:
modified:   .gitignore
2025-02-24 15:47:45 +01:00
Eric Coissac 13d610aff7 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obiclean/chimera.go
2025-02-24 15:25:45 +01:00
Eric Coissac db284f1d44 Add an experimental chimera detection... 2025-02-24 15:02:49 +01:00
Eric Coissac 51b3e83d32 some cleaning 2025-02-24 11:31:49 +01:00
Eric Coissac 8671285d02 add the --min-sample-count option to obiclean. 2025-02-24 08:48:31 +01:00
Eric Coissac 51d11aa36d Changes to be committed:
modified:   pkg/obialign/alignment.go
	modified:   pkg/obialign/pairedendalign.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obipairing/pairing.go
2025-02-23 17:37:56 +01:00
Eric Coissac fb6f857d8c Update the computation of the consensus quality score 2025-02-23 15:16:31 +01:00
Eric Coissac d4209b4549 Add a basic test for obiparing 2025-02-22 09:57:44 +01:00
Eric Coissac ef05d4975f Upadte the scoring schema of obipairing 2025-02-21 22:41:34 +01:00
Eric Coissac 4588bf8b5d Patch the make file to fail on error 2025-02-19 15:55:07 +01:00
Eric Coissac 090633850d Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 15:28:42 +01:00
Eric Coissac 15a058cf63 with all the sample files for tests 2025-02-19 15:27:38 +01:00
Eric Coissac 2f5f7634d6 Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:50:10 +01:00
Eric Coissac 48138b605c Changes to be committed:
modified:   .github/workflows/obitest.yml
	modified:   Makefile
	modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:37:05 +01:00
Eric Coissac aed22c12a6 Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:34:22 +01:00
Eric Coissac 443a9b3ce3 Changes to be committed:
modified:   Makefile
	modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:28:49 +01:00
Eric Coissac 7e90537379 For run of test using bash in makefile 2025-02-19 13:58:52 +01:00
Eric Coissac d3d15acc6c Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
	modified:   pkg/obioptions/version.go
2025-02-19 13:54:01 +01:00
Eric Coissac bd4a0b5ca5 Essais d'une google action pour lancer les tests des obitools 2025-02-19 13:45:43 +01:00
Eric Coissac 952f85f312 A first trial of a test for obicount 2025-02-19 13:17:36 +01:00
Eric Coissac 4774438644 Changes to be committed:
modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/taxonomy_methods.go
2025-02-12 08:40:38 +01:00
Eric Coissac 6a8061cc4f Add managment of the taxonomy alias politic 2025-02-10 14:05:47 +01:00
Eric Coissac e2563cd8df Patch a bug in registering merged taxa 2025-02-10 11:42:46 +01:00
Eric Coissac f2e81adf95 Changes to be committed:
modified:   .gitignore
	deleted:    xxx.csv
2025-02-05 19:28:19 +01:00
Eric Coissac f27e9bc91e patch a bug related to csv and qualities 2025-02-05 19:27:00 +01:00
Eric Coissac 773e54965d Patch a bug on compressed output 2025-02-05 14:18:24 +01:00
Eric Coissac ceca33998b add extensions fq in directory scanning 2025-02-04 20:34:58 +01:00
Eric Coissac b9bee5f426 Changes to be committed:
modified:   go.mod
	modified:   go.sum
	modified:   pkg/obilua/obilib.go
	modified:   pkg/obilua/obiseq.go
	modified:   pkg/obilua/obiseqslice.go
	new file:   pkg/obilua/obitaxon.go
	new file:   pkg/obilua/obitaxonomy.go
	modified:   pkg/obioptions/version.go
2025-02-02 16:52:52 +01:00
Eric Coissac c10df073a7 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitax/iterator.go
2025-02-01 12:06:19 +01:00
Eric Coissac d3dac1b21f Make obitag able to use the taxonomic path included in reference database as taxonomy 2025-01-30 11:50:03 +01:00
Eric Coissac 0df082da06 Adds possibility to extract a taxonomy from taxonomic path included in sequence files 2025-01-30 11:18:21 +01:00
Eric Coissac 2452aef7a9 patch multiple -Z options 2025-01-29 21:35:28 +01:00
Eric Coissac 337954592d add the --out option to the obitaxonomy 2025-01-29 13:22:35 +01:00
Eric Coissac 8a28c9ae7c add the --download-ncbi option to obitaxonomy 2025-01-29 12:38:39 +01:00
Eric Coissac b6b18c0fa1 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-01-29 11:34:01 +01:00
Eric Coissac 67e2758d63 Switch to realease number 4.3.0 2025-01-29 11:33:30 +01:00
678 changed files with 43349 additions and 6594 deletions
+19
View File
@@ -0,0 +1,19 @@
name: "Run the obitools command test suite"
on:
push:
branches:
- master
- V*
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Setup Go
uses: actions/setup-go@v2
with:
go-version: '1.23'
- name: Checkout obitools4 project
uses: actions/checkout@v4
- name: Run tests
run: make githubtests
+187
View File
@@ -0,0 +1,187 @@
name: Create Release on Tag
on:
push:
tags:
- "Release_*"
permissions:
contents: write
jobs:
# First run tests
test:
runs-on: ubuntu-latest
steps:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.26"
- name: Checkout obitools4 project
uses: actions/checkout@v4
- name: Run tests
run: make githubtests
# Build binaries for each platform
build:
needs: test
strategy:
matrix:
include:
- os: ubuntu-latest
goos: linux
goarch: amd64
output_name: linux_amd64
- os: ubuntu-24.04-arm
goos: linux
goarch: arm64
output_name: linux_arm64
- os: macos-15-intel
goos: darwin
goarch: amd64
output_name: darwin_amd64
- os: macos-latest
goos: darwin
goarch: arm64
output_name: darwin_arm64
runs-on: ${{ matrix.os }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.26"
- name: Extract version from tag
id: get_version
run: |
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
- name: Install build tools (macOS)
if: runner.os == 'macOS'
run: |
# Ensure Xcode Command Line Tools are installed
xcode-select --install 2>/dev/null || true
xcode-select -p
- name: Build binaries (Linux)
if: runner.os == 'Linux'
env:
VERSION: ${{ steps.get_version.outputs.version }}
run: |
docker run --rm \
-v "$(pwd):/src" \
-w /src \
-e VERSION="${VERSION}" \
golang:1.26-alpine \
sh -c "apk add --no-cache gcc musl-dev zlib-dev zlib-static make && \
make LDFLAGS='-linkmode=external -extldflags=-static' obitools"
mkdir -p artifacts
tar -czf artifacts/obitools4_${VERSION}_${{ matrix.output_name }}.tar.gz -C build .
- name: Build binaries (macOS)
if: runner.os == 'macOS'
env:
GOOS: ${{ matrix.goos }}
GOARCH: ${{ matrix.goarch }}
VERSION: ${{ steps.get_version.outputs.version }}
run: |
make obitools
mkdir -p artifacts
tar -czf artifacts/obitools4_${VERSION}_${{ matrix.output_name }}.tar.gz -C build .
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: binaries-${{ matrix.output_name }}
path: artifacts/*
# Create the release
create-release:
needs: build
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Extract version from tag
id: get_version
run: |
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
- name: Download all artifacts
uses: actions/download-artifact@v4
with:
path: release-artifacts
- name: Prepare release directory
run: |
mkdir -p release
find release-artifacts -type f -name "*.tar.gz" -exec cp {} release/ \;
ls -lh release/
- name: Generate Release Notes
env:
VERSION: ${{ steps.get_version.outputs.version }}
run: |
PREV_TAG=$(git describe --tags --abbrev=0 HEAD^ 2>/dev/null || echo "")
echo "# OBITools4 Release ${VERSION}" > release_notes.md
echo "" >> release_notes.md
if [ -n "$PREV_TAG" ]; then
echo "## Changes since ${PREV_TAG}" >> release_notes.md
echo "" >> release_notes.md
git log ${PREV_TAG}..HEAD --pretty=format:"- %s" >> release_notes.md
else
echo "## Changes" >> release_notes.md
echo "" >> release_notes.md
git log --pretty=format:"- %s" -n 20 >> release_notes.md
fi
echo "" >> release_notes.md
echo "" >> release_notes.md
echo "## Installation" >> release_notes.md
echo "" >> release_notes.md
echo "Download the appropriate archive for your system and extract it:" >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (AMD64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_linux_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (ARM64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_linux_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Intel)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_darwin_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Apple Silicon)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_darwin_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "All OBITools4 binaries are included in each archive." >> release_notes.md
- name: Create GitHub Release
uses: softprops/action-gh-release@v1
with:
name: Release ${{ steps.get_version.outputs.version }}
body_path: release_notes.md
files: release/*
draft: false
prerelease: false
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+36 -133
View File
@@ -1,134 +1,37 @@
cpu.pprof
cpu.trace
test
bin
vendor
*.fastq
*.fasta
*.fastq.gz
*.fasta.gz
.DS_Store
*.gml
*.log
/argaly
/obiconvert
/obicount
/obimultiplex
/obipairing
/obipcr
/obifind
/obidistribute
/obiuniq
/build
/Makefile.old
.Rproj.user
obitools.Rproj
Stat_error.knit.md
.Rhistory
Stat_error.nb.html
Stat_error.Rmd
/.luarc.json
/doc/TAXO/
/doc/results/
/doc/_main.log
/doc/_book/_main.tex
/doc/_freeze/
/doc/tutorial_files/
/doc/wolf_data/
/taxdump/
/.vscode/
/Algo-Alignement.numbers
/Estimate_proba_true_seq.html
/Estimate_proba_true_seq.nb.html
/Estimate_proba_true_seq.Rmd
/modele_error_euka.qmd
/obitools.code-workspace
.DS_Store
.RData
x
xxx
y
/doc/wolf_diet.tgz
/doc/man/depends
/sample/wolf_R1.fasta.gz
/sample/wolf_R2.fasta.gz
/sample/euka03.ecotag.fasta.gz
/sample/ratio.csv
/sample/STD_PLN_1.dat
/sample/STD_PLN_2.dat
/sample/subset_Pasvik_R1.fastq.gz
/sample/subset_Pasvik_R2.fastq.gz
/sample/test_gobitools.fasta.bz2
euka03.csv*
gbbct793.seq.gz
gbinv1003.seq.gz
gbpln210.seq
/doc/book/OBITools-V4.aux
/doc/book/OBITools-V4.fdb_latexmk
/doc/book/OBITools-V4.fls
/doc/book/OBITools-V4.log
/doc/book/OBITools-V4.pdf
/doc/book/OBITools-V4.synctex.gz
/doc/book/OBITools-V4.tex
/doc/book/OBITools-V4.toc
getoptions.adoc
Archive.zip
.DS_Store
sample/.DS_Store
sample/consensus_graphs/specimen_hac_plants_Vern_disicolor_.gml
93954
Bact03.e5.gb_R254.obipcr.idx.fasta.save
sample/test.obipcr.log
Bact02.e3.gb_R254.obipcr.fasta.gz
Example_Arth03.ngsfilter
SPER01.csv
SPER03.csv
wolf_diet_ngsfilter.txt
**/cpu.pprof
**/cpu.trace
**/test
**/bin
**/vendor
**/*.fastq
**/*.fasta
**/*.fastq.gz
**/*.fasta.gz
**/.DS_Store
**/*.gml
**/*.log
**/xxx*
**/*.sav
**/*.old
**/*.tgz
**/*.yaml
**/*.csv
**/*.pb.gz
xx
xxx.gb
yyy_geom.csv
yyy_LCS.csv
yyy.json
bug_obimultiplex/toto
bug_obimultiplex/toto_mapping
bug_obimultiplex/tutu
bug_obimultiplex/tutu_mapping
bug_obipairing/GIT1_GH_ngsfilter.txt
doc/book/TAXO/citations.dmp
doc/book/TAXO/delnodes.dmp
doc/book/TAXO/division.dmp
doc/book/TAXO/gc.prt
doc/book/TAXO/gencode.dmp
doc/book/TAXO/merged.dmp
doc/book/TAXO/names.dmp
doc/book/TAXO/nodes.dmp
doc/book/TAXO/readme.txt
doc/book/wolf_data/Release-253/ncbitaxo/citations.dmp
doc/book/wolf_data/Release-253/ncbitaxo/delnodes.dmp
doc/book/wolf_data/Release-253/ncbitaxo/division.dmp
doc/book/wolf_data/Release-253/ncbitaxo/gc.prt
doc/book/wolf_data/Release-253/ncbitaxo/gencode.dmp
doc/book/wolf_data/Release-253/ncbitaxo/merged.dmp
doc/book/wolf_data/Release-253/ncbitaxo/names.dmp
doc/book/wolf_data/Release-253/ncbitaxo/nodes.dmp
doc/book/wolf_data/Release-253/ncbitaxo/readme.txt
doc/book/results/toto.tasta
sample/.DS_Store
GO
ncbitaxo/citations.dmp
ncbitaxo/delnodes.dmp
ncbitaxo/division.dmp
ncbitaxo/gc.prt
ncbitaxo/gencode.dmp
ncbitaxo/merged.dmp
ncbitaxo/names.dmp
ncbitaxo/nodes.dmp
ncbitaxo/readme.txt
template.16S
xxx.gz
*.sav
*.old
ncbitaxo.tgz
.rhistory
/.vscode
/build
/bugs
autodoc
/ncbitaxo
!/obitests/**
!/sample/**
LLM/**
*_files
entropy.html
bug_id.txt
obilowmask_ref
test_*
+170 -26
View File
@@ -2,8 +2,17 @@
#export GOBIN=$(GOPATH)/bin
#export PATH=$(GOBIN):$(shell echo $${PATH})
.DEFAULT_GOAL := all
GREEN := \033[0;32m
YELLOW := \033[0;33m
BLUE := \033[0;34m
NC := \033[0m
GOFLAGS=
LDFLAGS=
GOCMD=go
GOBUILD=$(GOCMD) build # -compiler gccgo -gccgoflags -O3
GOBUILD=$(GOCMD) build $(GOFLAGS) $(if $(LDFLAGS),-ldflags="$(LDFLAGS)")
GOGENERATE=$(GOCMD) generate
GOCLEAN=$(GOCMD) clean
GOTEST=$(GOCMD) test
@@ -16,6 +25,12 @@ PACKAGES_SRC:= $(wildcard pkg/*/*.go pkg/*/*/*.go)
PACKAGE_DIRS:=$(sort $(patsubst %/,%,$(dir $(PACKAGES_SRC))))
PACKAGES:=$(notdir $(PACKAGE_DIRS))
GITHOOK_SRC_DIR=git-hooks
GITHOOKS_SRC:=$(wildcard $(GITHOOK_SRC_DIR)/*)
GITHOOK_DIR=.git/hooks
GITHOOKS:=$(patsubst $(GITHOOK_SRC_DIR)/%,$(GITHOOK_DIR)/%,$(GITHOOKS_SRC))
OBITOOLS_SRC:= $(wildcard cmd/obitools/*/*.go)
OBITOOLS_DIRS:=$(sort $(patsubst %/,%,$(dir $(OBITOOLS_SRC))))
OBITOOLS:=$(notdir $(OBITOOLS_DIRS))
@@ -36,7 +51,7 @@ $(OBITOOLS_PREFIX)$(notdir $(1)): $(BUILD_DIR) $(1) pkg/obioptions/version.go
@echo -n - Building obitool $(notdir $(1))...
@$(GOBUILD) -o $(BUILD_DIR)/$(OBITOOLS_PREFIX)$(notdir $(1)) ./$(1) \
2> $(OBITOOLS_PREFIX)$(notdir $(1)).log \
|| cat $(OBITOOLS_PREFIX)$(notdir $(1)).log
|| { cat $(OBITOOLS_PREFIX)$(notdir $(1)).log; rm -f $(OBITOOLS_PREFIX)$(notdir $(1)).log; exit 1; }
@rm -f $(OBITOOLS_PREFIX)$(notdir $(1)).log
@echo Done.
endef
@@ -53,27 +68,53 @@ endif
OUTPUT:=$(shell mktemp)
all: obitools
help:
@printf "$(GREEN)OBITools4 Makefile$(NC)\n\n"
@printf "$(BLUE)Main targets:$(NC)\n"
@printf " %-20s %s\n" "all" "Build all obitools (default)"
@printf " %-20s %s\n" "obitools" "Build all obitools binaries to build/"
@printf " %-20s %s\n" "test" "Run Go unit tests"
@printf " %-20s %s\n" "obitests" "Run integration tests (obitests/)"
@printf " %-20s %s\n" "bump-version" "Increment patch version (or set with VERSION=x.y.z)"
@printf " %-20s %s\n" "update-deps" "Update all Go dependencies"
@printf "\n$(BLUE)Jujutsu workflow:$(NC)\n"
@printf " %-20s %s\n" "jjnew" "Document current commit and start a new one"
@printf " %-20s %s\n" "jjpush" "Release: describe, bump, generate notes, push PR, tag (VERSION=x.y.z optional)"
@printf " %-20s %s\n" "jjfetch" "Fetch latest commits from origin"
@printf "\n$(BLUE)Required tools:$(NC)\n"
@printf " %-20s " "go"; command -v go >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(go version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "git"; command -v git >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(git --version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "jj"; command -v jj >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(jj --version)" || printf "$(YELLOW)✗ not found$(NC)\n"
@printf " %-20s " "gh"; command -v gh >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(gh --version | head -1)" || printf "$(YELLOW)✗ not found$(NC) (brew install gh)\n"
@printf "\n$(BLUE)Optional tools (release notes generation):$(NC)\n"
@printf " %-20s " "aichat"; command -v aichat >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(aichat --version)" || printf "$(YELLOW)✗ not found$(NC) (https://github.com/sigoden/aichat)\n"
@printf " %-20s " "jq"; command -v jq >/dev/null 2>&1 && printf "$(GREEN)$(NC) %s\n" "$$(jq --version)" || printf "$(YELLOW)✗ not found$(NC) (brew install jq)\n"
all: install-githook obitools
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
install-githook: $(GITHOOKS)
$(GITHOOK_DIR)/%: $(GITHOOK_SRC_DIR)/%
@echo installing $$(basename $@)...
@mkdir -p $(GITHOOK_DIR)
@cp $< $@
@chmod +x $@
packages: $(patsubst %,pkg-%,$(PACKAGES))
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
update-deps:
go get -u ./...
test:
test: .FORCE
$(GOTEST) ./...
man:
make -C doc man
obibook:
make -C doc obibook
doc: man obibook
macos-pkg:
@bash pkgs/macos/macos-installer-builder-master/macOS-x64/build-macos-x64.sh \
OBITools \
0.0.1
obitests:
@for t in $$(find obitests -name test.sh -print) ; do \
bash $${t} || exit 1;\
done
githubtests: obitools obitests
$(BUILD_DIR):
mkdir -p $@
@@ -83,19 +124,122 @@ $(foreach P,$(PACKAGE_DIRS),$(eval $(call MAKE_PKG_RULE,$(P))))
$(foreach P,$(OBITOOLS_DIRS),$(eval $(call MAKE_OBITOOLS_RULE,$(P))))
pkg/obioptions/version.go: .FORCE
ifneq ($(strip $(COMMIT_ID)),)
@cat $@ \
| sed -E 's/^var _Commit = "[^"]*"/var _Commit = "'$(COMMIT_ID)'"/' \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "'"$(LAST_TAG)"'"/' \
pkg/obioptions/version.go: version.txt .FORCE
@version=$$(cat version.txt); \
cat $@ \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "Release '$$version'"/' \
> $(OUTPUT)
@diff $@ $(OUTPUT) 2>&1 > /dev/null \
|| echo "Update version.go : $@ to $(LAST_TAG) ($(COMMIT_ID))" \
&& mv $(OUTPUT) $@
|| (echo "Update version.go to $$(cat version.txt)" && mv $(OUTPUT) $@)
@rm -f $(OUTPUT)
endif
.PHONY: all packages obitools man obibook doc update-deps .FORCE
.FORCE:
bump-version:
@current=$$(cat version.txt); \
if [ -n "$(VERSION)" ]; then \
new_version="$(VERSION)"; \
echo "Setting version to $$new_version (was $$current)"; \
else \
echo "Incrementing version..."; \
echo " Current version: $$current"; \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
echo " New version: $$new_version"; \
fi; \
echo "$$new_version" > version.txt
@echo "✓ Version updated in version.txt"
@$(MAKE) pkg/obioptions/version.go
jjnew:
@echo "$(YELLOW)→ Creating a new commit...$(NC)"
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
@echo "$(BLUE)→ Done.$(NC)"
@jj new
@echo "$(GREEN)✓ New commit created$(NC)"
jjpush:
@$(MAKE) jjpush-describe
@$(MAKE) jjpush-bump
@$(MAKE) jjpush-notes
@$(MAKE) jjpush-push
@$(MAKE) jjpush-tag
@echo "$(GREEN)✓ Release complete$(NC)"
jjpush-describe:
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
jjpush-bump:
@echo "$(BLUE)→ Creating new commit for version bump...$(NC)"
@jj new
@$(MAKE) bump-version
jjpush-notes:
@version=$$(cat version.txt); \
echo "$(BLUE)→ Generating release notes for version $$version...$(NC)"; \
release_title="Release $$version"; \
release_body=""; \
if command -v aichat >/dev/null 2>&1; then \
previous_tag=$$(git describe --tags --abbrev=0 --match 'Release_*' 2>/dev/null); \
if [ -z "$$previous_tag" ]; then \
echo "$(YELLOW)⚠ No previous Release tag found, skipping release notes$(NC)"; \
else \
raw_output=$$(git log --format="%h %B" "$$previous_tag..HEAD" | \
aichat \
"Summarize the following commits into a GitHub release note for version $$version. Ignore commits related to version bumps, .gitignore changes, or any internal housekeeping that is irrelevant to end users. Describe each user-facing change precisely without exposing code. Eliminate redundancy. Output strictly valid JSON with no surrounding text, using this exact schema: {\"title\": \"<short release title>\", \"body\": \"<detailed markdown release notes>\"}" 2>/dev/null) || true; \
if [ -n "$$raw_output" ]; then \
notes=$$(printf '%s\n' "$$raw_output" | python3 tools/json2md.py 2>/dev/null); \
if [ -n "$$notes" ]; then \
release_title=$$(echo "$$notes" | head -1); \
release_body=$$(echo "$$notes" | tail -n +3); \
else \
echo "$(YELLOW)⚠ JSON parsing failed, using default release message$(NC)"; \
fi; \
fi; \
fi; \
fi; \
printf '%s' "$$release_title" > /tmp/obitools4-release-title.txt; \
printf '%s' "$$release_body" > /tmp/obitools4-release-body.txt; \
echo "$(BLUE)→ Setting release notes as commit description...$(NC)"; \
jj desc -m "$$release_title"$$'\n\n'"$$release_body"
jjpush-push:
@echo "$(BLUE)→ Pushing commits...$(NC)"
@jj git push --change @
@echo "$(BLUE)→ Creating/updating PR...$(NC)"
@release_title=$$(cat /tmp/obitools4-release-title.txt 2>/dev/null || echo "Release $$(cat version.txt)"); \
release_body=$$(cat /tmp/obitools4-release-body.txt 2>/dev/null || echo ""); \
branch=$$(jj log -r @ --no-graph -T 'bookmarks.map(|b| b.name()).join("\n")' 2>/dev/null | head -1); \
if [ -n "$$branch" ] && command -v gh >/dev/null 2>&1; then \
gh pr create --title "$$release_title" --body "$$release_body" --base master --head "$$branch" 2>/dev/null \
|| gh pr edit "$$branch" --title "$$release_title" --body "$$release_body" 2>/dev/null \
|| echo "$(YELLOW)⚠ Could not create/update PR$(NC)"; \
fi
jjpush-tag:
@version=$$(cat version.txt); \
tag_name="Release_$$version"; \
release_title=$$(cat /tmp/obitools4-release-title.txt 2>/dev/null || echo "Release $$version"); \
release_body=$$(cat /tmp/obitools4-release-body.txt 2>/dev/null || echo ""); \
install_section=$$'\n## Installation\n\n### Pre-built binaries\n\nDownload the appropriate archive for your system from the\n[release assets](https://github.com/metabarcoding/obitools4/releases/tag/Release_'"$$version"')\nand extract it:\n\n#### Linux (AMD64)\n```bash\ntar -xzf obitools4_'"$$version"'_linux_amd64.tar.gz\n```\n\n#### Linux (ARM64)\n```bash\ntar -xzf obitools4_'"$$version"'_linux_arm64.tar.gz\n```\n\n#### macOS (Intel)\n```bash\ntar -xzf obitools4_'"$$version"'_darwin_amd64.tar.gz\n```\n\n#### macOS (Apple Silicon)\n```bash\ntar -xzf obitools4_'"$$version"'_darwin_arm64.tar.gz\n```\n\nAll OBITools4 binaries are included in each archive.\n\n### From source\n\nYou can also compile and install OBITools4 directly from source using the\ninstallation script:\n\n```bash\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --version '"$$version"'\n```\n\nBy default binaries are installed in `/usr/local/bin`. Use `--install-dir` to\nchange the destination and `--obitools-prefix` to add a prefix to command names:\n\n```bash\ncurl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | \\\n bash -s -- --version '"$$version"' --install-dir ~/local --obitools-prefix k\n```\n'; \
release_message="$$release_title"$$'\n\n'"$$release_body$$install_section"; \
echo "$(BLUE)→ Creating tag $$tag_name...$(NC)"; \
commit_hash=$$(jj log -r @ --no-graph -T 'commit_id' 2>/dev/null); \
git tag -a "$$tag_name" $${commit_hash:+"$$commit_hash"} -m "$$release_message" 2>/dev/null || echo "$(YELLOW)⚠ Tag $$tag_name already exists$(NC)"; \
echo "$(BLUE)→ Pushing tag $$tag_name...$(NC)"; \
git push origin "$$tag_name" 2>/dev/null || echo "$(YELLOW)⚠ Tag push failed or already pushed$(NC)"; \
rm -f /tmp/obitools4-release-title.txt /tmp/obitools4-release-body.txt
jjfetch:
@echo "$(YELLOW)→ Pulling latest commits...$(NC)"
@jj git fetch
@jj new master@origin
@echo "$(GREEN)✓ Latest commits pulled$(NC)"
.PHONY: all obitools update-deps obitests githubtests help jjnew jjpush jjpush-describe jjpush-bump jjpush-notes jjpush-push jjpush-tag jjfetch bump-version .FORCE
.FORCE:
+33 -7
View File
@@ -16,28 +16,54 @@ The easiest way to run it is to copy and paste the following command into your t
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash
```
By default, the script installs the *OBITools* commands and other associated files into the `/usr/local` directory.
The names of the commands in the new *OBITools4* are mostly identical to those in *OBITools2*.
Therefore, installing the new *OBITools* may hide or delete the old ones. If you want both versions to be
available on your system, the installation script offers two options:
By default, the script installs the latest version of *OBITools* commands and other associated files into the `/usr/local` directory.
### Installation Options
The installation script offers several options:
> -l, --list List all available versions and exit.
>
> -v, --version Install a specific version (e.g., `-v 4.4.3`).
> By default, the latest version is installed.
>
> -i, --install-dir Directory where obitools are installed
> (as example use `/usr/local` not `/usr/local/bin`).
>
> -p, --obitools-prefix Prefix added to the obitools command names if you
> want to have several versions of obitools at the
> same time on your system (as example `-p g` will produce
> same time on your system (as example `-p g` will produce
> `gobigrep` command instead of `obigrep`).
>
> -j, --jobs Number of parallel jobs used for compilation
> (default: 1). Increase this value to speed up
> compilation on multi-core systems (e.g., `-j 4`).
You can use these options by following the installation command:
### Examples
List all available versions:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --list
```
Install a specific version:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --version 4.4.3
```
Install in a custom directory with command prefix:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | \
bash -s -- --install-dir test_install --obitools-prefix k
```
In this case, the binaries will be installed in the `test_install` directory and all command names will be prefixed with the letter `k`. Thus `obigrep` will be named `kobigrep`.
In this last example, the binaries will be installed in the `test_install` directory and all command names will be prefixed with the letter `k`. Thus, `obigrep` will be named `kobigrep`.
### Note on Version Compatibility
The names of the commands in the new *OBITools4* are mostly identical to those in *OBITools2*.
Therefore, installing the new *OBITools* may hide or delete the old ones. If you want both versions to be
available on your system, use the `--install-dir` and `--obitools-prefix` options as shown above.
## Continuing the analysis...
+196 -106
View File
@@ -1,19 +1,85 @@
# OBITools release notes
## Latest changes
## New changes
### Bug fixes
- In `obipairing` correct the misspelling of the `obiparing_*` tags where the `i`
was missing to `obipairing_`.
- In `obigrep` the **-C** option that excludes sequences too abundant was not
functional.
- In `obitaxonomy` the **-l** option that lists all the taxonomic rank defined by
a taxonomy was not functional
- The file type guesser was not using enough data to be able to correctly detect
file format when sequences were too long in fastq and fasta or when lines were
to long in CSV files. That's now corrected
- Options **--fasta** or **--fastq** usable to specify input format were ignored.
They are now correctly considered
- The `obiannotate` command were crashing when a selection option was used but
no editing option.
- The `--fail-on-taxonomy` led to an error on merged taxa even when the
`--update-taxid` option was used.
- The `--compressed` option was not correctly named. It was renamed to `--compress`
### Enhancement
- Some sequences in the Genbank and EMBL databases are several gigabases long. The
sequence parser had to reallocate and recopy memory many times to read them,
resulting in a complexity of O(N^2) for reading such large sequences.
The new file chunk reader has a linear algorithm that speeds up the reading
of very long sequences.
- A new option **--csv** is added to every obitools to indicate that the input
format is CSV
- The new version of obitools are now printing the taxids in a fancy way
including the scientific name and the taxonomic rank (`"taxon:9606 [Homo
sapiens]@species"`). But if you need the old fashion raw taxid, a new option
**--raw-taxid** has been added to get obitools printing the taxids without any
decorations (`"9606"`).
## March 1st, 2025. Release 4.4.0
A new documentation website is available at https://obitools4.metabarcoding.org.
Its development is still in progress.
The biggest step forward in this new version is taxonomy management. The new
version is now able to handle taxonomic identifiers that are not just integer
values. This is a first step towards an easy way to handle other taxonomy
databases soon, such as the GBIF or Catalog of Life taxonomies. This version
is able to handle files containing taxonomic information created by previous
versions of OBITools, but files created by this new version may have some
problems to be analyzed by previous versions, at least for the taxonomic
information.
### Breaking changes
- In `obimultiplex`, the short version of the **--tag-list** option used to specify the list
of tags and primers to be used for the demultiplexing has been changed from `-t` to `-s`.
- In `obimultiplex`, the short version of the **--tag-list** option used to
specify the list of tags and primers to be used for the demultiplexing has
been changed from `-t` to `-s`.
- The command `obifind` is now renamed `obitaxonomy`.
- The **--taxdump** option used to specify the path to the taxdump containing the NCBI taxonomy
has been renamed to **--taxonomy**.
- The **--taxdump** option used to specify the path to the taxdump containing
the NCBI taxonomy has been renamed to **--taxonomy**.
### Bug fixes
- Correction of a bug when using paired sequence file with the **--out** option.
- Correction of a bug in `obitag` when trying to annotate very short sequences of
4 bases or less.
- In `obipairing`, correct the stats `seq_a_single` and `seq_b_single` when
on right alignment mode
@@ -21,12 +87,32 @@
the batch size and not reading the qualities from the fastq files as `obiuniq`
is producing only fasta output without qualities.
- In `obitag`, correct the wrong assignment of the **obitag_bestmatch**
attribute.
- In `obiclean`, the **--no-progress-bar** option disables all progress bars,
not just the data.
- Several fixes in reading FASTA and FASTQ files, including some code
simplification and factorization.
- Fixed a bug in all obitools that caused the same file to be processed
multiple times, when specifying a directory name as input.
### New features
- `obigrep` add a new **--valid-taxid** option to keep only sequence with a
valid taxid
- `obiclean` add a new **--min-sample-count** option with a default value of 1,
asking to filter out sequences which are not occurring in at least the
specified number of samples.
- `obitoaxonomy` a new **--dump|D** option allows for dumping a sub-taxonomy.
- Taxonomy dump can now be provided as a four-columns CSV file to the **--taxonomy**
option.
- Taxonomy dump can now be provided as a four-columns CSV file to the
**--taxonomy** option.
- NCBI Taxonomy dump does not need to be uncompressed and unarchived anymore. The
path of the tar and gziped dump file can be directly specified using the
@@ -37,54 +123,68 @@
allow the processing of the rare fasta and fastq files not recognized.
- In `obiscript`, adds new methods to the Lua sequence object:
- `md5_string()`: returning the MD5 check sum as an hexadecimal string,
- `subsequence(from,to)`: allows to extract a subsequence on a 0 based
coordinate system, upper bound expluded like in go.
- `reverse_complement`: returning a sequence object corresponding to the reverse complement
of the current sequence.
- `md5_string()`: returning the MD5 check sum as a hexadecimal string,
- `subsequence(from,to)`: allows extracting a subsequence on a 0 based
coordinate system, upper bound excluded like in go.
- `reverse_complement`: returning a sequence object corresponding to the
reverse complement of the current sequence.
### Change of git repositiory
### Enhancement
- The OBITools4 git repository has been moved to the github repository.
- All obitools now have a **--taxonomy** option. If specified, the taxonomy is
loaded first and taxids annotating the sequences are validated against that
taxonomy. A warning is issued for any invalid taxid and for any taxid that
is transferred to a new taxid. The **--update-taxid** option allows these
old taxids to be replaced with their new equivalent in the result of the
obitools command.
- The scoring system used by the `obipairing` command has been changed to be
more coherent. In the new version, the scores associated to a match and a
mismatch involving a nucleotide with a quality score of 0 are equal. Which
is normal as a zero quality score means a perfect indecision on the read
nucleotide, therefore there is no reason to penalize a match differently
from a mismatch (see
https://obitools4.metabarcoding.org/docs/commands/alignments/obipairing/exact-alignment/).
- In every *OBITools* command, the progress bar is automatically deactivated
when the standard error output is redirected.
- Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimized for short sequences, `obipcr` faces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the default `obipcr` parameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact of `obipcr` without changing the
computational efficiency too much.
- Logging system and therefore format, have been homogenized.
## August 2nd, 2024. Release 4.3.0
### Change of git repository
- The OBITools4 git repository has been moved to the GitHub repository.
The new address is: https://github.com/metabarcoding/obitools4.
Take care for using the new install script for retrieving the new version.
```bash
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
curl -L https://metabarcoding.org/obitools4/install.sh \
| bash
```
or with options:
```bash
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
curl -L https://metabarcoding.org/obitools4/install.sh \
| bash -s -- --install-dir test_install --obitools-prefix k
```
### CPU limitation
- By default, *OBITools4* tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by *OBITools4* or by using the **--max-cpu**
option or by setting the **OBIMAXCPU** environment variable. Some strange
behaviour of *OBITools4* has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get *OBITools4* to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, **OBITools4** will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
**--force-one-core** option. But be aware that this can lead to incorrect
calculations.
### New features
- The output of the obitools will evolve to produce results only in standard
formats such as fasta and fastq. For non-sequential data, the output will be
in CSV format, with the separator `,`, the decimal separator `.`, and a
header line with the column names. It is more convenient to use the output
in other programs. For example, you can use the `csvtomd` command to
reformat the csv output into a markdown table. The first command to initiate
reformat the CSV output into a Markdown table. The first command to initiate
this change is `obicount`, which now produces a 3-line CSV output.
```bash
@@ -96,7 +196,7 @@
database for `obitag` is to use `obipcr` on a local copy of Genbank or EMBL.
However, these sequence databases are known to contain many taxonomic
errors, such as bacterial sequences annotated with the taxid of their host
species. obicleandb tries to detect these errors. To do this, it first keeps
species. `obicleandb` tries to detect these errors. To do this, it first keeps
only sequences annotated with the taxid to which a species, genus, and
family taxid can be assigned. Then, for each sequence, it compares the
distance of the sequence to the other sequences belonging to the same genus
@@ -107,7 +207,7 @@
with the p-value of the Mann-Whitney U test in the **obicleandb_trusted**
slot. Later, the distribution of this p-value can be analyzed to determine a
threshold. Empirically, a threshold of 0.05 is a good compromise and allows
to filter out less than 1‰ of the sequences. These sequences can then be
filtering out less than 1‰ of the sequences. These sequences can then be
removed using `obigrep`.
- Adds a new `obijoin` utility to join information contained in a sequence
@@ -117,16 +217,16 @@
- Adds a new tool `obidemerge` to demerge a `merge_xxx` slot by recreating the
multiple identical sequences having the slot `xxx` recreated with its initial
value and the sequence count set to the number of occurences refered in the
value and the sequence count set to the number of occurrences referred in the
`merge_xxx` slot. During the operation, the `merge_xxx` slot is removed.
- Adds CSV as one of the input format for every obitools command. To encode
sequence the CSV file must includes a column named `sequence` and another
sequence the CSV file must include a column named `sequence` and another
column named `id`. An extra column named `qualities` can be added to specify
the quality scores of the sequence following the same ascii encoding than the
the quality scores of the sequence following the same ASCII encoding than the
fastq format. All the other columns will be considered as annotations and will
be interpreted as JSON objects encoding potentially for atomic values. If a
calumn value can not be decoded as JSON it will be considered as a string.
column value can not be decoded as JSON it will be considered as a string.
- A new option **--version** has been added to every obitools command. It will
print the version of the command.
@@ -135,8 +235,8 @@
quality scores from a BioSequence object.\
- In `obimultuplex` the ngsfilter file describing the samples can be no provided
not only using the classical nfsfilter format but also using the csv format.
When using csv, the first line must contain the column names. 5 columns are
not only using the classical ngsfilter format but also using the CSV format.
When using CSV, the first line must contain the column names. 5 columns are
expected:
- `experiment` the name of the experiment
@@ -152,43 +252,34 @@
Supplementary columns are allowed. Their names and content will be used to
annotate the sequence corresponding to the sample, as the `key=value;` did
in the nfsfilter format.
in the ngsfilter format.
The CSV format used allows for comment lines starting with `#` character.
Special data lines starting with `@param` in the first column allow to
configure the algorithm. The options **--template** provided an over
commented example of the csv format, including all the possible options.
Special data lines starting with `@param` in the first column allow configuring the algorithm. The options **--template** provided an over
commented example of the CSV format, including all the possible options.
### CPU limitation
### Enhancement
- By default, *OBITools4* tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by *OBITools4* or by using the **--max-cpu**
option or by setting the **OBIMAXCPU** environment variable. Some strange
behavior of *OBITools4* has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get *OBITools4* to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, **OBITools4** will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
**--force-one-core** option. But be aware that this can lead to incorrect
calculations.
- In every *OBITools* command, the progress bar are automatically deactivated
when the standard error output is redirected.
- Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimised for short sequences, `obipcr` faces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the default `obipcr` parameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact of `obipcr` without changing the
computational efficiency too much.
- Logging system and therefore format, have been homogenized.
### Bug
- In `obitag`, correct the wrong assignment of the **obitag_bestmatch**
attribute.
- In `obiclean`, the **--no-progress-bar** option disables all progress bars,
not just the data.
- Several fixes in reading FASTA and FASTQ files, including some code
simplification and and factorization.
- Fixed a bug in all obitools that caused the same file to be processed
multiple times. when specifying a directory name as input.
## April 2nd, 2024. Release 4.2.0
### New features
- A new OBITools named `obiscript` allows to process each sequence according
- A new OBITools named `obiscript` allows processing each sequence according
to a Lua script. This is an experimental tool. The **--template** option
allows for generating an example script on the `stdout`.
@@ -196,7 +287,7 @@
- Two of the main class `obiseq.SeqWorker` and `obiseq.SeqWorker` have their
declaration changed. Both now return two values a `obiseq.BioSequenceSlice`
and an `error`. This allow a worker to return potentially several sequences
and an `error`. This allows a worker to return potentially several sequences
as the result of the processing of a single sequence, or zero, which is
equivalent to filter out the input sequence.
@@ -204,12 +295,12 @@
- In `obitag` if the reference database contains sequences annotated by taxid
not referenced in the taxonomy, the corresponding sequences are discarded
from the reference database and a warning indicating the sequence id and the
from the reference database and a warning indicating the sequence *id* and the
wrong taxid is emitted.
- The bug corrected in the parsing of EMBL and Genbank files as implemented in
version 4.1.2 of OBITools4, potentially induced some reduction in the
performance of the parsing. This should have been now fixed.
- In the same idea, parsing of genbank and EMBL files were reading and storing
- In the same idea, parsing of Genbank and EMBL files were reading and storing
in memory not only the sequence but also the annotations (features table).
Up to now none of the OBITools are using this information, but with large
complete genomes, it is occupying a lot of memory. To reduce this impact,
@@ -248,7 +339,7 @@
### New feature
- In `obimatrix` a **--transpose** option allows to transpose the produced
- In `obimatrix` a **--transpose** option allows transposing the produced
matrix table in CSV format.
- In `obitpairing` and `obipcrtag` two new options **--exact-mode** and
**--fast-absolute** to control the heuristic used in the alignment
@@ -256,7 +347,7 @@
the exact algorithm at the cost of a speed. **--fast-absolute** change the
scoring schema of the heuristic.
- In `obiannotate` adds the possibility to annotate the first match of a
pattern using the same algorithm than the one used in `obipcr` and
pattern using the same algorithm as the one used in `obipcr` and
`obimultiplex`. For that four option were added :
- **--pattern** : to specify the pattern. It can use IUPAC codes and
position with no error tolerated has to be followed by a `#` character.
@@ -337,7 +428,7 @@
### Bugs
- in the obitools language, the `composition` function now returns a map
- In the obitools language, the `composition` function now returns a map
indexed by lowercase string "a", "c", "g", "t" and "o" for other instead of
being indexed by the ASCII codes of the corresponding letters.
- Correction of the reverse-complement operation. Every reverse complement of
@@ -350,18 +441,18 @@
duplicating the quality values. This made `obimultiplex` to produce fastq
files with sequences having quality values duplicated.
### Becareful
### Be careful
GO 1.21.0 is out, and it includes new functionalities which are used in the
OBITools4 code. If you use the recommanded method for compiling OBITools on your
computer, their is no problem, as the script always load the latest GO version.
If you rely on you personnal GO install, please think to update.
OBITools4 code. If you use the recommended method for compiling OBITools on your
computer, there is no problem, as the script always load the latest GO version.
If you rely on your personal GO install, please think to update.
## August 29th, 2023. Release 4.0.5
### Bugs
- Patch a bug in the `obiseq.BioSequence` constructor leading to a error on
- Patch a bug in the `obiseq.BioSequence` constructor leading to an error on
almost every obitools. The error message indicates : `fatal error: sync:
unlock of unlocked mutex` This bug was introduced in the release 4.0.4
@@ -380,7 +471,7 @@ If you rely on you personnal GO install, please think to update.
data structure to limit the number of alignments actually computed. This
increase a bit the speed of both the software. `obirefidx` is nevertheless
still too slow compared to my expectation.
- Switch to a parallel version of the gzip library, allowing for high speed
- Switch to a parallel version of the GZIP library, allowing for high speed
compress and decompress operation on files.
### New feature
@@ -424,12 +515,12 @@ If you rely on you personnal GO install, please think to update.
--unidentified not_assigned.fastq
```
the command produced four files : `tagged_library_R1.fastq` and
The command produced four files : `tagged_library_R1.fastq` and
`tagged_library_R2.fastq` containing the assigned reads and
`not_assigned_R1.fastq` and `not_assigned_R2.fastq` containing the
unassignable reads.
the tagged library files can then be split using `obidistribute`:
The tagged library files can then be split using `obidistribute`:
```{bash}
mkdir pcr_reads
@@ -439,9 +530,9 @@ If you rely on you personnal GO install, please think to update.
- Adding of two options **--add-lca-in** and **--lca-error** to `obiannotate`.
These options aim to help during construction of reference database using
`obipcr`. On obipcr output, it is commonly run obiuniq. To merge identical
`obipcr`. On `obipcr` output, it is commonly run `obiuniq`. To merge identical
sequences annotated with different taxids, it is now possible to use the
following strategie :
following strategies :
```{bash}
obiuniq -m taxid myrefdb.obipcr.fasta \
@@ -472,7 +563,7 @@ If you rely on you personnal GO install, please think to update.
- Correction of a bug in `obiconsensus` leading into the deletion of a base
close to the beginning of the consensus sequence.
## March 31th, 2023. Release 4.0.2
## March 31st, 2023. Release 4.0.2
### Compiler change
@@ -483,15 +574,15 @@ If you rely on you personnal GO install, please think to update.
- Add the possibility for looking pattern with indels. This has been added to
`obimultiplex` through the **--with-indels** option.
- Every obitools command has a **--pprof** option making the command
publishing a profiling web site available at the address :
publishing a profiling website available at the address :
<http://localhost:8080/debug/pprof/>
- A new `obiconsensus` command has been added. It is a prototype. It aims to
build a consensus sequence from a set of reads. The consensus is estimated
for all the sequences contained in the input file. If several input files,
or a directory name are provided the result contains a consensus per file.
The id of the sequence is the name of the input file depleted of its
The *id* of the sequence is the name of the input file depleted of its
directory name and of all its extensions.
- In `obipcr` an experimental option **--fragmented** allows for spliting very
- In `obipcr` an experimental option **--fragmented** allows for splitting very
long query sequences into shorter fragments with an overlap between the two
contiguous fragment insuring that no amplicons are missed despite the split.
As a site effect some amplicon can be identified twice.
@@ -534,7 +625,7 @@ If you rely on you personnal GO install, please think to update.
### Enhancement
- *OBITools* are automatically processing all the sequences files contained in
a directory and its sub-directory\
a directory and its subdirectory\
recursively if its name is provided as input. To process easily Genbank
files, the corresponding filename extensions have been added. Today the
following extensions are recognized as sequence files : `.fasta`, `.fastq`,
@@ -551,7 +642,7 @@ If you rely on you personnal GO install, please think to update.
export OBICPUMAX=4
```
- Adds a new option --out\|-o allowing to specify the name of an outpout file.
- Adds a new option --out\|-o allowing to specify the name of an output file.
``` bash
obiconvert -o xyz.fasta xxx.fastq
@@ -573,10 +664,10 @@ If you rely on you personnal GO install, please think to update.
matched files remain consistent when processed.
- Adding of the function `ifelse` to the expression language for computing
conditionnal values.
conditional values.
- Adding two function to the expression language related to sequence
conposition : `composition` and `gcskew`. Both are taking a sequence as
composition : `composition` and `gcskew`. Both are taking a sequence as
single argument.
## February 18th, 2023. Release 4.0.0
@@ -584,8 +675,8 @@ If you rely on you personnal GO install, please think to update.
It is the first version of the *OBITools* version 4. I decided to tag then
following two weeks of intensive data analysis with them allowing to discover
many small bugs present in the previous non-official version. Obviously other
bugs are certainly persent in the code, and you are welcome to use the git
ticket system to mention them. But they seems to produce now reliable results.
bugs are certainly present in the code, and you are welcome to use the git
ticket system to mention them. But they seem to produce now reliable results.
### Corrected bugs
@@ -593,11 +684,11 @@ ticket system to mention them. But they seems to produce now reliable results.
of sequences and to the production of incorrect file because of the last
sequence record, sometime truncated in its middle. This was only occurring
when more than a single CPU was used. It was affecting every obitools.
- The `obiparing` software had a bug in the right aligment procedure. This led
to the non alignment of very sort barcode during the paring of the forward
- The `obiparing` software had a bug in the right alignment procedure. This led
to the non-alignment of very sort barcode during the paring of the forward
and reverse reads.
- The `obipairing` tools had a non deterministic comportment when aligning a
paor very low quality reads. This induced that the result of the same low
- The `obipairing` tools had a non-deterministic comportment when aligning a
pair very low quality reads. This induced that the result of the same low
quality read pair was not the same from run to run.
### New features
@@ -605,11 +696,10 @@ ticket system to mention them. But they seems to produce now reliable results.
- Adding of a `--compress|-Z` option to every obitools allowing to produce
`gz` compressed output. OBITools were already able to deal with gziped input
files transparently. They can now produce their results in the same format.
- Adding of a `--append|-A` option to the `obidistribute` tool. It allows to
append the result of an `obidistribute` execution to preexisting files. -
- Adding of a `--append|-A` option to the `obidistribute` tool. It allows appending the result of an `obidistribute` execution to preexisting files. -
Adding of a `--directory|-d` option to the `obidistribute` tool. It allows
to declare a secondary classification key over the one defined by the
'--category\|-c\` option. This extra key leads to produce directories in
declaring a secondary classification key over the one defined by the
`--category\|-c\` option. This extra key leads to produce directories in
which files produced according to the primary criterion are stored.
- Adding of the functions `subspc`, `printf`, `int`, `numeric`, and `bool` to
the expression language.
+300
View File
@@ -0,0 +1,300 @@
# NAME
obicomplement — reverse complement of sequences
---
# SYNOPSIS
```
obicomplement [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header]
[--paired-with <FILENAME>] [--raw-taxid] [--silent-warning]
[--skip-empty] [--solexa] [--taxonomy|-t <string>] [--u-to-t]
[--update-taxid] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obicomplement` computes the reverse complement of every sequence in the
input. For each input sequence, the nucleotides are first reversed, then
each base is replaced by its WatsonCrick complement (A↔T, C↔G), yielding
the strand that would pair with the original sequence read in the opposite
direction.
When quality scores are present (FASTQ data), they are reversed in the same
order as the sequence so that each quality value remains associated with its
corresponding base. Ambiguous IUPAC characters (e.g. `N`, `R`, `Y`) are
handled correctly and preserved in the output.
This operation is commonly needed when sequences have been sequenced on the
wrong strand, when a primer is designed on the reverse strand, or when
preparing sequences for strand-aware downstream analyses.
The command reads from standard input or from one or more files, processes
sequences in parallel, and writes the result to standard output or to the
file specified with `--out`.
---
# INPUT
`obicomplement` accepts biological sequence data in FASTA, FASTQ, EMBL,
GenBank, ecoPCR output, and CSV formats. When no format flag is given, the
format is inferred automatically from the file contents or extension.
Input is read from standard input when no filename argument is provided, or
from one or more files passed as positional arguments. Gzip-compressed files
are handled transparently.
Paired-end data can be provided with `--paired-with`, which specifies the
file containing the second mate. Both mates are reverse-complemented and
written to separate output files.
---
# OUTPUT
The output is a sequence file in which every sequence is the reverse
complement of the corresponding input sequence. The output format matches
the input by default (FASTA if no quality data, FASTQ if quality data are
present), and can be overridden with `--fasta-output`, `--fastq-output`, or
`--json-output`.
All annotations (attributes stored in the sequence header) are preserved
unchanged. Quality scores, when present, are reversed to stay aligned with
their bases.
## Observed output example
```
>seq001 {"definition":"basic DNA sequence"}
cgatcgatcgatcgatcgat
>seq002 {"definition":"GC-rich sequence"}
gcgcgcgcgcgcgcgcgcgc
>seq003 {"definition":"AT-rich sequence"}
atatatatatatatatatat
>seq004 {"definition":"palindromic sequence"}
aattccggaattccggaatt
>seq005 {"definition":"mixed sequence"}
agctagcatgcatagccgat
```
---
# OPTIONS
## Input format
**`--fasta`**
: Default: false. Force parsing of input as FASTA format.
**`--fastq`**
: Default: false. Force parsing of input as FASTQ format.
**`--embl`**
: Default: false. Force parsing of input as EMBL flatfile format.
**`--genbank`**
: Default: false. Force parsing of input as GenBank flatfile format.
**`--ecopcr`**
: Default: false. Force parsing of input as ecoPCR output format.
**`--csv`**
: Default: false. Force parsing of input as CSV format.
**`--solexa`**
: Default: false. Decode quality scores using the Solexa/Illumina pre-1.3
convention instead of the standard Phred+33 encoding.
**`--input-OBI-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using the OBI
key=value format.
**`--input-json-header`**
: Default: false. Interpret FASTA/FASTQ header annotations using JSON
format.
**`--no-order`**
: Default: false. When several input files are given, declare that no
ordering relationship exists among them, allowing the reader to interleave
records freely.
**`--paired-with <FILENAME>`**
: Default: none. File containing the paired (R2) reads. When set,
`obicomplement` processes both mates and writes them to separate output
files.
## Sequence preprocessing
**`--u-to-t`**
: Default: false. Convert Uracil (U) to Thymine (T) before computing the
reverse complement. Useful when processing RNA sequences that must be
treated as DNA.
**`--skip-empty`**
: Default: false. Discard sequences of length zero from the output.
## Output format
**`--fasta-output`**
: Default: false. Write output in FASTA format regardless of whether quality
scores are present.
**`--fastq-output`**
: Default: false. Write output in FASTQ format (requires quality data).
**`--json-output`**
: Default: false. Write output in JSON format.
**`--out|-o <FILENAME>`**
: Default: `-` (standard output). File used to save the output.
**`--output-OBI-header|-O`**
: Default: false. Write FASTA/FASTQ header annotations in OBI key=value
format.
**`--output-json-header`**
: Default: false. Write FASTA/FASTQ header annotations in JSON format.
**`--compress|-Z`**
: Default: false. Compress the output with gzip.
## Taxonomy
**`--taxonomy|-t <string>`**
: Default: none. Path to a taxonomy database. Required only when the input
sequences carry taxid annotations that need to be validated or updated.
**`--fail-on-taxonomy`**
: Default: false. Cause `obicomplement` to exit with an error if a taxid
referenced in the data is not a currently valid node in the loaded
taxonomy.
**`--update-taxid`**
: Default: false. Automatically replace taxids that have been declared
merged into a newer node by the taxonomy database.
**`--raw-taxid`**
: Default: false. Print taxids without appending the taxon name and rank.
**`--with-leaves`**
: Default: false. When the taxonomy is extracted from the sequence file,
attach sequences as leaves of their taxid node.
## Performance and diagnostics
**`--max-cpu <int>`**
: Default: 16 (env: `OBIMAXCPU`). Number of parallel threads used to
process sequences.
**`--batch-size <int>`**
: Default: 1 (env: `OBIBATCHSIZE`). Minimum number of sequences per
processing batch.
**`--batch-size-max <int>`**
: Default: 2000 (env: `OBIBATCHSIZEMAX`). Maximum number of sequences per
processing batch.
**`--batch-mem <string>`**
: Default: `128M` (env: `OBIBATCHMEM`). Maximum memory allocated per batch
(e.g. `128K`, `64M`, `1G`). Set to `0` to disable the memory limit.
**`--no-progressbar`**
: Default: false. Disable the progress bar printed to stderr.
**`--silent-warning`**
: Default: false (env: `OBIWARNING`). Suppress warning messages.
**`--debug`**
: Default: false (env: `OBIDEBUG`). Enable debug logging.
---
# EXAMPLES
```bash
# Reverse complement all sequences in a FASTA file
obicomplement sequences.fasta > out_default.fasta
```
**Expected output:** 5 sequences written to `out_default.fasta`.
```bash
# Reverse complement a FASTQ file, preserving quality scores
obicomplement reads.fastq --fastq-output --out out_fastq.fastq
```
**Expected output:** 5 sequences written to `out_fastq.fastq`.
```bash
# Convert RNA sequences to their reverse complement DNA strand
obicomplement --u-to-t rna_sequences.fasta > out_rna_rc.fasta
```
**Expected output:** 3 sequences written to `out_rna_rc.fasta`.
```bash
# Reverse complement paired-end reads into two separate output files
obicomplement R1.fastq --paired-with R2.fastq --out out_paired.fastq
```
**Expected output:** 3 sequences written to `out_paired_R1.fastq` and 3 sequences to `out_paired_R2.fastq`.
```bash
# Reverse complement and compress output, skipping any empty sequences
obicomplement --skip-empty --compress sequences.fasta --out out_compressed.fasta.gz
```
**Expected output:** 5 sequences written to `out_compressed.fasta.gz` (gzip-compressed FASTA).
```bash
# Reverse complement with OBI-format header output
obicomplement --output-OBI-header sequences.fasta --out out_obi.fasta
```
**Expected output:** 5 sequences written to `out_obi.fasta`.
```bash
# Reverse complement with explicit JSON-format header output
obicomplement --output-json-header sequences.fasta --out out_jsonheader.fasta
```
**Expected output:** 5 sequences written to `out_jsonheader.fasta`.
```bash
# Reverse complement and write full JSON output format
obicomplement --json-output sequences.fasta --out out_json.json
```
**Expected output:** 5 sequences written to `out_json.json`.
---
# SEE ALSO
- `obiconvert` — format conversion and sequence filtering pipeline
- `obipairing` — paired-end read merging (uses reverse complement internally)
- `obigrep` — sequence filtering and selection
---
# NOTES
Quality scores (Phred-scaled) are reversed in lock-step with the sequence
so that positional quality information remains valid after the reverse
complement operation. This is essential for downstream tools that rely on
per-base quality for alignment or variant calling.
Ambiguous IUPAC characters and gap symbols (`-`) are handled gracefully:
standard ambiguous bases are complemented according to IUPAC rules, while
gap and missing-data symbols are preserved unchanged.
+188
View File
@@ -0,0 +1,188 @@
# obiconsensus(1) — OBITools4 Manual
## NAME
`obiconsensus` — denoise Oxford Nanopore Technology (ONT) reads by building consensus sequences
## SYNOPSIS
```
obiconsensus [OPTIONS] [FILE...]
```
## DESCRIPTION
`obiconsensus` is designed to correct sequencing errors in long reads produced by Oxford Nanopore Technology (ONT) sequencers. Because ONT reads have a relatively high error rate compared to short-read technologies, sequences originating from the same biological molecule can differ slightly from one another. `obiconsensus` groups these related reads and builds a single, more reliable consensus sequence for each group.
The tool works by constructing a *difference graph*: each unique read is represented as a node, and two nodes are connected if their sequences differ by at most a small number of positions (controlled by `--distance`). Within each sample, clusters of closely related reads are identified, and a consensus is assembled from the cluster members using a *de Bruijn graph* approach. The result is a set of high-quality representative sequences, one per cluster.
Two denoising strategies are available:
- **Standard mode** (default): identifies hub nodes (likely true sequences) in the difference graph and builds a consensus from each hub and its immediate neighbours.
- **Clustering mode** (`--cluster`): groups reads around local abundance maxima and builds a consensus from each neighbourhood.
Sequences are read from one or more files, or from standard input when no file is given. Results are written to standard output or to a file specified with `--out`.
The tool processes data on a per-sample basis. Sample identity is taken from a sequence annotation attribute (default: `sample`). Each sample's reads are denoised independently.
## INPUT FORMATS
`obiconsensus` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## DENOISING OPTIONS
`--distance INT`, `-d INT`
: Maximum number of differences allowed between two reads for them to be considered related and placed in the same cluster. Default: 1. A value of 1 means reads differing by a single nucleotide substitution are grouped together.
`--cluster`, `-C`
: Switch to clustering mode. Instead of identifying hub sequences, reads are grouped around local abundance maxima. This mode may produce fewer but more representative consensus sequences.
`--kmer-size SIZE`
: Size of the short words (k-mers) used when building the de Bruijn graph for consensus assembly. The default value of `-1` means the size is estimated automatically from the data. Manual adjustment is rarely needed.
`--no-singleton`
: Discard any read (or cluster) that occurs only once across the dataset. Singleton sequences are often the result of sequencing errors and carry little biological signal.
`--low-coverage FLOAT`
: Discard any sample whose sequence coverage falls below this threshold. Default: 0 (no filtering). Useful for removing poorly sequenced samples.
`--sample ATTRIBUTE`, `-s ATTRIBUTE`
: Name of the sequence annotation attribute that identifies the sample of origin. Default: `sample`. Each unique value of this attribute is treated as an independent sample during denoising.
## OUTPUT ANNOTATION OPTIONS
`--unique`, `-U`
: After denoising, dereplicate the output sequences (equivalent to running `obiuniq`). Identical consensus sequences across samples are merged into a single record carrying abundance information.
`--save-graph DIRECTORY`
: Save the difference graphs built during denoising to the specified directory. Each graph is written in GraphML format, one file per sample. Useful for inspecting the clustering structure.
`--save-ratio FILE`
: Save a table of abundance ratios on graph edges to the specified CSV file. Each row describes the relative abundance of a read compared to its neighbours. Useful for quality control and parameter tuning.
## PERFORMANCE OPTIONS
`--max-cpu INT`
: Number of parallel threads to use for computation. Default: all available processors (up to 16). Reducing this value limits memory and CPU usage.
`--batch-size INT`
: Minimum number of sequences processed together in a single batch. Default: 1.
`--batch-size-max INT`
: Maximum number of sequences in a single batch. Default: 2000.
`--batch-mem STRING`
: Maximum memory allocated per batch (e.g., `128M`, `1G`). Default: `128M`. Set to `0` to disable the memory limit.
`--no-progressbar`
: Disable the progress bar.
`--no-order`
: When reading from multiple files, indicate that there is no meaningful order among them. This can improve performance for large multi-file inputs.
## OTHER OPTIONS
`--u-to-t`
: Convert uracil (U) to thymine (T) in all input sequences. Use this option when working with RNA data stored in a DNA context.
`--skip-empty`
: Remove sequences of length zero from the output.
`--solexa`
: Interpret quality scores using the Solexa encoding rather than the standard Phred encoding.
`--silent-warning`
: Suppress warning messages.
`--debug`
: Enable detailed logging for troubleshooting.
`--version`
: Print the version number and exit.
`--help`, `-h`
: Display a brief help message and exit.
## OUTPUT ATTRIBUTES
Each output consensus sequence carries several annotation attributes describing how it was built:
| Attribute | Description |
|-----------|-------------|
| `consensus` | Boolean flag: `true` if the sequence is a true consensus, `false` if it was kept unchanged (e.g., isolated singleton) |
| `merged_sample` | Map of sample names to read counts contributing to this consensus |
| `count` | Total number of reads merged into this consensus across all samples |
| `kmer_size` | Size of the k-mers used to build the de Bruijn graph for this consensus |
| `seq_length` | Length of the consensus sequence |
## EXAMPLES
**Basic denoising of a FASTQ file:**
```sh
obiconsensus reads.fastq > denoised.fastq
```
**Increase the allowed distance between reads to 2:**
```sh
obiconsensus --distance 2 reads.fastq > denoised.fastq
```
**Use clustering mode and remove singletons:**
```sh
obiconsensus --cluster --no-singleton reads.fastq > denoised.fastq
```
**Denoise, then dereplicate the output:**
```sh
obiconsensus --unique reads.fastq > denoised_uniq.fastq
```
**Save denoising graphs for inspection:**
```sh
obiconsensus --save-graph ./graphs reads.fastq > denoised.fastq
```
**Specify the sample annotation attribute:**
```sh
obiconsensus --sample library reads.fastq > denoised.fastq
```
## SEE ALSO
`obiuniq`(1), `obiclean`(1), `obigrep`(1), `obiconvert`(1)
## NOTES
`obiconsensus` was designed primarily for Oxford Nanopore Technology amplicon data, where individual reads of the same molecule may carry different sequencing errors. For short-read Illumina data, `obiclean` may be more appropriate.
The automatic k-mer size selection (`--kmer-size -1`) works well in most cases. If the consensus assembly fails for a group (e.g., due to circular structures in the de Bruijn graph), the k-mer size is progressively increased until the assembly succeeds or a fallback strategy is used.
+179
View File
@@ -0,0 +1,179 @@
# NAME
obiconvert — convertion of sequence files to various formats
---
# SYNOPSIS
```
obiconvert [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--csv] [--debug]
[--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header] [--paired-with <FILENAME>] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--raw-taxid]
[--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
obiconvert is a versatile command-line tool that converts biological sequence data between multiple standard bioinformatics formats. It enables biologists to process large datasets by reading from one format and writing to another, with support for quality scores, taxonomic annotations, and various input/output combinations. The tool is optimized for high-performance processing with configurable batching, parallel execution, and memory management.
Biologists use obiconvert to standardize sequence data for compatibility with different bioinformatics tools, extract quality information from FASTQ files into more readable formats, or convert between FASTA and FASTQ when working with DNA/RNA sequences that have associated quality data. The tool automatically detects input formats and intelligently selects output formats based on data presence (e.g., FASTQ when quality scores exist, FASTA otherwise). To force a specific output format regardless of input content, use the explicit output flags (`--fasta-output`, `--fastq-output`, `--json-output`). <!-- corrected: without --fasta-output, a FASTQ input with quality scores stays in FASTQ format even when the output filename has a .fasta extension -->
---
# INPUT
obiconvert accepts input in multiple biological sequence formats:
- **FASTA**: Standard text-based format with `>` headers and sequence data
- **FASTQ**: Binary quality-score format (default when both sequence and quality data present)
- **GenBank**: Comprehensive biological record format with annotations
- **EMBL**: EMBL flatfile format for sequence and feature information
- **ecoPCR**: Specialized output format from ecoPCR amplification tools
- **CSV**: Tabular sequence data with configurable delimiters
Input is provided as positional arguments (file paths or `-` for stdin). The tool automatically detects the input format based on file content and can handle multiple files in sequence. When paired-end sequencing is used, the `--paired-with` option specifies the mate read file.
---
# OUTPUT
obiconvert produces sequence data in several output formats depending on input content and user selection:
- **FASTA**: Text format with sequence only (use `--fasta-output` to force)
- **FASTQ**: Format including quality scores (default when quality data present; use `--fastq-output` to force)
- **JSON**: Structured output with all sequence metadata and attributes (use `--json-output`)
The tool preserves all sequence annotations (taxonomic information, custom attributes) during conversion. When converting to FASTA/FASTQ formats, title line annotations can be formatted as OBI structured data or JSON using the `--output-OBI-header`/`--output-json-header` flags. Sequences of zero length can be suppressed with `--skip-empty`.
## Observed output example
```
>seq001 {"definition":"DNA sequence with quality scores for FASTQ to FASTA conversion"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq002 {"definition":"Second sequence with moderate quality scores"}
gctagctagctagctagctagctagctagctagctagct
>seq003 {"definition":"Third sequence with high quality scores"}
ttaaccggttaaccggttaaccggttaaccggttaaccg
>seq004 {"definition":"Fourth sequence with variable quality scores"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacg
```
---
# OPTIONS
## Input Format Options
- **--fasta**: Read data following the fasta format. (default: false)
- **--fastq**: Read data following the fastq format. (default: false)
- **--genbank**: Read data following the Genbank flatfile format. (default: false)
- **--embl**: Read data following the EMBL flatfile format. (default: false)
- **--ecopcr**: Read data following the ecoPCR output format. (default: false)
- **--csv**: Read data following the CSV format. (default: false)
## Input Header Options
- **--input-OBI-header**: FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--input-json-header**: FASTA/FASTQ title line annotations follow json format. (default: false)
## Output Format Options
- **--fasta-output**: Write sequence in fasta format (default if no quality data available). (default: false)
- **--fastq-output**: Write sequence in fastq format (default if quality data available). (default: false)
- **--json-output**: Write sequence in json format. (default: false)
## Output Header Options
- **--output-OBI-header|-O**: output FASTA/FASTQ title line annotations follow OBI format. (default: false)
- **--output-json-header**: output FASTA/FASTQ title line annotations follow json format. (default: false)
## Processing Options
- **--skip-empty**: Sequences of length equal to zero are suppressed from the output (default: false)
- **--no-order**: When several input files are provided, indicates that there is no order among them. (default: false)
- **--u-to-t**: Convert Uracil to Thymine. (default: false)
- **--update-taxid**: Make obitools automatically updating the taxid that are declared merged to a newest one. (default: false)
- **--raw-taxid**: When set, taxids are printed in files with any supplementary information (taxon name and rank) (default: false)
- **--fail-on-taxonomy**: Make obitools failing on error if a used taxid is not a currently valid one (default: false)
- **--with-leaves**: If taxonomy is extracted from a sequence file, sequences are added as leave of their taxid annotation (default: false)
## File Options
- **--out|-o <FILENAME>**: Filename used for saving the output (default: "-")
- **--paired-with <FILENAME>**: Filename containing the paired reads (default: "")
## Performance Options
- **--batch-mem <string>**: Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- **--batch-size <int>**: Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- **--batch-size-max <int>**: Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- **--max-cpu <int>**: Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
- **--compress|-Z**: Compress all the result using gzip (default: false)
## Debug Options
- **--debug**: Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- **--silent-warning**: Stop printing of the warning message (default: false, env: OBIWARNING)
- **--no-progressbar**: Disable the progress bar printing (default: false)
## Profiling Options
- **--pprof**: Enable pprof server. Look at the log for details. (default: false)
- **--pprof-goroutine <int>**: Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- **--pprof-mutex <int>**: Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
## Utility Options
- **--taxonomy|-t <string>**: Path to the taxonomy database. (default: "")
- **--solexa**: Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- **--help|-h|-?**: Show help message (default: false)
- **--version**: Prints the version and exits. (default: false)
---
# EXAMPLES
## Convert FASTQ to FASTA
```bash
# Convert quality-score data from FASTQ to readable FASTA format
obiconvert --fastq --fasta-output input.fastq -o output.fasta
```
**Expected output:** 4 sequences written to `output.fasta`.
## Convert FASTA to JSON
```bash
# Convert sequences to structured JSON format preserving all annotations
obiconvert --fasta --json-output input.fasta -o output.json
```
**Expected output:** 3 sequences written to `output.json`.
## Process paired-end sequencing data
```bash
# Convert paired FASTQ files preserving read pairing
obiconvert --fastq --fasta-output forward.fastq --paired-with reverse.fastq -o merged_sequences.fasta
```
**Expected output:** 4 sequences written to `merged_sequences_R1.fasta` and `merged_sequences_R2.fasta`.
---
# SEE ALSO
- obiannotate: Add taxonomic and functional annotations to sequences
- obicount: Count sequences in files
- obigrep: Filter sequences based on attributes or patterns
- obisummary: Generate statistics from sequence files
- obiuniq: Remove duplicate sequences
---
# NOTES
obiconvert automatically selects the optimal output format based on input data presence, preferring FASTQ when quality scores are available and FASTA otherwise. To force a specific output format, use `--fasta-output`, `--fastq-output`, or `--json-output` explicitly. <!-- corrected: the output format is NOT determined by the output filename extension; it must be forced with explicit flags -->
Memory usage is controlled through batch processing, with configurable memory limits per batch to handle large datasets efficiently. Progress reporting can be disabled for scripting purposes using `--no-progressbar`.
When working with taxonomic data, ensure the taxonomy database is accessible and properly formatted to avoid failures during sequence annotation processing.
+190
View File
@@ -0,0 +1,190 @@
# NAME
obicount — counts the sequences present in a file of sequences
---
# SYNOPSIS
```
obicount [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--csv] [--debug] [--ecopcr] [--embl] [--fasta] [--fastq]
[--genbank] [--help|-h|-?] [--input-OBI-header]
[--input-json-header] [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--reads|-r]
[--silent-warning] [--solexa] [--symbols|-s] [--u-to-t]
[--variants|-v] [--version] [<args>]
```
---
# DESCRIPTION
obicount is a command-line tool designed to count biological sequences from various input formats. It helps biologists quickly obtain quantitative metrics about sequence collections, which is essential for quality control, data assessment, and pipeline monitoring. The tool can count reads (total sequences), variants (unique sequence strings), or symbols (sum of character lengths), providing flexibility to focus on specific aspects of sequence data depending on the analysis needs.
---
# INPUT
obicount accepts input from files or stdin, supporting multiple biological sequence formats:
- FASTA (.fasta[.gz])
- FASTQ (.fastq[.fq][.gz])
- GenBank/EMBL (.gb|.gbff|.dat[.gz])
- ecoPCR format (.ecopcr[.gz])
- CSV format (--csv flag)
Input can be provided as multiple filenames or read from stdin. The tool automatically detects file formats and parses sequences accordingly.
---
# OUTPUT
obicount outputs one or more of the following metrics, depending on the flags used:
- **Read counts**: Total number of sequences in the input
- **Variant counts**: Number of unique sequence strings (distinct sequences)
- **Symbol counts**: Sum of all character lengths across all sequences
When no specific counting flags are provided (-r, -v, -s), all three metrics are reported by default. Output is printed to stdout in CSV format with headers: `entities,n` for the type of entity counted, followed by the count value.
---
# OPTIONS
## General Options
- --help|-h|-?
Show help message and exit.
- --max-cpu <int>
Number of parallel threads computing the result (default: 16, env: OBIMAXCPU).
- --debug
Enable debug mode, by setting log level to debug. (default: false, env: OBIDEBUG)
- --silent-warning
Stop printing of the warning message (default: false, env: OBIWARNING)
## Input Format Options
- --fasta
Read data following the fasta format. (default: false)
- --fastq
Read data following the fastq format. (default: false)
- --genbank
Read data following the Genbank flatfile format. (default: false)
- --embl
Read data following the EMBL flatfile format. (default: false)
- --ecopcr
Read data following the ecoPCR output format. (default: false)
- --csv
Read data following the CSV format. (default: false)
## Input Header Options
- --input-OBI-header
FASTA/FASTQ title line annotations follow OBI format. (default: false)
- --input-json-header
FASTA/FASTQ title line annotations follow json format. (default: false)
## Counting Mode Options
- --reads|-r
Prints read counts. (default: false)
- --variants|-v
Prints variant counts. (default: false)
- --symbols|-s
Prints symbol counts. (default: false)
## Processing Options
- --u-to-t
Convert Uracil to Thymine. (default: false, env: OBISOLEXA)
- --solexa
Decodes quality string according to the Solexa specification. (default: false, env: OBISOLEXA)
- --no-order
When several input files are provided, indicates that there is no order among them. (default: false)
## Performance Options
- --batch-mem <string>
Maximum memory per batch (e.g. 128K, 64M, 1G; default: 128M). Set to 0 to disable. (default: "", env: OBIBATCHMEM)
- --batch-size <int>
Minimum number of sequences per batch (floor, default 1) (default: 1, env: OBIBATCHSIZE)
- --batch-size-max <int>
Maximum number of sequences per batch (ceiling, default 2000) (default: 2000, env: OBIBATCHSIZEMAX)
- --max-cpu <int>
Number of parallele threads computing the result (default: 16, env: OBIMAXCPU)
## Profiling Options
- --pprof
Enable pprof server. Look at the log for details. (default: false)
- --pprof-goroutine <int>
Enable profiling of goroutine blocking profile. (default: 6060, env: OBIPPROFGOROUTINE)
- --pprof-mutex <int>
Enable profiling of mutex lock. (default: 10, env: OBIPPROFMUTEX)
- --version
Prints the version and exits. (default: false)
---
# EXAMPLES
# Count total number of sequences in a FASTA file
# Useful for quick assessment of dataset size
obicount input.fasta
**Expected output:** 4 sequences, out_default.txt
# Count only the number of unique sequence variants
# Helpful for identifying genetic diversity in population data
obicount --variants input.fasta
**Expected output:** 4 sequences, out_variants.txt
# Count sum of all sequence symbol lengths (nucleotides/amino acids)
# Useful for estimating total data volume or computing average read length
obicount --symbols input.fasta
**Expected output:** 4 sequences, out_symbols.txt
# Count reads from FASTQ format with quality scores
# Essential for assessing read throughput in sequencing data
obicount --fastq --reads input.fastq
**Expected output:** 4 sequences, out_fastq_reads.txt
---
# OUTPUT
## Observed output example
```
time="2026-04-02T19:33:11+02:00" level=info msg="Number of workers set 16"
time="2026-04-02T19:33:11+02:00" level=info msg="Found 1 files to process"
time="2026-04-02T19:33:11+02:00" level=info msg="input.fasta mime type: text/fasta"
entities,n
variants,5
reads,5
symbols,435
```
---
# SEE ALSO
- obiconvert - Convert between biological sequence file formats
- obiuniq - Remove duplicate sequences from files
---
# NOTES
_(not available)_
+315
View File
@@ -0,0 +1,315 @@
# NAME
obicsv — converts sequence files to CSV format
---
# SYNOPSIS
```
obicsv [--auto] [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--compress|-Z] [--count] [--csv] [--debug]
[--definition|-d] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fastq] [--genbank] [--help|-h|-?] [--ids|-i] [--input-OBI-header]
[--input-json-header] [--keep|-k <KEY>]... [--max-cpu <int>]
[--na-value <NAVALUE>] [--no-order] [--no-progressbar] [--obipairing]
[--out|-o <FILENAME>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--quality|-q] [--raw-taxid] [--sequence|-s]
[--silent-warning] [--solexa] [--taxon] [--taxonomy|-t <string>]
[--u-to-t] [--update-taxid] [--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
obicsv converts biological sequence data into CSV format for easy inspection, spreadsheet analysis, or integration with other tools. A biologist might use it to export sequences from OBITools for quality control, taxonomic inspection, or downstream analysis in R or Python.
Columns must be explicitly selected: use `--ids` for the identifier, `--sequence` for the nucleotide sequence, `--quality` for quality scores, `--taxon` for taxonomic information, `--auto` to auto-detect annotation attributes, or `--keep` for specific named attributes. Multiple flags can be combined freely.
The command uses parallel workers to process large datasets efficiently and can write output to stdout or directly to a file.
---
# INPUT
obicsv accepts input from files or stdin. The input format is automatically detected based on the file extension, but can be explicitly specified using format flags.
Supported input formats:
- FASTA (`--fasta`)
- FASTQ (`--fastq`)
- GenBank (`--genbank`)
- EMBL (`--embl`)
- ecoPCR output (`--ecopcr`)
- CSV (`--csv`)
Input sources:
- Local files (specified as arguments)
- stdin (when no input file is provided)
- Remote URLs (`http://`, `https://`, `ftp://`)
- Directories (automatically scanned for valid files)
Header formats:
- OBI format (`--input-OBI-header`)
- JSON format (`--input-json-header`)
- Auto-detection (default)
Taxonomy database can be provided with `--taxonomy|-t`.
---
# OUTPUT
The output is a CSV file with one row per sequence. The columns included depend on the flags used:
| Column | Flag | Description |
|--------|------|-------------|
| id | `--ids\|-i` | Sequence identifier |
| sequence | `--sequence\|-s` | DNA/RNA sequence |
| qualities | `--quality\|-q` | Quality scores (ASCII-encoded) |
| definition | `--definition\|-d` | Sequence description/annotation |
| count | `--count` | Number of reads represented by this sequence |
| taxid | `--taxon` | NCBI taxonomy identifier |
| scientific_name | `--taxon` | Taxonomic scientific name |
| custom attributes | `--keep\|-k` | Any attribute stored in sequence annotations |
If `--auto` is used, columns are automatically determined based on the attributes present in the first batch of sequences.
Missing values are written as the NA value (default: "NA").
## Observed output example
```csv
id,sequence
seq001,atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
seq002,ggggaaaattttccccggggaaaattttccccggggaaaattttccccggggaaaatttt
seq003,cccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
```
---
# OPTIONS
## Output Columns
These flags control which columns appear in the CSV output.
- **`--ids|-i`**
- Default: `false`
- Meaning: Include the sequence identifier column. Useful for tracking or linking sequences.
- **`--sequence|-s`**
- Default: `false`
- Meaning: Include the nucleotide or amino acid sequence. This is the main biological data.
- **`--quality|-q`**
- Default: `false`
- Meaning: Include quality scores for each position. Essential for quality control and filtering.
- **`--definition|-d`**
- Default: `false`
- Meaning: Include the sequence description or definition from the source file.
- **`--count`**
- Default: `false`
- Meaning: Include the count attribute, representing how many original reads were collapsed into this sequence (e.g., from clustering or demultiplexing).
- **`--taxon`**
- Default: `false`
- Meaning: Include taxonomic information. Outputs both the NCBI taxid and the scientific name. Requires a taxonomy database (see `--taxonomy`).
- **`--obipairing`**
- Default: `false`
- Meaning: Include attributes that were added by the `obipairing` command (pairing scores, mismatches, etc.).
- **`--auto`**
- Default: `false`
- Meaning: Automatically detect which columns to output by examining the first batch of sequences. Outputs all annotation attributes found in the headers. Can be combined with `--ids`, `--sequence`, etc. to add those columns on top of the auto-detected ones.
- **`--keep|-k <KEY>`**
- Default: `none`
- Meaning: Keep only the specified attribute(s). Can be used multiple times to keep several columns. Useful for extracting specific annotations.
- **`--na-value <NAVALUE>`**
- Default: `"NA"`
- Meaning: String to use for missing or unavailable values in the CSV. Customize for compatibility with other tools (e.g., empty string, "NA", "null").
## Input/Output Files
- **`--out|-o <FILENAME>`**
- Default: `"-"` (stdout)
- Meaning: Write output to the specified file instead of stdout.
- **`--compress|-Z`**
- Default: `false`
- Meaning: Compress the output using gzip.
## Input Format
- **`--fasta`**, **`--fastq`**, **`--genbank`**, **`--embl`**, **`--ecopcr`**, **`--csv`**
- Default: auto-detection
- Meaning: Explicitly specify the input format.
- **`--input-OBI-header`**, **`--input-json-header`**
- Default: auto-detection
- Meaning: Specify the header format in FASTA/FASTQ files (OBI or JSON annotations).
- **`--u-to-t`**
- Default: `false`
- Meaning: Convert Uracil to Thymine. Useful for RNA sequences.
- **`--solexa`**
- Default: `false`
- Meaning: Decode quality strings according to the Solexa specification instead of Phred.
## Taxonomy
- **`--taxonomy|-t <string>`**
- Default: `""`
- Meaning: Path to the taxonomy database directory. Required for `--taxon` output.
- **`--fail-on-taxonomy`**
- Default: `false`
- Meaning: Make OBITools fail if a used taxid is not currently valid.
- **`--update-taxid`**
- Default: `false`
- Meaning: Automatically update taxids that have been merged to their newest valid taxid.
- **`--raw-taxid`**
- Default: `false`
- Meaning: Print only taxids without supplementary information (name and rank).
- **`--with-leaves`**
- Default: `false`
- Meaning: Add sequences as leaves of their taxid annotation when taxonomy is extracted from a sequence file.
## Performance
- **`--max-cpu <int>`**
- Default: `16`
- Meaning: Number of parallel threads for processing.
- **`--batch-size <int>`**
- Default: `1`
- Meaning: Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
- Default: `2000`
- Meaning: Maximum number of sequences per batch.
- **`--batch-mem <string>`**
- Default: `"128M"`
- Meaning: Maximum memory per batch (e.g., 128K, 64M, 1G).
- **`--no-order`**
- Default: `false`
- Meaning: When multiple input files are provided, indicates there is no order among them.
- **`--no-progressbar`**
- Default: `false`
- Meaning: Disable the progress bar.
## Other Options
- **`--debug`**
- Default: `false`
- Meaning: Enable debug mode by setting log level to debug.
- **`--pprof`**
- Default: `false`
- Meaning: Enable pprof server.
- **`--pprof-goroutine <int>`**
- Default: `6060`
- Meaning: Enable profiling of goroutine blocking.
- **`--pprof-mutex <int>`**
- Default: `10`
- Meaning: Enable profiling of mutex lock.
- **`--silent-warning`**
- Default: `false`
- Meaning: Suppress warning messages.
- **`--version`**
- Default: `false`
- Meaning: Print version information and exit.
- **`--help|-h|-?`**
- Default: `false`
- Meaning: Print help information.
---
# EXAMPLES
**Export sequences with identifiers to CSV**
Extracts sequence IDs and sequences from a FASTQ file.
```bash
obicsv --ids --sequence sequences.fastq -o output1.csv
```
**Expected output:** 3 sequences written to `output1.csv`.
**Export sequences with quality scores**
Useful for quality control and filtering in downstream tools.
```bash
obicsv --ids --sequence --quality sequences.fastq -o output2.csv
```
**Expected output:** 3 sequences written to `output2.csv`.
**Export with taxonomic information**
Includes taxid and scientific name for taxonomic analysis.
```bash
obicsv --ids --sequence --taxon --taxonomy /path/to/taxonomy sequences.fasta -o output.csv
```
**Auto-detect annotation columns from sequence headers**
Automatically discovers all annotation attributes present in the sequence headers and outputs them as CSV columns. Combined with `--ids` to also include the sequence identifier.
```bash
obicsv --auto --ids sequences.fasta -o output4.csv
```
**Expected output:** 3 rows in `output4.csv` with columns `id`, `sample`, `taxid` (attributes found in sequence headers).
**Extract specific attributes**
Keeps only the specified attributes as columns. Attributes not present in a sequence are written as the NA value.
```bash
obicsv --keep sample --keep taxid sequences.fasta -o output5.csv
```
**Expected output:** 3 rows in `output5.csv` with columns `taxid`, `sample`.
**Export with compression**
Writes gzip-compressed CSV output for large datasets.
```bash
obicsv --ids --sequence -Z sequences.fasta -o output6.csv.gz
```
**Expected output:** 3 sequences written to `output6.csv.gz`.
---
# SEE ALSO
- `obiconvert` — input/output handling framework
- `obipairing` — pairing information (used with `--obipairing`)
- Other export commands: `obifasta`, `obifastq`, `obijson`
---
# NOTES
- Without any column selection flag (`--ids`, `--sequence`, `--quality`, `--taxon`, `--auto`, `--keep`), the output contains no columns and no data.
- The `--taxon` option requires a valid taxonomy database specified with `--taxonomy`.
- Output is written to stdout by default; use `--out` to write to a file.
- Missing attributes are written as the NA value (customizable with `--na-value`).
- Input sequences are processed using streaming iterators to minimize memory footprint, even for large files.
+321
View File
@@ -0,0 +1,321 @@
# obidemerge
## NAME
`obidemerge` — split merged sequence records back into individual, sample-annotated copies
## SYNOPSIS
```
obidemerge [options] [input_files...]
```
## DESCRIPTION
In a typical metabarcoding workflow, `obiuniq` or similar tools collapse identical sequences
from multiple samples into a single representative record. That record carries a statistics
attribute (for example `merged_sample`) that stores, for every original sample, how many
times the sequence was observed. This compact representation is convenient for clustering
and denoising, but some downstream analyses need the original, per-sample view.
`obidemerge` reverses that merging step. For each input sequence, it reads the statistics
stored under a chosen attribute (by default `sample`) and produces one output sequence per
entry in that statistics map. Each output sequence is a copy of the original, but:
- its `sample` attribute (or whichever slot you chose) is set to the name of the individual
sample,
- its read count is set to the abundance recorded for that sample.
The original statistics attribute is removed from all output sequences.
Sequences that carry no statistics for the chosen slot are passed through unchanged.
The command reads sequences from one or more files, or from standard input when no file is
given, and writes the results to standard output or to the file specified with `--out`.
## INPUT FORMATS
`obidemerge` accepts all sequence formats supported by OBITools4:
| Format | Description |
|--------|-------------|
| FASTA | Plain nucleotide sequences with annotation in the title line |
| FASTQ | Sequences with per-base quality scores |
| EMBL | European Nucleotide Archive flat-file format |
| GenBank | NCBI GenBank flat-file format |
| ecoPCR | Output produced by the ecoPCR tool |
| CSV | Comma-separated values with sequence and metadata columns |
The format is detected automatically from the file extension or content. You can override
detection with the format flags listed under **Input format options** below.
Annotations embedded in FASTA/FASTQ title lines can follow the OBI key=value style
(`--input-OBI-header`) or JSON style (`--input-json-header`).
## OUTPUT FORMATS
By default, the output format mirrors the input:
- If the input contains quality scores, output is FASTQ.
- Otherwise, output is FASTA with OBI-style annotations.
You can force a specific format with `--fasta-output`, `--fastq-output`, or `--json-output`.
## OPTIONS
### Demerge option
`--demerge <slot>`, `-d <slot>`
: Name of the sequence attribute that holds the per-sample statistics to expand.
Each key in that statistics map becomes a separate output sequence.
**Default:** `sample`
### Output options
`--out <FILENAME>`, `-o <FILENAME>`
: Write output to this file instead of standard output. Use `-` for standard output.
**Default:** `-` (standard output)
`--fasta-output`
: Write output in FASTA format, even when quality scores are available.
**Default:** false
`--fastq-output`
: Write output in FASTQ format (requires quality scores in the input).
**Default:** false
`--json-output`
: Write output in JSON format, one record per line.
**Default:** false
`--output-OBI-header`, `-O`
: Write FASTA/FASTQ title lines in OBI key=value annotation style.
**Default:** false (JSON-style headers)
`--output-json-header`
: Write FASTA/FASTQ title lines in JSON annotation style.
**Default:** false
`--compress`, `-Z`
: Compress the output with gzip.
**Default:** false
`--skip-empty`
: Discard sequences of length zero from the output.
**Default:** false
### Input format options
`--fasta`
: Force reading in FASTA format.
`--fastq`
: Force reading in FASTQ format.
`--embl`
: Force reading in EMBL flat-file format.
`--genbank`
: Force reading in GenBank flat-file format.
`--ecopcr`
: Force reading in ecoPCR output format.
`--csv`
: Force reading in CSV format.
`--input-OBI-header`
: Parse FASTA/FASTQ title lines as OBI-style key=value annotations.
`--input-json-header`
: Parse FASTA/FASTQ title lines as JSON annotations.
`--solexa`
: Decode quality scores using the Solexa/Illumina 1.0 convention instead of the standard
Phred scale. Use this only for very old sequencing data.
**Default:** false
`--u-to-t`
: Convert uracil (U) to thymine (T) in all sequences. Useful when working with RNA-derived
data that should be treated as DNA.
**Default:** false
`--no-order`
: When reading from several input files, do not attempt to preserve the order of records
across files. May improve speed when order does not matter.
**Default:** false
### Taxonomy options
`--taxonomy <path>`, `-t <path>`
: Path to the OBITools4 taxonomy database. Required only if taxonomic identifiers need to
be resolved or validated during output.
**Default:** none
`--fail-on-taxonomy`
: Stop with an error if a taxonomic identifier in the data is not found in the loaded
taxonomy database.
**Default:** false
`--raw-taxid`
: Print taxonomic identifiers as plain numbers, without appending the taxon name and rank.
**Default:** false
`--update-taxid`
: Automatically replace deprecated taxonomic identifiers with their current equivalents,
as declared in the taxonomy database.
**Default:** false
`--with-leaves`
: When a taxonomy is extracted from the sequence file itself, treat each sequence as a
leaf node under its annotated taxonomic identifier.
**Default:** false
### Performance options
`--max-cpu <int>`
: Maximum number of parallel processing threads. Increase for faster processing on
multi-core machines.
**Default:** 16 (or the value of the `OBIMAXCPU` environment variable)
`--batch-size <int>`
: Minimum number of sequences processed together as a group.
**Default:** 1
`--batch-size-max <int>`
: Maximum number of sequences processed together as a group.
**Default:** 2000
`--batch-mem <size>`
: Maximum memory used per processing group (e.g. `64M`, `1G`). Set to `0` to disable the
memory limit and rely on `--batch-size-max` alone.
**Default:** `128M`
### Display options
`--no-progressbar`
: Hide the progress bar.
**Default:** false
`--silent-warning`
: Suppress warning messages.
**Default:** false
`--debug`
: Enable verbose debug logging.
**Default:** false
`--version`
: Print the OBITools4 version and exit.
`--help`, `-h`, `-?`
: Print this help message and exit.
## EXAMPLES
### Example 1 — basic demerge using the default slot
After running `obiuniq`, the file `unique.fasta` contains merged sequences whose
`merged_sample` attribute records abundance per sample. Demerge back to one
sequence per sample:
<!-- corrected: -d sample (not -d merged_sample) because HasStatsOn("sample") looks for the merged_sample attribute -->
```bash
obidemerge -d sample unique.fasta > per_sample_merged.fasta
```
**Expected output:** 7 sequences written to `per_sample_merged.fasta`.
### Example 2 — demerge with the default `sample` slot
If the statistics are already stored under the attribute named `sample` (the default),
no `-d` flag is needed:
```bash
obidemerge unique.fasta > per_sample_default.fasta
```
**Expected output:** 7 sequences written to `per_sample_default.fasta`.
### Example 3 — write compressed output to a file
```bash
obidemerge -d sample -o per_sample.fasta.gz --compress unique.fasta
```
**Expected output:** 7 sequences written (compressed) to `per_sample.fasta.gz`.
### Example 4 — pipeline use: cluster, then demerge
Obtain unique sequences, cluster them, then expand the clusters back to individual
sample records for ecological analysis:
```bash
obiuniq -m sample reads.fastq \
| obiclean ... \
| obidemerge -d sample -o demerged.fasta
```
### Example 5 — process multiple input files
```bash
obidemerge -d sample run1_unique.fasta run2_unique.fasta > combined_demerged.fasta
```
**Expected output:** 6 sequences written to `combined_demerged.fasta`.
## SEE ALSO
`obiuniq(1)` — collapses identical sequences and records per-sample counts (the inverse operation)
`obiclean(1)` — removes PCR/sequencing artefacts from a set of unique sequences
`obiannotate(1)` — adds or modifies sequence attributes
`obigrep(1)` — filters sequences by attributes or sequence content
`obicount(1)` — counts sequences and total reads in a file
## NOTES
**Relationship to `obiuniq`.**
`obiuniq --merge sample` stores per-sample counts under an attribute named `merged_sample`.
When you later call `obidemerge`, you must therefore pass `-d sample` to match that
attribute name. The `-d` option takes the **logical** slot name (here `sample`), not the
internal storage name (`merged_sample`).
<!-- corrected: -d sample is correct (not -d merged_sample); the tool prepends "merged_" internally when looking up the attribute -->
**Read counts after demerging.**
Each output sequence has its read count set to the value recorded in the statistics map for
that sample. If you sum the counts of all output sequences that share the same identifier,
you recover the total count of the original merged record.
**Order of output sequences.**
The order in which the per-sample copies of a single merged sequence appear in the output
is not guaranteed. If a stable order is required, pipe the output through `obisort`.
## OUTPUT
`obidemerge` writes one sequence record per sample entry found in the statistics attribute.
Each output record is a copy of the input sequence, with:
- the statistics attribute (`merged_<slot>`) removed,
- the `<slot>` attribute set to the sample name,
- the `count` attribute set to the abundance for that sample.
Sequences with no statistics for the chosen slot are passed through unchanged.
## Observed output example
```
>seq001 {"count":5,"sample":"sampleA"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":3,"sample":"sampleB"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq001 {"count":1,"sample":"sampleC"}
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":2,"sample":"sampleA"}
ttggccaattggccaattggccaattggccaattggccaa
>seq002 {"count":7,"sample":"sampleD"}
ttggccaattggccaattggccaattggccaattggccaa
>seq003 {"count":4,"sample":"sampleB"}
gctagctagctagctagctagctagctagctagctagcta
>seq004 {"count":6}
aaaaccccggggttttaaaaccccggggttttaaaacccc
```
+296
View File
@@ -0,0 +1,296 @@
# NAME
obidistribute — divided an input set of sequences into subsets
---
# SYNOPSIS
```
obidistribute --pattern|-p <string> [--append|-A] [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>]
[--batches|-n <int>] [--classifier|-c <string>] [--compress|-Z]
[--csv] [--debug] [--directory|-d <string>] [--ecopcr] [--embl]
[--fasta] [--fasta-output] [--fastq] [--fastq-output]
[--genbank] [--hash|-H <int>] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--na-value <string>] [--no-order]
[--no-progressbar] [--out|-o <FILENAME>]
[--output-OBI-header|-O] [--output-json-header] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>]
[--silent-warning] [--skip-empty] [--solexa] [--u-to-t]
[--version] [<args>]
```
---
# DESCRIPTION
`obidistribute` splits a set of biological sequences into multiple output files according to one of three distribution strategies: annotation-based classification, round-robin batch assignment, or hash-based sharding.
The most common use case in metabarcoding is demultiplexing: sequences carry a tag annotation (e.g., `sample_id`) and `obidistribute` writes each sample's sequences into its own file. The output filename for each group is built from a user-supplied pattern containing `%s`, which is replaced by the classifier value or batch index.
When no classifier is specified, sequences can be split into a fixed number of batches (`--batches`) for parallel downstream processing, or sharded deterministically by hash (`--hash`) to ensure reproducible partitioning regardless of input order.
Output files can be organised into subdirectories (one per classifier value) using `--directory`, and existing files can be extended rather than overwritten with `--append`. Sequences lacking the classifier annotation are assigned to a file whose name uses the NA value (default: `"NA"`).
---
# INPUT
`obidistribute` reads biological sequences from one or more files supplied as positional arguments, or from standard input when no files are given. All major NGS and flat-file formats are supported and auto-detected:
- FASTA / FASTQ (plain or gzip-compressed)
- GenBank and EMBL flat files
- ecoPCR output
- CSV
Format can be forced with `--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, or `--csv`. Header annotation style can be specified with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Each distribution group produces a separate output file named according to the `--pattern` template. The `%s` placeholder in the pattern is replaced by the classifier value, batch index, or hash shard index, depending on the chosen distribution mode.
Output format follows the same rules as other OBITools commands: FASTQ is used when quality scores are present, FASTA otherwise. The format can be forced with `--fasta-output`, `--fastq-output`, or `--json-output`. All annotations present in the input sequences are preserved in the output files.
When `--directory` is used together with `--classifier`, output files are placed in subdirectories named after the classifier values, allowing hierarchical organisation of results.
## Observed output example
```
@seq001 {"sample_id":"sampleA"}
atcgatcgatcgatcgatcg
+
IIIIIIIIIIIIIIIIIIII
@seq002 {"sample_id":"sampleA"}
gctagctagctagctagcta
+
IIIIIIIIIIIIIIIIIIII
@seq003 {"sample_id":"sampleA"}
ttagctaatcggtaatcggt
+
IIIIIIIIIIIIIIIIIIII
@seq009 {"sample_id":"sampleA"}
atgatgatgatgatgatgat
+
IIIIIIIIIIIIIIIIIIII
```
---
# OPTIONS
## Distribution mode
- **`--pattern|-p <string>`** — _(required)_
Default: none.
The template used to build output filenames. The variable part is represented by `%s`. Example: `toto_%s.fastq`.
- **`--classifier|-c <string>`**
Default: `""`.
The name of an annotation tag on the sequences. Sequences are dispatched into separate files based on the value of this tag. The tag value must be a string, integer, or boolean.
- **`--batches|-n <int>`**
Default: `0`.
Splits the input into exactly *N* batches by round-robin assignment, regardless of sequence metadata.
- **`--hash|-H <int>`**
Default: `0`.
Splits the input into at most *N* batches using a hash of the sequence. Produces deterministic, reproducible sharding.
- **`--directory|-d <string>`**
Default: `""`.
Used together with `--classifier`: organises output files into subdirectories named after classifier values.
## Output file handling
- **`--append|-A`**
Default: `false`.
Appends sequences to output files if they already exist, instead of overwriting them.
- **`--na-value <string>`**
Default: `"NA"`.
Value used as the filename component when a sequence does not have the classifier tag defined.
- **`--compress|-Z`**
Default: `false`.
Compresses all output files using gzip.
## Input format
- **`--fasta`**
Default: `false`.
Read data following the FASTA format.
- **`--fastq`**
Default: `false`.
Read data following the FASTQ format.
- **`--embl`**
Default: `false`.
Read data following the EMBL flatfile format.
- **`--genbank`**
Default: `false`.
Read data following the GenBank flatfile format.
- **`--ecopcr`**
Default: `false`.
Read data following the ecoPCR output format.
- **`--csv`**
Default: `false`.
Read data following the CSV format.
- **`--input-OBI-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow OBI format.
- **`--input-json-header`**
Default: `false`.
FASTA/FASTQ title line annotations follow JSON format.
- **`--solexa`**
Default: `false`.
Decodes quality string according to the Solexa specification.
- **`--u-to-t`**
Default: `false`.
Convert Uracil to Thymine.
- **`--skip-empty`**
Default: `false`.
Sequences of length equal to zero are suppressed from the output.
- **`--no-order`**
Default: `false`.
When several input files are provided, indicates that there is no order among them.
## Output format
- **`--fasta-output`**
Default: `false`.
Write sequences in FASTA format (default if no quality data available).
- **`--fastq-output`**
Default: `false`.
Write sequences in FASTQ format (default if quality data available).
- **`--json-output`**
Default: `false`.
Write sequences in JSON format.
- **`--output-OBI-header|-O`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow OBI format.
- **`--output-json-header`**
Default: `false`.
Output FASTA/FASTQ title line annotations follow JSON format.
- **`--out|-o <FILENAME>`**
Default: `"-"`.
Filename used for saving the output.
## Performance
- **`--max-cpu <int>`**
Default: `16`.
Number of parallel threads computing the result.
- **`--batch-size <int>`**
Default: `1`.
Minimum number of sequences per batch.
- **`--batch-size-max <int>`**
Default: `2000`.
Maximum number of sequences per batch.
- **`--batch-mem <string>`**
Default: `""` (128M).
Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostic & debug
- **`--debug`**
Default: `false`.
Enable debug mode, by setting log level to debug.
- **`--no-progressbar`**
Default: `false`.
Disable the progress bar printing.
- **`--silent-warning`**
Default: `false`.
Stop printing of warning messages.
- **`--pprof`**
Default: `false`.
Enable pprof server. Look at the log for details.
- **`--pprof-goroutine <int>`**
Default: `6060`.
Enable profiling of goroutine blocking profile.
- **`--pprof-mutex <int>`**
Default: `10`.
Enable profiling of mutex lock.
---
# EXAMPLES
```bash
# Demultiplex sequences by sample_id annotation into per-sample FASTQ files
obidistribute --classifier sample_id --pattern out_ex1_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `out_ex1_sampleA.fastq` (4 sequences), `out_ex1_sampleB.fastq` (3 sequences), `out_ex1_sampleC.fastq` (2 sequences), `out_ex1_NA.fastq` (1 sequence).
```bash
# Demultiplex into subdirectories, one directory per sample
obidistribute --classifier sample_id --directory --pattern %s/reads.fastq reads.fastq
```
```bash
# Split a large dataset into 3 equal batches for parallel processing
obidistribute --batches 3 --pattern chunk_%s.fasta --fasta-output --no-progressbar sequences.fasta
```
**Expected output:** 10 sequences written to 3 files: `chunk_1.fasta` (4 sequences), `chunk_2.fasta` (3 sequences), `chunk_3.fasta` (3 sequences). Batch indices are 1-based.
```bash
# Hash-based sharding into 4 reproducible shards
obidistribute --hash 4 --pattern shard_%s.fastq --no-progressbar reads.fastq
```
**Expected output:** 10 sequences written to 4 files: `shard_0.fastq` through `shard_3.fastq`. Shard indices are 0-based.
```bash
# Append new sequences to existing per-sample files (incremental demultiplexing)
obidistribute --classifier sample_id --pattern samples_%s.fastq --append new_reads.fastq
```
```bash
# Demultiplex sequences, replacing the NA label for unclassified sequences
obidistribute --classifier sample_id --na-value unclassified --pattern out_ex6_%s.fastq --no-progressbar --input-json-header reads.fastq
```
**Expected output:** 10 sequences written to 4 files including `out_ex6_unclassified.fastq` (1 sequence without `sample_id` annotation).
---
# SEE ALSO
`obiconvert`, `obisplit`, `obigrep`
---
# NOTES
- Sequences that lack the annotation specified by `--classifier` are written to the file whose name is built using the `--na-value` (default: `"NA"`).
- The three distribution modes (`--classifier`, `--batches`, `--hash`) are mutually exclusive.
- When using `--directory` together with `--classifier`, subdirectories are created automatically if they do not exist.
- Batch indices produced by `--batches` are 1-based; hash shard indices produced by `--hash` are 0-based.
+326
View File
@@ -0,0 +1,326 @@
# obigrep(1) — OBITools4 Manual
## NAME
`obigrep` — select a subset of sequence records on various criteria
## SYNOPSIS
```
obigrep [OPTIONS] [FILE...]
```
## DESCRIPTION
`obigrep` filters a set of biological sequence records (in FASTA or FASTQ format) and writes only those matching all specified criteria to the output. Its name is modelled on the Unix `grep` command, but instead of filtering lines in a text file, it filters sequence records.
Filtering criteria can be combined freely: only sequence records satisfying **all** specified conditions are retained. The selection can be inverted with `--inverse-match` to keep the records that would otherwise be discarded.
Sequences are read from one or more files, or from standard input if no file is given. Results are written to standard output or to a file specified with `--out`. Records that do not pass the filters can optionally be saved to a separate file with `--save-discarded`.
## INPUT FORMATS
`obigrep` recognises the following input formats automatically. A specific format can be forced with the corresponding flag:
| Flag | Format |
|------|--------|
| `--fasta` | FASTA |
| `--fastq` | FASTQ |
| `--embl` | EMBL flat file |
| `--genbank` | GenBank flat file |
| `--ecopcr` | ecoPCR output |
| `--csv` | CSV tabular format |
Header annotation styles can be selected with `--input-OBI-header` (OBITools format) or `--input-json-header` (JSON format).
## OUTPUT FORMATS
By default, the output format matches the input format (FASTQ when quality scores are present, FASTA otherwise). The format can be forced:
- `--fasta-output` — write FASTA
- `--fastq-output` — write FASTQ
- `--json-output` — write JSON
- `--output-OBI-header` / `-O` — annotate FASTA/FASTQ title lines in OBITools format
- `--output-json-header` — annotate FASTA/FASTQ title lines in JSON format
- `--compress` / `-Z` — compress output with gzip
Use `--out FILE` / `-o FILE` to write results to a file instead of standard output.
## FILTERING OPTIONS
### By sequence length
- `--min-length LENGTH` / `-l LENGTH`
Keep only sequences at least *LENGTH* bases long.
- `--max-length LENGTH` / `-L LENGTH`
Keep only sequences at most *LENGTH* bases long.
### By read abundance
Sequence records can carry a `count` attribute recording how many times the sequence was observed. The following options filter on that count:
- `--min-count COUNT` / `-c COUNT`
Keep only sequences observed at least *COUNT* times (default: 1).
- `--max-count COUNT` / `-C COUNT`
Keep only sequences observed at most *COUNT* times.
### By sequence pattern
- `--sequence PATTERN` / `-s PATTERN`
Keep records whose nucleotide sequence matches the regular expression *PATTERN* (case-insensitive). This option can be repeated; all patterns must match.
- `--approx-pattern PATTERN`
Keep records whose sequence contains an approximate match to *PATTERN*. The number of allowed differences is controlled by `--pattern-error`. This option can be repeated.
- `--pattern-error N`
Maximum number of mismatches (or indels, if `--allows-indels` is set) tolerated when using `--approx-pattern` (default: 0, i.e. exact match).
- `--allows-indels`
Allow insertions and deletions (in addition to substitutions) when performing approximate pattern matching.
- `--only-forward`
Search patterns on the forward strand only. By default both strands are searched.
### By identifier or definition
- `--identifier PATTERN` / `-I PATTERN`
Keep records whose identifier matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
- `--id-list FILENAME`
Keep only records whose identifier appears in *FILENAME*, a plain-text file with one identifier per line.
- `--definition PATTERN` / `-D PATTERN`
Keep records whose definition line matches the regular expression *PATTERN* (case-insensitive). Can be repeated.
### By attribute (metadata)
Sequence records can carry arbitrary key/value annotations:
- `--has-attribute KEY` / `-A KEY`
Keep records that possess an attribute named *KEY*, regardless of its value. Can be repeated.
- `--attribute KEY=PATTERN` / `-a KEY=PATTERN`
Keep records for which the value of attribute *KEY* matches the regular expression *PATTERN* (case-sensitive). Can be repeated; all constraints must be satisfied.
### By custom boolean expression
- `--predicate EXPRESSION` / `-p EXPRESSION`
Keep records for which the boolean expression *EXPRESSION* evaluates to true. Attributes are accessed via the `annotations` map (e.g. `annotations["count"]`). The special variable `sequence` refers to the sequence object; its length can be obtained with `len(sequence)`. Can be repeated; all expressions must be true.
Example: `-p 'annotations["count"] >= 10 && len(sequence) < 200'`
### By taxonomy
Taxonomy-based filtering requires a taxonomy database to be provided with `--taxonomy`.
- `--taxonomy PATH` / `-t PATH`
Path to the taxonomy database.
- `--restrict-to-taxon TAXID` / `-r TAXID`
Keep only records whose taxon belongs to the lineage of *TAXID* (i.e. is *TAXID* itself or a descendant). Can be repeated; sequences must satisfy at least one of the provided taxids.
- `--ignore-taxon TAXID` / `-i TAXID`
Discard records whose taxon belongs to the lineage of *TAXID*. Can be repeated.
- `--valid-taxid`
Keep only records that carry a valid, recognised taxonomic identifier.
- `--require-rank RANK_NAME`
Keep only records whose taxon has a defined ancestor at the given rank (e.g. *species*, *genus*, *family*). Can be repeated.
- `--update-taxid`
Automatically update merged taxids to their current valid equivalent.
- `--fail-on-taxonomy`
Exit with an error if a taxid referenced in the data is not valid.
- `--with-leaves`
When the taxonomy is extracted from a sequence file, attach each sequence as a leaf node under its annotated taxid.
- `--raw-taxid`
Print taxids in output files without supplementary information (taxon name and rank).
### Inversion
- `--inverse-match` / `-v`
Invert the selection: output the records that would otherwise be discarded.
## PAIRED-END OPTIONS
When paired-end sequencing data are provided (forward and reverse reads stored in two files), `obigrep` can apply filters taking both reads into account.
- `--paired-with FILENAME`
File containing the reverse (paired) reads.
- `--paired-mode MODE`
How to combine the filter result from the forward and reverse reads. *MODE* is one of:
| Mode | Meaning |
|------|---------|
| `forward` | Keep the pair if the **forward** read passes (default) |
| `reverse` | Keep the pair if the **reverse** read passes |
| `and` | Keep the pair if **both** reads pass |
| `or` | Keep the pair if **at least one** read passes |
| `andnot` | Keep the pair if the **forward** passes and the **reverse** does not |
| `xor` | Keep the pair if **exactly one** read passes |
## OUTPUT CONTROL
- `--save-discarded FILENAME`
Write sequence records that do **not** pass the filters to *FILENAME*.
- `--out FILENAME` / `-o FILENAME`
Write the selected records to *FILENAME* (default: standard output).
- `--skip-empty`
Suppress sequences of length zero from the output.
## PERFORMANCE OPTIONS
- `--max-cpu N`
Number of parallel processing threads (default: number of available CPUs).
- `--batch-size N`
Minimum number of sequences per processing batch (default: 1).
- `--batch-size-max N`
Maximum number of sequences per processing batch (default: 2000).
- `--batch-mem SIZE`
Maximum memory per batch (e.g. `128M`, `1G`). Overrides `--batch-size-max` when memory is the limiting factor. Can also be set via the environment variable `OBIBATCHMEM`.
- `--no-order`
When multiple input files are provided, indicates that no ordering is assumed between them, which can improve throughput.
- `--no-progressbar`
Disable the progress bar.
## MISCELLANEOUS OPTIONS
- `--u-to-t`
Convert uracil (U) to thymine (T) in all sequences (useful for RNA data).
- `--solexa`
Decode quality scores according to the legacy Solexa specification instead of the standard Phred encoding.
- `--silent-warning`
Suppress warning messages.
- `--debug`
Enable verbose debug logging.
- `--version`
Print version information and exit.
- `--help` / `-h` / `-?`
Display the help message and exit.
## EXAMPLES
Keep all sequences longer than 100 bases:
```bash
obigrep --min-length 100 input.fasta > out_min_length.fasta
```
**Expected output:** 6 sequences written to `out_min_length.fasta`.
Select sequences observed at least 10 times:
```bash
obigrep --min-count 10 input.fasta > out_min_count.fasta
```
**Expected output:** 4 sequences written to `out_min_count.fasta`.
Keep sequences whose identifier starts with `BOLD`:
```bash
obigrep --identifier '^BOLD' input.fasta > out_bold.fasta
```
**Expected output:** 2 sequences written to `out_bold.fasta`.
Select only sequences carrying the IUPAC primer motif `GGGCWATGTTTCATAAYGGG` with up to 2 mismatches:
```bash
obigrep --approx-pattern GGGCWATGTTTCATAAYGGG --pattern-error 2 input.fasta > out_primer.fasta
```
**Expected output:** 2 sequences written to `out_primer.fasta`.
Retain sequences belonging to the genus *Homo* (taxid 9605) in an NCBI taxonomy:
```bash
obigrep --taxonomy /data/ncbi_tax --restrict-to-taxon 9605 input.fasta
```
Keep sequences that have a `sample` attribute equal to `lake1` and save the rest to a separate file:
```bash
obigrep --attribute sample='^lake1$' --save-discarded discarded.fasta \
input.fasta > lake1.fasta
```
**Expected output:** 5 sequences written to `lake1.fasta`, 5 sequences written to `discarded.fasta`.
Invert a length filter (discard sequences shorter than 50 bases):
```bash
obigrep --min-length 50 --inverse-match input.fasta > out_short.fasta
```
**Expected output:** 1 sequence written to `out_short.fasta`.
Apply a custom predicate (sequences with count ≥ 5):
```bash
obigrep -p 'annotations["count"] >= 5' input.fasta > out_predicate.fasta
```
**Expected output:** 6 sequences written to `out_predicate.fasta`.
## OUTPUT
### Attribute table
Attributes present on sequence records are preserved unchanged in the output. No new attributes are added by `obigrep` itself — only filtering occurs.
| Attribute | Type | Description |
|-----------|------|-------------|
| `count` | integer | Number of times the sequence was observed (read from input) |
| `sample` | string | Sample identifier (read from input) |
Any other annotations present in the input are carried through to the output unmodified.
### Observed output example
```
>seq001 {"count":15,"sample":"lake1"}
acgtacgtacgtacgtacgtgggcaatgtttcataatgggacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgtacgt
acgtacgtacgtacgtacgtacgtacgtacgt
>seq002 {"count":3,"sample":"lake1"}
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
tgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgca
>seq004 {"count":2,"sample":"lake1"}
aaacccgggtttagctagctagctagctagctagctagctagctagctagctagctagct
agctagctagctagctagctagctagctagctagctagctagctagctagctagctagct
atacgtatcgatcg
>BOLD_005 {"count":8,"sample":"pond1"}
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
cgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>seq008 {"count":7,"sample":"river2"}
ttacgatcgatcgatcgatcgggcaatgtttcataaggggacgatcgatcgatcgatcga
tcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgat
```
## SEE ALSO
`obiannotate`(1), `obiuniq`(1), `obiconvert`(1), `obitag`(1), `obisplit`(1)
## OBITools4
`obigrep` is part of the **OBITools4** suite for analysing DNA metabarcoding and environmental DNA data.
+257
View File
@@ -0,0 +1,257 @@
# NAME
obijoin — merge annotations contained in a file to another file
---
# SYNOPSIS
```
obijoin --join-with|-j <string> [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--by|-b <string>]... [--compress|-Z]
[--csv] [--debug] [--ecopcr] [--embl] [--fail-on-taxonomy] [--fasta]
[--fasta-output] [--fastq] [--fastq-output] [--genbank]
[--help|-h|-?] [--input-OBI-header] [--input-json-header]
[--json-output] [--max-cpu <int>] [--no-order] [--no-progressbar]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-id|-i]
[--update-quality|-q] [--update-sequence|-s] [--update-taxid]
[--version] [--with-leaves] [<args>]
```
---
# DESCRIPTION
`obijoin` merges annotations from a secondary file into a primary sequence dataset. For each sequence in the primary input, it looks up matching records in the secondary file based on one or more shared attribute keys, then copies all annotations from matched partner records onto the primary sequence.
The join is a **left outer join**: every sequence in the primary dataset is preserved in the output, whether or not a match is found in the secondary file. Unmatched sequences simply receive no additional annotations. Key matching is exact string equality.
A common use case is enriching amplicon or read sequences with external sample metadata. The secondary file (the *annotation source*) can be a FASTA/FASTQ sequence file, a CSV table, an EMBL or GenBank flat file, or any other format accepted by OBITools4. This makes it straightforward to prepare a simple spreadsheet with sample identifiers and metadata columns, save it as CSV, and merge it directly into a sequence dataset — the CSV format is auto-detected, no format conversion or extra flag is needed. <!-- corrected: secondary CSV is auto-detected; --csv flag is not needed for the secondary file -->
In addition to transferring annotations, `obijoin` can optionally replace the sequence identifier, nucleotide sequence, or quality scores of each primary sequence with values from its matched partner, controlled by the `--update-id`, `--update-sequence`, and `--update-quality` flags.
---
# INPUT
`obijoin` accepts a primary sequence dataset on standard input or as one or more file arguments. The supported formats are automatically detected and include FASTA, FASTQ, EMBL, GenBank, ecoPCR output, CSV, and JSON. Format-specific flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) can force a specific parser when auto-detection is ambiguous.
The secondary file, supplied via `--join-with`, is loaded entirely into memory before processing begins, and supports the same set of formats including CSV — the format is auto-detected automatically. <!-- corrected: removed incorrect claim that --csv is needed for secondary file -->
When multiple primary input files are provided and their ordering across files is irrelevant, `--no-order` allows the reader to return batches in whichever order they complete, improving throughput.
---
# OUTPUT
The output is a sequence file in FASTA or FASTQ format (determined automatically by the presence of quality data), written to standard output or to the file specified by `--out`. Alternative output formats can be requested with `--fasta-output`, `--fastq-output`, or `--json-output`. The output can be gzip-compressed with `--compress`.
Each output sequence carries all annotations from the primary dataset, enriched with every annotation attribute copied from the matched partner record. If a field name exists in both, the partner value overwrites the primary value. When `--update-id`, `--update-sequence`, or `--update-quality` are set, the corresponding sequence-level fields are also replaced with the partner's values.
## Observed output example
```
>seq001 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
atgcatgcatgcatgcatgc
>seq002 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gctagctagctagctagcta
>seq003 {"barcode":"TTTT","sample":"S3"}
tttttttttttttttttttt
>seq004 {"barcode":"ATGC","experiment":"amplicon_run1","location":"Paris","sample":"S1"}
aaaaatttttcccccggggg
>seq005 {"barcode":"GCTA","experiment":"amplicon_run2","location":"Lyon","sample":"S2"}
gggggaaaaatttttccccc
>seq006 {"barcode":"AAAA","sample":"S4"}
ccccccgggggtttttaaaaa
```
---
# OPTIONS
## Required
`--join-with|-j <string>`
: Path to the secondary file whose records are joined onto the primary sequences. This parameter is mandatory. The file can be in any format accepted by OBITools4 (FASTA, FASTQ, CSV, EMBL, GenBank, ecoPCR); the format is auto-detected. Default: none.
## Join control
`--by|-b <string>`
: Declares a join key as an attribute name or a `primary_attr=secondary_attr` mapping. Repeat the flag to join on multiple keys simultaneously; all keys must match for a record pair to be considered a hit (intersection semantics). When omitted, the join defaults to matching by sequence identifier (`id`). Default: `[]`.
`--update-id|-i`
: Replace the identifier of each primary sequence with the identifier from its matched partner record. Default: `false`.
`--update-sequence|-s`
: Replace the nucleotide or amino acid sequence of each primary sequence with the sequence from its matched partner. Default: `false`.
`--update-quality|-q`
: Replace the per-base quality scores of each primary sequence with the quality scores from its matched partner. Relevant only when both datasets carry quality information (FASTQ). Default: `false`.
## Input format
`--csv`
: Read the primary input data in OBITools CSV format (e.g., sequences exported by `obicsv`). This flag applies to the primary input only; secondary files supplied via `--join-with` are always auto-detected. Default: `false`. <!-- corrected: --csv affects primary input only, not the secondary annotation file -->
`--ecopcr`
: Read data following the ecoPCR output format. Default: `false`.
`--embl`
: Read data following the EMBL flatfile format. Default: `false`.
`--fasta`
: Read data following the FASTA format. Default: `false`.
`--fastq`
: Read data following the FASTQ format. Default: `false`.
`--genbank`
: Read data following the GenBank flatfile format. Default: `false`.
`--input-OBI-header`
: Treat FASTA/FASTQ title line annotations as OBI format. Default: `false`.
`--input-json-header`
: Treat FASTA/FASTQ title line annotations as JSON format. Default: `false`.
`--solexa`
: Decode the quality string according to the Solexa specification. Default: `false`.
`--u-to-t`
: Convert uracil (U) to thymine (T) in input sequences. Default: `false`.
`--skip-empty`
: Suppress sequences of length zero from the output. Default: `false`.
`--no-order`
: When several input files are provided, indicates that there is no order among them. Default: `false`.
## Output format
`--out|-o <FILENAME>`
: Filename used for saving the output. Default: `-` (standard output).
`--fasta-output`
: Write sequences in FASTA format (default when no quality data are available). Default: `false`.
`--fastq-output`
: Write sequences in FASTQ format (default when quality data are available). Default: `false`.
`--json-output`
: Write sequences in JSON format. Default: `false`.
`--output-OBI-header|-O`
: Output FASTA/FASTQ title line annotations in OBI format. Default: `false`.
`--output-json-header`
: Output FASTA/FASTQ title line annotations in JSON format. Default: `false`.
`--compress|-Z`
: Compress the output using gzip. Default: `false`.
## Taxonomy
`--taxonomy|-t <string>`
: Path to the taxonomy database. Default: `""`.
`--fail-on-taxonomy`
: Cause `obijoin` to fail with an error if a taxid encountered is not currently valid. Default: `false`.
`--raw-taxid`
: Print taxids in files without supplementary information (taxon name and rank). Default: `false`.
`--update-taxid`
: Automatically update taxids that are declared as merged to a newer one. Default: `false`.
`--with-leaves`
: When taxonomy is extracted from a sequence file, add sequences as leaves of their taxid annotation. Default: `false`.
## Performance
`--max-cpu <int>`
: Number of parallel threads used to compute the result. Default: `16`.
`--batch-size <int>`
: Minimum number of sequences per processing batch. Default: `1`.
`--batch-size-max <int>`
: Maximum number of sequences per processing batch. Default: `2000`.
`--batch-mem <string>`
: Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable. Default: `128M`.
## Diagnostics
`--no-progressbar`
: Disable the progress bar. Default: `false`.
`--silent-warning`
: Stop printing warning messages. Default: `false`.
`--debug`
: Enable debug mode by setting the log level to debug. Default: `false`.
---
# EXAMPLES
```bash
# Annotate amplicon sequences with sample metadata from a CSV table,
# matching on the sample attribute. CSV format is auto-detected.
obijoin --join-with metadata.csv --by sample input.fasta > out_basic.fasta
```
**Expected output:** 6 sequences written to `out_basic.fasta`.
```bash
# Join using a cross-attribute key: primary sequences have a 'sample' attribute,
# while the annotation CSV uses 'well' for the same identifier.
obijoin --join-with well_metadata.csv --by sample=well input.fasta > out_crosskey.fasta
```
**Expected output:** 6 sequences written to `out_crosskey.fasta`.
```bash
# Join on two keys simultaneously: match only when both sample and barcode agree,
# then update sequence identifiers with those from the reference file.
obijoin --join-with references.fasta \
--by sample --by barcode \
--update-id \
input.fasta > out_multikey.fasta
```
**Expected output:** 6 sequences written to `out_multikey.fasta`.
```bash
# Replace sequences and quality scores of reads with values from a corrected FASTQ file,
# joining by sequence ID (default when no --by is specified).
obijoin --join-with corrected.fastq \
--update-sequence --update-quality \
input.fastq > out_updated.fastq
```
**Expected output:** 3 sequences written to `out_updated.fastq`.
```bash
# Use an OBITools CSV file as primary input (--csv flag), join with a metadata CSV,
# then write compressed FASTA output without showing the progress bar.
obijoin --join-with metadata.csv --by sample \
--csv --fasta-output --compress \
--no-progressbar \
--out out_compressed.fasta.gz \
primary.csv
```
**Expected output:** 3 sequences written to `out_compressed.fasta.gz`.
---
# NOTES
- The secondary file supplied via `--join-with` is loaded entirely into memory before the join begins. For very large secondary files this may require significant RAM.
- Key matching is based on exact string equality; no regular expression or fuzzy matching is applied.
- The join is a left outer join: primary sequences without a matching partner in the secondary file are still emitted, unchanged, in the output.
- When the annotation source is a plain CSV spreadsheet (columns = attributes, rows = records), the format is auto-detected — no `--csv` flag is needed. The `--csv` flag applies exclusively to the primary input and is intended for sequences stored in OBITools CSV format.
+205
View File
@@ -0,0 +1,205 @@
# NAME
obimicrosat — looks for microsatellites sequences in a sequence file
---
# SYNOPSIS
```
obimicrosat [options] [<filename>...]
```
---
# DESCRIPTION
`obimicrosat` scans DNA sequences for simple sequence repeats (SSRs), also called
microsatellites — tandem repetitions of a short motif (16 bp by default). For each
sequence containing a qualifying repeat, the command annotates it with the location,
unit sequence, repeat count, and flanking regions, then writes it to output. Sequences
with no detected microsatellite are silently discarded.
The detection works in two passes. A first regular expression finds any tandem repeat
satisfying the unit-length and repeat-count constraints. The true minimal repeat unit
is then determined, and a second scan refines the exact boundaries. The repeat unit is
normalized to its lexicographically smallest rotation across all rotations and its
reverse complement, which allows equivalent loci to be grouped consistently across
samples.
By default, when the canonical form of a unit requires the reverse complement, the
whole sequence is reoriented so that the microsatellite is always reported on the
direct strand of the normalized unit. This behaviour can be disabled with
`--not-reoriented`.
A common use case is identifying polymorphic SSR markers for population genetics, or
flagging repeat-rich regions before designing PCR primers.
---
# INPUT
Accepts one or more sequence files on the command line. If no file is given, sequences
are read from standard input. Supported formats include FASTA, FASTQ, JSON/OBI, GenBank,
EMBL, ecoPCR output, and CSV. Compressed files (gzip) are handled transparently.
Format is detected automatically unless overridden by input flags.
---
# OUTPUT
Outputs only the sequences in which a microsatellite was found. Each retained sequence
carries the following additional attributes:
| Attribute | Content |
|---|---|
| `microsat` | Full repeat region as a string |
| `microsat_from` | 1-based start position of the repeat |
| `microsat_to` | End position of the repeat (inclusive) |
| `microsat_unit` | Repeat unit as observed in the sequence |
| `microsat_unit_normalized` | Lexicographically smallest canonical form |
| `microsat_unit_orientation` | `direct` or `reverse` |
| `microsat_unit_length` | Length of the repeat unit (bp) |
| `microsat_unit_count` | Number of complete unit repetitions |
| `seq_length` | Total length of the (possibly reoriented) sequence |
| `microsat_left` | Flanking sequence to the left of the repeat |
| `microsat_right` | Flanking sequence to the right of the repeat |
When a sequence is reoriented (reverse-complemented), `_cmp` is appended to its
identifier.
The output format follows the same rules as the rest of OBITools4: FASTQ when quality
scores are present, FASTA or JSON/OBI otherwise, configurable via output flags.
## Observed output example
```
>seq001 {"definition":"dinucleotide AC repeat 16x with 40bp non-repetitive flanks","microsat":"acacacacacacacacacacacacacacacac","microsat_from":40,"microsat_left":"agtcgaacttgcatgccttcagggcaagtctagcttacg","microsat_right":"cgatagtcatgcaagtcttgcggcatagatcgttacca","microsat_to":71,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"direct","seq_length":109}
agtcgaacttgcatgccttcagggcaagtctagcttacgacacacacacacacacacaca
cacacacacaccgatagtcatgcaagtcttgcggcatagatcgttacca
>seq006_cmp {"definition":"GT repeat 16x with 40bp non-repetitive flanks canonical form is AC","microsat":"acacacacacacacacacacacacacacacac","microsat_from":39,"microsat_left":"tggtaacgatctatgccgcaagacttgcatgactatcg","microsat_right":"cgtaagctagacttgccctgaaggcatgcaagttcgact","microsat_to":70,"microsat_unit":"ac","microsat_unit_count":16,"microsat_unit_length":2,"microsat_unit_normalized":"ac","microsat_unit_orientation":"reverse","seq_length":109}
tggtaacgatctatgccgcaagacttgcatgactatcgacacacacacacacacacacac
acacacacaccgtaagctagacttgccctgaaggcatgcaagttcgact
```
---
# OPTIONS
## Microsatellite detection
**`--min-unit-length` / `-m`**
- Default: `1`
- Minimum length in base pairs of the repeated motif. Set to `2` to exclude
mononucleotide repeats, `3` for di- and mononucleotide-free searches, etc.
**`--max-unit-length` / `-M`**
- Default: `6`
- Maximum length in base pairs of the repeated motif. Increasing this value detects
longer repeat units (minisatellites) at the cost of more complex patterns.
**`--min-unit-count`**
- Default: `5`
- Minimum number of times the motif must be repeated. A value of `5` with a 2 bp unit
requires at least 10 bp of pure repeat.
**`--min-length` / `-l`**
- Default: `20`
- Minimum total length (in bp) of the repeat region. This filter applies after the
unit-count filter and is useful to exclude very short but technically qualifying
repeats.
**`--min-flank-length` / `-f`**
- Default: `0`
- Minimum length of the flanking sequence on each side of the repeat. Sequences with
flanks shorter than this threshold are discarded, which is useful when the output
will feed a primer-design step.
**`--not-reoriented` / `-n`**
- Default: `false` (reorientation is active by default)
- When set, sequences are never reverse-complemented to match the canonical orientation
of the repeat unit. The microsatellite is reported as found, in its original
orientation.
## Input / output
Inherited from the standard OBITools4 conversion layer. Common flags include:
**`--input-OBI-header`** — parse OBI-style FASTA/FASTQ headers.
**`--input-json-header`** — parse JSON-encoded headers.
**`--skip-empty`** — skip sequences with no nucleotides.
**`--u-to-t`** — convert U to T (RNA → DNA).
**`--output-json-header`** — write JSON-encoded headers.
**`--output-obi-header`** — write OBI-style headers.
**`--gzip`** — compress output with gzip.
**`--workers` / `-p`** — number of parallel processing workers.
---
# EXAMPLES
```bash
# Detect default microsatellites (unit 16 bp, ≥5 repeats, ≥20 bp total)
obimicrosat sequences.fasta > out_default.fasta
```
**Expected output:** 6 sequences written to `out_default.fasta`.
```bash
# Restrict to di- and trinucleotide repeats only
obimicrosat -m 2 -M 3 sequences.fasta > out_dinucleotide.fasta
```
**Expected output:** 4 sequences written to `out_dinucleotide.fasta`
(mononucleotide and tetranucleotide repeats excluded).
```bash
# Require at least 30 bp flanking sequence on each side (for primer design)
obimicrosat -f 30 sequences.fasta > out_primer_ready.fasta
```
**Expected output:** 3 sequences written to `out_primer_ready.fasta`
(sequences with flanks shorter than 30 bp are discarded).
```bash
# Keep sequences in their original orientation (no reverse-complement)
obimicrosat --not-reoriented sequences.fasta > out_no_reorient.fasta
```
**Expected output:** 6 sequences written to `out_no_reorient.fasta`
(GT-repeat sequence kept as-is without `_cmp` suffix; `microsat_unit_orientation` is `reverse`).
```bash
# Require at least 8 repeat units and a minimum repeat length of 30 bp
obimicrosat --min-unit-count 8 -l 30 sequences.fasta > out_strict.fasta
```
**Expected output:** 4 sequences written to `out_strict.fasta`
(short or low-count repeats excluded).
---
# SEE ALSO
`obigrep` — filter sequences by annotation after microsatellite detection.
`obiannotate` — add or modify sequence annotations.
`obiconvert` — format conversion for sequence files.
---
# NOTES
- Only sequences with at least one qualifying microsatellite are written to output;
all others are silently filtered out.
- The normalization algorithm considers all rotations of the unit and their reverse
complements, selecting the lexicographically smallest string. This ensures consistent
grouping of loci regardless of which strand was sequenced.
- When reorientation is active (the default), sequences whose canonical unit falls on
the reverse strand are reverse-complemented and their ID receives the suffix `_cmp`.
Coordinate attributes (`microsat_from`, `microsat_to`) always refer to the
(possibly reoriented) output sequence.
- Repetitive low-complexity sequences may match multiple overlapping patterns; only the
first match is reported per sequence.
- Flanking sequences must be **non-repetitive** to avoid the tool detecting a tandem
repeat within the flank instead of the intended SSR. When designing synthetic test
data, ensure flanking regions do not contain tandem repeat motifs of their own.
+384
View File
@@ -0,0 +1,384 @@
# NAME
obiscript — executes a lua script on the input sequences
---
# SYNOPSIS
```
obiscript [--allows-indels] [--approx-pattern <PATTERN>]...
[--attribute|-a <KEY=VALUE>]... [--batch-mem <string>]
[--batch-size <int>] [--batch-size-max <int>] [--compress|-Z]
[--csv] [--debug] [--definition|-D <PATTERN>]... [--ecopcr]
[--embl] [--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--has-attribute|-A <KEY>]...
[--help|-h|-?] [--id-list <FILENAME>]
[--identifier|-I <PATTERN>]... [--ignore-taxon|-i <TAXID>]...
[--input-OBI-header] [--input-json-header] [--inverse-match|-v]
[--json-output] [--max-count|-C <COUNT>] [--max-cpu <int>]
[--max-length|-L <LENGTH>] [--min-count|-c <COUNT>]
[--min-length|-l <LENGTH>] [--no-order] [--no-progressbar]
[--only-forward] [--out|-o <FILENAME>] [--output-OBI-header|-O]
[--output-json-header]
[--paired-mode <forward|reverse|and|or|andnot|xor>]
[--pattern-error <int>] [--pprof] [--pprof-goroutine <int>]
[--pprof-mutex <int>] [--predicate|-p <EXPRESSION>]...
[--raw-taxid] [--require-rank <RANK_NAME>]...
[--restrict-to-taxon|-r <TAXID>]... [--script|-S <string>]
[--sequence|-s <PATTERN>]... [--silent-warning] [--skip-empty]
[--solexa] [--taxonomy|-t <string>] [--template] [--u-to-t]
[--update-taxid] [--valid-taxid] [--version] [--with-leaves]
[<args>]
```
---
# DESCRIPTION
`obiscript` applies a user-provided Lua script to a stream of biological sequences. For each input sequence record, the script's `worker(sequence)` function is called, giving the user full programmatic access to the sequence's identifier, data, quality scores, and metadata attributes. This makes it possible to implement custom annotation logic, computed filters, or record transformations that go beyond what fixed-function OBITools commands offer.
The Lua script may also define two optional lifecycle hooks: `begin()`, called once before any sequence is processed (useful for initialising counters or opening files), and `finish()`, called after the last sequence (useful for printing summary statistics or flushing output). A thread-safe shared table `obicontext` is available across all workers, allowing aggregation across parallel executions.
Sequences are read from files or standard input in any format supported by OBITools4 (FASTA, FASTQ, EMBL, GenBank, ecoPCR, CSV). The sequence filtering flags (such as `--min-length`, `--predicate`, etc.) select which sequences the Lua script is applied to; sequences that do not match the filter pass through to the output unchanged without the script being executed on them. All sequences — scripted or not — are written to the output. <!-- corrected: non-matching sequences are passed through unchanged, not discarded -->
To get started, use `--template` to print a minimal Lua script skeleton with stubs for all three hooks and inline documentation.
---
# INPUT
`obiscript` reads biological sequences from one or more files supplied as positional arguments, or from standard input if no files are given. All formats supported by OBITools4 are accepted: FASTA, FASTQ, EMBL flatfile, GenBank flatfile, ecoPCR output, and CSV. Format auto-detection is used by default; explicit format flags (`--fasta`, `--fastq`, `--embl`, `--genbank`, `--ecopcr`, `--csv`) override it. Header annotation style can be forced with `--input-OBI-header` or `--input-json-header`.
---
# OUTPUT
Sequences processed by the Lua script are written to standard output, or to the file given by `--out`. Any modifications made to sequence records inside `worker()` (identifier, sequence, attributes) are reflected in the output. The output format defaults to FASTA when no quality data are present and to FASTQ otherwise; use `--fasta-output`, `--fastq-output`, or `--json-output` to override. Header annotation style in FASTA/FASTQ output can be set with `--output-OBI-header` or `--output-json-header`. Output can be gzip-compressed with `--compress`.
## Observed output example
```
>sample1_seq001 {"definition":"control sequence for annotation test","sample":"sample1"}
atcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcgatcg
>sample1_seq002 {"definition":"another control sequence from sample1","sample":"sample1"}
gctagctagctagctagctagctagctagctagctagctagctagcta
>sample2_seq003 {"definition":"second sample sequence","sample":"sample2"}
ttaattaattaattaattaattaattaattaattaattaattaattaa
>sample2_seq004 {"definition":"second sample another sequence","sample":"sample2"}
ccggccggccggccggccggccggccggccggccggccggccggccgg
>sample3_seq005 {"definition":"third sample first sequence","sample":"sample3"}
aaaattttccccggggaaaattttccccggggaaaattttccccgggg
>sample3_seq006 {"definition":"third sample second sequence","sample":"sample3"}
ttttaaaaccccggggttttaaaaccccggggttttaaaaccccgggg
```
---
# OPTIONS
## Script
### `--script|-S <string>`
- Default: `""`
- Path to the Lua script file to execute. The file must exist and be syntactically valid Lua. The script should define a `worker(sequence)` function, and optionally `begin()` and `finish()`.
### `--template`
- Default: `false`
- Print a minimal Lua script template to standard output, with stubs for `begin()`, `worker()`, and `finish()` and inline documentation, then exit. Use this to bootstrap a new script.
## Sequence filtering (selects sequences on which the script is applied; non-matching sequences pass through unchanged)
### `--predicate|-p <EXPRESSION>`
- Default: `[]`
- Boolean expression evaluated for each sequence record. Attribute keys are accessible as variable names; `sequence` refers to the record itself. Multiple `-p` options are combined with AND.
### `--sequence|-s <PATTERN>`
- Default: `[]`
- Regular expression matched against the nucleotide sequence. Case-insensitive. Multiple patterns are combined with AND.
### `--identifier|-I <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence identifier. Case-insensitive.
### `--definition|-D <PATTERN>`
- Default: `[]`
- Regular expression matched against the sequence definition line. Case-insensitive.
### `--approx-pattern <PATTERN>`
- Default: `[]`
- Pattern matched approximately against the sequence. Use `--pattern-error` to set the maximum number of errors allowed.
### `--pattern-error <int>`
- Default: `0`
- Maximum number of errors (mismatches) allowed during approximate pattern matching.
### `--allows-indels`
- Default: `false`
- Allow insertions and deletions (in addition to mismatches) during approximate pattern matching.
### `--only-forward`
- Default: `false`
- Restrict pattern matching to the forward strand only.
### `--has-attribute|-A <KEY>`
- Default: `[]`
- Apply the script only to records that have an attribute with key `<KEY>`; others pass through.
### `--attribute|-a <KEY=VALUE>`
- Default: `{}`
- Apply the script only to records where the attribute `KEY` matches the regular expression `VALUE`. Case-sensitive. Multiple `-a` options are combined with AND.
### `--id-list <FILENAME>`
- Default: `""`
- Path to a text file containing one sequence identifier per line. The script is applied only to records whose identifier appears in the file; others pass through.
### `--min-length|-l <LENGTH>`
- Default: `1`
- Apply the script only to sequences whose length is at least `LENGTH`; shorter sequences pass through unchanged.
### `--max-length|-L <LENGTH>`
- Default: `2000000000`
- Apply the script only to sequences whose length is at most `LENGTH`; longer sequences pass through unchanged.
### `--min-count|-c <COUNT>`
- Default: `1`
- Apply the script only to sequences with a count (abundance) of at least `COUNT`; others pass through unchanged.
### `--max-count|-C <COUNT>`
- Default: `2000000000`
- Apply the script only to sequences with a count (abundance) of at most `COUNT`; others pass through unchanged.
### `--inverse-match|-v`
- Default: `false`
- Invert the selection: apply the script to records that do NOT match the filter criteria; matching records pass through unchanged.
## Taxonomic filtering
### `--taxonomy|-t <string>`
- Default: `""`
- Path to the taxonomy database. Required for taxonomy-based options.
### `--restrict-to-taxon|-r <TAXID>`
- Default: `[]`
- Retain only sequences whose taxid belongs to the specified taxon.
### `--ignore-taxon|-i <TAXID>`
- Default: `[]`
- Exclude sequences whose taxid belongs to the specified taxon.
### `--require-rank <RANK_NAME>`
- Default: `[]`
- Retain only sequences whose taxon has the specified rank (e.g., `species`, `genus`).
### `--valid-taxid`
- Default: `false`
- Retain only sequences that carry a currently valid NCBI taxid.
### `--fail-on-taxonomy`
- Default: `false`
- Abort with an error if a taxid used during filtering is not currently valid.
### `--update-taxid`
- Default: `false`
- Automatically replace taxids declared as merged with their current equivalent.
### `--raw-taxid`
- Default: `false`
- Print taxids in output without supplementary information (taxon name and rank).
### `--with-leaves`
- Default: `false`
- When extracting taxonomy from a sequence file, attach sequences as leaves of their taxid annotation.
## Paired-end mode
### `--paired-mode <forward|reverse|and|or|andnot|xor>`
- Default: `"forward"`
- When paired reads are provided, determines how filter conditions are applied to both reads of a pair.
## Input format
### `--fasta`
- Default: `false`
- Force FASTA format parsing.
### `--fastq`
- Default: `false`
- Force FASTQ format parsing.
### `--embl`
- Default: `false`
- Force EMBL flatfile format parsing.
### `--genbank`
- Default: `false`
- Force GenBank flatfile format parsing.
### `--ecopcr`
- Default: `false`
- Force ecoPCR output format parsing.
### `--csv`
- Default: `false`
- Force CSV format parsing.
### `--input-OBI-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as OBI format.
### `--input-json-header`
- Default: `false`
- Parse FASTA/FASTQ title line annotations as JSON format.
### `--solexa`
- Default: `false`
- Decode quality strings according to the Solexa specification.
### `--u-to-t`
- Default: `false`
- Convert uracil (U) to thymine (T) in sequences.
### `--skip-empty`
- Default: `false`
- Suppress sequences of length zero from the output.
### `--no-order`
- Default: `false`
- When multiple input files are provided, indicates that no ordering is assumed among them.
## Output format
### `--out|-o <FILENAME>`
- Default: `"-"` (standard output)
- File path for saving the output.
### `--fasta-output`
- Default: `false`
- Write output in FASTA format.
### `--fastq-output`
- Default: `false`
- Write output in FASTQ format.
### `--json-output`
- Default: `false`
- Write output in JSON format.
### `--output-OBI-header|-O`
- Default: `false`
- Write FASTA/FASTQ title line annotations in OBI format.
### `--output-json-header`
- Default: `false`
- Write FASTA/FASTQ title line annotations in JSON format.
### `--compress|-Z`
- Default: `false`
- Compress output using gzip.
## Performance
### `--max-cpu <int>`
- Default: `16` (env: `OBIMAXCPU`)
- Number of parallel threads used for processing.
### `--batch-size <int>`
- Default: `1` (env: `OBIBATCHSIZE`)
- Minimum number of sequences per processing batch.
### `--batch-size-max <int>`
- Default: `2000` (env: `OBIBATCHSIZEMAX`)
- Maximum number of sequences per processing batch.
### `--batch-mem <string>`
- Default: `""``128M` (env: `OBIBATCHMEM`)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable.
## Diagnostics
### `--debug`
- Default: `false` (env: `OBIDEBUG`)
- Enable debug logging.
### `--no-progressbar`
- Default: `false`
- Disable the progress bar.
### `--silent-warning`
- Default: `false` (env: `OBIWARNING`)
- Suppress warning messages.
### `--pprof`
- Default: `false`
- Enable the pprof profiling HTTP server (see log for address).
### `--pprof-goroutine <int>`
- Default: `6060` (env: `OBIPPROFGOROUTINE`)
- Port for goroutine blocking profile.
### `--pprof-mutex <int>`
- Default: `10` (env: `OBIPPROFMUTEX`)
- Rate for mutex lock profiling.
---
# EXAMPLES
```bash
# Print a starter script template and save it to my_script.lua
obiscript --template > my_script.lua
```
**Expected output:** Lua template with `begin()`, `worker()`, and `finish()` stubs written to `my_script.lua`.
```bash
# Add a custom annotation to every sequence record
# (the script sets a new attribute 'sample' from the identifier prefix)
obiscript --script annotate.lua --fasta-output sequences.fasta > annotated.fasta
```
**Expected output:** 6 sequences written to `annotated.fasta`.
```bash
# Count reads per taxon using the finish() hook, filtering to a specific taxon
obiscript --script count_taxa.lua \
--restrict-to-taxon 6231 \
--taxonomy /data/ncbi_tax \
sequences.fasta > filtered_annotated.fasta
```
```bash
# Apply a script to FASTQ sequences with a length filter
obiscript --script process_pairs.lua \
--min-length 100 \
--out result.fastq \
reads.fastq
```
**Expected output:** 4 sequences written to `result.fastq`.
```bash
# Run on FASTQ input, output JSON, using 4 CPU threads
obiscript --script enrich.lua \
--json-output \
--max-cpu 4 \
sequences.fastq > enriched.json
```
**Expected output:** 4 sequences written to `enriched.json`.
---
# SEE ALSO
`obigrep` — filter sequences using the same selection criteria without scripting.
`obiannotate` — add or modify sequence attributes without scripting.
---
# NOTES
- The Lua `worker(sequence)` function is called in parallel across multiple goroutines. Use the thread-safe `obicontext` table (with `obicontext:lock()` / `obicontext:unlock()`) for any shared mutable state accessed across workers.
- The `begin()` and `finish()` hooks each run in a single goroutine and do not need locking for their own internal state.
- Sequence records modified inside `worker()` must be returned (or the original returned unmodified) for the record to appear in the output. Returning `nil` drops the sequence.
+271
View File
@@ -0,0 +1,271 @@
# NAME
obisummary — resume main information from a sequence file
---
# SYNOPSIS
```
obisummary [--batch-mem <string>] [--batch-size <int>]
[--batch-size-max <int>] [--csv] [--debug] [--ecopcr] [--embl]
[--fasta] [--fastq] [--genbank] [--help|-h|-?]
[--input-OBI-header] [--input-json-header] [--json-output]
[--map <string>]... [--max-cpu <int>] [--no-order] [--pprof]
[--pprof-goroutine <int>] [--pprof-mutex <int>] [--silent-warning]
[--solexa] [--u-to-t] [--version] [--yaml-output] [<args>]
```
---
# DESCRIPTION
`obisummary` reads a set of biological sequences and computes a statistical
summary of their content and annotations. Rather than producing a new sequence
file, it outputs a single structured record describing the dataset as a whole.
The summary covers three main areas. First, global counts: the total number of
reads (sequences weighted by their `count` attribute), the number of distinct
sequence variants, and the total sequence length across all records. Second,
annotation profiling: `obisummary` inspects every annotation key present in
the dataset and classifies it as a scalar attribute (single value per
sequence), a map attribute (key-to-count mapping), or a vector attribute
(multi-value per sequence). Third, per-sample statistics: when sequences carry
sample information (via `merged_sample` or equivalent per-sample annotations),
`obisummary` reports for each sample the number of reads, the number of
variants, and the number of singletons. If `obiclean` has been run previously,
the summary also captures `obiclean_status` and related quality flags per
sample.
The output is a single JSON record by default, or YAML when `--yaml-output` is
requested. <!-- corrected: actual default output is JSON, not YAML -->
`obisummary` is typically used after processing steps such as
`obiclean` or `obiuniq` to quickly validate the state of a dataset before
downstream analysis.
---
# INPUT
`obisummary` accepts biological sequence data from one or more files supplied
as positional arguments, or from standard input when no files are given.
Supported formats include FASTA, FASTQ, GenBank flatfile, EMBL flatfile,
ecoPCR output, and CSV. By default the format is detected automatically; use
the format flags (`--fasta`, `--fastq`, `--genbank`, `--embl`, `--ecopcr`,
`--csv`) to force a specific parser.
FASTA/FASTQ annotation headers may follow the OBI format (`--input-OBI-header`)
or JSON format (`--input-json-header`). RNA sequences can be read as DNA by
converting uracil to thymine with `--u-to-t`. Quality strings encoded according
to the Solexa specification are handled with `--solexa`.
When multiple input files are provided, `obisummary` assumes they are ordered;
use `--no-order` to indicate that no ordering exists among them.
---
# OUTPUT
`obisummary` writes a single structured record to standard output. The default
format is JSON; use `--yaml-output` to obtain YAML instead.
<!-- corrected: actual default output is JSON, not YAML -->
The record contains three top-level sections:
- **`count`**: global metrics including `variants` (distinct sequences),
`reads` (total weighted count), and `total_length` (sum of all sequence
lengths).
- **`annotations`**: a breakdown of all annotation keys found in the dataset,
classified as `scalar_attributes`, `map_attributes`, or `vector_attributes`,
together with the observed keys and their occurrence counts within each
category.
- **`samples`**: when sample information is present, `sample_count` and a
per-sample `sample_stats` table with `reads`, `variants`, and `singletons`
fields. If `obiclean` data is present, an `obiclean_bad` field is also
reported per sample.
When `--map` is used, the named map attribute is included in the annotation
detail for that attribute.
## Observed output example
```
{
"annotations": {
"keys": {
"scalar": {
"count": 5
}
},
"map_attributes": 0,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 21,
"total_length": 100,
"variants": 5
}
}
```
---
# OPTIONS
## Summary output
**`--json-output`**
- Default: `false`
- Print the result as a JSON record (this is the default behaviour; this flag
makes the choice explicit).
<!-- corrected: JSON is the default output format, not YAML -->
**`--yaml-output`**
- Default: `false`
- Print the result as a YAML record instead of the default JSON format.
<!-- corrected: YAML is not the default; JSON is -->
**`--map <string>`**
- Default: `[]` (none)
- Name of a map attribute to include in the summary. This option may be
repeated to request multiple map attributes. Each named attribute will be
detailed in the `map_attributes` section of the output.
## Input format
**`--fasta`**
- Default: `false`
- Read data following the FASTA format.
**`--fastq`**
- Default: `false`
- Read data following the FASTQ format.
**`--genbank`**
- Default: `false`
- Read data following the GenBank flatfile format.
**`--embl`**
- Default: `false`
- Read data following the EMBL flatfile format.
**`--ecopcr`**
- Default: `false`
- Read data following the ecoPCR output format.
**`--csv`**
- Default: `false`
- Read data following the CSV format.
**`--input-OBI-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow OBI format.
**`--input-json-header`**
- Default: `false`
- FASTA/FASTQ title line annotations follow JSON format.
**`--solexa`**
- Default: `false`
- Decode quality strings according to the Solexa specification.
**`--u-to-t`**
- Default: `false`
- Convert uracil (U) to thymine (T) when reading RNA sequences.
## Batch control
**`--batch-size <int>`**
- Default: `1`
- Minimum number of sequences per processing batch.
**`--batch-size-max <int>`**
- Default: `2000`
- Maximum number of sequences per processing batch.
**`--batch-mem <string>`**
- Default: `""` (128M effective)
- Maximum memory per batch (e.g. `128K`, `64M`, `1G`). Set to `0` to disable
the memory limit.
## Processing
**`--max-cpu <int>`**
- Default: `16`
- Number of parallel threads used to compute the result.
**`--no-order`**
- Default: `false`
- When several input files are provided, indicates that there is no order
among them.
## General
**`--debug`**
- Default: `false`
- Enable debug mode by setting the log level to debug.
**`--silent-warning`**
- Default: `false`
- Stop printing warning messages.
**`--version`**
- Default: `false`
- Print the version and exit.
**`--help` / `-h` / `-?`**
- Default: `false`
- Display help and exit.
**`--pprof`**
- Default: `false`
- Enable the pprof profiling server. Consult the log for the server address.
**`--pprof-goroutine <int>`**
- Default: `6060`
- Port for goroutine blocking profile.
**`--pprof-mutex <int>`**
- Default: `10`
- Port for mutex lock profiling.
---
# EXAMPLES
```bash
# Get a JSON summary of a FASTA file produced by obiclean
obisummary cleaned.fasta > out_default.yaml
```
**Expected output:** a JSON summary record in `out_default.yaml`.
```bash
# Get the summary as an explicit JSON record for programmatic processing
obisummary --json-output cleaned.fasta > out_json.json
```
**Expected output:** a JSON summary record in `out_json.json`.
```bash
# Get a YAML record from a FASTQ file
obisummary --yaml-output --fastq reads.fastq > out_yaml.yaml
```
**Expected output:** a YAML summary record in `out_yaml.yaml`.
```bash
# Summarise data read from standard input, forcing FASTA format
obigrep -p 'annotations.count > 1' sequences.fasta | obisummary --fasta > out_pipeline.yaml
```
**Expected output:** a JSON summary record in `out_pipeline.yaml` (3 variants, 10 reads).
---
# SEE ALSO
`obiclean`, `obiuniq`, `obicount`
+347
View File
@@ -0,0 +1,347 @@
# NAME
obiuniq — dereplicate sequence data sets
---
# SYNOPSIS
```
obiuniq [--batch-mem <string>] [--batch-size <int>] [--batch-size-max <int>]
[--category-attribute|-c <CATEGORY>]... [--chunk-count <int>]
[--compress|-Z] [--csv] [--debug] [--ecopcr] [--embl]
[--fail-on-taxonomy] [--fasta] [--fasta-output] [--fastq]
[--fastq-output] [--genbank] [--help|-h|-?] [--in-memory]
[--input-OBI-header] [--input-json-header] [--json-output]
[--max-cpu <int>] [--merge|-m <KEY>]... [--na-value <NA_NAME>]
[--no-order] [--no-progressbar] [--no-singleton]
[--out|-o <FILENAME>] [--output-OBI-header|-O] [--output-json-header]
[--pprof] [--pprof-goroutine <int>] [--pprof-mutex <int>]
[--raw-taxid] [--silent-warning] [--skip-empty] [--solexa]
[--taxonomy|-t <string>] [--u-to-t] [--update-taxid] [--version]
[--with-leaves] [<args>]
```
---
# DESCRIPTION
`obiuniq` groups identical sequences together and replaces them with a single
representative, recording the total number of original occurrences as an
abundance count. This process — called dereplication — is a standard step in
amplicon sequencing workflows: it dramatically reduces the number of sequence
records to process, while preserving exact counts needed for downstream
statistical analyses.
By default, two sequences are considered identical if and only if their
nucleotide strings are the same. Using `--category-attribute` (repeatable),
additional metadata fields can be included in the identity criterion. For
example, grouping by sample name keeps the same sequence as separate records
when it occurs in different samples, enabling per-sample abundance tracking.
For each group of identical sequences, `obiuniq` emits one output record
carrying the merged metadata of all members. The `--merge` option (repeatable)
instructs the command to also record, in an attribute named `merged_<KEY>`, the
distribution of `KEY` attribute values across the sequences collapsed into each
group — useful for provenance tracking and quality control. <!-- corrected: actual attribute name is merged_KEY (not KEY); tracks attribute value distributions, not a list of sequence IDs -->
Sequences that appear only once in the entire dataset (singletons) can be
removed with `--no-singleton`. Singletons often represent sequencing errors
rather than genuine biological variants, so their removal is a common
noise-reduction step.
---
# INPUT
`obiuniq` accepts biological sequence data in FASTA, FASTQ, EMBL, GenBank,
ecoPCR, or CSV format (auto-detected by default, or forced with format flags
such as `--fasta`, `--fastq`, `--embl`, etc.). Input is read from one or more
files given as positional arguments, or from standard input when no files are
provided.
When multiple input files are provided, `obiuniq` assumes they are ordered
(e.g., paired-end reads in the same read order). If no such ordering exists,
use `--no-order` to signal that files can be consumed independently.
FASTA/FASTQ header annotations are parsed heuristically by default. Use
`--input-OBI-header` for OBI-formatted headers or `--input-json-header` for
JSON-formatted headers. RNA sequences can be normalised to DNA on the fly with
`--u-to-t`.
---
# OUTPUT
`obiuniq` writes dereplicated sequences to standard output or to the file
specified by `--out`. Each output record represents one group of identical
sequences (identical under the chosen grouping criterion). The output carries
the merged metadata from all input records in the group.
The output format defaults to FASTA. Even when the input contains quality
scores (FASTQ), quality information is not preserved across merged sequences,
so the output is written in FASTA format unless `--fastq-output` is explicitly
requested. <!-- corrected: actual output is always FASTA when dereplicating; quality scores are dropped during merging -->
Output annotations follow the OBI header format when `--output-OBI-header` is
set, or JSON when `--output-json-header` is set. The output can be
gzip-compressed with `--compress`.
For each output record:
- The abundance count reflects how many input sequences were merged into the
group.
- Attributes created by `--merge KEY` are named `merged_KEY` and map each
observed value of the `KEY` attribute to the count of input sequences
carrying that value within the group. <!-- corrected: attribute name is merged_KEY; value is a map not a list -->
- All other attributes are merged from the contributing records according to
the standard OBITools4 merging rules.
## Observed output example
```
>seq008 {"count":1,"primer":"p1"}
cccccccccccccccccccc
>seq001 {"count":4,"primer":"p1"}
atcgatcgatcgatcgatcg
>seq004 {"count":2,"primer":"p1","sample":"s1"}
gctagctagctagctagcta
>seq007 {"count":1,"primer":"p1","sample":"s2"}
tttttttttttttttttttt
```
---
# OPTIONS
## Dereplication Options
**`--category-attribute|-c <CATEGORY>`** (default: `[]`)
Adds one metadata attribute to the grouping criterion. Two sequences are
placed in the same group only when they are nucleotide-identical **and** share
the same value for every attribute listed with `-c`. This option can be
repeated to combine multiple attributes (e.g., `-c sample -c primer`).
Records that lack a listed attribute receive the value set by `--na-value`.
**`--chunk-count <int>`** (default: `100`)
Controls how many internal partitions the dataset is split into during
processing. A higher value reduces per-partition memory usage at the cost of
more temporary files; a lower value increases per-partition memory but reduces
I/O overhead. Tune this when processing very large or very small datasets.
**`--in-memory`** (default: `false`)
Stores intermediate data chunks in RAM rather than in temporary disk files.
Speeds up processing on datasets that fit comfortably in available memory;
omit this flag (the default) for large datasets that exceed available RAM.
**`--merge|-m <KEY>`** (default: `[]`)
Creates an output attribute named `merged_KEY` that maps each observed value
of the `KEY` attribute to the count of input sequences carrying that value
within the group. Repeat to track multiple attributes. <!-- corrected: actual attribute name is merged_KEY (not KEY); value is a map of attribute values to counts, not a list of sequence IDs -->
Useful for tracking which sample or category contributions were collapsed into each group.
**`--na-value <NA_NAME>`** (default: `"NA"`)
Value assigned to a category attribute when a sequence record does not carry
that attribute. All sequences lacking the attribute are grouped together under
this placeholder, rather than being treated as incomparable.
**`--no-singleton`** (default: `false`)
Discards all output records whose abundance count is exactly one — i.e.,
sequences that occur only once across the entire input. Removing singletons
is a standard heuristic for excluding sequencing errors from further analysis.
## Input Options
**`--batch-mem <string>`** (default: `""`, env: `OBIBATCHMEM`)
Maximum memory budget per processing batch (e.g. `128K`, `64M`, `1G`). Set
to `0` to disable the memory ceiling. Overrides `--batch-size-max` when
both are set.
**`--batch-size <int>`** (default: `10`, env: `OBIBATCHSIZE`)
Minimum number of sequences per batch (floor).
**`--batch-size-max <int>`** (default: `2000`, env: `OBIBATCHSIZEMAX`)
Maximum number of sequences per batch (ceiling).
**`--csv`** (default: `false`)
Parse input as CSV format.
**`--ecopcr`** (default: `false`)
Parse input as ecoPCR output format.
**`--embl`** (default: `false`)
Parse input as EMBL flatfile format.
**`--fasta`** (default: `false`)
Parse input as FASTA format.
**`--fastq`** (default: `false`)
Parse input as FASTQ format.
**`--genbank`** (default: `false`)
Parse input as GenBank flatfile format.
**`--input-OBI-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as OBI-format key=value pairs.
**`--input-json-header`** (default: `false`)
Treat FASTA/FASTQ title line annotations as JSON objects.
**`--no-order`** (default: `false`)
When multiple input files are provided, indicates that there is no ordering
relationship among them.
**`--skip-empty`** (default: `false`)
Suppress sequences of length zero from the output.
**`--solexa`** (default: `false`, env: `OBISOLEXA`)
Decode quality strings according to the Solexa specification rather than the
standard Phred encoding.
**`--u-to-t`** (default: `false`)
Convert uracil (U) to thymine (T) in all input sequences, normalising RNA to
DNA representation.
## Output Options
**`--compress|-Z`** (default: `false`)
Compress output using gzip.
**`--fasta-output`** (default: `false`)
Write output in FASTA format (default when no quality scores are available).
**`--fastq-output`** (default: `false`)
Write output in FASTQ format (default when quality scores are present).
**`--json-output`** (default: `false`)
Write output in JSON format.
**`--out|-o <FILENAME>`** (default: `"-"`)
Write output to the specified file instead of standard output.
**`--output-OBI-header|-O`** (default: `false`)
Write FASTA/FASTQ title line annotations in OBI format.
**`--output-json-header`** (default: `false`)
Write FASTA/FASTQ title line annotations in JSON format.
## Taxonomy Options
**`--fail-on-taxonomy`** (default: `false`)
Cause `obiuniq` to exit with an error if a taxid in the data is not a
currently valid taxon in the loaded taxonomy.
**`--raw-taxid`** (default: `false`)
Print taxids in output without supplementary information (taxon name and rank).
**`--taxonomy|-t <string>`** (default: `""`)
Path to the taxonomy database used to validate or update taxids.
**`--update-taxid`** (default: `false`)
Automatically replace merged taxids with the most recent valid taxid.
**`--with-leaves`** (default: `false`)
When taxonomy is extracted from a sequence file, add sequences as leaves of
their taxid annotation.
## Execution Options
**`--max-cpu <int>`** (default: `16`, env: `OBIMAXCPU`)
Number of parallel threads used to compute the result.
**`--debug`** (default: `false`, env: `OBIDEBUG`)
Enable debug mode by setting the log level to debug.
**`--no-progressbar`** (default: `false`)
Disable the progress bar.
**`--silent-warning`** (default: `false`, env: `OBIWARNING`)
Suppress warning messages.
**`--pprof`** (default: `false`)
Enable the pprof profiling server (address logged at startup).
**`--pprof-goroutine <int>`** (default: `6060`, env: `OBIPPROFGOROUTINE`)
Port for the goroutine blocking profile endpoint.
**`--pprof-mutex <int>`** (default: `10`, env: `OBIPPROFMUTEX`)
Rate for the mutex contention profile.
**`--version`** (default: `false`)
Print the version string and exit.
**`--help|-h|-?`** (default: `false`)
Print usage information and exit.
---
# EXAMPLES
```bash
# Dereplicate a FASTQ file of amplicon reads; write unique sequences to a FASTA output file.
obiuniq reads.fastq -o out_basic.fastq
```
**Expected output:** 4 sequences written to `out_basic.fastq`.
```bash
# Dereplicate keeping sequences separate per sample (category attribute),
# and discard singletons to remove likely sequencing errors.
obiuniq -c sample --no-singleton reads.fastq -o out_no_singleton.fastq
```
**Expected output:** 2 sequences written to `out_no_singleton.fastq`.
```bash
# Dereplicate per sample, recording the sample distribution in 'merged_sample',
# and use 'UNKNOWN' for reads missing the sample attribute.
obiuniq -c sample --merge sample --na-value UNKNOWN reads.fastq -o out_merge.fastq
```
**Expected output:** 5 sequences written to `out_merge.fastq`.
```bash
# Process a dataset entirely in memory using 200 internal partitions,
# writing gzip-compressed output.
obiuniq --in-memory --chunk-count 200 --compress -o out_inmemory.fastq.gz reads.fastq
```
**Expected output:** 4 sequences written to `out_inmemory.fastq.gz`.
```bash
# Dereplicate reads from two sample files with no assumed ordering between them,
# grouping by both sample and primer attributes.
obiuniq --no-order -c sample -c primer sample1.fastq sample2.fastq -o out_multifile.fastq
```
**Expected output:** 4 sequences written to `out_multifile.fastq`.
---
# SEE ALSO
- `obigrep` — filter dereplicated sequences by abundance, length, or annotation
- `obiannotate` — add or modify annotations on dereplicated records
- `obicount` — count sequences or groups in a dataset
- `obiclean` — remove sequencing artefacts from a dereplicated dataset
- `obisummary` — summarise annotation distributions across a sequence set
---
# NOTES
For datasets that do not fit in RAM, `obiuniq` uses temporary disk-backed
chunk files by default. The number of chunks is controlled by `--chunk-count`
(default 100). Increasing this value lowers per-chunk memory requirements;
decreasing it reduces I/O at the cost of higher peak memory. Use `--in-memory`
only when the full working set fits in available RAM, as exceeding memory will
degrade performance or cause out-of-memory failures.
Singletons (sequences with abundance = 1) are a common source of noise in
amplicon sequencing, often arising from PCR or sequencing errors. The
`--no-singleton` flag is therefore recommended for most metabarcoding
workflows, unless the study design requires retaining all observed variants.
When the `--category-attribute` option is used, records that lack the
specified attribute are grouped together under the `--na-value` placeholder
(default `"NA"`). This ensures that all records participate in dereplication
without being silently dropped, but users should be aware that heterogeneous
records with different missing attributes may be unintentionally merged.
+48
View File
@@ -0,0 +1,48 @@
# `neural-ensemble` — A Lightweight Library for Modular Neural Ensemble Learning
The `neural-ensemble` package provides tools to build, train, evaluate, and deploy ensembles of neural networks with minimal boilerplate. It emphasizes modularity, reproducibility, and scalability—supporting both homogeneous (e.g., multiple ResNets) and heterogeneous ensembles (mix of CNNs, Transformers, MLPs)—while offering unified interfaces for data handling, training orchestration, and uncertainty quantification.
## Core Functionalities
### 1. **Model Composition**
- `Ensemble`: A container class to manage multiple models (heterogeneous or homogeneous), supporting dynamic model registration, weighted averaging, voting, and stacking.
- `ModelConfig`: A dataclass to declaratively specify model architecture (e.g., backbone, input shape), training hyperparameters, and checkpoint paths.
### 2. **Training & Orchestration**
- `EnsembleTrainer`: Handles distributed or sequential training of ensemble members, with support for early stopping, learning rate scheduling per member, and custom loss weighting.
- `TrainerCallback`: Abstract base for implementing logging, checkpointing, or metric tracking hooks.
### 3. **Data Handling**
- `EnsembleDataset`: Wraps any PyTorch-compatible dataset and automatically replicates inputs across all ensemble members (with optional per-member augmentation).
- `EnsembleDataModule`: Lightning-compatible data module for seamless integration with PyTorch Lightning workflows.
### 4. **Inference & Aggregation**
- `EnsemblePredictor`: Provides `.predict()` and `.forward_ensemble()`, supporting:
- *Hard/soft voting* (classification)
- *Mean/variance aggregation* (regression)
- *Monte Carlo dropout & deep ensembles* for uncertainty estimation
- `UncertaintyMetrics`: Computes ECE, NLL, Brier score, and predictive entropy.
### 5. **Evaluation & Calibration**
- `EnsembleEvaluator`: Runs comprehensive evaluation across members and the ensemble, reporting per-member vs. aggregate metrics.
- `CalibrationWrapper`: Applies temperature scaling or isotonic regression to calibrate ensemble outputs.
### 6. **Serialization & Deployment**
- `Ensemble.save()` / `.load()`: Persists full ensemble state (weights, configs) to disk.
- `Ensemble.to_torchscript()`: Exports the ensemble for production inference (e.g., via TorchServe or ONNX).
## Key Design Principles
- **Minimal dependencies**: Built on top of PyTorch, with optional integrations (Lightning, HuggingFace).
- **No hidden state**: All ensemble behavior is controlled via explicit configuration.
- **Extensible hooks**: Custom aggregation rules, losses, or metrics can be injected via inheritance.
## Example Workflow
```python
ensemble = Ensemble([
ModelConfig(backbone="resnet18", input_shape=(3, 224, 224)),
ModelConfig(backbone="vit_b_16", input_shape=(3, 224, 224)),
])
trainer = EnsembleTrainer(ensemble=ensemble)
trainer.fit(train_loader, val_loader)
preds, uncertainties = EnsemblePredictor(ensemble).predict(test_loader, return_uncertainty=True)
```
+22
View File
@@ -0,0 +1,22 @@
# `obialign` Package: Sequence Alignment Utilities
The `obialign` package provides core functions for pairwise biological sequence alignment in Go, designed to work with `obiseq.BioSequence` objects.
- **Core Alignment Construction**: `_BuildAlignment()` and `BuildAlignment()` reconstruct aligned sequences from a precomputed alignment path (e.g., output by dynamic programming). It supports gap characters and reuses buffers for efficiency.
- **Quality-Aware Consensus Building**: `BuildQualityConsensus()` generates a consensus sequence from an alignment and per-base quality scores:
- At mismatches, it retains the higher-quality base.
- When qualities are equal and bases differ, an IUPAC ambiguity code is used (via `_FourBitsBaseCode`/`_Decode`).
- Quality values are combined and adjusted for mismatches using a Phred-like error probability model.
- Optionally records mismatch statistics in sequence attributes.
- **Performance & Memory Efficiency**: Uses preallocated buffers (via `PEAlignArena`) or fallback allocation, with slice recycling to minimize GC pressure.
- **Metadata Handling**: Preserves sequence IDs and definitions in output; supports optional mismatch reporting for downstream analysis.
- **Alignment Path Format**: The path is a sequence of signed integers encoding:
- Negative steps → deletions in seqB (insertion in A),
- Positive steps → insertions in B,
- Consecutive pairs encode match/mismatch runs.
This package is part of the OBITools4 ecosystem, targeting high-throughput amplicon or metagenomic data processing.
@@ -0,0 +1,30 @@
# Semantic Description of `obialign` Backtracking Module
The `_Backtracking` function implements a **traceback algorithm** for sequence alignment, reconstructing the optimal path through an alignment matrix.
## Core Functionality
- **Input**:
- `pathMatrix`: Encodes alignment decisions (match/mismatch/gap) as integers.
- `lseqA`, `lseqB`: Lengths of sequences A and B.
- `path`: Pre-allocated slice to store the traceback path.
- **Output**: A compact representation of alignment steps, alternating between:
- Diagonal moves (`ldiag`): Matches/mismatches (one step in both sequences).
- Horizontal/vertical moves (`lleft` or `lup`): Gaps in sequence B (horizontal) or A (vertical).
## Algorithm Highlights
- **Reverse traversal** from `(lseqA1, lseqB1)` to origin.
- **Batching logic**: Consecutive gaps in same direction are aggregated (e.g., `lleft += step`) to compress run-length encoding.
- **Path reconstruction**: Steps are pushed *backwards* into the `path` slice using a moving pointer `p`.
- **Memory efficiency**: Uses `slices.Grow()` to preallocate space and logs resizing for debugging.
## Encoded Path Semantics
Each pair in the returned slice encodes:
- `[diag_count, move_type]`, where `move_type` is either a gap length (`lleft > 0`: horizontal, or `lup < 0`: vertical) or zero (end of diagonal run).
## Use Case
Enables efficient reconstruction and serialization of alignment paths—ideal for tools requiring low-level control over dynamic programming backtracking (e.g., pairwise aligners, edit-distance decompositions).
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obialign` Package
This Go package provides core utilities for **DNA sequence alignment scoring**, leveraging probabilistic models and log-space computations to ensure numerical stability.
## Key Functionalities
- **Four-bit nucleotide encoding**: Uses `_FourBitsBaseCode` (implied but not shown) to encode DNA bases as 4-bit values, enabling bitwise operations for fast comparison.
- **Bitwise match ratio (`_MatchRatio`)**: Computes a normalized overlap score between two encoded bases by counting shared bits, adjusting for presence/absence in each operand.
- **Log-space arithmetic helpers**:
- `_Logaddexp`: Stable computation of `log(exp(a) + exp(b))`.
- `_Log1mexp`, `_Logdiffexp`: Accurate log-domain operations for `log(1 exp(a))` and `log(exp(a) exp(b))`, critical for probability transformations.
- **Match/mismatch scoring (`_MatchScoreRatio`)**:
- Derives log-probability-based scores for observed matches/mismatches using Phred-quality inputs (`QF`, `QR`).
- Incorporates base composition priors (e.g., uniform 4-mer assumption via `log(3)`, `log(4)`).
- **Precomputed scoring matrices**:
- `_NucPartMatch`: Precomputes match ratios for all base-pair combinations.
- `_NucScorePartMatch{Match,Mismatch}`: Stores integer-scaled alignment scores (×10) for all Phred-quality pairs, enabling fast lookup during dynamic programming.
- **Thread-safe initialization**:
- `_InitDNAScoreMatrix` ensures one-time setup of all matrices using a mutex guard, preventing race conditions.
All computations are designed for high performance and numerical robustness in large-scale sequence alignment tasks.
+23
View File
@@ -0,0 +1,23 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficiently encoding, decoding, and manipulating alignment-related metrics—specifically **score**, **path length**, and an **out-flag**—within compact 64-bit integers. This design supports high-performance operations in sequence alignment pipelines (e.g., OBITools4).
- **Core Encoding Strategy**:
A `uint64` encodes three fields: a *score* (upper bits), an inverted path *length*, and a single-bit flag indicating whether the value represents an "out" (i.e., terminal/invalid) state.
- **`encodeValues(score, length int, out bool)`**:
Packs `score`, `-length-1` (to preserve ordering via unsigned comparison), and the `out` flag into one integer. The most significant bit (bit 32) marks out-values.
- **`decodeValues(value uint64)`**:
Reverses encoding: extracts score, reconstructs original length via `((value + 1) ^ mask)`, and checks the out-flag.
- **Utility Bitwise Helpers**:
- `_incpath(value)`: decrements stored length (since it's negated, subtraction increases actual path).
- `_incscore(value)`: increments score by `1 << wsize`.
- `_setout(value)`: clears the out-flag, marking value as *not* terminal.
- **Predefined Constants**:
- `_empty`: neutral state (score=0, length=0).
- `_out`/`_notavail`: sentinel values for invalid or unavailable paths (high length, score=0).
This compact representation enables fast comparisons and updates during dynamic programming or alignment graph traversal—critical for scalability in large-scale metabarcoding analyses.
+42
View File
@@ -0,0 +1,42 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance functions for computing the **Longest Common Subsequence (LCS)** between two biological sequences, with support for error tolerance and end-gap-free alignment.
## Core Algorithm
- Implements a **Needleman-Wunsch** dynamic programming algorithm optimized for speed and memory efficiency.
- Uses bit-packed encoding (`uint64`) to store score, path length, and gap status in a compact form.
- Leverages **diagonal banding** to restrict computation only within the allowed error margin, reducing time and space complexity.
## Scoring Scheme
- **Match**: +1 point
- **Mismatch or gap (indel)**: 0 points
## Key Functions
1. `FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[]uint64) (int, int, int)`
- Computes LCS score and alignment length between raw byte sequences.
- If `endgapfree` is true, ignores leading/trailing gaps (useful for read alignment).
- Returns `(score, length, end_position)`; `end_position` marks where the LCS ends in sequence A.
- Returns `-1, -1, -1` if the actual error count exceeds `maxError`.
2. `FastLCSEGFScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Wrapper for `FastLCSEGFScoreByte` with end-gap-free mode enabled by default.
- Designed for standard biosequence inputs.
3. `FastLCSScore(seqA, seqB *obiseq.BioSequence, maxError int, buffer ...)`
- Computes standard LCS (including end gaps). Returns `(score, alignment_length)`.
## Features
- **Error-bounded**: Supports `maxError = -1` (unlimited) or a fixed max number of mismatches + gaps.
- **Memory-efficient**: Reuses user-provided or auto-created buffers to avoid allocations during repeated calls.
- **IUPAC-aware**: Uses `obiseq.SameIUPACNuc()` to handle ambiguous nucleotide codes (e.g., `R`, `Y`).
- **Optimized for short reads**: Particularly suited to high-throughput sequencing data alignment tasks (e.g., in OBITools4).
## Use Cases
- Molecular barcode/UMI clustering
- Read-to-reference alignment in amplicon sequencing
- Similarity filtering of biological sequences
@@ -0,0 +1,15 @@
# Semantic Description of `obialign` Package
The `obialign` package provides low-level utilities for efficient nucleotide sequence encoding and decoding, specifically designed for bioinformatics alignment tasks.
- **Core functionality**: Encodes IUPAC nucleotide symbols (including ambiguous codes like `R`, `Y`, `N`) into compact 4-bit binary representations.
- **Binary encoding scheme**: Each bit in a byte corresponds to one canonical nucleotide: A (bit 0), C (bit 1), G (bit 2), T (bit 3).
- **Ambiguity support**: Codes like `R` (A/G) set both corresponding bits (`0b0101`). Fully ambiguous `N` sets all four bits (`0b1111`).
- **Gap/missing handling**: Symbols `.` and `-`, as well as non-nucleotide characters, map to `0b0000`.
- **Memory efficiency**: The encoding avoids allocations via optional buffer reuse.
- **Lookup tables**:
- `_FourBitsBaseCode`: Maps ASCII nucleotide characters (lowercased via `nuc & 31`) to their binary code.
- `_FourBitsBaseDecode`: Inverse mapping for human-readable output (not exported, used internally).
- **Integration**: Works with `obiseq.BioSequence`, a generic biological sequence container from the OBITools4 ecosystem.
The `Encode4bits` function enables fast, space-efficient sequence processing—ideal for high-throughput sequencing data where alignment speed and memory usage are critical.
+19
View File
@@ -0,0 +1,19 @@
## `obialign` Package: Semantic Overview (≤50 lines)
The `obialign` package provides a lightweight, high-performance utility for **detecting single-edit-distance relationships** between biological sequences (`obiseq.BioSequence`). Its core function, `D1Or0`, determines whether two sequences are either **identical** or differ by exactly **one substitution, insertion, or deletion (indel)**.
- `abs[k]`: A generic helper computing absolute values for integers or floats (via Go generics).
- `D1Or0(...)`: Returns a 4-tuple:
- **`int` (first)**: `0` if identical, `1` if differing by one edit, `-1` otherwise.
- **`int` (second)**: Position of the differing site (`-1` if identical).
- **`byte`, `byte`**: Mismatched characters (or `'-'` for gaps indicating indels).
**Algorithmic strategy:**
1. Early rejection if length difference exceeds 1.
2. Forward scan until first mismatch → identifies left boundary of divergence.
3. Backward scan from ends to find rightmost match boundary.
4. Validates whether the mismatch region allows exactly one edit:
- Single substitution: equal lengths, single divergent position.
- Insertion/deletion: length differs by 1 and only one non-overlapping character remains.
Designed for speed in **OTU/ASV dereplication or error correction** pipelines (e.g., metabarcoding), where rapid filtering of near-identical sequences is critical. Does *not* compute full alignments; optimized for binary decision-making under strict edit constraints.
@@ -0,0 +1,29 @@
# `LocatePattern` Functionality Overview
The `obialign.LocatePattern` function implements a **local alignment algorithm** to find the best approximate match of a short DNA pattern (e.g., primer) within a longer biological sequence, using **dynamic programming**.
- **Input**:
- `id`: identifier for logging/error reporting.
- `pattern []byte`: the query sequence (e.g., primer).
- `sequence []byte`: the target read/contig.
- **Constraints**:
- Pattern must be strictly shorter than the sequence (`len(pattern) < len(sequence)`).
- **Scoring Scheme**:
- Match: `+0` (using IUPAC compatibility via `obiseq.SameIUPACNuc`).
- Mismatch/Gap: `-1`.
- **Algorithm Features**:
- End-gap free alignment (no penalty for gaps at sequence ends), enabling flexible primer positioning.
- Uses a flattened buffer (`buffIndex`) for memory-efficient matrix storage (width × height).
- Tracks alignment path via `path` array: diagonal (`0`, match/mismatch), up (`+1`, deletion in pattern/left gap), left (`-1`, insertion/deletion).
- Backtracks from the bottom-right to find optimal local alignment start/end coordinates.
- **Output**:
- `start`: starting index in `sequence`.
- `end+1`: ending index (exclusive) of best match.
- Error count: `-score`, i.e., number of mismatches/gaps in alignment.
- **Use Case**:
Designed for high-throughput amplicon processing (e.g., primer trimming in metabarcoding pipelines like OBITools4).
@@ -0,0 +1,37 @@
# Semantic Description of `obialign` Package
The `obialign` package provides high-performance, memory-efficient tools for **pairwise alignment of paired-end biological sequences**, optimized specifically for Next-Generation Sequencing (NGS) data.
## Core Functionalities
### 1. **Memory Arena Management**
- `PEAlignArena` is a reusable memory buffer to avoid repeated allocations during multiple alignments.
- Preallocates matrices (`scoreMatrix`, `pathMatrix`), alignment buffers, and auxiliary structures based on expected max sequence lengths.
### 2. **Dynamic Programming Alignment Functions**
Implements three specialized global alignment variants using NeedlemanWunsch with affine gap penalties (scaled per mismatch):
- **`PELeftAlign`**: Free gaps at the *start* of `seqB` and end of `seqA`. Ideal for aligning overlapping reads where the first read starts before or within the second.
- **`PERightAlign`**: Free gaps at start of `seqA` and end of `seqB`. Suited when the second read extends beyond the first.
- **`PECenterAlign`**: Free gaps at both ends of *both* sequences; requires `seqA ≥ seqB`. Designed for full overlap scenarios (e.g., merging paired-end reads).
All use column-major matrix storage and efficient index arithmetic via helper functions `_GetMatrix`, `_SetMatrices`, etc.
### 3. **Scoring & Quality Integration**
- Pairwise base/quality scores computed by `_PairingScorePeAlign`, combining:
- Nucleotide compatibility (via precomputed `_NucPartMatch`)
- Phred quality scores (`_NucScorePartMatchMatch`, `_NucScorePartMatchMismatch`)
- A user-defined `scale` factor to modulate mismatch penalties.
### 4. **Fast Heuristic Pre-Alignment**
The main `PEAlign` function integrates a kmer-based fast pre-screening:
- Uses 4-mer indexing (`obikmer.Index4mer`) and shift estimation via `FastShiftFourMer`.
- If overlap is significant (`fastCount + 3 < over`), performs localized DP only on the predicted overlapping region (using `PELeftAlign` or `PERightAlign`) to save time.
- Otherwise, computes full alignment over entire sequences (both left and right variants), selecting the best score.
### 5. **Backtracking & Path Output**
- `_Backtracking` reconstructs the optimal alignment path from `pathMatrix`.
- Paths encoded as alternating `(offset, length)` pairs for aligned segments (diagonal = 0), with gaps encoded as `-1`/`+1`.
### Use Case
Designed for **paired-end read merging**, overlap detection, and consensus building in metagenomic pipelines (e.g., OBITOOLS4 ecosystem). Efficient, scalable for large batch processing via arena reuse.
+58
View File
@@ -0,0 +1,58 @@
# Semantic Description of `obialign.ReadAlign`
The `ReadAlign` function performs **paired-end read alignment** with quality-aware scoring, optimized for overlapping consensus construction in NGS data processing.
## Core Functionality
- **Input**: Two biological sequences (`seqA`, `seqB`) as `BioSequence` objects, plus alignment parameters:
- `gap`: gap penalty (linear)
- `scale`: scaling factor for quality scores
- `delta`: extension buffer around initial overlap estimate
- `fastScoreRel`: use relative vs absolute k-mer matching score
## Algorithm Overview
1. **Preprocessing & Initialization**
- Ensures DNA scoring matrix is initialized (`_InitDNAScoreMatrix`).
2. **Fast Overlap Estimation via 4-mer Indexing**
- Builds a k-mer index of `seqA` using `obikmer.Index4mer`.
- Computes optimal shift via `_FastShiftFourMer` in both forward and reverse-complement orientations.
- Selects orientation (direct or reversed) yielding highest k-mer match count (`fastCount`) and score (`fastScore`).
3. **Overlap Computation**
- Determines overlap length `over` based on shift:
```text
over = |seqA| - shift if shift > 0
|seqB| + shift if shift < 0
min(|seqA|,|seqB)| otherwise
```
4. **Dynamic Programming Alignment**
- If overlap is *not* identical (`fastCount + 3 < over`):
- Extracts subregions with `delta`-buffered boundaries.
- Calls either `_FillMatrixPeLeftAlign` (left-aligned case) or `_FillMatrixPERightAlign`.
- Backtracks via `_Backtracking` to produce alignment path.
- Else (near-perfect overlap):
- Skips DP; computes score directly from quality scores using `_NucScorePartMatchMatch`.
- Returns trivial path `[extra5, partLen]`.
## Output
Returns:
| Index | Type | Meaning |
|-------|----------|---------|
| 0️⃣ | `int` | Final alignment score (weighted by quality) |
| 1️⃣ | `[]int` | Alignment path (list of positions: `[startA, endA, startB, endB]` or similar) |
| 2️⃣ | `int` | K-mer match count (`fastCount`) |
| 3️⃣ | `int` | Overlap length (`over`) |
| 4️⃣ | `float64` | K-mer-based score (`fastScore`) |
| 5️⃣ | `bool` | Whether alignment was performed in direct orientation (`true`) or on reverse-complement of `seqB` |
## Key Design Highlights
- **Efficient pre-filtering** using 4-mers avoids full DP for nearly identical reads.
- **Quality-aware scoring**, leveraging Phred scores via `_NucScorePartMatchMatch`.
- Supports **asymmetric overlaps** (left/right alignment) with boundary padding (`delta`).
- Uses preallocated memory arenas to minimize GC pressure in high-throughput pipelines.
+25
View File
@@ -0,0 +1,25 @@
# Apat Package: Pattern Matching for Biological Sequences
The `obiapat` Go package provides high-performance pattern matching over biological sequences using the **Apat algorithm**, a C-based implementation wrapped in Go. It supports fuzzy matching (with mismatches and indels), reverse-complement patterns, memory-safe resource management via finalizers, and efficient filtering of non-overlapping matches.
## Core Types
- `ApatPattern`: Represents a compiled pattern (up to 64 bp), supporting IUPAC ambiguity codes (`W`, `[AT]`), negated bases (`!A`), and fixed positions (`#`).
- `ApatSequence`: Wraps a biological sequence (from `obiseq.BioSequence`) for fast matching, with optional circular topology support and memory recycling.
## Key Functions & Methods
- `MakeApatPattern(pattern string, errormax int, allowsIndel bool)`: Compiles a pattern with max error tolerance and optional indels.
- `ReverseComplement()`: Returns the reverse-complemented pattern (useful for DNA strand symmetry).
- `FindAllIndex(...)`: Returns all matches as `[start, end, errors]`, supporting partial sequence searches.
- `IsMatching(...)`: Boolean check for presence of at least one match in a region.
- `BestMatch(...)`: Finds the *best* (lowest-error) match, with local realignment for indel-containing patterns.
- `FilterBestMatch(...)`: Returns *non-overlapping* matches, prioritizing lower-error occurrences.
- `AllMatches(...)`: Filters and refines all valid matches (including indel-aware alignment).
- `Free()`, `Len()`: Explicit memory cleanup and length queries.
## Implementation Notes
Internally, the package uses `cgo` to interface with C structures (`Pattern`, `Seq`) allocated via custom memory management. Finalizers ensure safe deallocation, while unsafe pointer arithmetic avoids data copying during search (e.g., `unsafe.SliceData`). Logging is integrated via Logrus.
This package enables scalable, low-level pattern mining in NGS data preprocessing pipelines (e.g., primer detection, adapter trimming).
+32
View File
@@ -0,0 +1,32 @@
# Semantic Description of `obiapat` Package Functionality
The `obiapat` package provides utilities for constructing and representing **approximate sequence patterns**—flexible biological or symbolic string templates supporting mismatches, insertions, and deletions.
## Core Functionality
- **`MakeApatPattern(pattern string, errormax int, allowsIndel bool)`**
Parses a pattern specification (e.g., `"A[T]C!GT"`) and returns an internal representation (`*ApatPattern`) suitable for approximate matching.
- `pattern`: A string where:
- Standard characters (e.g., `'A'`, `'C'`) denote exact matches.
- Brackets `[X]` indicate *optional* or *variable positions*, e.g., ambiguity (like IUPAC codes).
- Exclamation `!` marks positions where **errors** (substitutions) are permitted.
- `errormax`: Maximum number of allowed errors (mismatches or indels, depending on flags).
- `allowsIndel`: Boolean flag enabling/disabling insertion/deletion operations.
## Behavior & Semantics
- Returns a compiled pattern object (non-nil) on success; errors may arise from malformed input or invalid parameters.
- Supports three modes:
- **Exact matching** (`errormax = 0`, `allowsIndel = false`).
- **Substitution-only approximation** (`errormax > 0`, `allowsIndel = false`).
- **Full approximate matching with indels** (`errormax > 0`, `allowsIndel = true`).
## Testing Coverage
The provided test suite validates:
- Valid pattern parsing across different configurations.
- Correct handling of `nil` vs. non-nil output pointers.
- Robustness against error conditions (e.g., invalid inputs would trigger expected errors).
In summary, `obiapat` enables efficient definition and handling of *approximate regular expressions* tailored for sequence analysis in bioinformatics or pattern recognition contexts.
+27
View File
@@ -0,0 +1,27 @@
# PCR Simulation Module (`obiapat`)
This Go package implements a **PCR (Polymerase Chain Reaction) simulation algorithm** for biological sequence analysis. It supports flexible primer matching, amplicon extraction with optional flanking extensions, and handles both linear and circular DNA topologies.
## Key Functionalities
- **Primer Matching**: Accepts forward/reverse primers with configurable mismatch tolerance (`OptionForwardPrimer`, `OptionReversePrimer`). Internally builds pattern objects and their reverse complements.
- **Amplicon Extraction**: Identifies valid amplicons bounded by primer pairs, respecting user-defined length constraints (`OptionMinLength`, `OptionMaxLength`).
- **Extension Support**: Optionally adds fixed-length flanking regions (`OptionWithExtension`) — either strict full-extension only or partial trimming allowed.
- **Topology Handling**: Supports linear (`Circular: false`) and circular DNA sequences via `OptionCircular`.
- **Batch & Parallel Processing**: Configurable batch size (`OptionBatchSize`) and parallel workers count (`OptionParallelWorkers`), enabling efficient processing of large datasets.
- **Annotation-Rich Output**: Each amplicon includes detailed annotations (primer sequences, match positions, errors, direction), preserving original sequence metadata.
## Core API
- `PCRSim(sequence, options...)`: Simulates PCR on a single sequence.
- `PCRSlice(sequencesSlice, options...)`: Applies simulation across multiple sequences in a slice.
- `PCRSliceWorker(options...)`: Returns a reusable worker function for parallel execution via `obiseq.MakeISliceWorker`.
## Implementation Details
- Uses pattern-matching (`ApatPattern`) with fuzzy search to locate primers.
- Handles circular topology by wrapping indices around sequence boundaries.
- Reuses internal memory via `MakeApatSequence`/`Free`, supporting efficient GC and large-scale processing.
- Logs critical errors with `logrus`; debug-level details for amplicon generation.
Designed to integrate within the OBITools4 ecosystem, this module enables high-fidelity *in silico* PCR for metabarcoding and NGS data validation workflows.
+23
View File
@@ -0,0 +1,23 @@
## Semantic Description of `IsPatternMatchSequence`
The function `IsPatternMatchSequence` defines a **sequence predicate** for pattern-based matching in biological sequences (e.g., DNA/RNA), supporting fuzzy and strand-aware search.
### Core Functionality:
- **Input Parameters**
- `pattern`: A regular expression-like string describing the target pattern.
- `errormax`: Maximum allowed mismatches (substitutions only by default).
- `bothStrand`: If true, also search on the reverse-complement strand.
- `allowIndels`: Enables insertion/deletion errors (beyond mismatches) when set to true.
- **Internal Workflow**
- Parses the pattern into an automaton (`apat`) via `MakeApatPattern`.
- Computes its reverse complement for dual-strand matching.
- Returns a closure (`SequencePredicate`) that tests whether a given `BioSequence` matches the pattern (or its RC), within error tolerance.
- **Matching Logic**
- Converts input sequence to `apat` format.
- Checks match on forward strand first; if failed and `bothStrand=true`, tries reverse complement.
- Uses automaton-based matching (`IsMatching`) for efficient fuzzy search.
### Semantic Use Case:
Enables flexible, error-tolerant detection of sequence motifs (e.g., primers, barcodes) in high-throughput sequencing data—supporting both *in silico* primer design validation and read filtering in metagenomic pipelines.
+15
View File
@@ -0,0 +1,15 @@
# `ISequenceChunk` Function — Semantic Description
The `ISequenceChunk` function provides a unified interface for processing biological sequence data in chunks, supporting two execution modes: **in-memory** and **on-disk**, depending on resource constraints or performance needs.
- It accepts an iterator over biological sequences (`obiiter.IBioSequence`) and a sequence classifier (`obiseq.BioSequenceClassifier`), used to annotate or categorize sequences.
- A boolean flag `onMemory` determines whether processing occurs in RAM (`ISequenceChunkOnMemory`) or on disk (`ISequenceChunkOnDisk`), enabling scalability for large datasets.
- Optional parameters allow fine-tuning:
- `dereplicate`: enables deduplication of identical sequences.
- `na`: specifies how missing or ambiguous values are handled (e.g., `"?"`, `"N"`, etc.).
- `statsOn`: configures what metadata (e.g., description fields) are tracked for statistics.
- `uniqueClassifier`: an optional secondary classifier used to assign unique identifiers or labels.
The function abstracts the underlying implementation, ensuring consistent behavior regardless of storage strategy. It returns an iterator over processed sequences (`obiiter.IBioSequence`) or an error, supporting streaming workflows and compatibility with downstream pipeline stages.
This design promotes flexibility, memory efficiency, and modularity in high-throughput sequence analysis pipelines (e.g., metabarcoding).
@@ -0,0 +1,18 @@
# `obichunk` Package: On-Disk Chunking and Dereplication of Biosequences
The `obichunk` package provides functionality to efficiently process large sets of biological sequences by splitting them into manageable, disk-based chunks. Its core feature is the `ISequenceChunkOnDisk` function, which takes a sequence iterator and distributes sequences into temporary files using a classifier. Each file corresponds to one *batch* (e.g., `chunk_*.fastx`), enabling scalable, parallel-friendly workflows.
Key capabilities include:
- **Temporary Directory Management**: Automatically creates and cleans up a system temp directory (`obiseq_chunks_*`) for intermediate storage.
- **File Discovery**: Recursively finds all `.fastx` files generated during chunking via `find`.
- **Asynchronous Streaming**: Returns an iterator (`obiiter.IBioSequence`) that yields batches asynchronously, decoupling chunk creation from consumption.
- **Optional Dereplication**: When enabled (`dereplicate = true`), sequences are deduplicated *per batch* using a composite key (sequence + classification categories). Merged duplicates retain aggregated statistics.
- **Logging & Monitoring**: Logs total batch count and per-batch processing start events for transparency.
Internally, `ISequenceChunkOnDisk` uses:
- `obiiter.MakeIBioSequence()` to build the output iterator,
- `obiformats.WriterDispatcher` for parallel writing of distributed sequences into chunk files,
- and a second goroutine to read, optionally dereplicate (via `BioSequenceClassifier`), and push batches back into the output iterator.
Designed for memory efficiency, it avoids loading all sequences in RAM by streaming and chunking on-disk—ideal for large-scale NGS data preprocessing.
@@ -0,0 +1,21 @@
# `ISequenceChunkOnMemory` Function — Semantic Description
The function `Isequencechunkonmemory`, from the Go package `obichunk`, implements **asynchronous in-memory chunking** of biological sequence data.
It consumes an iterator over `BioSequence` objects and distributes them into **heterogeneous batches** using a provided classifier. The core purpose is to group sequences by classification (e.g., sample, taxon, or feature), store each group in memory as a slice (`BioSequenceSlice`), and emit them sequentially via an output iterator.
Key features:
- **Parallel processing**: Each classification group (referred to as a *flux*) is processed in its own goroutine.
- **Thread-safe aggregation**: A mutex ensures safe concurrent updates to shared `chunks` and `sources` maps.
- **Lazy emission**: Batches are emitted only after all classification groups have been fully processed (`jobDone.Wait()`).
- **Ordered output**: Batches are emitted in increasing `order` index (0, 1, …), preserving determinism despite parallel internal processing.
- **Error handling**: Critical failures (e.g., channel retrieval errors) terminate the program with `log.Fatalf`.
Input:
- An iterator (`obiiter.IBioSequence`) of raw sequences.
- A `*obiseq.BioSequenceClassifier`, used to route each sequence into a classification bucket.
Output:
- A new iterator yielding `BioSequenceBatch` objects, each containing all sequences belonging to one classification group and its source identifier.
Use case: Efficient parallel preprocessing of high-throughput sequencing data into sample- or taxon-specific batches for downstream analysis.
+26
View File
@@ -0,0 +1,26 @@
# Semantic Description of `obichunk` Package
The `obichunk` package provides a flexible and configurable options management system for data processing pipelines, particularly in the context of biological sequence analysis (e.g., metabarcoding). It defines a typed `Options` struct and associated builder-style configuration functions.
## Core Concepts
- **Immutable Configuration Builder**: Options are constructed via `MakeOptions([]WithOption)`, applying a list of functional setters (`WithOption`) to an internal `__options__` struct.
- **Encapsulation**: The concrete options are hidden behind a pointer (`pointer *__options__`) to ensure safe sharing and mutation control.
## Supported Functionalities
- **Categorization**: `OptionSubCategory(keys...)` appends category labels (e.g., sample or marker names) to an internal list; `PopCategories()` retrieves and removes the first category.
- **Missing Value Handling**: `OptionNAValue(na string)` customizes placeholder for missing data (default: `"NA"`).
- **Statistical Tracking**: `OptionStatOn(keys...)` registers statistical descriptions (via `obiseq.StatsOnDescription`) for per-field metrics collection.
- **Batch Processing Control**:
- `OptionBatchCount(number)` sets the number of batches.
- `OptionsBatchSize(size)` defines how many items per batch (default from `obidefault`).
- **Parallelization**: `OptionsParallelWorkers(nworkers)` configures concurrency level (default from environment).
- **Disk vs Memory Sorting**: `OptionSortOnDisk()` enables disk-backed sorting; `OptionSortOnMemory()` disables it (default).
- **Singleton Filtering**: `OptionsNoSingleton()` excludes singleton sequences; `OptionsWithSingleton()` allows them (default).
## Design Highlights
- Functional options pattern for extensibility and readability.
- Default values derived from `obidefault` where applicable (e.g., batch size, workers).
- Designed for integration with `obiseq` and `obidefault`, supporting scalable, reproducible NGS data workflows.
+29
View File
@@ -0,0 +1,29 @@
# Semantic Description of `obichunk.ISequenceSubChunk`
The function `ISequenceSubChunk` in the `obichunk` package implements **parallel, class-based sorting and batching of biological sequences**, preserving input order within each batch while reordering across batches by classification code.
## Core Functionality
- **Input**:
- An iterator over `BioSequence` batches (`obiiter.IBioSequence`)
- A sequence classifier (`obiseq.BioSequenceClassifier`) assigning each sequence a numeric class code
- A number of worker goroutines (`nworkers`), defaulting to system-configured parallelism
- **Processing**:
- Each worker consumes its own iterator split and classifier clone, enabling concurrent batch processing.
- For each incoming `BioSequenceBatch`:
- If the batch has >1 sequence: sequences are extracted, classified into `code`, and sorted *in-place* by class code.
- Consecutive sequences with the same `code` are grouped into new batches; a new batch is emitted upon code change.
- If the batch has ≤1 sequence, its passed through unchanged (but reordered with a new order ID).
- **Ordering Mechanism**:
- Uses `atomic.AddInt32` to assign strictly increasing order IDs (`nextOrder`) across workers, preserving deterministic inter-batch ordering.
- Sorting within batches is performed via a custom `sort.Interface` implementation using closures for flexible comparison logic (here, by ascending class code).
- **Output**:
- Returns a new iterator (`obiiter.IBioSequence`) emitting batches grouped by classification code, with globally ordered batch IDs.
- Workers are coordinated via `newIter.Done()`/`Wait()/Close()`, ensuring clean termination.
## Semantic Purpose
Enables efficient, parallel **grouping of sequences by taxonomic or functional class** (e.g., OTU assignment), optimizing downstream processing that requires sorted/class-ordered input — e.g., consensus building, alignment, or read merging per group.
+45
View File
@@ -0,0 +1,45 @@
# Semantic Description of `IUniqueSequence` Functionality
The `IUniqueSequence` function performs **dereplication** of biological sequence data — i.e., grouping identical or near-identical sequences while preserving metadata and counts. It operates on an `obiiter.IBioSequenceBatch` iterator.
## Core Workflow
1. **Input Processing**
Accepts an input sequence iterator and optional configuration via `WithOption`.
2. **Parallelization Strategy**
Supports configurable parallel workers (`nworkers`). When `SortOnDisk()` is enabled, it falls back to single-threaded processing for disk-based sorting.
3. **Data Splitting Phase**
- Uses `HashClassifier` to partition input into buckets (controlled by `BatchCount`).
- Ensures deterministic chunking for reproducibility.
4. **Storage Choice**
- *In-memory*: via `ISequenceChunkOnMemory`.
- *Disk-based*: via `ISequenceSubChunk` + external sorting (requires single worker).
5. **Uniqueness Classification**
- Builds a composite classifier combining:
- Sequence identity (`SequenceClassifier`)
- Optional annotation categories (e.g., sample, primer), with NA handling.
- If no annotations are specified, only raw sequence identity is used.
6. **Singleton Filtering**
Optionally excludes singleton reads (count = 1) via `NoSingleton()` option.
7. **Parallel Dereplication**
- Spawns worker goroutines to process chunks.
- Each worker applies `ISequenceSubChunk` + deduplication logic per classifier group.
8. **Output Merging**
- Aggregates results using `IMergeSequenceBatch`, preserving:
- Sequence counts
- Statistics (if enabled)
- NA handling and ordering
## Key Features
- **Scalable**: Supports both memory-efficient (disk) and high-speed (RAM) modes.
- **Configurable**: Via functional options (`Options`).
- **Thread-safe**: Uses `sync.Mutex` for deterministic ordering.
- **Metadata-aware**: Incorporates annotation-based grouping (e.g., sample, primer).
+28
View File
@@ -0,0 +1,28 @@
# Aho-Corasick-Based Sequence Analysis in `obicorazick`
This Go package provides efficient pattern-matching utilities for biological sequence data, leveraging the Aho-Corasick algorithm.
## Core Components
- **`AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker`**
Builds *multiple* Aho-Corasick matchers in parallel (batched to manage memory), then returns a `SeqWorker` function.
- Scans each sequence *forward* and its reverse complement.
- Counts total matches (`slot`), forward-only (`_Fwd`) and reverse-complement-specific (`_Rev`) matches.
- Attaches match counts as sequence attributes.
- **`AhoCorazickPredicate(minMatches int, patterns []string) obiseq.SequencePredicate`**
Compiles a *single* matcher and returns a predicate function.
- Returns `true` if the number of matches ≥ `minMatches`.
- Useful for filtering sequences (e.g., taxonomic assignment or contamination detection).
## Technical Highlights
- **Batched compilation**: Large pattern sets are split into chunks (default `10⁷` patterns/batch) to avoid memory overload.
- **Parallelization**: Matcher construction uses goroutines, scaled by `obidefault.ParallelWorkers()`.
- **Progress tracking**: Optional CLI progress bar via `progressbar/v3`, enabled globally.
- **Logging & debugging**: Uses Logrus for info/debug messages; logs match counts per sequence.
## Use Cases
- Rapid screening of sequences against large reference databases (e.g., primers, barcodes, contaminants).
- Filtering or annotating sequences based on pattern presence/abundance.
+34
View File
@@ -0,0 +1,34 @@
# ObiDefault Package: Batch Configuration Module
This Go module provides centralized configuration for sequence batching in Obitools, supporting both **count-based** and **memory-aware** batch processing.
## Core Features
- `_BatchSize` / `SetBatchSize()`
Defines and configures the *minimum* number of sequences per batch (default: `1`).
Used internally as `minSeqs` in `RebatchBySize`.
- `_BatchSizeMax()` / `SetBatchSizeMax()`
Sets the *maximum* sequences per batch (default: `2000`). Batches are flushed upon reaching this limit, regardless of memory.
- **CLI & Environment Integration**
Batch size is determined by `--batch-size` CLI flag and/or the `OBIBATCHSIZE` environment variable (via parsing logic not shown here but implied by comments).
- `_BatchMem()` / `SetBatchMem(n int)`
Configures the *maximum memory per batch* (default: `128 MB`). A value of `0` disables memory-based batching, falling back to pure count-based logic.
- `_BatchMemStr()`
Stores the *raw CLI string* passed to `--batch-mem` (e.g., `"256M"`, `"1G"`), enabling human-readable input parsing elsewhere.
## Utility Functions
- `BatchSizePtr()`, `BatchMemPtr()`
Expose pointers to internal variables for direct modification or inter-process sharing.
- `BatchSizeMaxPtr()`, `BatchMemStrPtr()`
Provide read/write access to max-size and raw memory string values.
## Design Intent
- Separates **configuration** (defaults, CLI/env parsing) from **processing logic**, enabling modular and testable batch handling.
- Supports both scalable, large-scale processing (via count limits) and memory-constrained environments (via soft RAM caps).
@@ -0,0 +1,35 @@
# Output Compression Control Module
This Go package (`obidefault`) provides a simple, global configuration mechanism for toggling output compression behavior across an application.
## Core Features
- **Global Compression Flag**: A package-level boolean variable `__compress__` (default: `false`) controls whether output should be compressed.
- **Read Access**:
- `CompressOutput()` returns the current compression setting as a boolean.
- **Write Access**:
- `SetCompressOutput(b bool)` updates the compression flag to a new value.
- **Pointer Access**:
- `CompressOutputPtr()` returns a pointer to the internal flag, enabling indirect modification (e.g., for UI bindings or reflection-based updates).
## Design Intent
- Minimal, side-effect-free API.
- Thread-safety *not* guaranteed — intended for use in single-threaded initialization or controlled environments.
- Encapsulation via unexported variable `__compress__`, enforced through accessor functions.
## Typical Usage
```go
// Enable compression globally:
obidefault.SetCompressOutput(true)
if obidefault.CompressOutput() {
// Apply compression logic (e.g., gzip, brotli)
}
```
## Notes
- The double underscore prefix (`__compress__`) signals internal/private status (convention, not enforced).
- Designed for runtime configurability without recompilation.
+38
View File
@@ -0,0 +1,38 @@
# `obidefault` Package — Semantic Overview
This minimal Go package provides a centralized, mutable global flag for controlling warning verbosity across an application.
## Core Functionality
- **`__silent_warning__`**:
A package-level boolean variable (unexported) that determines whether warnings should be suppressed.
- **`SilentWarning() bool`**:
A read-only accessor returning the current state of `__silent_warning__`. Enables safe, non-mutating checks elsewhere in the codebase.
- **`SilentWarningPtr() *bool`**:
Returns a pointer to `__silent_warning__`, allowing external code (e.g., CLI parsers, config loaders) to directly mutate the flag — e.g., `*SilentWarningPtr() = true`.
## Design Intent
- **Simplicity & Centralization**:
Avoids scattering warning-control logic; provides a single source of truth.
- **Flexibility**:
Supports both *read-only* inspection (via `SilentWarning()`) and *global mutation* (via pointer), useful for early initialization phases.
- **Explicit Semantics**:
When `SilentWarning()` returns `true`, all warning-generating code *should* suppress output (implementation responsibility lies outside this package).
## Usage Example
```go
// Suppress warnings globally:
*obidefault.SilentWarningPtr() = true
if !obidefault.SilentWarning() {
log.Println("⚠️ Warning: something happened")
}
```
> **Note**: The double underscore prefix on `__silent_warning__` signals internal/private status, discouraging direct access.
@@ -0,0 +1,33 @@
# Progress Bar Control Module (`obidefault`)
This Go package provides a simple, global mechanism to enable or disable progress bar display across an application.
## Core Functionality
- **`ProgressBar()`**: Returns `true` if progress bars are *enabled* (i.e., when `__no_progress_bar__` is `false`).
- **`NoProgressBar()`**: Returns the current state of `__no_progress_bar__`, i.e., whether progress bars are *disabled*.
- **`SetNoProgressBar(b bool)`**: Sets the global flag `__no_progress_bar__`. Passing `true` disables progress bars; passing `false` enables them.
- **`NoProgressBarPtr()`**: Returns a pointer to the internal `__no_progress_bar__` variable, allowing direct read/write access (e.g., for reflection or UI binding).
## Design Intent
- Centralizes progress bar visibility control in one place.
- Supports both boolean query/set and pointer-based manipulation for flexibility (e.g., CLI flags, config binding).
- Uses a *negative* flag name (`__no_progress_bar__`) internally to default progress bars **on** (i.e., `false` → enabled).
## Usage Example
```go
// Disable progress bars globally:
obidefault.SetNoProgressBar(true)
// Check status:
if !obidefault.ProgressBar() {
log.Println("Progress bars are disabled.")
}
```
## Notes
- Thread-safety is *not* guaranteed; concurrent access should be externally synchronized.
- The double underscore prefix (`__no_progress_bar__`) signals internal/private usage per Go convention (though not enforced).
+26
View File
@@ -0,0 +1,26 @@
# Quality Shift and Read/Write Control Module
This Go package (`obidefault`) provides configurable controls over quality score handling in sequence data processing (e.g., FASTQ files). It defines three global variables and corresponding accessor/mutator functions:
- `_Quality_Shift_Input`: Input quality score offset (default: `33`, i.e., Phred+33/Sanger format).
- `_Quality_Shift_Output`: Output quality score offset (default: `33`), allowing format conversion.
- `_Read_Qualities`: Boolean flag indicating whether quality scores should be parsed/processed (`true` by default).
## Public API
| Function | Purpose |
|---------|--------|
| `SetReadQualitiesShift(shift byte)` | Sets the quality score offset for *input* data (e.g., when reading FASTQ). |
| `ReadQualitiesShift() byte` | Returns the current input quality offset. |
| `SetWriteQualitiesShift(shift byte)` | Sets the quality score offset for *output* data (e.g., when writing FASTQ). |
| `WriteQualitiesShift() byte` | Returns the current output quality offset. |
| `SetReadQualities(read bool)` | Enables/disables reading/processing of quality scores. |
| `ReadQualities() bool` | Returns whether qualities are currently being read/used. |
## Semantic Use Cases
- **Format Interoperability**: Allows seamless conversion between Phred+33 (Sanger), Phred+64, or other quality encodings.
- **Performance Optimization**: Disabling `ReadQualities` skips parsing of quality strings, useful when only sequences are needed.
- **Centralized Configuration**: Global state enables consistent behavior across modules without passing parameters.
All functions are thread-unsafe by design—intended for initialization before concurrent processing begins.
+21
View File
@@ -0,0 +1,21 @@
# `obidefault` Package: Configuration State Management
This Go package provides a centralized, thread-safe(ish) configuration layer for taxonomy-related settings in the OBIDMS (Open Biological and Biomedical Data Management System) framework. It exposes simple getters, setters, and pointer accessors for four core boolean/string flags that control how taxonomic identifiers (taxids) are handled during data processing.
## Core Configuration Flags
- `__taxonomy__`: Stores the currently selected taxonomy (e.g., `"NCBI"`, `"UNIPROT"`).
- `__alternative_name__`: Enables/disables use of alternative taxonomic names (e.g., synonyms).
- `__fail_on_taxonomy__`: If true, processing halts on taxonomy mismatches/errors.
- `__update_taxid__`: If true, taxids are auto-updated to current NCBI/DB versions.
- `__raw_taxid__`: If true, raw (unprocessed) taxids are preserved instead of normalized.
## Public API
- **Getters**: `UseRawTaxids()`, `SelectedTaxonomy()`, `HasSelectedTaxonomy()`, etc., return current values.
- **Pointer Accessors**: e.g., `SelectedTaxonomyPtr()` returns a pointer for direct mutation (advanced use).
- **Setters**: `SetSelectedTaxonomy()`, `SetAlternativeNamesSelected()`, etc., update state.
## Use Case
Typically used at application startup to configure global behavior (e.g., `SetSelectedTaxonomy("NCBI")`, `SetUpdateTaxid(true)`), then referenced by downstream modules during data import, validation, or mapping. Minimalist and explicit—no external dependencies.
+35
View File
@@ -0,0 +1,35 @@
# Obidefault: Parallelism Configuration Module
This Go package (`obideault`) provides a centralized, configurable interface for managing parallel execution parameters—particularly useful in I/O- and CPU-bound workloads.
## Core Concepts
- **CPU-aware defaults**: Automatically detects available cores via `runtime.NumCPU()`.
- **Configurable workers per core**:
- General: `_WorkerPerCore` (default `1.0`)
- Read-specific: `_ReadWorkerPerCore` (`0.25`, i.e., ~1 reader per 4 cores)
- Write-specific: `_WriteWorkerPerCore` (`0.25`)
- **Strict overrides**: Allow hardcoding worker counts via `SetStrictReadWorker()`/`Write...`, bypassing per-core scaling.
## Public API
| Function | Purpose |
|---------|--------|
| `ParallelWorkers()` | Total workers = `MaxCPU() × WorkerPerCore` |
| `Read/WriteParallelWorkers()` | Resolves to strict count if set, else per-core calculation (min 1) |
| `ParallelFilesRead()` | Files read in parallel: defaults to `ReadParallelWorkers()`, overridable |
| Getters (`MaxCPU`, `WorkerPerCore`, etc.) | Expose current settings safely |
| Setters (`Set*`) | Dynamically adjust behavior at runtime |
## Configuration Sources
- **Command-line flags**: e.g., `--max-cpu` or `-m`
- **Environment variable**: `OBIMAXCPU`
## Design Highlights
✅ Decouples resource discovery from policy
✅ Supports both *proportional* (per-core) and *absolute* (strict) worker definitions
✅ Ensures non-zero defaults for critical paths (`ReadParallelWorkers` ≥ 1)
⚠️ **Note**: `WriteParallelWorkers()` contains a likely bug—returns `_StrictReadWorker` in the else branch instead of `StrictWriteWorker`.
+28
View File
@@ -0,0 +1,28 @@
# `obidist` Package: Efficient Symmetric Distance/Similarity Matrix Management
The `*DistMatrix` type provides a memory-efficient, symmetric matrix implementation for distance or similarity data.
- **Storage Strategy**: Only the upper triangle (i < j) is stored, reducing memory usage from *O(n²)* to *n(n1)/2*.
- **Diagonal Handling**: Diagonal entries are fixed (0.0 for distances, 1.0 for similarities); assignments to diagonal indices are silently ignored.
- **Symmetry Guarantee**: `Get(i, j)` and `Set(i, j, v)` automatically handle both (i,j) and (j,i), ensuring consistency.
## Constructors
| Function | Description |
|---------|-------------|
| `NewDistMatrix(n)` / `WithLabels(labels)` | Creates *n×n* distance matrix (diag = 0). |
| `NewSimilarityMatrix(n)` / `WithLabels(labels)` | Creates *n×n* similarity matrix (diag = 1). |
## Core Operations
- `Get(i, j)` / `Set(i, j, v)`: Access/update symmetric entries.
- `Size() int`, `GetLabel(i)` / `SetLabel(i, label)`: Query/mutate element labels.
- `Labels() []string`, `GetRow(i)` / `GetColumn(j)`: Retrieve full rows/columns (as copies).
## Analysis Helpers
- `MinDistance()`, `MaxDistance()``(value, i, j)` of the extremal off-diagonal entry.
- `Copy() *DistMatrix`: Deep copy for immutability-safe operations.
- `ToFullMatrix()``[][]float64`: Converts to dense representation (use sparingly).
Designed for clustering, phylogenetics, or any domain requiring fast symmetric matrix access with minimal footprint.
@@ -0,0 +1,28 @@
# `obidist` Package: Semantic Feature Overview
The `obidist` Go package provides two core data structures for managing **distance** and **similarity matrices**, with built-in guarantees suitable for scientific computing (e.g., clustering, phylogenetics). Key features include:
- **`DistMatrix`**: A symmetric `n×n` matrix representing pairwise distances, where:
- Diagonal entries are *always* `0.0` (self-distance).
- Off-diagonals obey symmetry: `dist(i, j) == dist(j, i)`.
- Automatic enforcement via dedicated `Set()`/`Get()` methods.
- **`SimilarityMatrix`**: A symmetric matrix where:
- Diagonal entries are *always* `1.0`.
- Off-diagonals represent similarity scores (e.g., between `0` and `1`, though not enforced).
- Symmetry is similarly guaranteed.
Both matrix types support:
- **Optional labels**: Associate human-readable identifiers (e.g., sample names) with rows/columns.
- **Safe bounds checking**: Panics on out-of-range access (tested via `defer/recover`).
- **Deep copy support**: Ensures isolation between original and copied instances.
- **Utility methods**:
- `MinDistance()` / `MaxDistance()`: Return extremal values and their indices.
- `GetRow(i)`: Retrieve a full row as a slice (symmetric copy).
- `ToFullMatrix()`: Export the matrix as an immutable 2D slice.
Edge cases are rigorously handled:
- Empty (`n=0`) and singleton (`n=1`) matrices return `(0.0, -1, -1)` for min/max.
- Label mutations do not affect internal state via defensive copying.
All behaviors are validated through comprehensive unit tests, emphasizing correctness and robustness.
@@ -0,0 +1,43 @@
# Semantic Description of `ReadSequencesBatchFromFiles`
This function implements **concurrent, batched streaming** of biological sequences from multiple input files.
## Core Functionality
- **Input**: A slice of file paths (`[]string`), an optional batch reader interface, and a concurrency level.
- **Default behavior**: Uses `ReadSequencesFromFile` if no custom reader is provided.
## Concurrency Model
- Launches `concurrent_readers` goroutines to process files in parallel.
- Files are distributed via a shared channel (`filenameChan`) — ensuring fair load balancing.
## Streaming Interface
- Returns an `obiiter.IBioSequence`, a streaming iterator over batches of biological sequences.
- Internally uses an atomic counter (`nextCounter`) to assign unique, ordered IDs to sequence batches (via `Reorder`), preserving global order despite parallelism.
## Error Handling & Logging
- Panics on file-open failure (via `log.Panicf`).
- Logs start/end of reading per file using structured logging (`log.Printf`, `log.Println`).
## Resource Management
- Uses a barrier pattern: each reader goroutine calls `batchiter.Done()` upon completion.
- A finalizer goroutine waits for all readers (`WaitAndClose`) and logs termination.
## Design Intent
- Enables scalable, memory-efficient ingestion of large NGS datasets.
- Decouples *reading logic* (via `IBatchReader`) from orchestration — supporting pluggable formats.
- Prioritizes throughput and deterministic ordering over strict FIFO per-file semantics.
## Key Abstractions
| Type/Interface | Role |
|----------------|------|
| `IBatchReader` | Reader factory: `(filename, options...) → SequenceIterator` |
| `obiiter.IBioSequence` | Thread-safe batch iterator (push model) |
| `AtomicCounter` | Ensures globally unique, sequential batch IDs across goroutines |
@@ -0,0 +1,36 @@
# `obiformats` Package — Semantic Overview
The `obiformats` package provides a standardized interface for **format-agnostic batch reading of biological sequence data** within the OBITools4 ecosystem.
## Core Abstraction
- **`IBatchReader`** is a function type defining the contract for opening and iterating over sequence files:
```go
func(string, ...WithOption) (obiiter.IBioSequence, error)
```
- It accepts:
- A file path (`string`)
- Optional configuration via variadic `WithOption` arguments (e.g., filtering, parsing rules)
- Returns:
- An iterator over biological sequences (`obiiter.IBioSequence`)
- Or an error if the file cannot be opened/parsed
## Semantic Intent
- **Decouples format handling from iteration logic**: Enables uniform consumption of FASTA, FASTQ, SAM/BAM, etc., via a single entry point.
- **Supports extensibility**: New format readers can be registered as `IBatchReader` implementations without altering client code.
- **Enables lazy, streaming access**: Sequences are yielded on-demand via the iterator—memory-efficient for large datasets.
## Typical Usage Pattern
1. Select or compose an `IBatchReader` implementation (e.g., for FASTQ).
2. Call it with a file path and optional options.
3. Iterate over the returned `IBioSequence` to process sequences one-by-one.
## Design Principles
- **Functional, minimal API**: Single responsibility—reading and iteration.
- **Option-based configurability**: Avoids combinatorial function overloading via `With...` patterns.
- **Integration-ready**: Built to work seamlessly with the broader OBITools4 iterator and sequence abstractions.
> *Note: Actual format-specific readers (e.g., `NewFASTQBatchReader`) are expected to conform to this interface but reside outside the core type definition.*
+30
View File
@@ -0,0 +1,30 @@
# CSV Import Module for Biological Sequences (`obiformats`)
This Go package provides functionality to parse biological sequence data from CSV files into structured objects compatible with the OBItools4 framework.
## Core Features
- **CSV Parsing**: Reads CSV data via `io.Reader`, supporting comments (`#`), flexible field counts, and leading-space trimming.
- **Sequence Extraction**: Identifies columns named `sequence`, `id`, or `qualities` by header and maps them to corresponding biological sequence fields.
- **Quality Score Adjustment**: Applies a configurable Phred score shift (default: `33`) to quality strings.
- **Metadata Handling**:
- Special handling for taxonomic IDs (`taxid`, `*_taxid`).
- Generic attributes parsed as JSON when possible; fallback to raw string otherwise.
- **Batched Output**: Streams sequences in configurable batches (`batchSize`) via an iterator interface (`obiiter.IBioSequence`).
- **Multiple Entry Points**:
- `ReadCSV`: From any `io.Reader`.
- `ReadCSVFromFile`: Loads from a file (with source naming derived from filename).
- `ReadCSVFromStdin`: Reads from standard input.
- **Error & Edge Handling**:
- Gracefully handles empty files/streams via `ReadEmptyFile`.
- Uses structured logging (Logrus) for fatal and informational messages.
## Integration
Designed to integrate with OBItools4s core types:
- `obiseq.BioSequence`: Holds sequence, ID, qualities, taxid, and arbitrary attributes.
- `obiiter.IBioSequence`: Streaming interface for batched sequence iteration.
## Use Case
Efficient, flexible ingestion of tabular biological data (e.g., from alignment outputs or preprocessed FASTQ/FASTA conversions) into downstream analysis pipelines.
@@ -0,0 +1,22 @@
# CSVSequenceRecord Function Description
The `CSVSequenceRecord` function converts a biological sequence object (`*obiseq.BioSequence`) into a slice of strings suitable for CSV output. It dynamically constructs the record based on user-defined options (`opt Options`), enabling flexible column selection.
## Core Features
- **Sequence ID**: Includes the sequence identifier if `opt.CSVId()` is enabled.
- **Abundance Count**: Appends the sequence count (e.g., read depth) if `opt.CSVCount()` is true.
- **Taxonomic Information**: Adds both NCBI taxid and scientific name (retrieved from attributes or fallback via `opt.CSVNAValue()`).
- **Definition Line**: Includes the sequence definition/description if requested via `opt.CSVDefinition()`.
- **Custom Attributes**: Iterates over keys from `opt.CSVKeys()` and appends corresponding attribute values (or NA if missing).
- **Nucleotide Sequence**: Appends the raw sequence string when `opt.CSVSequence()` is enabled.
- **Quality Scores**: Converts Phred-quality scores to ASCII characters (using a configurable shift) if available; otherwise inserts NA.
## Design Highlights
- Uses `obiutils.InterfaceToString()` for safe type conversion of arbitrary attribute values.
- Handles missing data consistently via `opt.CSVNAValue()`.
- Supports both standard and user-defined metadata fields.
- Adapts quality encoding to common formats (e.g., Sanger/Illumina) via `obidefault.WriteQualitiesShift()`.
This function enables interoperable, configurable export of sequence data to tabular formats.
@@ -0,0 +1,24 @@
# `CSVTaxaIterator` Function — Semantic Description
The function `CSVTaxaIterator`, part of the `obiformats` package, converts a taxonomic iterator (`*obitax.ITaxon`) into an **incremental CSV record generator** via `obiitercsv.ICSVRecord`. It enables streaming, batched export of taxonomic data to CSV format with configurable fields.
### Core Functionality:
- **Input**: A pointer-based taxonomic iterator (`*obitax.ITaxon`) and optional configuration via `WithOption`.
- **Output**: An asynchronous CSV record iterator (`*obiitercsv.ICSVRecord`) that yields batches of records.
### Configurable Output Fields (via options):
- `query`: Taxon-associated query identifier, if enabled (`WithPattern`).
- `taxid`: Either raw node ID (e.g., string pointer) or formatted taxon path (`WithRawTaxid` toggle).
- `parent`: Parent taxonomic ID or string representation, if enabled (`WithParent`).
- `taxonomic_rank`: Taxon rank (e.g., "species", "genus").
- `scientific_name`: Full scientific name of the taxon.
- Custom metadata fields: Specified via `WithMetadata`, extracted from taxon metadata store.
- `path`: Full lineage path (e.g., "k__Bacteria; p__; c__..."), if enabled (`WithPath`).
### Implementation Highlights:
- Uses **goroutines** for non-blocking push of batches and clean shutdown (`WaitAndClose`, `Done`).
- Supports **batching** (configurable via `BatchSize`) to optimize I/O.
- Dynamically builds CSV headers based on selected options before processing begins.
### Use Case:
Efficient, memory-light conversion of large taxonomic datasets (e.g., from classification pipelines) into structured CSV for downstream analysis or reporting.
@@ -0,0 +1,27 @@
## CSV Taxonomy Loader for OBITools4
This Go module provides a function `LoadCSVTaxonomy` to parse and load taxonomic data from CSV files into an internal taxonomy structure.
### Key Features:
- **Robust CSV Parsing**: Uses Gos `encoding/csv` with configurable options (comment lines, lazy quotes, whitespace trimming).
- **Column Mapping**: Dynamically identifies required columns: `taxid`, `parent`, `scientific_name`, and `taxonomic_rank`.
- **Error Handling**: Validates presence of all required columns; fails early with descriptive errors.
- **Taxonomy Construction**:
- Builds a hierarchical taxonomy using `obitax.Taxon` objects.
- Ensures existence of a root node; returns error otherwise.
- **Metadata Extraction**:
- Derives taxonomy name and short code (e.g., prefix before `:` in first taxid).
- Logs key metadata for traceability.
- **Scalable Design**:
- Processes records line-by-line (memory-efficient).
- Supports large datasets via streaming CSV reading.
### Input Format:
CSV must contain exactly four columns (case-sensitive headers):
- `taxid`: Unique taxon identifier.
- `parent`: Parent taxonomic node ID (empty for root).
- `scientific_name`: Binomial or descriptive name.
- `taxonomic_rank`: e.g., *species*, *genus*.
### Output:
Returns a fully populated `obitax.Taxonomy` object ready for downstream phylogenetic or sequence classification tasks.
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats.WriterDispatcher`
The package `obiformats` provides utilities for writing biosequences (e.g., DNA/RNA/protein reads) to files in a structured, parallelized manner. Its core component is the `WriterDispatcher` function.
- **Purpose**: Enables concurrent, classifier-guided writing of biosequence batches to multiple output files based on dynamic dispatching logic.
- **Input**: Takes a prototype filename template (`prototypename`), an `IDistribute` dispatcher (which partitions and routes sequences by classification keys), a formatting/writing function (`formater` of type `SequenceBatchWriterToFile`), and optional configuration.
- **Concurrency**: Launches one goroutine per classification category (via `dispatcher.News()`), ensuring scalable parallel writes.
- **Classification Handling**: Supports simple and composite keys (e.g., dual annotations like sample + region), parsing JSON-encoded classifier values when needed.
- **File Naming & Organization**: Substitutes keys into the prototype name, appends `.gz` if compression is enabled, and creates subdirectories (e.g., for sample groups) as required.
- **Error Handling**: Uses `log.Fatalf` to abort on unrecoverable errors (e.g., failed key parsing, directory creation issues).
- **Resource Management**: Ensures all goroutines complete before returning via `sync.WaitGroup`.
- **Extensibility**: The generic `SequenceBatchWriterToFile` type allows plugging in different output formats (e.g., FASTA, JSON) without modifying the dispatcher logic.
In summary: `WriterDispatcher` is a high-level orchestrator for parallel, classifier-based batch writing of biological sequences to organized file outputs.
@@ -0,0 +1,29 @@
# EcoPCR File Parser for Biological Sequences
This Go package (`obiformats`) provides functionality to parse EcoPCR output files—tab-delimited CSV-like files containing amplified sequence data generated by the *EcoPCR* tool (used in metabarcoding pipelines). The parser supports two versions of the format (`v1` and `v2`) and extracts rich biological metadata alongside sequences.
## Key Features
- **Version Detection**: Automatically detects EcoPCR file version via the `#@ecopcr-v2` header.
- **Primer Extraction**: Reads forward and reverse primer sequences from comment lines in the file header.
- **Mode Inference**: Identifies amplification mode (e.g., `direct`, `inverted`) from header metadata.
- **Sequence Parsing**: Reads each record as a biological sequence (`obiseq.BioSequence`) with:
- Name (with deduplication support)
- Nucleotide/protein sequence
- Comment field
- **Structured Annotation**: Populates rich annotations including:
- Taxonomic hierarchy (taxid, rank, species/genus/family names)
- Primer matching info (`forward_match`, `reverse_mismatch`)
- Melting temperatures (if present in v2)
- Amplicon length and strand orientation
- **Streaming & Batching**: Returns an iterator (`obiiter.IBioSequence`) for memory-efficient, batched processing of large files.
- **File Handling**: Provides both `ReadEcoPCR` (from any `io.Reader`) and `ReadEcoPCRFromFile` convenience functions.
## Implementation Highlights
- Custom line reader (`__readline__`) for robust header parsing.
- CSV parser configured with `|` delimiter and comment support (`#`).
- Deduplication of sequence names using a running count suffix.
- Concurrent goroutine-based streaming to decouple I/O and processing.
This module integrates with the broader *OBItools4* ecosystem for high-throughput sequence analysis in environmental DNA studies.
+17
View File
@@ -0,0 +1,17 @@
# EMBL Format Parser for OBITools4
This Go package (`obiformats`) provides robust, streaming parsers for the **EMBL nucleotide sequence format**, supporting both standard and rope-based (memory-efficient) parsing. Key features:
- **Entry Boundary Detection**: `EndOfLastFlatFileEntry()` identifies the end of EMBL entries using the signature terminator pattern `//` (with optional CR/LF), enabling chunked file processing.
- **Two Parsing Modes**:
- `EmblChunkParser()`: Line-scanning parser for buffered I/O (`io.Reader`).
- `EmblChunkParserRope()`: Direct rope-based parser for zero-copy processing of large files.
- **Configurable Options**:
- `withFeatureTable`: Includes EMBL feature table (`FH`/`FT`) lines.
- `UtoT`: Converts RNA uracil (`u/U`) to DNA thymine (`t/T`).
- **Metadata Extraction**: Captures `ID`, `OS` (scientific name), `DE` (description), and taxonomic ID (`/db_xref="taxon:..."`) into sequence annotations.
- **Sequence Handling**: Parses multi-line EMBL sequences (10-bases-per-group, with position numbers), skipping digits and whitespace.
- **Parallel Processing**: `ReadEMBL()`/`ReadEMBLFromFile()` support concurrent parsing via worker goroutines, streaming results as `BioSequenceBatch` objects.
- **Integration**: Outputs are compatible with OBITools4s iterator framework (`obiiter.IBioSequence`) and sequence type `obiseq.BioSequence`.
Designed for scalability, the module handles large EMBL files efficiently—ideal for metagenomic or biodiversity data pipelines.
@@ -0,0 +1,22 @@
## `ReadEmptyFile` Function — Semantic Description
- **Package**: `obiformats`, part of the OBITools4 ecosystem for biological sequence handling.
- **Purpose**: Creates and returns an *empty*, closed iterator over biosequences (`IBioSequence`).
- **Signature**:
`func ReadEmptyFile(options ...WithOption) (obiiter.IBioSequence, error)`
- **Input**: Accepts variadic `WithOption` configuration functions (currently unused in this minimal implementation).
- **Behavior**:
- Instantiates a new `IBioSequence` iterator via `obiiter.MakeIBioSequence()`.
- Immediately closes the stream using `.Close()` — indicating no data will be yielded.
- **Output**:
- Returns a *terminal* iterator (no elements), suitable as a safe default or fallback.
- Error return is always `nil`, since no I/O occurs and the operation is deterministic.
### Semantic Role & Use Cases
- **Default/Placeholder**: Useful in conditional logic where a valid (but empty) sequence iterator is required when no input file exists or parsing fails.
- **Consistency**: Ensures callers always receive a well-formed iterator, avoiding `nil` checks.
- **Resource Safety**: The closed state prevents accidental iteration or memory leaks.
### Design Notes
- Reflects a *pure-functional* and *fail-safe* pattern: no side effects, deterministic behavior.
- Aligns with iterator-based I/O design principles in OBITools4 (lazy, composable streams).
@@ -0,0 +1,34 @@
# FASTA Parser Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTA-formatted nucleotide sequences. It supports both standard and rope-based (memory-efficient) input handling.
## Core Functionalities
- **`FastaChunkParser(UtoT bool)`**
Returns a parser function for in-memory byte streams. Converts `U→T` if enabled (for RNA/DNA normalization). Validates headers, identifiers, and sequences; rejects invalid characters or malformed entries.
- **`FastaChunkParserRope(...)`**
Parses FASTA directly from a `PieceOfChunk` rope structure, avoiding full data materialization. Optimized for large files.
- **`ReadFasta(reader io.Reader, ...)`**
High-level API to parse FASTA from any `io.Reader`. Uses chunked reading with parallel workers (configurable via options). Supports full-file batching and header annotation parsing.
- **`ReadFastaFromFile(...)` / `ReadFastaFromStdin(...)`**
Convenience wrappers for file and stdin inputs, including source naming and empty-file handling.
- **`EndOfLastFastaEntry(...)`**
Helper to locate the last complete FASTA entry in a buffer, enabling safe chunked streaming without splitting records.
## Key Features
- **Strict validation**: Ensures entries start with `>`, contain valid identifiers, and only use allowed sequence characters (`a-z`, `- . [ ]`).
- **Case normalization**: Converts uppercase to lowercase; optional `U→T` conversion.
- **Whitespace handling**: Ignores spaces/tabs in sequences, preserves line breaks only for parsing structure.
- **Parallel processing**: Configurable worker count via options; batches results by source and order for downstream sorting/aggregation.
- **Integration with `obiseq`/`obiiter`**: Yields typed sequence objects (`BioSequence`) and batched iterators compatible with OBITools4 pipelines.
## Design Highlights
- Minimal allocations via rope-based parsing (`extractFastaSeq`).
- Graceful error reporting with context (source, identifier, invalid char position).
- Extensible via `WithOption` pattern for header parsing and batching behavior.
@@ -0,0 +1,41 @@
# FASTQ Parsing Module (`obiformats`)
This Go package provides robust, streaming-capable parsing of FASTQ files — a standard format for storing nucleotide sequences along with quality scores.
## Core Functionalities
- **`EndOfLastFastqEntry(buffer []byte) int`**
Locates the start position (`@`) of the last complete FASTQ entry in a byte buffer using state-machine scanning from end to beginning. Returns `-1` if no valid entry found.
- **`FastqChunkParser(...)`**
Returns a parser function for processing FASTQ data from an `io.Reader`. Handles:
- Header parsing (`@id [definition]`)
- Sequence normalization (uppercase → lowercase, `U→T` conversion if enabled)
- Quality score shifting (`quality_shift`)
- Strict validation (e.g., `+` line, matching sequence/length)
- **`FastqChunkParserRope(...)`**
Optimized parser for rope-based input (`PieceOfChunk`), avoiding unnecessary memory copies. Uses direct line-by-line scanning.
- **Batched File Parsing (`_ParseFastqFile`, `ReadFastq`, etc.)**
Enables concurrent, chunked parsing of large files:
- Splits input into chunks using `ReadFileChunk`
- Uses configurable parallel workers (`nworker`)
- Pushes parsed batches to an iterator interface
- **Convenience I/O Wrappers**
- `ReadFastqFromFile(filename, ...)`: Parses a file by name.
- `ReadFastqFromStdin(...)`: Reads FASTQ from standard input.
## Key Options & Features
- **Quality handling**: Optional quality extraction (`with_quality`), configurable offset (`quality_shift`)
- **Uracil-to-Thymine conversion**: `UtoT` flag for RNA→DNA normalization
- **Header annotation parsing**: Optional post-parsing header interpretation via `ParseFastSeqHeader`
- **Batch sorting & full-file mode**: Supports both streaming and complete-file aggregation
## Design Highlights
- **Memory-efficient chunking** with overlap-aware boundary detection (`EndOfLastFastqEntry`)
- **Strict error reporting**: Fails fast on malformed FASTQ (e.g., invalid chars, length mismatch)
- **Integration with `obiseq`, `obiiter`**: Returns typed biological sequence slices and iterator streams compatible with the broader OBITools4 ecosystem.
@@ -0,0 +1,11 @@
## Semantic Description of `obiformats` Package
The `obiformats` package provides core formatting utilities for biological sequence data in standard FASTX formats (FASTA and FASTQ). It defines two functional types:
- `BioSequenceFormater`: Converts a single biological sequence (`*obiseq.BioSequence`) into its string representation.
- `BioSequenceBatchFormater`: Converts a batch of sequences (`obiiter.BioSequenceBatch`) into raw bytes, suitable for file or stream output.
Two main constructor functions enable flexible formatting:
- `BuildFastxSeqFormater(format, header)` returns a sequence-level formatter based on the requested format (`"fasta"` or `"fastq"`), applying optional header metadata via `FormatHeader`.
- `BuildFastxFormater(format, header)` builds a batch formatter by composing the sequence-level function over all sequences in an iterator-driven batch, concatenating results with newline separators.
The package supports extensibility and type safety through function composition while integrating logging (via `logrus`) for critical errors—e.g., unsupported formats trigger a fatal log. It abstracts away low-level I/O, focusing purely on *semantic formatting logic*, making it ideal for pipeline integration in NGS data processing tools.
@@ -0,0 +1,27 @@
# Semantic Description of `obiformats` Package
The `obiformats` package provides utilities for parsing sequence headers in the OBItools4 framework, supporting two distinct formats:
- **JSON-based format** (e.g., `{"id":"seq1", ...}`): Detected by a leading `{` character.
- **Legacy OBI format** (plain text, e.g., `>seq1 description`): Used when no JSON prefix is present.
## Core Functions
- **`ParseGuessedFastSeqHeader(sequence *obiseq.BioSequence)`**
Dynamically routes header parsing based on the first character of the sequence definition:
- Calls `ParseFastSeqJsonHeader` if JSON-prefixed.
- Otherwise invokes `ParseFastSeqOBIHeader`.
- **`IParseFastSeqHeaderBatch(iterator, options...) obiiter.IBioSequence`**
Applies header parsing to a *batch* of sequences:
- Takes an iterator over `BioSequence`s.
- Uses optional configuration (e.g., parallelism, parsing behavior).
- Wraps the parser in a worker pipeline via `MakeIWorker`, preserving sequence flow.
## Design Principles
- **Format agnosticism**: Automatically detects header type.
- **Iterator-based streaming**: Enables memory-efficient batch processing of large datasets (e.g., FASTQ/FASTA).
- **Extensibility**: Options pattern (`WithOption`) supports runtime customization.
This package serves as a header-decoding layer for downstream analysis in metagenomic or metabarcoding workflows.
@@ -0,0 +1,28 @@
# `FormatHeader` Function Type in `obiformats`
The `obiformats` package defines a core functional interface for sequence formatting within the OBITools4 ecosystem.
- **Package**: `obiformats`
Provides utilities for formatting biological sequences according to various output standards (e.g., FASTA, GenBank).
- **Type Definition**:
```go
type FormatHeader func(sequence *obiseq.BioSequence) string
```
- A `FormatHeader` is a *function type* that takes a pointer to an `obiseq.BioSequence` and returns its formatted header as a string.
- **Semantic Role**:
Encapsulates the logic for generating *header lines* (e.g., `>id description`) in sequence file formats.
Decouples header formatting from core data structures (`BioSequence`), enabling modular and reusable format adapters.
- **Usage Context**:
- Used by writers/formatters to produce standardized headers when exporting sequences.
- Allows custom header generation (e.g., for MIxS-compliant metadata, user-defined tags).
- Supports polymorphism: different `FormatHeader` implementations can be swapped per output format.
- **Dependencies**:
- Relies on `obiseq.BioSequence`, the core sequence data model (ID, description, annotations, etc.).
- **Design Intent**:
Promotes clean separation of concerns: data (sequence) ↔ formatting logic.
Facilitates extensibility for new output formats without modifying core types.
@@ -0,0 +1,21 @@
This Go package `obiformats` provides semantic parsing and serialization utilities for FASTQ/FASTA sequence headers encoded in JSON format, primarily used within the OBITools4 framework.
- **JSON Parsing Helpers**:
It defines internal functions (`_parse_json_map_*`, `_parse_json_array_*`) to convert JSON objects/arrays into typed Go maps and slices (`map[string]string`, `[]int`, etc.), using the high-performance [`jsonparser`](https://github.com/buger/jsonparser) library for streaming parsing.
- **Header Interpretation**:
`_parse_json_header_` interprets a FASTQ/FASTA header string containing embedded JSON metadata. It extracts and assigns:
- Core fields (`id`, `definition`, `count`)
- Specialized OBITools annotations (e.g., `"obiclean_weight"`, `"taxid"` with optional taxonomic ranks)
- Generic annotations of any JSON type (string, number, bool, array, object), preserving numeric precision where possible.
- **Sequence Annotation Enrichment**:
`ParseFastSeqJsonHeader` parses the header of a `BioSequence`, extracting JSON metadata into its annotations map and reconstructing non-JSON text as the new definition.
- **Serialization Support**:
`WriteFastSeqJsonHeader` and `FormatFastSeqJsonHeader` serialize sequence annotations back into JSON format, appending them to a buffer or returning as string — enabling round-trip compatibility for annotated sequences.
- **Error Handling**:
Uses `log.Fatalf` on parsing failures, ensuring malformed headers fail fast during processing.
In summary: *structured JSON header ↔ BioSequence annotation mapping*, optimized for metabarcoding workflows.
@@ -0,0 +1,31 @@
# OBIFormats Package: Semantic Description
The `obiformats` package provides parsing and formatting utilities for **OBI-compliant FASTA headers**, enabling structured annotation of biological sequences.
- It supports parsing key-value annotations embedded in sequence definitions (e.g., `key=value;`), including nested dictionaries.
- Three core parsing functions detect value types:
- `__match__key__`: Identifies assignment patterns (`Key = ...`).
- `__obi_header_value_numeric_pattern__`: Matches floats/integers (e.g., `42.0;`).
- `__obi_header_value_string_pattern__`: Matches quoted strings (e.g., `'example';`).
- `__match__dict__`: Parses balanced `{...}` blocks, handling nested structures and string delimiters.
- Boolean detection (`__is_true__/__is_false__`) handles multiple case variants (e.g., `true`, `True`, `TRUE`).
- The main entry point, **`ParseOBIFeatures(text string, annotations obiseq.Annotation)`,**
iteratively extracts key-value pairs from a header string and populates an `Annotation` map.
- Numeric values are stored as integers if they have no fractional part.
- Dictionary-like strings (e.g., `{'a':1,'b':2}`) are JSON-unmarshalled into typed maps:
- `*_count``map[string]int`,
- `merged_*` → wrapped in a statistics object (`obiseq.StatsOnValues`).
- `*_status`/`*_mutation``map[string]string`.
- **`ParseFastSeqOBIHeader(sequence *obiseq.BioSequence)`** applies parsing to a sequences definition line, moving annotations into its metadata map and preserving leftover text.
- **`WriteFastSeqOBIHeade(buffer *bytes.Buffer, sequence)`** serializes annotations back into OBI header format:
- Strings and booleans use `key=value;`.
- Maps/dicts are JSON-encoded, then single-quoted for compatibility.
- Special handling ensures `obiseq.StatsOnValues` are safely marshalled.
- **`FormatFastSeqOBIHeader(sequence)`** returns the formatted header as a string (zero-copy via `unsafe.String` for performance).
- Designed to interoperate with the broader OBITools4 ecosystem (`obiseq`, `obiutils`), supporting both human-readable and machine-processable sequence metadata.
@@ -0,0 +1,26 @@
# FastSeq Reader Module — Semantic Description
This Go package (`obiformats`) provides high-performance parsing of FASTA/FASTQ files using a C-backed library (`fastseq_read.h`). It enables streaming, batched reading of biological sequences with optional quality scores.
## Core Features
- **C-based FASTX parsing**: Leverages `kseq.h` via Go's cgo for efficient, low-level file/stream parsing.
- **Batched iteration**: Sequences are grouped into configurable batches (`batch_size`) for memory-efficient processing.
- **Quality score handling**: Supports FASTQ; decodes Phred quality scores using a configurable shift offset (`obidefault.ReadQualitiesShift()`).
- **Source tracking**: Each sequence carries its origin (filename or `"stdin"`), aiding provenance.
- **Header parsing hook**: Optional custom header parser (`ParseFastSeqHeader`) allows metadata extraction or transformation.
- **Full-file batching mode**: When enabled, yields a single batch containing the entire file (useful for small files or global operations).
- **Stdin & File I/O**: Two entry points:
- `ReadFastSeqFromFile(filename, ...)` for regular files.
- `ReadFastSeqFromStdin(...)` to process piped input (e.g., from upstream tools).
- **Error resilience**: Gracefully handles missing files, with logging (via `logrus`) for debugging.
- **Async streaming**: Uses goroutines to decouple reading from consumption, enabling concurrent pipelines.
## Integration
Built on top of `obitools4`s core abstractions:
- `obiiter.IBioSequence`: Iterator interface for biological sequences.
- `obiseq.BioSequence`: Data model holding name, sequence bytes, comment, and quality.
- `obiutils`, `obidefault`: Utilities for path handling and defaults.
Designed for scalability in high-throughput metabarcoding pipelines.
@@ -0,0 +1,35 @@
# `obiformats` Package Overview
The `obiformats` package provides utilities for formatting and writing biological sequences (e.g., DNA, RNA) in standard formats—primarily **FASTA**. It is designed for high-performance batch processing and supports parallel I/O, compression-aware streaming, and flexible configuration.
## Core Formatting Functions
- **`FormatFasta(seq, formater)`**
Converts a single `BioSequence` into a FASTA string: header (`>id description`) followed by sequence lines of up to 60 characters.
- **`FormatFastaBatch(batch, formater, skipEmpty)`**
Efficiently formats a batch of sequences into FASTA using pre-allocated buffers and direct byte writes—avoiding intermediate strings. Empty sequences are either skipped (with warning) or cause a fatal error.
## File Writing Functions
- **`WriteFasta(iterator, file, options...)`**
Writes a stream of sequences to any `io.WriteCloser`. Supports:
- Parallel workers (`ParallelWorkers`)
- Chunked writing via `WriteFileChunk`
- Optional compression (e.g., gzip)
Returns a new iterator mirroring the input for pipeline chaining.
- **`WriteFastaToStdout(iterator, options...)`**
Convenience wrapper to output FASTA directly to `stdout`, with file-closing behavior configurable.
- **`WriteFastaToFile(iterator, filename, options...)`**
Writes to a named file with:
- Truncation or append mode (`AppendFile`)
- Automatic paired-end output if `HaveToSavePaired()` is enabled
(writes reverse reads to a secondary file specified via `PairedFileName`)
## Key Design Highlights
- **Memory-efficient**: Uses `bytes.Buffer.Grow()` and avoids unnecessary allocations.
- **Robust error handling**: Panics on nil sequences; logs warnings/errors via `logrus`.
- **Pipeline-friendly**: Integrates with the `obiiter` iterator abstraction for streaming workflows.
@@ -0,0 +1,35 @@
# FASTQ Output Module (`obiformats`)
This Go package provides utilities for formatting and writing biological sequence data in **FASTQ format**. It supports single-end, paired-end, batch processing, and parallelized I/O.
## Core Functionality
- **`FormatFastq(seq, headerFormatter)`**: Formats a single `BioSequence` into FASTQ string.
- **`FormatFastqBatch(batch, headerFormatter, skipEmpty)`**: Formats a batch of sequences efficiently with dynamic buffer growth and optional skipping/termination on empty reads.
## Header Customization
- Accepts a `FormatHeader` function to inject custom metadata (e.g., read group, sample ID) after the sequence identifier.
## Writing to Streams/Files
- **`WriteFastq(iterator, fileWriter)`**: Writes sequences from an iterator to any `io.WriteCloser`, supporting compression and parallel workers via options.
- **`WriteFastqToStdout(...)`**: Convenience wrapper for stdout output (e.g., piping).
- **`WriteFastqToFile(...)`**: Writes to a file, with support for:
- Append/truncate modes
- Paired-end output (splits iterator and writes to two files)
- Automatic compression via `obiutils.CompressStream`
## Parallelization & Robustness
- Uses goroutines to parallelize formatting/writing across multiple workers.
- Handles empty sequences gracefully: logs warning or fatal error based on `skipEmpty` option.
- Ensures ordered output via batch tracking (`Order()`) and chunked writing.
## Integration
Designed to work seamlessly with the `obitools4` ecosystem:
- Uses `obiiter.BioSequenceBatch`, `obiseq.BioSequence`, and logging via Logrus.
- Extensible through functional options (`WithOption`) for configuration.
> *Efficient, scalable FASTQ output with support for high-throughput NGS workflows.*
@@ -0,0 +1,19 @@
# `obiformats` Package Overview
The `obiformats` package provides semantic support for handling and validating structured data formats, particularly focused on biodiversity observation records. It offers:
- **Format Abstraction**: Defines common interfaces and base classes for standardized biodiversity data formats (e.g., Darwin Core, OBIS-ENV).
- **Validation Rules**: Implements semantic validation logic to ensure data integrity and compliance with community standards (e.g., required fields, controlled vocabularies).
- **Mapping Utilities**: Includes tools for transforming records between different biodiversity data schemas (e.g., from local formats to Darwin Core).
- **Ontology Integration**: Leverages semantic web technologies (e.g., RDF, OWL) to support interoperability and reasoning over observation metadata.
- **Type Safety**: Uses strongly-typed data models (e.g., `Occurrence`, `Event`) to reduce runtime errors and improve code clarity.
- **Extensibility**: Designed for easy extension—new formats or standards can be added by implementing core interfaces.
- **Test Coverage**: Includes unit and integration tests to guarantee correctness across format transformations and validations.
The package targets biodiversity data managers, informaticians building OBIS-compatible systems, and researchers working with ecological observation datasets.
@@ -0,0 +1,25 @@
# Semantic Description of `obiformats` Package Functionalities
The `obiformats` package provides robust, streaming-aware chunking utilities for processing large biological sequence files (e.g., FASTA/FASTQ) in a memory-efficient and parallel-friendly manner.
- **`PieceOfChunk`**: A rope-like linked buffer structure enabling efficient concatenation and partial reading of large data streams without full materialization. Supports dynamic chaining (`NewPieceOfChunk`, `Next()`) and final packing into a contiguous slice via `Pack()`.
- **`FileChunk`**: Encapsulates one chunk of raw data (`*bytes.Buffer`) or its rope representation, tagged with source file name and positional order for ordered downstream processing.
- **`ChannelFileChunk`**: A typed channel (`chan FileChunk`) enabling concurrent, pipeline-style data ingestion—ideal for parallel parsing or streaming workflows.
- **`LastSeqRecord`**: A callback type (`func([]byte) int`) used to locate the end of a complete biological record (e.g., last newline after full FASTQ entry), ensuring chunks split only at valid boundaries.
- **`ReadFileChunk()`**: Core function that:
- Reads from an `io.Reader` in configurable chunks (`fileChunkSize`);
- Uses a probe string (e.g., `"@M0"` for FASTQ) to early-exit non-matching segments and avoid unnecessary parsing;
- Extends chunks incrementally (e.g., +1MB) until a full record boundary is found via `splitter`;
- Returns data as an ordered stream of `FileChunk`s on a channel, closing it upon EOF;
- Optionally packs rope buffers to contiguous memory (`pack` flag), balancing speed vs. RAM usage.
- **Key semantics**:
- *Chunking by record integrity*, not fixed byte size — prevents splitting biological entries.
- *Lazy evaluation*: only reads ahead when needed to find record boundaries.
- *Streaming-first design* — supports large files without full loading into memory.
This package is foundational for scalable, robust parsing of high-throughput sequencing data in the OBITools4 ecosystem.
@@ -0,0 +1,26 @@
# `WriteFileChunk` Function — Semantic Description
The `WriteFileChunk` function in the `obiformats` package implements a **thread-safe, ordered chunk writer** for streaming data to an `io.WriteCloser`. It accepts a destination writer and a flag indicating whether the writer should be closed upon completion.
- **Input**:
- `writer`: An `io.WriteCloser` (e.g., file, buffer) to which data chunks are written.
- `toBeClosed`: Boolean flag specifying if the writer should be closed after all chunks are processed.
- **Core Behavior**:
- Launches a goroutine that consumes `FileChunk` items from an unbuffered channel (`chunk_channel`).
- Ensures **strict sequential ordering** of chunks by their `Order` field (intended for reassembly after parallel or out-of-order processing).
- If a chunk arrives in order (`chunk.Order == nextToPrint`), it is immediately written.
- Out-of-order chunks are buffered in a map (`toBePrinted`) until their predecessor arrives.
- **Buffer Management**:
- After writing an in-order chunk, the function checks for newly consecutive buffered chunks and writes them greedily (e.g., if order 2 arrives, it triggers writing of buffered orders 3,4,... as available).
- **Error Handling**:
- Logs fatal errors on write failures or writer closure issues using `log.Fatalf`.
- **Cleanup & Lifecycle**:
- Closes the underlying writer if requested and unregisters a pipe registration (via `obiutils`) to signal end-of-stream.
- Returns the input channel, enabling external producers to stream `FileChunk` structs.
- **Use Case**:
Designed for robust, ordered reconstruction of large binary/data streams (e.g., sequencing reads) in OBITools4 pipelines, especially where parallel chunking and reassembly occur.
@@ -0,0 +1,34 @@
# GenBank Parser Module (`obiformats`)
This Go package provides high-performance parsing of **GenBank flat files**, optimized for large-scale genomic data processing. It supports both rope-based (memory-efficient) and buffered I/O parsing strategies.
## Core Functionalities
- **State-machine parser**: Processes GenBank records through well-defined states (`inHeader`, `inEntry`, `inFeature`, etc.), ensuring robust handling of structured sections (LOCUS, DEFINITION, SOURCE, FEATURES, ORIGIN/CONTIG).
- **Rope-aware parsing** (`GenbankChunkParserRope`): Directly parses from a `PieceOfChunk` rope structure, avoiding large contiguous memory allocations—critical for chromosomal-scale sequences.
- **Sequence extraction**: Efficient byte-by-byte scanning of the `ORIGIN` section, compacting bases and optionally converting uracil (`u`) to thymine (`t`).
- **Metadata extraction**: Captures sequence ID, declared length (from LOCUS), scientific name (`SOURCE`), and taxonomic ID (`/db_xref="taxon:..."`).
- **Optional feature table support**: When enabled, stores raw FEATURES section content for downstream annotation processing.
- **Parallel streaming I/O**:
- `ReadGenbank()` and `ReadGenbankFromFile()` return an iterator (`obiiter.IBioSequence`) over parsed sequences.
- Supports concurrent parsing via configurable worker count, with chunked file reading and batch output.
## Key Design Decisions
- **Zero-copy where possible**: Rope parser avoids `Pack()` to prevent expensive reallocation.
- **Strict state validation**: Logs fatal errors on unexpected line sequences (e.g., `DEFINITION` outside entry state).
- **Fallback parsing**: Falls back to buffered I/O (`GenbankChunkParser`) when rope data is unavailable.
- **U-to-T conversion**: Optional base modification for RNA→DNA normalization (e.g., in transcriptome data).
- **Error resilience**: Warns on empty IDs but continues processing; rejects overly long lines (>100 chars) in buffered mode.
## Output
Returns a batched iterator of `BioSequence` objects, each containing:
- Identifier (`id`)
- Compact nucleotide sequence
- Definition line (as description)
- Source file origin
- Optional feature table bytes
- Annotations: `scientific_name`, `taxid`
Ideal for pipelines requiring scalable, low-memory GenBank ingestion (e.g., metagenomic databases).
@@ -0,0 +1,27 @@
# JSON Output Module for Biological Sequences (`obiformats`)
This Go package provides utilities to serialize biological sequence data (from `obiseq`) into structured JSON format, supporting batch processing and parallel I/O.
- **`JSONRecord(sequence)`**: Converts a single `BioSequence` into an indented JSON object containing:
- `"id"`: Sequence identifier.
- `"sequence"` (optional): Nucleotide/protein sequence string if present.
- `"qualities"` (optional): Quality scores as a string if available.
- `"annotations"` (optional): Metadata annotations map.
- **`FormatJSONBatch(batch)`**: Formats a batch of sequences as JSON array elements, returning a `*bytes.Buffer`. Handles comma separation and indentation.
- **`WriteJSON(iterator, file)`**: Writes a stream of sequences to an `io.Writer`, supporting:
- Parallel workers (configurable via options).
- Automatic compression (`gzip`/`bgzip`) if enabled.
- Proper JSON array wrapping: `[`, chunked batches, and final `]`.
- Atomic ordering to preserve sequence integrity across parallel writes.
- **`WriteJSONToStdout()` / `WriteJSONToFile()`**: Convenience wrappers:
- Outputs to stdout or a file (with append/truncate control).
- Supports paired-end data: writes both forward and reverse reads to separate files when configured.
- **Internal helpers**:
- `_UnescapeUnicodeCharactersInJSON()`: Fixes double-escaped Unicode in JSON output (e.g., `\\u00E9``\u00E9`).
- Uses chunked concurrency with `FileChunk`, ordered by batch number to ensure valid JSON structure.
Designed for high-throughput NGS data pipelines, it ensures correctness and performance while integrating with `obitools4`'s iterator-based processing model.
@@ -0,0 +1,17 @@
# NCBI Taxonomy Loader Module (`obiformats`)
This Go package provides functionality to parse and load NCBI taxonomy dump files into a structured `Taxonomy` object. It supports three core file types:
- **nodes.dmp**: Defines the taxonomic hierarchy via `taxid|parent_taxid|rank` records.
- **names.dmp**: Maps taxonomic IDs to names and name classes (e.g., "scientific name", "common name").
- **merged.dmp**: Tracks deprecated taxonomic IDs and their replacements.
Key features:
- Custom CSV parsing with `|` delimiter, comment support (`#`), and whitespace trimming.
- Support for loading *only scientific names* via the `onlysn` flag in `LoadNCBITaxDump`.
- Efficient buffered reading (`bufio.Reader`) for large files.
- Automatic root taxon (taxid `"1"`, i.e., *root*) assignment after loading.
- Alias resolution: deprecated taxids are mapped to current ones via `AddAlias`.
- Robust error handling with fatal logging on critical failures (e.g., missing root taxon, invalid parent references).
The main entry point is `LoadNCBITaxDump(directory string, onlysn bool)`, which constructs a fully initialized taxonomy from NCBI dump files. Designed for integration with `obitax` and `obiutils`, it enables downstream applications (e.g., metabarcoding pipelines) to perform taxonomic queries and filtering.
@@ -0,0 +1,31 @@
## NCBI Taxonomy Archive Support in `obiformats`
This Go package provides utilities for handling **NCBI Taxonomy dumps archived as `.tar` files**.
### Core Functionalities
1. **Archive Validation (`IsNCBITarTaxDump`)**
- Checks whether a given `.tar` file contains all required NCBI Taxonomy dump files: `citations.dmp`, `division.dmp`, `gencode.dmp`, `names.dmp`, `delnodes.dmp`, `gc.prt`, `merged.dmp`, and `nodes.dmp`.
- Returns a boolean indicating if the archive is a complete NCBI tax dump.
2. **Taxonomy Loading (`LoadNCBITarTaxDump`)**
- Parses the `.tar` archive and extracts key files to build a `Taxonomy` object.
- Steps include:
- **Nodes**: Loads taxonomic hierarchy (`nodes.dmp`) via `loadNodeTable`.
- **Names**: Parses scientific and common names (`names.dmp`) via `loadNameTable`, with an option to load *only scientific names* (`onlysn`).
- **Merged Taxa**: Integrates taxonomic aliases from `merged.dmp`, using `loadMergedTable`.
- Sets the root taxon to NCBIs default (`taxid = 1`, i.e., *root*).
3. **Integration with Other Modules**
- Uses `obiutils.Ropen`, `TarFileReader` for robust file handling.
- Leverages `obitax.Taxonomy`, a structured representation of taxonomic data.
### Key Parameters
- `onlysn`: If true, only scientific names are loaded (reduces memory usage).
- `seqAsTaxa`: Reserved for future use; currently unused.
### Logging & Error Handling
- Uses `logrus` to log loading progress and counts.
- Returns descriptive errors if required files or the root taxon are missing.
> **Note**: Designed for efficient, standards-compliant ingestion of NCBI Taxonomy data in bioinformatics pipelines.
@@ -0,0 +1,31 @@
# Newick Format Export Functionality in `obiformats`
This Go package provides utilities to export taxonomic data into the **Newick format**, a standard for representing phylogenetic trees.
## Core Components
- `Tree`: A struct modeling a node in a Newick tree, containing:
- `Children`: list of child nodes (nested trees),
- `TaxNode`: reference to a taxonomic entry (`obitax.TaxNode`),
- `Length`: optional branch length (evolutionary distance).
- **`Newick()` methods**:
- `Tree.Newick(...)`: Recursively generates a Newick string for the subtree.
Supports optional annotations: `scientific_name`, `taxid` (with `'@'` for rank), and branch lengths.
- Package-level `Newick(...)`: Converts a full taxon set into a Newick tree string using the root node from `taxa.Sort().Get(0)`.
- **Writing Functions**:
- `WriteNewick(...)`: Asynchronously writes the Newick representation to any `io.WriteCloser`.
- Accepts an iterator over taxa (`*obitax.ITaxon`).
- Validates single-taxonomy input.
- Applies compression (via `obiutils.CompressStream`) if configured via options (`WithOption`).
- `WriteNewickToFile(...)`: Convenience wrapper to write directly to a file.
- `WriteNewickToStdout(...)`: Outputs Newick tree to standard output.
## Configuration Options
Options (e.g., `WithScientificName`, `WithTaxid`, `WithRank`) control annotation content and behavior (e.g., file closing, compression).
## Semantic Summary
The module enables **conversion of hierarchical taxonomic datasets into structured Newick trees**, supporting rich node labeling for downstream phylogenetic or bioinformatic tools.
@@ -0,0 +1,47 @@
# NGSFilter Configuration Parser — Semantic Overview
This Go package (`obiformats`) provides robust parsing and validation of NGS (Next-Generation Sequencing) filter configurations used in the OBITools4 ecosystem. It supports two legacy and modern formats: a line-based text format (`ReadOldNGSFilter`) and CSV-based configuration files with parameter headers.
## Core Functionality
- **Format Detection**:
`OBIMimeNGSFilterTypeGuesser` detects MIME type using content sniffing (via [`mimetype`](https://github.com/gabriel-vasile/mimetype)), distinguishing between `text/csv`, custom `text/ngsfilter-csv`, and plain text.
A heuristic CSV detector (`NGSFilterCsvDetector`) validates structure (consistent column count, non-empty rows).
- **Dual Input Parsing**:
- `ReadOldNGSFilter`: Parses line-based config files (e.g., lines like `"EXP1@SAMPLE1:TAGFWD-TAGREV primer_f primer_r"`), supporting:
- Primer pairs (`forward`, `reverse`)
- Tag pairs (with optional `-` for untagged direction)
- Experiment/sample metadata
- OBIFeatures annotations (via `ParseOBIFeatures`)
- `ReadCSVNGSFilter`: Parses structured CSV files with mandatory columns:
`"experiment"`, `"sample"`, `"sample_tag"`, `"forward_primer"`, `"reverse_primer"`
Additional columns are stored as annotations.
- **Parameter Configuration**:
A rich set of `@param` lines (in CSV or legacy format) configures global/primers-specific settings:
- `spacer`, `forward_spacer`, `reverse_spacer`: Tag-primer spacing (bp)
- `tag_delimiter` / directional variants: Symbol separating tags in sequences
- `matching`: Tag matching algorithm (e.g., exact, fuzzy)
- Error tolerance:
`primer_mismatches`, `forward_mismatches`, `reverse_mismatches` (max mismatches)
`tag_indels`, `forward_tag_indels`, etc. (allow indel errors)
- Indel handling:
`indels` / directional variants (`true/false`) to enable/disable indels in primer matching
- **Validation & Integrity Checks**:
- `CheckPrimerUnicity`: Ensures each primer pair is defined only once.
- Duplicate tag-pair detection per marker (error on reuse).
- Strict column/field validation with informative error messages.
- **Logging & Observability**:
Uses `logrus` for detailed info/warnings (e.g., parameter application, skipped unknown params).
## Design Highlights
- **Extensibility**: New parameters can be added via `library_parameter` map.
- **Robustness**: Handles BOM, line continuation (`ReadLines`), CSV quirks (lazy quotes, comments).
- **Semantic Clarity**: Separates *data* (samples/markers/tags) from *configuration* (parameters).
- **Integration Ready**: Returns a validated `obingslibrary.NGSLibrary` ready for downstream processing.
> **Use Case**: Enables reproducible, metadata-rich NGS filtering setups in metabarcoding workflows.
+14
View File
@@ -0,0 +1,14 @@
# Semantic Description of `obiformats` Package Functionalities
The `go` package `obiformats` provides a flexible, configuration-driven framework for handling biological sequence data (e.g., FASTA/FASTQ) and associated metadata. Its core component is the `Options` type, which encapsulates user-defined settings via an immutable configuration pattern using functional setters (`WithOption`).
Key capabilities include:
- **I/O control**: file handling options (e.g., `OptionCloseFile`, `OptionsAppendFile`), compression support (`OptionsCompressed`), and batch processing modes (e.g., `FullFileBatch`, custom `BatchSize`).
- **Parallelism & performance tuning**: configurable number of workers (`OptionsParallelWorkers`) and memory buffer size (via `TotalSeqSize`).
- **Sequence parsing/formatting**: pluggable header parsers/writers for FASTA/FASTQ (e.g., `OptionsFastSeqHeaderParser`, `OptionFastSeqDoNotParseHeader`), with support for quality scores (`OptionsReadQualities`).
- **CSV export**: granular control over columns (ID, sequence, quality, taxon, count), separators (`CSVSeparator`), NA values (`CSVNAValue`), and auto-inferred keys (`CSVAutoColumn`).
- **Taxonomic metadata integration**: toggles for taxid, scientific name, rank, path (with/without root), parent relationships (`OptionsWithTaxid`, `OptionWithoutRootPath`), and U→T conversion for ambiguous bases.
- **Advanced features**: feature table inclusion (`WithFeatureTable`), pattern matching support (`OptionsWithPattern`), and paired-end read handling via `WritePairedReadsTo`.
- **Metadata extensibility**: arbitrary metadata fields can be attached via `OptionsWithMetadata`, with automatic cleanup (e.g., removal of `"query"` when pattern mode is active).
All options are initialized with sensible defaults (e.g., `batch_size`, `parallel_workers`) and can be composed using the `MakeOptions` constructor. This design enables declarative, reusable configuration across sequence processing pipelines in OBITools4.
@@ -0,0 +1,27 @@
# `ropeScanner` — Line-by-Line Text Scanning over a Rope Data Structure
The `obiformats` package provides the `ropeScanner`, an efficient line-oriented iterator over a *Rope* (a tree-based immutable string representation, implemented here as `PieceOfChunk`). This scanner supports streaming large texts without full materialization.
## Core Functionality
- **`newRopeScanner(rope *PieceOfChunk)`**
Constructs a new scanner starting at the root of the rope.
- **`ReadLine() []byte`**
Returns the next line (without trailing `\n`, or `\r\n`) as a byte slice.
- Returns `nil` when the end of the rope is reached.
- Reuses internal buffers (`carry`) to handle lines spanning multiple nodes efficiently.
- The returned slice aliases rope data and is only valid until the next call.
- **`skipToNewline()`**
Advances internal position to just after the next newline (`\n`), discarding content. Useful for skipping unwanted lines or headers.
## Implementation Highlights
- **Buffered carry-over**: Lines split across rope nodes are assembled incrementally in the `carry` buffer, which grows dynamically.
- **Cross-platform line endings**: Automatically strips `\r\n`, leaving only the content (no trailing CR).
- **Zero-copy where possible**: When a line fits entirely within one node and no carry exists, it returns a slice directly into the ropes underlying data.
## Use Case
Ideal for parsing large text files or streams (e.g., OBIE/Obi formats) where memory efficiency and streaming behavior are critical—without loading the entire content into RAM.
@@ -0,0 +1,34 @@
# Taxonomy Loading Module (`obiformats`)
This Go package provides semantic functionality to automatically detect and load taxonomic data from various file formats. It supports flexible, format-agnostic taxonomy ingestion via a unified interface.
## Core Features
1. **Format Detection**
- `DetectTaxonomyFormat(path)` identifies the taxonomy source format by inspecting file type (directory, MIME-type), filename patterns, or structure.
- Supports:
• NCBI Taxdump (both directory and `.tar` archive)
• CSV files (`text/csv`)
• FASTA/FASTQ sequences (via `mimetype` detection)
2. **Modular Loaders**
- Returns a typed `TaxonomyLoader` function, enabling deferred loading with configurable options (`onlysn`, `seqAsTaxa`).
- Each loader abstracts format-specific parsing (e.g., NCBI `nodes.dmp`, FASTA header taxonomy extraction).
3. **Sequence-Based Taxonomy Extraction**
- For sequence files (FASTA/FASTQ), taxonomy is inferred from headers or associated metadata, using `ExtractTaxonomy()`.
4. **Integration with OBITools Ecosystem**
- Leverages `obitax.Taxonomy` as the canonical output structure.
- Uses custom MIME-type registration (`obiutils.RegisterOBIMimeType()`) for robust detection of bioinformatics formats.
5. **Error Handling & Logging**
- Graceful failure with descriptive errors; informative logging via `logrus`.
## Usage Flow
```go
tax, err := LoadTaxonomy("path/to/data", onlysn=true, seqAsTaxa=false)
```
The module enables interoperability across taxonomic data sources in metabarcoding workflows.
@@ -0,0 +1,26 @@
# OBIFORMATS Package: Semantic Description
The `obiformats` package provides robust, format-agnostic sequence reading capabilities for biological data in the OBITools4 ecosystem.
It supports automatic detection and parsing of common bioinformatics file formats via MIME-type inference:
- **FASTA** (`text/fasta`): identified by lines starting with `>`.
- **FASTQ** (`text/fastq`): detected via leading `@` characters.
- **ecoPCR2**: recognized by the header line `#@ecopcr-v2`.
- **EMBL** (`text/embl`): detected by lines starting with `ID `.
- **GenBank** (`text/genbank`): identified by either `LOCUS ` or legacy `"Genetic Sequence Data Bank"` headers.
- **CSV** (`text/csv`): generic tabular support.
Core functionality is exposed through:
- `OBIMimeTypeGuesser()`: inspects the first ~1 MiB of an input stream to infer MIME type using `github.com/gabriel-vasile/mimetype`, while preserving unread data for downstream processing.
- `ReadSequencesFromFile()`: reads sequences from a file path, infers format via MIME detection, and dispatches to dedicated parsers (e.g., `ReadFasta`, `ReadFastq`).
- `ReadSequencesFromStdin()`: convenience wrapper to read from stdin, treating `"-"` as filename and auto-closing the stream.
Internally leverages:
- `obiutils.Ropen()` for unified file opening (including stdin handling).
- Path extension stripping and source tagging via `OptionsSource()`.
- Logging (`logrus`) for format diagnostics.
- Iterator interface (`obiiter.IBioSequence`) to abstract sequential access over sequences.
The package ensures extensibility: new formats can be added by extending the `switch` dispatch in `ReadSequencesFromFile()` and registering corresponding MIME types.
Error handling covers empty files, invalid streams, and unsupported formats via explicit logging or fatal exits.
@@ -0,0 +1,29 @@
# `obiformats` Package: Sequence Writing Utilities
This Go package provides utilities for writing biological sequence data to files or standard output in FASTA/FASTQ formats.
## Core Functionality
- **`WriteSequence()`**:
Main dispatcher that detects sequence quality data and writes either FASTQ (if qualities present) or FASTA.
- Accepts an `IBioSequence` iterator, a writable stream (`io.WriteCloser`), and optional configuration.
- Preserves iterator state via `PushBack()` to allow chaining.
- **`WriteSequencesToStdout()`**:
Convenience wrapper writing sequences to `stdout`. Automatically closes the output stream.
- **`WriteSequencesToFile()`**:
Writes sequences to a specified file. Supports:
- File creation/truncation or append mode (`OptionAppendFile()`).
- Paired-end output: writes mate pairs to a second file if `OptionSavePaired()` is enabled.
## Design Highlights
- **Format-Aware Dispatch**: Automatically selects FASTQ vs. FASTA based on presence of quality scores (`HasQualities()`).
- **Iterator Preservation**: Ensures non-consumed sequences remain available after write operations.
- **Error Handling & Logging**: Uses `logrus` for fatal errors during file I/O; returns structured error codes.
- **Configurable Options**: Extensible via `WithOption` pattern (e.g., append mode, paired-end handling).
## Integration
Designed for use within the OBITools4 ecosystem—works with `obiiter.IBioSequence` iterators to support streaming, memory-efficient processing of large sequencing datasets.
+13
View File
@@ -0,0 +1,13 @@
## Uint128 Type in `obifp`: Semantic Overview
This Go package defines a custom 128-bit unsigned integer type (`Uint128`) composed of two `uint64` limbs (high and low). It provides comprehensive arithmetic, comparison, bitwise operations, and type conversions.
- **Basic Constructors**: `Zero()`, `MaxValue()` initialize the smallest/largest possible values.
- **State Checks**: `IsZero()`, and equality/comparison methods (`Equals`, `Cmp`, `<`, `>`, etc.) enable conditional logic.
- **Type Casting**: Safe conversions to/from smaller (`Uint64`, `uint64`) and larger (`Uint256`) integer types, with overflow warnings where applicable.
- **Arithmetic**: Full support for addition (`Add`, `Add64`), subtraction (`Sub`), multiplication (`Mul`, `Mul64`) — with panic on overflow.
- **Division & Modulo**: Integer division (`Div`, `Div64`) and remainder (`Mod`, `Mod64`), implemented via optimized quotient-remainder pairs (`QuoRem`, `QuoRem64`) using hardware-assisted 64-bit operations.
- **Bit Manipulation**: Left/right shifts (`LeftShift`, `RightShift`), and bitwise logic: AND, OR, XOR, NOT.
- **Utility**: Direct access to low limb via `AsUint64()`.
All operations preserve 128-bit precision, with strict overflow checking for correctness in high-precision contexts (e.g., bioinformatics counting).
+17
View File
@@ -0,0 +1,17 @@
# `obifp.Uint128` Package — Semantic Feature Overview
This Go package provides a 128-bit unsigned integer type (`Uint128`) with comprehensive arithmetic, comparison, and bitwise operations. Internally represented as two `uint64` limbs (`w1`: high, `w0`: low), it supports:
- **Arithmetic Operations**
- `Add`, `Sub`, `Mul` (128×128), and `Mul64` (scalar multiplication)
- Division: `Div`, `Mod`, and combined quotient/remainder via `QuoRem` (and their 64-bit variants)
- **Comparison & Equality**
- `Cmp`, `Equals`, `LessThan`/`GreaterThan`, and their inclusive variants (`≤`, `≥`)
- Support for comparing against both `Uint128` and native `uint64` values
- **Bitwise Operations**
- Logical AND (`And`), OR (`Or`), XOR (`Xor`) between two `Uint128`s
- Bitwise NOT (`Not`) — inverts all bits of the value
- **Conversion & Utility**
- `AsUint64()` safely truncates to lower 64 bits (assumes upper limb is zero)
All operations handle overflow/underflow correctly, including carry propagation in addition and borrow handling in subtraction. Tests cover edge cases: zero values, max `uint64` boundaries (e.g., wrapping in addition/subtraction), and large multiplications. Designed for cryptographic or high-precision numeric use where native integer types are insufficient.
+30
View File
@@ -0,0 +1,30 @@
# Uint256 Type and Operations — Semantic Overview
The `obifp` package provides a custom 256-bit unsigned integer type (`Uint256`) implemented in Go, composed of four 64-bit limbs (`w0` to `w3`). It supports arithmetic, comparison, bitwise operations, and safe casting with overflow detection.
- **Core Representation**: `Uint256` stores values as four 64-bit words, enabling arbitrary-precision unsigned integers up to $2^{256} - 1$.
- **Utility Methods**:
- `Zero()` / `MaxValue()`: Return the neutral and maximum values.
- `IsZero()`, `Equals(v)`, comparison methods (`LessThan`, etc.): Enable logical and ordering checks.
- **Casting & Conversion**:
- `Uint64()`, `Uint128()` downcast with warnings on overflow.
- `Set64(v)`: Initializes from a standard `uint64`.
- `AsUint64()`: Direct access to least-significant limb.
- **Bitwise Operations**:
- `And`, `Or`, `Xor`, `Not`: Standard bitwise logic per limb.
- **Shifts**:
- `LeftShift(n)` / `RightShift(n)`: Multi-limb shifts with carry propagation.
- **Arithmetic**:
- `Add(v)`, `Sub(v)` / `Mul(v)`: Use Gos `math/bits` for carry-aware operations; panic on overflow.
- `Div(v)`: Implements long division via repeated subtraction of shifted multiples; panics on zero divisor.
- **Safety & Logging**:
- Warnings via `obilog.Warnf` for silent overflows during narrowing casts.
- Panics on arithmetic overflow or division-by-zero using `log.Panicf`.
This type is suitable for cryptographic, genomic (OBITools), or high-precision counting use cases requiring precise control over large unsigned integers.
+34
View File
@@ -0,0 +1,34 @@
# Uint64 Type Functionalities Overview
The `obifp` package provides a custom `Uint64` type wrapping Gos native 64-bit unsigned integer (`uint64`) to support arithmetic, bitwise operations, and type conversions in a structured way.
## Core Operations
- **`Zero()` / `MaxValue()`**: Returns the zero and maximum representable values, respectively.
- **`IsZero()` / `Equals(v)`**: Checks if the value is zero or equal to another.
- **`Cmp(v)`, `LessThan(v)`**, etc.: Standard comparison operations returning `-1/0/+1` or boolean results.
## Arithmetic with Overflow Detection
- **Add/Sub/Mul**: Performs 64-bit addition, subtraction, and multiplication.
- Uses `math/bits` for low-level operations (`bits.Add64`, etc.).
- Panics on overflow (carry ≠ 0), enforcing strict safety.
## Bitwise Operations
- **`And`, `Or`, `Xor`, `Not()`**: Standard bitwise logic operations.
- **`LeftShift(n)` / `RightShift(n)`**:
- Shifts bits left/right by *n* positions.
- Uses internal `LeftShift64`/`RightShift64`, supporting *carry-in* for multi-word arithmetic.
## Extended Precision Conversions
- **`Uint128()` / `Uint256()`**: Casts the 64-bit value into larger unsigned integer types (zero-extended).
- **`Set64(v)`**: Reassigns the internal value from a raw `uint64`.
## Utility & Logging
- **`AsUint64()`**: Extracts the underlying `uint64`.
- **Warning on overflow in shift operations** (e.g., shifts ≥ 128 bits) via `obilog.Warnf`.
> Designed for use in high-precision or cryptographic contexts where explicit overflow handling and type safety are critical.
+32
View File
@@ -0,0 +1,32 @@
# Obifp Package: Generic Fixed-Point Unsigned Integer Operations
This Go package (`obifp`) provides a generic, type-safe interface for fixed-point unsigned integer arithmetic over three size variants: `Uint64`, `Uint128`, and `Uint256`.
## Core Interface: `FPUint[T]`
The interface defines a unified API for unsigned integer types, supporting:
- **Initialization & Conversion**:
- `Zero()`, `Set64(v)`: Create zero or set from a `uint64`.
- `AsUint64()`: Downcast to standard `uint64`.
- **Logical Operations**:
- Bitwise: `And`, `Or`, `Xor`, `Not`.
- Shifts: `LeftShift(n)`, `RightShift(n)`.
- **Arithmetic**:
- Addition (`Add`), subtraction (`Sub`), multiplication (`Mul`). Division is commented out—likely reserved for future implementation.
- **Comparison**:
- Full ordering: `<`, `<=`, `>`, `>=`.
- **Utility Predicates**:
- `IsZero()` for zero-checking.
## Helper Functions
- `ZeroUint[T]`: Returns the neutral element (zero) for type `T`.
- `OneUint[T]`: Constructs value 1 via `Set64(1)`.
- `From64[T]`: Converts a standard Go `uint64` into the generic type.
All operations are **method-chaining friendly** (return `T`, not pointers), enabling fluent syntax. The design promotes correctness and performance in cryptographic or financial contexts where large, fixed-size integers are required.
+30
View File
@@ -0,0 +1,30 @@
# `obigraph` Package: Semantic Overview
The `obigraph` package provides a generic, type-safe undirected/directed graph implementation in Go. Its core features include:
- **Generic Graph Structure**: Parametrized over vertex type `V` and edge data type `T`, enabling flexible use with arbitrary user-defined types.
- **Bidirectional Edge Tracking**: Maintains both forward (`Edges`) and reverse (`ReverseEdges`) adjacency maps for efficient neighbor/parent queries.
- **Edge Management**:
- `AddEdge`: Adds an *undirected* edge (inserted in both directions).
- `AddDirectedEdge`: Adds a *directed* edge (only one direction).
- `SetAsDirectedEdge`: Converts an existing undirected edge into a directed one by removing the reverse link.
- **Graph Queries**:
- `Neighbors(v)`: Returns all adjacent vertices (outgoing in directed case).
- `Parents(v)`: Returns incoming neighbors via reverse adjacency.
- `Degree(v)` / `ParentDegree(v)`: Compute vertex degrees (total or incoming).
- **Customizable Vertex/Edge Properties**:
- `VertexWeight`, `EdgeWeight`: Funcs to assign weights (default: constant weight = 1.0).
- `VertexId`: Custom vertex label generator (default: `"V%d"`).
- **GML Export**:
- `Gml(...)` / `WriteGml(...)`: Generates or writes a Graph Modelling Language (GML) representation.
- Supports directed/undirected modes, degree-based filtering (`min_degree`), and visual styling:
- Vertex shape: `circle` if weight ≥ threshold, else `rectangle`.
- Size scaled by square root of vertex weight.
- Uses Gos `text/template` for rendering.
- **File I/O**: Directly writes GML to file via `WriteGmlFile(...)`.
- **Logging & Safety**: Uses Logrus for bounds-checking errors; panics on template parsing/writing failures.
The package is designed for lightweight, high-performance graph modeling and visualization-ready export.
+14
View File
@@ -0,0 +1,14 @@
# `obigraph.GraphBuffer` Feature Overview
The `GraphBuffer[V, T]` type provides a **thread-safe graph construction interface** using buffered edge insertion via Go channels.
- **Asynchronous Edge Addition**: Edges are enqueued through a `chan Edge[T]`, processed in the background by a goroutine that updates an underlying static graph (`Graph[V, T]`).
- **Non-blocking API**: `AddEdge` and `AddDirectedEdge` are non-synchronous — they send to the channel without waiting for graph mutation, enabling high-throughput edge ingestion.
- **Graph Initialization**: `NewGraphBuffer` initializes both the graph and a dedicated worker goroutine to consume edges.
- **GML Export Support**: Full support for exporting the final graph in [Graph Modelling Language (GML)](https://en.wikipedia.org/wiki/Graph_Modelling_Language), with optional filtering (`min_degree`) and layout parameters (`threshold`, `scale`).
- **File & Stream Output**: Methods `WriteGml` and `WriteGmlFile` allow writing GML to any `io.Writer`, including files.
- **Resource Cleanup**: The explicit `Close()` method terminates the worker goroutine by closing the channel, ensuring clean shutdown.
- **Generic Design**: Fully generic over vertex (`V`) and edge data types (`T`), supporting arbitrary value semantics.
> ⚠️ **Note**: The buffer is *not* safe for concurrent `AddEdge` calls without external synchronization beyond channel semantics.
> ✅ Ideal for producer-consumer patterns where edges are streamed from multiple goroutines into a single graph.
+29
View File
@@ -0,0 +1,29 @@
# BioSequenceBatch: A Container for Ordered Biological Sequences
`BioSequenceBatch` is a structured data type encapsulating an ordered collection of biological sequences (`obiseq.BioSequenceSlice`) along with metadata: a `source` identifier and an integer `order`. It serves as a lightweight, immutable-friendly container for batch processing in bioinformatics pipelines.
## Core Properties
- **`source`**: String identifying the origin (e.g., file, pipeline stage).
- **`order`**: Integer defining processing sequence or priority.
- **`slice`**: Holds the actual sequences via `obiseq.BioSequenceSlice`.
## Key Functionalities
- **Construction**:
`MakeBioSequenceBatch(source, order, sequences)` creates a new batch.
- **Accessors**:
`Source()`, `Order()` return metadata; `Slice()` exposes the sequence slice.
- **Mutation (via copy)**:
`Reorder(newOrder)` returns a new batch with updated order.
- **Size & emptiness**:
`Len()` gives sequence count; `NotEmpty()` checks non-emptiness.
- **Consumption**:
`Pop0()` removes and returns the first sequence (FIFO behavior).
- **Safety**:
`IsNil()` detects uninitialized batches; a global `NilBioSequenceBatch` sentinel exists.
## Design Notes
- Instances are value types (struct), enabling safe copying.
- Operations follow Go idioms: methods return updated values rather than mutating in place (except internal slice mutation via `Pop0`).
- Designed for interoperability with the OBITools4 ecosystem (`obiseq` package).
This abstraction supports modular, traceable sequence processing workflows—ideal for pipeline stages where ordering and provenance matter.
@@ -0,0 +1,47 @@
# `obiiter`: Stream-Based Biosequence Iterator Library
This Go package provides a concurrent, batch-oriented iterator for processing large collections of biological sequences (`BioSequence`), designed for high-throughput NGS data pipelines.
## Core Functionality
- **Batched Streaming**: Reads sequences in configurable batches (`BioSequenceBatch`) via a channel-based iterator.
- **Thread Safety**: Uses `sync.WaitGroup`, RWMutex, and atomic flags for safe concurrent access.
- **Lazy Evaluation**: Iteration is on-demand via `Next()`/`Get()`, supporting memory-efficient processing.
## Iterator Management
- **Construction**: `MakeIBioSequence()` initializes a new iterator with default settings.
- **Lifecycle Control**:
- `Add(n)`, `Done()`: Track active workers (like goroutines).
- `Lock/RLock` and `Unlock/RUnlock`: Explicit synchronization.
- `Wait()` / `Close()`, `WaitAndClose()`: Graceful shutdown.
## Batch Transformation & Reorganization
- **`Rebatch(size)`**: Redistributes sequences into fixed-size batches (requires sorting).
- **`RebatchBySize(maxBytes, maxCount)`**: Dynamic batching respecting memory and count limits.
- **`SortBatches()`**: Ensures batches are emitted in strict order (by `order` field).
- **Concatenation & Pooling**:
- `Concat(...)`: Sequentially merges multiple iterators.
- `Pool(...)`: Interleaves batches from several sources (preserves order via renumbering).
## Filtering & Predicate-Based Processing
- **`FilterOn(pred, size)`**: Applies a sequence predicate in parallel (configurable workers), recycling discarded sequences.
- **`FilterAnd(pred, size)`**: Same as `FilterOn`, but also checks paired-end consistency.
- **`DivideOn(pred, size)`**: Splits input into two iterators (`true`, `false`) based on predicate.
## Utility & Analysis
- **`Load()`**: Collects all sequences into a single slice (for small datasets).
- **`Count(recycle)`**: Returns `(variants, reads, nucleotides)`.
- **`Consume()` / `Recycle()`**: Drains iterator, optionally triggering sequence recycling.
- **`CompleteFileIterator()`**: Reads entire remaining file as one batch.
## Additional Features
- Supports **paired-end data** via `MarkAsPaired()` / `IsPaired()`.
- Batch ordering preserved for downstream reproducibility.
- Integrates with OBITools4s `obidefault`, `obiutils` for config and resource management.
> Designed for scalability, low memory footprint, and composability in bioinformatics workflows.
+32
View File
@@ -0,0 +1,32 @@
# `IDistribute`: Semantic Description of Biosequence Distribution Functionality
The `IDistribute` type implements a thread-safe mechanism for distributing biosequences into classified, batched outputs.
- **Core Purpose**: Enables concurrent processing of sequences by routing them to dedicated output channels based on classification keys.
- **Key Fields**:
- `outputs`: A map from integer class codes to output streams (`IBioSequence`).
- `news`: An unbuffered channel emitting class codes when new output streams are created.
- `classifier`: A pointer to a sequence classifier used to assign sequences to keys during distribution.
- **Thread Safety**: All access to shared state (`outputs`, `slices`) is synchronized via a mutex.
- **Batching Strategy**:
- Sequences are accumulated per class key until either `BatchSizeMax()` sequences or `BatchMem()` bytes (per key) are reached.
- Batches are flushed automatically and on finalization.
- **Asynchronous Processing**:
- The `Distribute()` method launches a goroutine that consumes the input iterator, classifies each sequence, and feeds batches to per-key outputs.
- Outputs are closed only after all sequences have been processed.
- **Notifications**:
- The `News()` channel allows consumers to be notified of newly created output streams (i.e., when a new class key appears).
- **Error Handling**:
- `Outputs(key)` returns an error if the requested key has no associated output.
- **Integration**:
- Leverages `obidefault.BatchSizeMax()` and `BatchMem()` for configurable batch limits.
- Uses `SortBatches()` on the input iterator to ensure ordered processing.
In summary, `IDistribute` provides a scalable, concurrent pipeline for classifying and batching biosequences based on user-defined classification logic.
@@ -0,0 +1,24 @@
# `ExtractTaxonomy` Function — Semantic Description
The `ExtractTaxonomy` method is a core utility in the `obiiter` package, designed to aggregate taxonomic information across biological sequences processed by an iterator.
- **Input**:
- A pointer to `IBioSequence`, representing a sequence iterator over biological data.
- A boolean flag `seqAsTaxa`: if true, each full sequence is treated as a single taxonomic unit; otherwise, individual elements within slices are processed separately.
- **Process**:
- Iterates through all sequences via `iterator.Next()` and retrieves each current slice using `Get().Slice()`.
- For every slice, it calls the underlying `.ExtractTaxonomy()` method (from `obitax`), progressively building or updating a shared `*obitax.Taxonomy` object.
- Stops and returns immediately upon encountering the first error during taxonomy extraction.
- **Output**:
- Returns a fully populated `*obitax.Taxonomy` object (or partial result if early failure occurs).
- Returns `nil` error on success; otherwise, returns the first encountered error.
- **Semantic Role**:
Enables scalable taxonomic profiling of high-throughput sequencing data by delegating per-slice extraction logic to the `obitax` module, while ensuring robust iteration and error handling.
- **Dependencies**:
Relies on `obitax.Taxonomy` for structured taxonomic representation and assumes slices implement the `.ExtractTaxonomy()` interface.
This function exemplifies a *map-reduce*-style pattern: mapping taxonomy extraction over slices, and reducing results into a unified taxonomic summary.
+28
View File
@@ -0,0 +1,28 @@
# `IFragments` Functionality Overview
The `IFragments()` function in the `obiiter` package implements a parallelized sequence fragmentation pipeline for biological sequences. It is designed to split long nucleotide or protein sequences into smaller, overlapping fragments while preserving metadata and enabling concurrent processing.
## Core Parameters
- `minsize`: Minimum sequence length to skip fragmentation.
- `length`: Desired fragment size (in bases/amino acids).
- `overlap`: Number of overlapping residues between consecutive fragments.
- `size`, `nworkers`: Batch size and number of worker goroutines (currently unused in active logic).
## Workflow
1. **Batch Sorting**: Input sequences are batched and sorted for efficient processing.
2. **Parallel Fragmentation**:
- Each worker processes a subset of batches independently using goroutines.
- For each sequence longer than `minsize`, it is split into overlapping fragments of length `length` with step size = `length - overlap`.
- The final fragment is extended to cover the remainder (fusion mode), avoiding tiny trailing pieces.
3. **Resource Management**:
- Original sequences are recycled (`s.Recycle()`) to optimize memory usage.
- Fragments are reassembled into batches, sorted by source and order, then rebatched to respect memory/size limits.
## Key Features
- **Overlap handling**: Ensures contiguous coverage without gaps.
- **Memory efficiency**: Uses recycling and batched output.
- **Scalability**: Leverages Go concurrency via `nworkers`.
- **Error safety**: Panics on subsequence errors (e.g., invalid indices).
## Use Case
Ideal for preparing long-read sequencing data (e.g., PacBio, Nanopore) or assembled contigs for downstream analysis requiring fixed-length inputs (e.g., k-mer indexing, ML inference).
+29
View File
@@ -0,0 +1,29 @@
# Memory-Limited Biosequence Iterator
This Go function extends an `IBioSequence` iterator with memory-aware throttling to prevent excessive heap allocation during data processing.
## Core Functionality
- **`LimitMemory(fraction float64)`**
Returns a new iterator that respects an upper bound on heap usage relative to total system memory.
- **Memory Monitoring**
Uses `runtime.ReadMemStats()` and `github.com/pbnjay/memory.TotalMemory()` to compute the current heap fraction (`Alloc / TotalMemory`) dynamically.
- **Backpressure Mechanism**
While the memory fraction exceeds `fraction`, the producer goroutine yields control (`runtime.Gosched()`) until sufficient memory becomes available.
- **Logging**
Warns via `obilog.Warnf` when:
- Memory pressure persists (every ~1000 yields),
- Or wait duration becomes unusually long (>10,000 yielding cycles).
- **Concurrency Model**
- A producer goroutine consumes from the original iterator and pushes items to `newIter`, pausing as needed.
- A dedicated consumer goroutine calls `WaitAndClose()` to ensure graceful termination and resource cleanup.
## Semantic Behavior
- **Non-blocking consumer**: Downstream consumers are not stalled; they read from an internal buffered channel (`newIter`).
- **Adaptive rate control**: The iterator automatically slows down when memory pressure rises, avoiding OOM conditions.
- **Predictable resource use**: Ensures heap usage stays below the specified `fraction` (e.g., 0.5 → ≤ 50% of total RAM).
+19
View File
@@ -0,0 +1,19 @@
# Semantic Description of `IMergeSequenceBatch` and `MergePipe`
This code defines two related functions in the `obiiter` package for batch-wise merging of biological sequences during iteration.
- **`IMergeSequenceBatch(na, statsOn, sizes...) IBioSequence → IBioSequence`**
- Consumes an input sequence iterator (`IBioSequence`) and returns a new one.
- Groups incoming sequences into batches (default size: `100`, configurable via variadic argument).
- For each batch:
- Collects up to `batchsize` sequences via the input iterator.
- Applies `.Merge(na, statsOn)` on each sequence group (presumably merging reads based on `na`, e.g., nucleotide alignment or overlap).
- Wraps merged results into a `BioSequenceBatch` with ordering metadata.
- Emits batches asynchronously via goroutines; the output iterator is closed when input finishes.
- **`MergePipe(na, statsOn, sizes...) Pipeable → func(IBioSequence) IBioSequence`**
- A *pipeline combinator* (higher-order function), enabling functional composition.
- Returns a `Pipeable` — i.e., a transformation function compatible with iterator pipelines.
**Semantic Purpose**:
Enables efficient, memory-smoothed merging of biological sequence reads (e.g., paired-end merges) in streaming fashion, with optional statistics tracking (`statsOn`) and configurable batching.
+35
View File
@@ -0,0 +1,35 @@
# `NumberSequences` Function — Semantic Description
The `NumberSequences` method assigns a unique sequential identifier (`seq_number`) to each biological sequence in an `IBioSequence` iterator, preserving consistency for paired-end reads.
## Core Functionality
- **Sequential numbering**: Assigns integers (starting from `start`, defaulting to 0 or user-defined) incrementally across sequences.
- **Thread-safe**: Uses `sync.Mutex` and `atomic.Int64` to safely manage the global counter during concurrent processing.
- **Paired-read support**: When input is paired (`IsPaired()`), both reads in a pair receive the *same* `seq_number`, ensuring alignment between mates.
## Parallelization Strategy
- **Default mode**: Uses multiple workers (`ParallelWorkers()`) for performance; batches are processed concurrently.
- **Reordering mode**: If `forceReordering` is true:
- Input iterator is batch-sorted (`SortBatches()`).
- Parallelism disabled (1 worker) to ensure deterministic numbering order.
## Implementation Details
- Each goroutine processes its own split of the input iterator.
- A shared `next_first` counter tracks the next available sequence number globally.
- Locking ensures atomic increment and assignment, preventing race conditions.
## Output
Returns a new `IBioSequence` iterator:
- Contains the same sequence batches (possibly reordered if sorted).
- Each `BioSequence` object now carries a `"seq_number"` attribute.
- Paired sequences are co-numbered and marked accordingly.
## Use Cases
- Preparing data for downstream tools requiring unique sequence IDs.
- Maintaining cross-read identity in paired-end workflows (e.g., assembly, mapping).
- Reproducible numbering across pipeline stages or restarts.
+17
View File
@@ -0,0 +1,17 @@
# Paired-End Sequence Handling in `obiiter`
This Go package provides semantic functionality for managing **paired-end biological sequences** within batched iterators.
- `BioSequenceBatch` methods:
- **`IsPaired()`**: Checks whether the batch contains paired reads.
- **`PairedWith()`**: Returns a new batch containing only the mate (partner) of each read in the current batch.
- **`PairTo(*BioSequenceBatch)`**: Synchronizes and pairs reads between two batches *of identical order*; fails if orders differ.
- **`UnPair()`**: Removes pairing metadata, treating reads as unpaired.
- `IBioSequence` (iterator) methods:
- **`MarkAsPaired()`**: Marks the iterator as producing paired-end data.
- **`PairTo(IBioSequence)`**: Combines two iterators into a new paired-end iterator by aligning corresponding batches and calling `PairTo` on each pair.
- **`PairedWith()`**: Generates a new iterator yielding only the mate reads (i.e., second ends) from an existing paired-end stream.
- **`IsPaired()`**: Returns whether the iterator was explicitly marked as paired.
All operations preserve batched processing and concurrency via goroutines, ensuring efficient handling of large NGS datasets while maintaining semantic correctness for paired-end workflows.

Some files were not shown because too many files have changed in this diff Show More