Compare commits

...

277 Commits

Author SHA1 Message Date
Eric Coissac
0580611031 Implémentation des superkmers canoniques et nettoyage du parsing GenBank
Ajout de la fonction IterCanonicalSuperKmers dans superkmer_iter.go pour implémenter les superkmers canoniques selon le document d'architecture.

Corrections dans genbank_read.go :
- Nettoyage des lignes de données avec strings.TrimSpace
- Augmentation du nombre de parties extraites avec SplitN à 7
- Début de la boucle à l'indice 1 au lieu de 0 pour ignorer le premier élément vide

Création du fichier Canonical-superkmers.md pour documenter l'implémentation.
2026-02-19 18:30:54 +01:00
Eric Coissac
c30a22d356 Refactor build workflow and update version
Update GitHub Actions workflow to use setup-go v5 and align with latest tooling practices.

Update version to 4.4.15 in version.txt and pkg/obioptions/version.go.

Add comprehensive documentation for the canonical super-kmer strategy, including:
- Analysis of index v1 limitations
- Experimental observations on super-kmer efficiency
- Detailed pipeline for building v3 index
- Explanation of minimizer-canonization
- Description of unitig construction and frequency filtering
- Storage format specifications for v3
- Aho-Corasick matching implementation

This change introduces a major improvement in index compactness and performance through the use of canonical super-kmers, unitigs, and efficient storage formats.
2026-02-11 22:57:28 +01:00
Eric Coissac
1ce5da9bee Support new sequence file formats and improve error handling
Add support for .gbff and .gbff.gz file extensions in sequence reader.

Update the logic to return an error instead of using NilIBioSequence when no sequence files are found, improving the error handling and user feedback.
2026-02-11 06:31:10 +01:00
coissac
dc23d9de9a Merge pull request #85 from metabarcoding/push-smturnsrozkp
Push smturnsrozkp
2026-02-10 22:19:22 +01:00
Eric Coissac
aa9d7bbf72 Bump version to 4.4.14
Update version number from 4.4.13 to 4.4.14 in both version.go and version.txt files.
2026-02-10 22:17:23 +01:00
Eric Coissac
db22d20d0a Rename obisuperkmer test script to obik-super and update command references
Update test script name from obisuperkmer to obik-super and adjust all command references accordingly.

- Changed TEST_NAME from 'obisuperkmer' to 'obik-super'
- Changed CMD from 'obisuperkmer' to 'obik'
- Updated MCMD to 'OBIk-super'
- Modified command calls to use '$CMD super' instead of direct command names
- Updated help test to use '$CMD super -h'
- Updated all test cases to use the new command format
2026-02-10 22:17:22 +01:00
coissac
7c05bdb01c Merge pull request #84 from metabarcoding/push-uxvowwlxkrlq
Push uxvowwlxkrlq
2026-02-10 22:12:18 +01:00
Eric Coissac
b6542c4523 Bump version to 4.4.13
Update version from 4.4.12 to 4.4.13 in version.txt and pkg/obioptions/version.go
2026-02-10 22:10:38 +01:00
Eric Coissac
ac41dd8a22 Refactor k-mer matching pipeline with improved concurrency and memory management
Refactor k-mer matching to use a pipeline architecture with improved concurrency and memory management:

- Replace sort.Slice with slices.SortFunc and cmp.Compare for better performance
- Introduce PreparedQueries struct to encapsulate query buckets with metadata
- Implement MergeQueries function to merge query buckets from multiple batches
- Rewrite MatchBatch to use pre-allocated results and mutexes instead of map-based accumulation
- Add seek optimization in matchPartition to reduce linear scanning
- Refactor match command to use a multi-stage pipeline with proper batching and merging
- Add index directory option for match command
- Improve parallel processing of sequence batches

This refactoring improves performance by reducing memory allocations, optimizing k-mer lookup, and implementing a more efficient pipeline for large-scale k-mer matching operations.
2026-02-10 22:10:36 +01:00
Eric Coissac
bebbbbfe7d Add entropy-based filtering for k-mers
This commit introduces entropy-based filtering for k-mers to remove low-complexity sequences. It adds:

- New KmerEntropy and KmerEntropyFilter functions in pkg/obikmer/entropy.go for computing and filtering k-mer entropy
- Integration of entropy filtering in the k-mer set builder (pkg/obikmer/kmer_set_builder.go)
- A new 'filter' command in obik tool (pkg/obitools/obik/filter.go) to apply entropy filtering on existing indices
- CLI options for configuring entropy filtering during index building and filtering

The entropy filter helps improve the quality of k-mer sets by removing repetitive sequences that may interfere with downstream analyses.
2026-02-10 18:20:35 +01:00
Eric Coissac
c6e04265f1 Add sparse index support for KDI files with fast seeking
This commit introduces sparse index support for KDI files to enable fast random access during k-mer matching. It adds a new .kdx index file format and updates the KDI reader and writer to handle index creation and seeking. The changes include:

- New KdxIndex struct and related functions for loading, searching, and writing .kdx files
- Modified KdiReader to support seeking with the new index
- Updated KdiWriter to create .kdx index files during writing
- Enhanced KmerSetGroup.Contains to use the new index for faster lookups
- Added a new 'match' command to annotate sequences with k-mer match positions

The index is created automatically during KDI file creation and allows for O(log N / stride) binary search followed by at most stride linear scan steps, significantly improving performance for large datasets.
2026-02-10 13:24:24 +01:00
Eric Coissac
9babcc0fae Refactor lowmask options and shared kmer options
Refactor lowmask options to use shared kmer options and CLI getters

This commit refactors the lowmask subcommand to use shared kmer options and CLI getters instead of local variables. It also moves the kmer size and minimizer size options to a shared location and adds new CLI getters for the lowmask options.

- Move kmer size and minimizer size options to shared location
- Add CLI getters for lowmask options
- Refactor lowmask to use CLI getters
- Remove unused strings import
- Add MaskingMode type and related functions
2026-02-10 09:52:38 +01:00
Eric Coissac
e775f7e256 Add option to keep shorter fragments in lowmask
Add a new boolean option 'keep-shorter' to preserve fragments shorter than kmer-size during split/extract mode.

This change introduces a new flag _lowmaskKeepShorter that controls whether fragments
shorter than the kmer size should be kept during split/extract operations.

The implementation:
1. Adds the new boolean variable _lowmaskKeepShorter
2. Registers the command-line option "keep-shorter"
3. Updates the lowMaskWorker function signature to accept the keepShorter parameter
4. Modifies the fragment selection logic to check the keepShorter flag
5. Updates the worker creation to pass the global flag value

This allows users to control the behavior when dealing with short sequences in
split/extract modes, providing more flexibility in low-complexity masking.
2026-02-10 09:36:42 +01:00
Eric Coissac
f2937af1ad Add max frequency filtering and top-kmer saving capabilities
This commit introduces max frequency filtering to limit k-mer occurrences and adds functionality to save the N most frequent k-mers per set to CSV files. It also includes the ability to output k-mer frequency spectra as CSV and updates the CLI options accordingly.
2026-02-10 09:27:04 +01:00
Eric Coissac
56c1f4180c Refactor k-mer index management with subcommands and enhanced metadata support
This commit refactors the k-mer index management tools to use a unified subcommand structure with obik, adds support for per-set metadata and ID management, enhances the k-mer set group builder to support appending to existing groups, and improves command-line option handling with a new global options registration system.

Key changes:
- Introduce obik command with subcommands (index, ls, summary, cp, mv, rm, super, lowmask)
- Add support for per-set metadata and ID management in kmer set groups
- Implement ability to append to existing kmer index groups
- Refactor option parsing to use a global options registration system
- Add new commands for listing, copying, moving, and removing sets
- Enhance low-complexity masking with new options and output formats
- Improve kmer index summary with Jaccard distance matrix support
- Remove deprecated obikindex and obisuperkmer commands
- Update build process to use the new subcommand structure
2026-02-10 06:49:31 +01:00
Eric Coissac
f78543ee75 Refactor k-mer index building to use disk-based KmerSetGroupBuilder
Refactor k-mer index building to use the new disk-based KmerSetGroupBuilder instead of the old KmerSet and FrequencyFilter approaches. This change introduces a more efficient and scalable approach to building k-mer indices by using partitioned disk storage with streaming operations.

- Replace BuildKmerIndex and BuildFrequencyFilterIndex with KmerSetGroupBuilder
- Add support for frequency filtering via WithMinFrequency option
- Remove deprecated k-mer set persistence methods
- Update CLI to use new builder approach
- Add new disk-based k-mer operations (union, intersect, difference, quorum)
- Introduce KDI (K-mer Delta Index) file format for efficient storage
- Add K-way merge operations for combining sorted k-mer streams
- Update documentation and examples to reflect new API

This refactoring provides better memory usage, faster operations on large datasets, and more flexible k-mer set operations.
2026-02-10 06:49:31 +01:00
Eric Coissac
a016ad5b8a Refactor kmer index to disk-based partitioning with minimizer
Refactor kmer index package to use disk-based partitioning with minimizer

- Replace roaring64 bitmaps with disk-based kmer index
- Implement partitioned kmer sets with delta-varint encoding
- Add support for frequency filtering during construction
- Introduce new builder pattern for index construction
- Add streaming operations for set operations (union, intersect, etc.)
- Add support for super-kmer encoding during construction
- Update command line tool to use new index format
- Remove dependency on roaring bitmap library

This change introduces a new architecture for kmer indexing that is more memory efficient and scalable for large datasets.
2026-02-09 17:52:37 +01:00
coissac
09d437d10f Merge pull request #83 from metabarcoding/push-xssnppvunmlq
Push xssnppvunmlq
2026-02-09 09:58:06 +01:00
Eric Coissac
d00ab6f83a Bump version from 4.4.11 to 4.4.12
Update version number in version.txt from 4.4.11 to 4.4.12
2026-02-09 09:46:12 +01:00
Eric Coissac
8037860518 Update version and improve release note generation
Update version from 4.4.11 to 4.4.12

- Bump version in version.go
- Enhance release note generation in Makefile to use JSON output from orla and fallback to raw output if JSON parsing fails
- Improve test script to verify minimum super k-mer length is >= k (default k=31)
2026-02-09 09:46:10 +01:00
coissac
43d6cbe56a Merge pull request #82 from metabarcoding/push-vkprtnlyxmkl
Push vkprtnlyxmkl
2026-02-09 09:16:20 +01:00
Eric Coissac
6dadee9371 Bump version to 4.4.12
Update version from 4.4.11 to 4.4.12 in version.txt and pkg/obioptions/version.go
2026-02-09 09:05:49 +01:00
Eric Coissac
99a8e69d10 Optimize low-complexity masking algorithm
This commit optimizes the low-complexity masking algorithm by:

1. Precomputing logarithm values and normalization tables to avoid repeated calculations
2. Replacing the MinMultiset-based sliding minimum with a more efficient deque-based implementation
3. Improving entropy calculation by using precomputed n*log(n) values
4. Simplifying the circular normalization process with precomputed tables
5. Removing unused imports and log statements

The changes significantly improve performance while maintaining the same masking behavior.
2026-02-09 09:05:46 +01:00
Eric Coissac
c0ae49ef92 Ajout d'obilowmask_ref au fichier .gitignore
Ajout du fichier obilowmask_ref dans le fichier .gitignore pour éviter qu'il ne soit suivi par Git.
2026-02-08 19:31:12 +01:00
Eric Coissac
08490420a2 Fix whitespace in test script and add merge consistency tests
This commit fixes minor whitespace issues in the test script and adds new tests to ensure merge attribute consistency between in-memory and on-disk paths.

- Removed trailing spaces in log messages
- Added tests for merge consistency between in-memory and on-disk paths
- These tests catch a bug where shared classifier in on-disk dereplication path caused incorrect merged attributes
2026-02-08 18:08:29 +01:00
Eric Coissac
1a28d5ed64 Add progress bar configuration and conditional display
This commit introduces a new configuration module `obidefault` to manage progress bar settings, allowing users to disable progress bars via a `--no-progressbar` option. It updates various packages to conditionally display progress bars based on this new configuration, improving user experience by providing control over progress bar output. The changes also include improvements to progress bar handling in several packages, ensuring they are only displayed when appropriate (e.g., when stderr is a terminal and stdout is not piped).
2026-02-08 16:14:02 +01:00
Eric Coissac
b2d16721f0 Fix classifier cloning and reset in chunk processing
This commit fixes an issue in the chunk processing logic where the wrong classifier instance was being reset and used for code generation. A local clone of the classifier is now created and used to ensure correct behavior during dereplication.
2026-02-08 15:52:25 +01:00
Eric Coissac
7c12b1ee83 Disable progress bar when output is piped
Modify CLIProgressBar function to check if stdout is a named pipe and disable the progress bar accordingly. This prevents the progress bar from being displayed when the output is redirected or piped to another command.
2026-02-08 14:48:13 +01:00
Eric Coissac
db98ddb241 Fix super k-mer minimizer bijection and add validation test
This commit addresses a bug in the super k-mer implementation where the minimizer bijection property was not properly enforced. The fix ensures that:

1. All k-mers within a super k-mer share the same minimizer
2. Identical super k-mer sequences have the same minimizer

The changes include:

- Fixing the super k-mer iteration logic to properly validate the minimizer bijection property
- Adding a comprehensive test suite (TestSuperKmerMinimizerBijection) that validates the intrinsic property of super k-mers
- Updating the .gitignore file to properly track relevant files

This resolves issues where the same sequence could be associated with different minimizers, violating the super k-mer definition.
2026-02-08 13:47:33 +01:00
Eric Coissac
7a979ba77f Add obisuperkmer command implementation and tests
This commit adds the implementation of the obisuperkmer command, including:

- The main command in cmd/obitools/obisuperkmer/
- The package implementation in pkg/obitools/obisuperkmer/
- Automated tests in obitests/obitools/obisuperkmer/
- Documentation for the implementation and tests

The obisuperkmer command extracts super k-mers from DNA sequences, following the standard OBITools architecture. It includes proper CLI option handling, validation of parameters, and integration with the OBITools pipeline system.

Tests cover basic functionality, parameter validation, output format, metadata preservation, and file I/O operations.
2026-02-07 13:54:02 +01:00
Eric Coissac
00c8be6b48 docs: add architecture documentation for OBITools commands
Ajout d'une documentation détaillée sur l'architecture des commandes OBITools, incluant la structure modulaire, les patterns architecturaux et les bonnes pratiques pour la création de nouvelles commandes.
2026-02-07 12:26:35 +01:00
Eric Coissac
4ae331db36 Refactor SuperKmer extraction to use iterator pattern
This commit refactors the SuperKmer extraction functionality to use Go's new iterator pattern. The ExtractSuperKmers function is now implemented as a wrapper around a new IterSuperKmers iterator function, which yields results one at a time instead of building a complete slice. This change provides better memory efficiency and more flexible consumption of super k-mers. The functionality remains the same, but the interface is now more idiomatic and efficient for large datasets.
2026-02-07 12:23:12 +01:00
Eric Coissac
f1e2846d2d Amélioration du processus de release avec génération automatique des notes de version
Mise à jour du Makefile pour améliorer le processus de version bump et de création de tag.

- Utilisation de variables pour stocker les versions précédente et actuelle
- Ajout de la génération automatique des notes de version à partir des commits entre les tags
- Intégration d'une logique de fallback si orla n'est pas disponible
- Amélioration de la documentation des étapes du processus de release
- Mise à jour de la commande de création du tag avec le message généré
2026-02-07 11:48:26 +01:00
coissac
cd5562fb30 Merge pull request #81 from metabarcoding/push-nrylumyxtxnr
Push nrylumyxtxnr
2026-02-06 10:10:22 +01:00
Eric Coissac
f79b018430 Bump version to 4.4.11
Update version from 4.4.10 to 4.4.11 in version.txt and pkg/obioptions/version.go
2026-02-06 10:09:56 +01:00
Eric Coissac
aa819618c2 Enhance OBITools4 installation script with version control and documentation
Update installation script to support specific version installation, list available versions, and improve documentation.

- Add support for installing specific versions with -v/--version flag
- Add -l/--list flag to list all available versions
- Improve help message with examples
- Update README.md to reflect new installation options and examples
- Add note on version compatibility between OBITools2 and OBITools4
- Remove ecoprimers directory
- Improve error handling and user feedback during installation
- Add version detection and download logic from GitHub releases
- Update installation process to use tagged releases instead of master branch
2026-02-06 10:09:54 +01:00
coissac
da8d851d4d Merge pull request #80 from metabarcoding/push-vvonlpwlnwxy
Remove ecoprimers submodule
2026-02-06 09:53:29 +01:00
Eric Coissac
9823bcb41b Remove ecoprimers submodule 2026-02-06 09:52:54 +01:00
coissac
9c162459b0 Merge pull request #79 from metabarcoding/push-tpytwyyyostt
Remove ecoprimers submodule
2026-02-06 09:51:42 +01:00
Eric Coissac
25b494e562 Remove ecoprimers submodule 2026-02-06 09:50:45 +01:00
coissac
0b5cadd104 Merge pull request #78 from metabarcoding/push-pwvvkzxzmlux
Push pwvvkzxzmlux
2026-02-06 09:48:47 +01:00
Eric Coissac
a2106e4e82 Bump version to 4.4.10
Update version from 4.4.9 to 4.4.10 in version.txt and pkg/obioptions/version.go
2026-02-06 09:48:27 +01:00
Eric Coissac
a8a00ba0f7 Simplify artifact packaging and update release notes
This commit simplifies the artifact packaging process by creating a single tar.gz file containing all binaries for each platform, instead of individual files. It also updates the release notes to reflect the new packaging approach and corrects the documentation to use the new naming convention 'obitools4' instead of '<tool>'.
2026-02-06 09:48:25 +01:00
coissac
1595a74ada Merge pull request #77 from metabarcoding/push-lwtnswxmorrq
Push lwtnswxmorrq
2026-02-06 09:35:05 +01:00
Eric Coissac
68d723ecba Bump version to 4.4.9
Update version from 4.4.8 to 4.4.9 in version.txt and corresponding Go file.
2026-02-06 09:34:43 +01:00
Eric Coissac
250d616129 Mise à jour des workflows de release pour les nouvelles versions d'OS
Mise à jour du workflow de release pour utiliser ubuntu-24.04-arm au lieu de ubuntu-latest pour ARM64, et macos-15-intel au lieu de macos-latest pour macOS. Suppression de la compilation croisée pour ARM64 et ajustement de l'installation des outils de build pour macOS.
2026-02-06 09:34:41 +01:00
coissac
fbf816d219 Merge pull request #76 from metabarcoding/push-tzpmmnnxkvxx
Push tzpmmnnxkvxx
2026-02-06 09:09:05 +01:00
Eric Coissac
7f0133a196 Bump version to 4.4.8
Update version from 4.4.7 to 4.4.8 in version.txt and _Version variable.
2026-02-06 09:08:35 +01:00
Eric Coissac
f798f22434 Add cross-platform binary builds and release workflow improvements
This commit introduces a new build job that compiles binaries for multiple platforms (Linux, macOS) and architectures (amd64, arm64). It also refactors the release process to download pre-built artifacts and simplify the release directory preparation. The workflow now uses matrix strategy for building binaries and downloads all artifacts for the final release, removing the previous manual build steps for each platform.
2026-02-06 09:08:33 +01:00
coissac
248bc9f672 Merge pull request #75 from metabarcoding/push-mxxuykppzlpw
Push mxxuykppzlpw
2026-02-05 18:11:12 +01:00
Eric Coissac
7a7db703f1 Bump version to 4.4.7
Update version from 4.4.6 to 4.4.7 in version.txt and pkg/obioptions/version.go
2026-02-05 18:10:45 +01:00
Eric Coissac
da195ac5cb Optimisation de la construction des binaires
Modification du fichier de workflow de release pour compiler uniquement les outils obitools lors de la construction des binaires pour chaque plateforme (Linux AMD64, Linux ARM64, macOS AMD64, macOS ARM64, Windows AMD64). Cela permet d'optimiser le processus de build en ne générant que les binaires nécessaires.
2026-02-05 18:10:43 +01:00
coissac
20a0a09f5f Merge pull request #74 from metabarcoding/push-yqrwnpmoqllk
Push yqrwnpmoqllk
2026-02-05 18:03:28 +01:00
coissac
7d8c578c57 Merge branch 'master' into push-yqrwnpmoqllk 2026-02-05 18:03:18 +01:00
Eric Coissac
d7f615108f Bump version to 4.4.6
Update version from 4.4.5 to 4.4.6 in version.txt and pkg/obioptions/version.go
2026-02-05 18:02:30 +01:00
Eric Coissac
71574f240b Update version and add CI tests
Update version to 4.4.5 and add a test job in the release workflow to ensure tests pass before creating a release.
2026-02-05 18:02:28 +01:00
coissac
c98501a898 Merge pull request #73 from metabarcoding/push-pklkwsssrkuv
Push pklkwsssrkuv
2026-02-05 17:54:39 +01:00
Eric Coissac
23f145a4c2 Bump version to 4.4.5
Update version number from 4.4.4 to 4.4.5 in both version.go and version.txt files.
2026-02-05 17:53:53 +01:00
Eric Coissac
fe6d74efbf Add automated release workflow and update tag creation
This commit introduces a new GitHub Actions workflow to automatically create releases when tags matching the pattern 'Release_*' are pushed. It also updates the Makefile to use the new tag format 'Release_<version>' for tagging commits, ensuring consistency with the new release automation.
2026-02-05 17:53:52 +01:00
coissac
cff8135468 Merge pull request #72 from metabarcoding/push-zsprzlqxurrp
Push zsprzlqxurrp
2026-02-05 17:42:48 +01:00
Eric Coissac
02ab683fa0 Bump version to 4.4.4
Update version from 4.4.3 to 4.4.4 in version.txt and pkg/obioptions/version.go
2026-02-05 17:42:01 +01:00
Eric Coissac
de88e7eecd Fix typo in variable name
Corrected a typo in the variable name 'usreId' to 'userId' to ensure proper functionality.
2026-02-05 17:41:59 +01:00
Eric Coissac
e3c41fc11b Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup
Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup

This commit introduces Jaccard distance and similarity methods for KmerSet and KmerSetGroup.

For KmerSet:
- Added JaccardDistance method to compute the Jaccard distance between two KmerSets
- Added JaccardSimilarity method to compute the Jaccard similarity between two KmerSets

For KmerSetGroup:
- Added JaccardDistanceMatrix method to compute a pairwise Jaccard distance matrix
- Added JaccardSimilarityMatrix method to compute a pairwise Jaccard similarity matrix

Also includes:
- New DistMatrix implementation in pkg/obidist for storing and computing distance/similarity matrices
- Updated version handling with bump-version target in Makefile
- Added tests for all new methods
2026-02-05 17:39:23 +01:00
Eric Coissac
aa2e94dd6f Refactor k-mer normalization functions and add quorum operations
This commit refactors the k-mer normalization functions, renaming them from 'NormalizeKmer' to 'CanonicalKmer' to better reflect their purpose of returning canonical k-mers. It also introduces new quorum operations (AtLeast, AtMost, Exactly) for k-mer set groups, along with comprehensive tests and benchmarks. The version commit hash has also been updated.
2026-02-05 17:11:34 +01:00
Eric Coissac
a43e6258be docs: translate comments to English
This commit translates all French comments in the kmer filtering and set management code to English, improving code readability and maintainability for international collaborators.
2026-02-05 16:35:55 +01:00
Eric Coissac
12ca62b06a Implémentation complète de la persistance pour FrequencyFilter
Ajout de la fonctionnalité de sauvegarde et de chargement pour FrequencyFilter en utilisant le KmerSetGroup sous-jacent.

- Nouvelle méthode Save() pour enregistrer le filtre dans un répertoire avec formatage des métadonnées
- Nouvelle méthode LoadFrequencyFilter() pour charger un filtre depuis un répertoire
- Initialisation des métadonnées lors de la création du filtre
- Optimisation des méthodes Union() et Intersect() du KmerSetGroup
- Mise à jour du commit hash
2026-02-05 16:26:10 +01:00
Eric Coissac
09ac15a76b Refactor k-mer encoding functions to use 'canonical' terminology
This commit refactors all k-mer encoding and normalization functions to consistently use 'canonical' instead of 'normalized' terminology. This includes renaming functions like EncodeNormalizedKmer to EncodeCanonicalKmer, IterNormalizedKmers to IterCanonicalKmers, and NormalizeKmer to CanonicalKmer. The change aligns the API with biological conventions where 'canonical' refers to the lexicographically smallest representation of a k-mer and its reverse complement. All related documentation and examples have been updated accordingly. The commit also updates the version file with a new commit hash.
2026-02-05 16:14:35 +01:00
Eric Coissac
16f72e6305 refactoring of obikmer 2026-02-05 16:05:48 +01:00
Eric Coissac
6c6c369ee2 Add k-mer encoding and decoding functions with normalized k-mer support
This commit introduces new functions for encoding and decoding k-mers, including support for normalized k-mers. It also updates the frequency filter and k-mer set implementations to use the new encoding functions, providing zero-allocation encoding for better performance. The commit hash has been updated to reflect the latest changes.
2026-02-05 15:51:52 +01:00
Eric Coissac
c5dd477675 Refactor KmerSet and FrequencyFilter to use immutable K parameter and consistent Copy/Clone methods
This commit refactors the KmerSet and related structures to use an immutable K parameter and introduces consistent Copy methods instead of Clone. It also adds attribute API support for KmerSet and KmerSetGroup, and updates persistence logic to handle IDs and metadata correctly.
2026-02-05 15:32:36 +01:00
Eric Coissac
afcb43b352 Ajout de la gestion des métadonnées utilisateur dans KmerSet et KmerSetGroup
Cette modification ajoute la capacité de stocker et de persister des métadonnées utilisateur dans les structures KmerSet et KmerSetGroup. Les changements incluent l'ajout d'un champ Metadata dans KmerSet et KmerSetGroup, ainsi que la mise à jour des méthodes de clonage et de persistance pour gérer ces métadonnées. Cela permet de conserver des informations supplémentaires liées aux ensembles de k-mers tout en maintenant la compatibilité avec les opérations existantes.
2026-02-05 15:02:36 +01:00
Eric Coissac
b26b76cbf8 Add TOML persistence support for KmerSet and KmerSetGroup
This commit adds support for saving and loading KmerSet and KmerSetGroup structures using TOML, YAML, and JSON formats for metadata. It includes:

- Added github.com/pelletier/go-toml/v2 dependency
- Implemented Save and Load methods for KmerSet and KmerSetGroup
- Added metadata persistence with support for multiple formats (TOML, YAML, JSON)
- Added helper functions for format detection and metadata handling
- Updated version commit hash
2026-02-05 14:57:22 +01:00
Eric Coissac
aa468ec462 Refactor FrequencyFilter to use KmerSetGroup
Refactor FrequencyFilter to inherit from KmerSetGroup for better code organization and maintainability. This change replaces the direct bitmap management with a group-based approach, simplifying the implementation and improving readability.
2026-02-05 14:46:57 +01:00
Eric Coissac
00dcd78e84 Refactor k-mer encoding and frequency filtering with KmerSet
This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
2026-02-05 14:41:59 +01:00
Eric Coissac
60f27c1dc8 Add error handling for ambiguous bases in k-mer encoding
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
2026-02-04 21:45:08 +01:00
Eric Coissac
28162ac36f Ajout du filtre de fréquence avec v niveaux Roaring Bitmaps
Implémentation complète du filtre de fréquence utilisant v niveaux de Roaring Bitmaps pour éliminer efficacement les erreurs de séquençage.

- Ajout de la logique de filtrage par fréquence avec v niveaux
- Intégration des bibliothèques RoaringBitmap et bitset
- Ajout d'exemples d'utilisation et de documentation
- Implémentation de l'itérateur de k-mers pour une utilisation mémoire efficace
- Optimisation pour les distributions skewed typiques du séquençage

Ce changement permet de filtrer les k-mers par fréquence minimale avec une utilisation mémoire optimale et une seule passe sur les données.
2026-02-04 21:21:10 +01:00
Eric Coissac
1a1adb83ac Add error marker support for k-mers with enhanced documentation
This commit introduces error marker functionality for k-mers with odd lengths up to 31. The top 2 bits of each k-mer are now reserved for error coding (0-3), allowing for error detection and correction capabilities. Key changes include:

- Added constants KmerErrorMask and KmerSequenceMask for bit manipulation
- Implemented SetKmerError, GetKmerError, and ClearKmerError functions
- Updated EncodeKmers, ExtractSuperKmers, EncodeNormalizedKmers functions to enforce k ≤ 31
- Enhanced ReverseComplement to preserve error bits during reverse complement operations
- Added comprehensive tests for error marker functionality including edge cases and integration tests

The maximum k-mer size is now capped at 31 to accommodate the error bits, ensuring that k-mers with odd lengths ≤ 31 utilize only 62 bits of the 64-bit uint64, leaving the top 2 bits available for error coding.
2026-02-04 16:21:47 +01:00
Eric Coissac
05de9ca58e Add SuperKmer extraction functionality
This commit introduces the ExtractSuperKmers function which identifies maximal subsequences where all consecutive k-mers share the same minimizer. It includes:

- SuperKmer struct to represent the maximal subsequences
- dequeItem struct for tracking minimizers in a sliding window
- Efficient algorithm using monotone deque for O(1) amortized minimizer tracking
- Comprehensive parameter validation
- Support for buffer reuse for performance optimization
- Extensive test cases covering basic functionality, edge cases, and performance benchmarks

The implementation uses simultaneous forward/reverse m-mer encoding for O(1) canonical m-mer computation and maintains a monotone deque to track minimizers efficiently.
2026-02-04 16:04:06 +01:00
Eric Coissac
500144051a Add jj Makefile targets and k-mer encoding utilities
Add new Makefile targets for jj operations (jjnew, jjpush, jjfetch) to streamline commit workflow.

Introduce k-mer encoding utilities in pkg/obikmer:
- EncodeKmers: converts DNA sequences to encoded k-mers
- ReverseComplement: computes reverse complement of k-mers
- NormalizeKmer: returns canonical form of k-mers
- EncodeNormalizedKmers: encodes sequences with normalized k-mers

Add comprehensive tests for k-mer encoding functions including edge cases, buffer reuse, and performance benchmarks.

Document k-mer index design for large genomes, covering:
- Use cases and objectives
- Volume estimations
- Distance metrics (Jaccard, Sørensen-Dice, Bray-Curtis)
- Indexing options (Bloom filters, sorted sets, MPHF)
- Optimization techniques (k-2-mer indexing)
- MinHash for distance acceleration
- Recommended architecture for presence/absence and counting queries
2026-02-04 14:27:10 +01:00
coissac
740f66b4c7 Merge pull request #71 from metabarcoding/push-onwzsyuooozn
Implémentation du filtrage unique basé sur séquence et catégories
2026-01-14 19:19:27 +01:00
Eric Coissac
b49aba9c09 Implémentation du filtrage unique basé sur séquence et catégories
Ajout d'une fonctionnalité pour le filtrage unique qui prend en compte à la fois la séquence et les catégories.

- Modification de la fonction ISequenceChunk pour accepter un classifieur unique optionnel
- Implémentation du traitement unique sur disque en utilisant un classifieur composite
- Mise à jour du classifieur utilisé pour le tri sur disque
- Correction de la gestion des clés de unicité en utilisant le code et la valeur du classifieur
- Mise à jour du numéro de commit
2026-01-14 19:18:17 +01:00
coissac
52244cdb64 Merge pull request #70 from metabarcoding/push-kuwnszsxmxpn
Refactor chunk processing and update version commit
2026-01-14 18:47:17 +01:00
Eric Coissac
0678181023 Refactor chunk processing and update version commit
Optimize chunk processing by moving variable declarations inside the loop and update the commit hash in version.go to reflect the latest changes.
2026-01-14 18:46:04 +01:00
coissac
f55dd553c7 Merge pull request #68 from metabarcoding/push-rrulynolpprl
Push rrulynolpprl
2026-01-14 17:44:36 +01:00
coissac
4a383ac6c9 Merge branch 'master' into push-rrulynolpprl 2025-12-18 14:12:56 +01:00
Eric Coissac
371e702423 obiannotate --cut bug 2025-12-18 14:11:11 +01:00
Eric Coissac
ac0d3f3fe4 Update obiuniq for very large dataset 2025-12-18 14:11:11 +01:00
Eric Coissac
547135c747 End of obilowmask 2025-12-03 11:49:07 +01:00
coissac
f4a919732e Merge pull request #65 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-26 12:13:08 +01:00
Eric Coissac
e681666aaa End of obilowmask 2025-11-26 11:14:56 +01:00
coissac
adf2486295 Merge pull request #64 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-24 15:36:20 +01:00
Eric Coissac
272f5c9c35 End of obilowmask 2025-11-24 15:27:38 +01:00
coissac
c1b9503ca6 Merge pull request #63 from metabarcoding/push-vypwrurrsxuk
obicsv bug with stat on value map fields
2025-11-21 14:04:34 +01:00
Eric Coissac
86e60aedd0 obicsv bug with stat on value map fields 2025-11-21 14:03:31 +01:00
coissac
961abcea7b Merge pull request #61 from metabarcoding/push-mvxssvnysyxn
Push mvxssvnysyxn
2025-11-21 13:25:19 +01:00
Eric Coissac
57c65f9d50 obimatrix bug 2025-11-21 13:24:24 +01:00
Eric Coissac
e65b2a5efe obimatrix bugs 2025-11-21 13:24:06 +01:00
coissac
3e5f3f76b0 Merge pull request #60 from metabarcoding/push-qpnzxskwpoxo
Push qpnzxskwpoxo
2025-11-18 15:35:41 +01:00
Eric Coissac
ccc827afd3 finalise obilowmask 2025-11-18 15:33:08 +01:00
Eric Coissac
cef29005a5 debug url reading 2025-11-18 15:30:20 +01:00
Eric Coissac
4603d7973e implementation de obilowmask 2025-11-18 15:30:20 +01:00
coissac
8bc47c13d3 Merge pull request #58 from metabarcoding/push-vxkqkkrokwuz
debug obimultiplex
2025-11-06 15:44:31 +01:00
Eric Coissac
07cdd6f758 debug obimultiplex
bug option obimultiplex
2025-11-06 15:43:13 +01:00
coissac
432da366e2 Merge pull request #57 from metabarcoding/push-ywktmvpvtvmv
debug taxonomy core dump
2025-11-05 19:07:41 +01:00
Eric Coissac
2d7dc7d09d debug taxonomy core dump 2025-11-05 19:01:15 +01:00
coissac
5e12ed5400 Merge pull request #56 from metabarcoding/push-tnrvpwvqtzyo
update install script
2025-11-04 18:11:21 +01:00
Eric Coissac
7500ee1d15 update install script 2025-11-04 18:09:15 +01:00
coissac
5a1d66bf06 Merge pull request #53 from metabarcoding/push-skmxzrzulvtq
Push skmxzrzulvtq
2025-10-28 14:27:19 +01:00
Eric Coissac
0844dcc607 bug obimatrix 2025-10-28 13:57:31 +01:00
Eric Coissac
7f4ebe757e Bug obiuniq - don't clean the chunks 2025-10-28 13:50:22 +01:00
coissac
5150947e23 Merge pull request #51 from metabarcoding/push-urtwmwktsrru
Push urtwmwktsrru
2025-10-20 17:41:33 +02:00
Eric Coissac
d17a9520b9 work on obiclean chimera detection 2025-10-20 17:29:47 +02:00
Eric Coissac
29bf4ce871 add a feature to obimatrix adding obicsv option to obimatrix 2025-10-20 16:34:58 +02:00
coissac
d7ed9d343e Update install_obitools.sh for missing directory 2025-10-15 08:32:06 +02:00
Eric Coissac
82b6bb1ab6 correct a bug in func (worker SeqWorker) ChainWorkers(next SeqWorker) SeqWorker 2025-08-11 15:09:49 +02:00
Eric Coissac
6d204f6281 Patch the fastq detector 2025-08-08 10:23:03 -04:00
Eric Coissac
7a6d552450 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-08-07 17:01:48 -04:00
Eric Coissac
412b54822c Patch a bug in obliclean for d>1 leading to some instability in the result 2025-08-07 17:01:38 -04:00
Eric Coissac
730d448fc3 Allows for only one cpu and it should work 2025-08-06 16:09:25 -04:00
Eric Coissac
04f3af3e60 some renaming of functions 2025-08-06 15:54:50 -04:00
Eric Coissac
997b6e8c01 correct the fastq detector for distinguish with a csv ngsfilter 2025-08-06 15:52:54 -04:00
Eric Coissac
f239e8da92 Rename ISequenceChunk 2025-08-05 08:49:45 -04:00
Eric Coissac
ed28d3fb5b Adds a --u-to-t option 2025-07-07 15:35:26 +02:00
Eric Coissac
43b285587e Debug on taxonomy extraction and CSV conversion 2025-07-07 15:29:40 +02:00
Eric Coissac
8d53d253d4 Add a reading option on readers to convet U to T 2025-07-07 15:29:07 +02:00
Eric Coissac
8c26fc9884 Add a new test on obisummary 2025-07-07 15:28:29 +02:00
Eric Coissac
235a7e202a Update obisummary to account new obiseq.StatsOnValues type 2025-06-19 17:21:30 +02:00
Eric Coissac
27fa984a63 Patch obimatrix accoring to the new type obiseq.StatsOnValues 2025-06-19 16:51:53 +02:00
Eric Coissac
add9d89ccc Patch the Min and Max values of the expression language 2025-06-19 16:43:26 +02:00
Eric Coissac
9965370d85 Manage a lock on StatsOnValues 2025-06-17 16:46:11 +02:00
Eric Coissac
8a2bb1fe82 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/merge.go
2025-06-17 12:11:35 +02:00
Eric Coissac
efc3f3af29 Patch a concurrent access problem 2025-06-17 12:05:42 +02:00
Eric Coissac
1c6ab1c559 Changes to be committed:
modified:   pkg/obingslibrary/multimatch.go
	modified:   pkg/obioptions/version.go
2025-06-17 09:06:42 +02:00
Eric Coissac
38dcd98d4a Patch the genbank parser automata 2025-06-17 08:52:45 +02:00
Eric Coissac
7b23985693 Add _ to allowed in taxid 2025-06-06 14:37:57 +02:00
Eric Coissac
d31e677304 Patch a bug in obitag 2025-06-04 14:47:28 +02:00
Eric Coissac
6cb7a5a352 Changes to be committed:
modified:   cmd/obitools/obitag/main.go
	modified:   cmd/obitools/obitaxonomy/main.go
	modified:   pkg/obiformats/csvtaxdump_read.go
	modified:   pkg/obiformats/ecopcr_read.go
	modified:   pkg/obiformats/ncbitaxdump_read.go
	modified:   pkg/obiformats/ncbitaxdump_readtar.go
	modified:   pkg/obiformats/newick_write.go
	modified:   pkg/obiformats/options.go
	modified:   pkg/obiformats/taxonomy_read.go
	modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obiiter/extract_taxonomy.go
	modified:   pkg/obioptions/options.go
	modified:   pkg/obioptions/version.go
	new file:   pkg/obiphylo/tree.go
	modified:   pkg/obiseq/biosequenceslice.go
	modified:   pkg/obiseq/taxonomy_methods.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonset.go
	modified:   pkg/obitools/obiconvert/sequence_reader.go
	modified:   pkg/obitools/obitag/obitag.go
	modified:   pkg/obitools/obitaxonomy/obitaxonomy.go
	modified:   pkg/obitools/obitaxonomy/options.go
	deleted:    sample/.DS_Store
2025-06-04 09:48:10 +02:00
Eric Coissac
3424d3057f Changes to be committed:
modified:   pkg/obiformats/ngsfilter_read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiutils/mimetypes.go
2025-05-14 14:53:25 +02:00
Eric Coissac
f9324dd8f4 add min and max to the obitools expression language 2025-05-13 16:03:03 +02:00
Eric Coissac
f1b9ac4a13 Update the expression language 2025-05-07 20:45:05 +02:00
Eric Coissac
e065e2963b Update the install script 2025-05-01 11:45:46 +02:00
Eric Coissac
13ff892ac9 Patch type mismatch in apat C library 2025-04-23 16:14:10 +02:00
Eric Coissac
c0ecaf90ab Add the --number option to obiannotate 2025-04-22 18:35:51 +02:00
Eric Coissac
a57cfda675 Make the replace function of the eval language accepting regex 2025-04-10 15:17:15 +02:00
Eric Coissac
c2f38e737b Update of the packages 2025-04-10 15:16:36 +02:00
Eric Coissac
0aec5ba4df change the tests according to the corrections in obipairing 2025-04-04 17:10:17 +02:00
Eric Coissac
67e5b6ef24 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:02:45 +02:00
Eric Coissac
3b1aa2869e Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:01:20 +02:00
Eric Coissac
7542e33010 Several bugs dicoverd during the doc writing 2025-04-04 16:59:27 +02:00
Eric Coissac
03b5ce9397 Patch a bug in obitag when some reference sequences have taxid absent from the taxonomy 2025-03-27 16:45:02 +01:00
Eric Coissac
2d52322876 Patch a bug in the obi2 annotation parser on map indexed by integers 2025-03-27 14:54:13 +01:00
Eric Coissac
fd80249b85 Patch a bug in obitag when a taxon from the reference library is unknown in the taxonomy 2025-03-27 14:28:15 +01:00
Eric Coissac
5a3705b6bb Adds the --silent-warning options to the obitools commands and removes the --pared-with option from some of the obitols commands. 2025-03-25 16:44:46 +01:00
Eric Coissac
2ab6f67d58 Add a progress bar to chimera detection 2025-03-25 08:37:27 +01:00
Eric Coissac
8b379d30da Adds the --newick-output option to the obitaxonomy command 2025-03-14 14:24:12 +01:00
Eric Coissac
8448783499 Make sequence files recognized as a taxonomy 2025-03-14 14:22:22 +01:00
Eric Coissac
d1c31c54de add a first version of the inline documentation 2025-03-12 14:40:42 +01:00
Eric Coissac
7a9dc1ab3b update release notes 2025-03-12 14:06:20 +01:00
Eric Coissac
3a1cf4fe97 Accelerate the speed of very long fasta sequences, and more generaly of every format 2025-03-12 13:29:41 +01:00
Eric Coissac
83926c91e1 Patch the install script to desactivate the CSV check 2025-03-12 13:28:52 +01:00
Eric Coissac
937a483aa6 Changes to be committed:
modified:   Makefile
2025-03-12 12:55:41 +01:00
Eric Coissac
dada70e6b1 Changes to be committed:
modified:   Makefile
2025-03-12 12:49:34 +01:00
Eric Coissac
62e5a93492 update the compress option name 2025-03-11 17:14:40 +01:00
Eric Coissac
f21f51ae62 Correct the logic of --update-taxid and --fail-on-taxonomy 2025-03-11 16:56:02 +01:00
Eric Coissac
3b5d4ba455 patch a bug in obiannotate 2025-03-11 16:35:38 +01:00
Eric Coissac
50d11ce374 Add a pre-push git-hook to run tests on obitools commands before pushing on master 2025-03-08 18:56:02 +01:00
Eric Coissac
52d5f6fe11 make makefile crashing on test error 2025-03-08 16:54:24 +01:00
Eric Coissac
78caabd2fd Add basic test on -h for all the commands 2025-03-08 16:28:06 +01:00
Eric Coissac
65bd29b955 normalize the usage of obitaxonomy 2025-03-08 13:00:55 +01:00
Eric Coissac
b18c9b7ac6 add the --raw-taxid option 2025-03-08 09:40:06 +01:00
Eric Coissac
78df7db18d typos 2025-03-08 07:44:41 +01:00
Eric Coissac
fc08c12ab0 update release notes 2025-03-08 07:42:20 +01:00
Eric Coissac
0339e4dffa Patch size limite of the filetype guesser 2025-03-08 07:34:02 +01:00
Eric Coissac
706b44c37f Add option for csv input format 2025-03-08 07:21:24 +01:00
Eric Coissac
fbe7d15dc3 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obicleandb/obicleandb.go
	modified:   pkg/obitools/obicleandb/options.go
2025-03-06 13:38:38 +01:00
Eric Coissac
b5cf586f17 patch a duplicate --taxonomy option in obirefidxdb 2025-03-06 11:36:20 +01:00
Eric Coissac
286e27d6ba patch the scienctific_name tag name to "scientific_name" 2025-03-05 14:22:12 +01:00
Eric Coissac
996ec69bd9 update the release notes for version 4.4.0 2025-03-01 12:56:39 +01:00
Eric Coissac
5f9182d25b Changes to be committed:
modified:   pkg/obioptions/version.go
2025-03-01 09:20:39 +01:00
Eric Coissac
9913fa8354 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-03-01 09:14:56 +01:00
Eric Coissac
7b23314651 Some typos 2025-03-01 08:29:27 +01:00
Eric Coissac
1e541eac4c Last commit version 2025-03-01 08:24:26 +01:00
Eric Coissac
13cd4c86ac Patch the bug on --out with paired sequence files 2025-02-27 18:13:21 +01:00
Eric Coissac
75dd535201 Add a --valid-taxid option to obigrep 2025-02-27 18:12:55 +01:00
Eric Coissac
573acafafc Patch bug on ecotag with too short sequences 2025-02-27 15:09:07 +01:00
Eric Coissac
0067152c2b Patch the production of the ratio file 2025-02-27 10:19:39 +01:00
Eric Coissac
791d253edc Generate the ratio file as compressed if -Z option enabled. 2025-02-27 09:06:07 +01:00
Eric Coissac
6245d7f684 Changes to be committed:
modified:   .gitignore
2025-02-24 15:47:45 +01:00
Eric Coissac
13d610aff7 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obiclean/chimera.go
2025-02-24 15:25:45 +01:00
Eric Coissac
db284f1d44 Add an experimental chimera detection... 2025-02-24 15:02:49 +01:00
Eric Coissac
51b3e83d32 some cleaning 2025-02-24 11:31:49 +01:00
Eric Coissac
8671285d02 add the --min-sample-count option to obiclean. 2025-02-24 08:48:31 +01:00
Eric Coissac
51d11aa36d Changes to be committed:
modified:   pkg/obialign/alignment.go
	modified:   pkg/obialign/pairedendalign.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obipairing/pairing.go
2025-02-23 17:37:56 +01:00
Eric Coissac
fb6f857d8c Update the computation of the consensus quality score 2025-02-23 15:16:31 +01:00
Eric Coissac
d4209b4549 Add a basic test for obiparing 2025-02-22 09:57:44 +01:00
Eric Coissac
ef05d4975f Upadte the scoring schema of obipairing 2025-02-21 22:41:34 +01:00
Eric Coissac
4588bf8b5d Patch the make file to fail on error 2025-02-19 15:55:07 +01:00
Eric Coissac
090633850d Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 15:28:42 +01:00
Eric Coissac
15a058cf63 with all the sample files for tests 2025-02-19 15:27:38 +01:00
Eric Coissac
2f5f7634d6 Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:50:10 +01:00
Eric Coissac
48138b605c Changes to be committed:
modified:   .github/workflows/obitest.yml
	modified:   Makefile
	modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:37:05 +01:00
Eric Coissac
aed22c12a6 Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:34:22 +01:00
Eric Coissac
443a9b3ce3 Changes to be committed:
modified:   Makefile
	modified:   obitests/obitools/obicount/test.sh
2025-02-19 14:28:49 +01:00
Eric Coissac
7e90537379 For run of test using bash in makefile 2025-02-19 13:58:52 +01:00
Eric Coissac
d3d15acc6c Changes to be committed:
modified:   obitests/obitools/obicount/test.sh
	modified:   pkg/obioptions/version.go
2025-02-19 13:54:01 +01:00
Eric Coissac
bd4a0b5ca5 Essais d'une google action pour lancer les tests des obitools 2025-02-19 13:45:43 +01:00
Eric Coissac
952f85f312 A first trial of a test for obicount 2025-02-19 13:17:36 +01:00
Eric Coissac
4774438644 Changes to be committed:
modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/taxonomy_methods.go
2025-02-12 08:40:38 +01:00
Eric Coissac
6a8061cc4f Add managment of the taxonomy alias politic 2025-02-10 14:05:47 +01:00
Eric Coissac
e2563cd8df Patch a bug in registering merged taxa 2025-02-10 11:42:46 +01:00
Eric Coissac
f2e81adf95 Changes to be committed:
modified:   .gitignore
	deleted:    xxx.csv
2025-02-05 19:28:19 +01:00
Eric Coissac
f27e9bc91e patch a bug related to csv and qualities 2025-02-05 19:27:00 +01:00
Eric Coissac
773e54965d Patch a bug on compressed output 2025-02-05 14:18:24 +01:00
Eric Coissac
ceca33998b add extensions fq in directory scanning 2025-02-04 20:34:58 +01:00
Eric Coissac
b9bee5f426 Changes to be committed:
modified:   go.mod
	modified:   go.sum
	modified:   pkg/obilua/obilib.go
	modified:   pkg/obilua/obiseq.go
	modified:   pkg/obilua/obiseqslice.go
	new file:   pkg/obilua/obitaxon.go
	new file:   pkg/obilua/obitaxonomy.go
	modified:   pkg/obioptions/version.go
2025-02-02 16:52:52 +01:00
Eric Coissac
c10df073a7 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitax/iterator.go
2025-02-01 12:06:19 +01:00
Eric Coissac
d3dac1b21f Make obitag able to use the taxonomic path included in reference database as taxonomy 2025-01-30 11:50:03 +01:00
Eric Coissac
0df082da06 Adds possibility to extract a taxonomy from taxonomic path included in sequence files 2025-01-30 11:18:21 +01:00
Eric Coissac
2452aef7a9 patch multiple -Z options 2025-01-29 21:35:28 +01:00
Eric Coissac
337954592d add the --out option to the obitaxonomy 2025-01-29 13:22:35 +01:00
Eric Coissac
8a28c9ae7c add the --download-ncbi option to obitaxonomy 2025-01-29 12:38:39 +01:00
Eric Coissac
b6b18c0fa1 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-01-29 11:34:01 +01:00
Eric Coissac
67e2758d63 Switch to realease number 4.3.0 2025-01-29 11:33:30 +01:00
Eric Coissac
00f2dc2697 Rename obifind obitaxonomy and introduce the new CSV format for taxonomy. 2025-01-29 10:45:26 +01:00
Eric Coissac
c50a0f409d break the import cyccle 2025-01-27 17:23:07 +01:00
Eric Coissac
7c4042df6b introduce obidefault 2025-01-27 17:12:45 +01:00
Eric Coissac
0a567f621c small changes 2025-01-24 18:12:37 +01:00
Eric Coissac
9acb4a85a8 Refactoring of the default values 2025-01-24 18:09:59 +01:00
Eric Coissac
3137c1f841 Adds the ability to read gzip-tar file for the taxonomy dump 2025-01-24 11:47:59 +01:00
Eric Coissac
ffd67252c3 patch a bug in obimultiplex: change option -t into -s 2025-01-23 13:50:24 +01:00
Eric Coissac
757448cb1e debug obigrep taxonomy options 2025-01-15 20:48:34 +01:00
Eric Coissac
4ae3336135 obidemerge: set demerge on sample the default 2025-01-10 17:09:44 +01:00
Eric Coissac
d066bb6878 Changes to be committed:
modified:   .gitignore
	modified:   cmd/test/main.go
	modified:   pkg/obioptions/version.go
2025-01-09 07:24:41 +01:00
Eric Coissac
becb995e3d misspellings penality -> pelnalty 2025-01-09 07:22:23 +01:00
Eric Coissac
c58d9772ac Two small corrections on options 2025-01-07 11:51:37 +01:00
Eric Coissac
67c0d00a4d remove the obitag2 command 2025-01-07 11:50:28 +01:00
Eric Coissac
4fe0db63ff Patch CSV reader to use the new taxonomy system 2024-12-20 21:30:00 +01:00
Eric Coissac
ccd3b06532 Merge branch 'master' into taxonomy 2024-12-20 20:06:57 +01:00
Eric Coissac
5d0f996625 Patch a small bug on json write 2024-12-20 19:42:03 +01:00
Eric Coissac
abfa8f357a patch a bug in Taxid parsing 2024-12-19 13:54:23 +01:00
Eric Coissac
795df34d1a Changes to be committed:
modified:   cmd/obitools/obitag/main.go
	modified:   cmd/obitools/obitag2/main.go
	modified:   go.mod
	modified:   go.sum
	modified:   pkg/obiformats/ncbitaxdump/read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/attributes.go
	modified:   pkg/obiseq/taxonomy_lca.go
	modified:   pkg/obiseq/taxonomy_methods.go
	modified:   pkg/obiseq/taxonomy_predicate.go
	modified:   pkg/obitax/inner.go
	modified:   pkg/obitax/lca.go
	new file:   pkg/obitax/taxid.go
	modified:   pkg/obitax/taxon.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonslice.go
	modified:   pkg/obitools/obicleandb/obicleandb.go
	modified:   pkg/obitools/obigrep/options.go
	modified:   pkg/obitools/obilandmark/obilandmark.go
	modified:   pkg/obitools/obilandmark/options.go
	modified:   pkg/obitools/obirefidx/famlilyindexing.go
	modified:   pkg/obitools/obirefidx/geomindexing.go
	modified:   pkg/obitools/obirefidx/obirefidx.go
	modified:   pkg/obitools/obirefidx/options.go
	modified:   pkg/obitools/obitag/obigeomtag.go
	modified:   pkg/obitools/obitag/obitag.go
	modified:   pkg/obitools/obitag/options.go
	modified:   pkg/obiutils/strings.go
2024-12-19 13:36:59 +01:00
Eric Coissac
f2525d7b07 Changes to be committed:
modified:   pkg/obilua/lua.go
	modified:   pkg/obioptions/version.go
2024-12-04 11:42:05 +01:00
Eric Coissac
39dd3e3ce8 Add management of []interface{} to the LUA API. 2024-12-04 10:58:17 +01:00
Eric Coissac
f41a6fbb60 Patch a small bug on json write 2024-11-29 18:39:18 +01:00
Eric Coissac
00b0edc15a refactoring of the file chunck writing 2024-11-29 18:15:03 +01:00
Eric Coissac
ad2461a656 some minor improvements 2024-11-27 13:36:06 +01:00
Eric Coissac
40fb4e9767 reduce the memory impact of obiuniq. 2024-11-27 13:30:16 +01:00
Eric Coissac
d29a56dcbf Changes to be committed:
modified:   Release-notes.md
	modified:   pkg/obialign/pairedendalign.go
	modified:   pkg/obilua/obiseq.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/biosequence.go
	modified:   pkg/obitools/obipairing/pairing.go
2024-11-27 09:56:22 +01:00
Eric Coissac
69ef1758a2 obicsv debug 2024-11-24 23:38:49 +01:00
Eric Coissac
3d06978808 a functional new version of obifind 2024-11-24 19:33:24 +01:00
Eric Coissac
7884a74f9c Patch a bug in obitagpcr 2024-11-18 21:10:47 +01:00
Eric Coissac
36327c79c8 Changes to be committed:
modified:   .gitignore
	new file:   pkg/obitax/default_taxonomy.go
	modified:   pkg/obitax/taxon.go
	modified:   pkg/obitax/taxonnode.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonset.go
	modified:   pkg/obitax/taxonslice.go
	modified:   pkg/obitools/obifind/iterator.go
	modified:   pkg/obitools/obifind/options.go
2024-11-16 10:01:49 +01:00
Eric Coissac
f3d8707c08 Add default taxonomy 2024-11-16 10:01:07 +01:00
Eric Coissac
7633fc4d23 update documentation 2024-11-16 06:00:27 +01:00
Eric Coissac
f5d79d0bc4 update api documentation 2024-11-16 05:59:41 +01:00
Eric Coissac
03f4e88a17 Fisrt functional version 2024-11-14 19:10:23 +01:00
Eric Coissac
9471fedfa1 Fisrt step in the obitax rewriting 2024-11-08 09:48:16 +01:00
Eric Coissac
4b65bfce84 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obitagpcr/pcrtag.go
2024-11-04 14:30:17 +01:00
Eric Coissac
fc75974c68 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obitagpcr/options.go
2024-11-04 13:25:52 +01:00
Eric Coissac
422f11cceb Changes to be committed:
modified:   pkg/obitools/obitagpcr/pcrtag.go
2024-10-29 16:53:10 +01:00
Eric Coissac
fefc360f80 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obitagpcr/pcrtag.go
	modified:   pkg/obiutils/abs.go
	new file:   pkg/obiutils/abs_test.go
2024-10-28 21:51:21 +01:00
Eric Coissac
3e00d39d47 In obimultiplex, patch a bug when no tag are associated to a primer. 2024-10-22 14:12:20 +02:00
Eric Coissac
9e8a7fd9be Patch a bug in fastq reader 2024-10-20 16:07:43 +02:00
Eric Coissac
74280e4704 Merge branch 'no-pools' 2024-09-24 16:38:23 +02:00
Eric Coissac
7255c71576 commit version 2024-09-24 16:36:59 +02:00
Eric Coissac
241f2286f2 remove the slice pool management 2024-09-24 16:31:30 +02:00
Eric Coissac
b37fc39ead switch to go version 1.23.1 2024-09-24 15:55:01 +02:00
Eric Coissac
2b4a633c30 EndGapFree alignments 2024-09-24 15:52:12 +02:00
Eric Coissac
05bf2bfd6c Add option related to agrep match on obigrep and obiannotate 2024-09-09 16:52:13 +02:00
Eric Coissac
65ae82622e correction of several small bugs 2024-09-03 06:08:07 -03:00
Eric Coissac
373464cb06 On development genome skim tools 2024-08-30 11:17:33 +02:00
Eric Coissac
cd330db672 Add option to obimicrosat to control microsat length and orientation 2024-08-30 11:16:43 +02:00
Eric Coissac
31bfc88eb9 Patch a bug on writing to stdout, and add clearer error on openning data files 2024-08-13 09:45:28 +02:00
Eric Coissac
bdb96dda94 Adds the obimicrosat command 2024-08-05 15:31:20 +02:00
Eric Coissac
3f57935328 Adjust the size of the genbank and embl buffer size 2024-08-05 11:32:37 +02:00
Eric Coissac
886b5d9a96 Optimize memory for readers and writers 2024-08-05 10:48:28 +02:00
Eric Coissac
f83032e643 Tag version 2024-08-02 14:30:53 +02:00
357 changed files with 32848 additions and 5507 deletions

19
.github/workflows/obitest.yml vendored Normal file
View File

@@ -0,0 +1,19 @@
name: "Run the obitools command test suite"
on:
push:
branches:
- master
- V*
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout obitools4 project
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.23"
- name: Run tests
run: make githubtests

172
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,172 @@
name: Create Release on Tag
on:
push:
tags:
- "Release_*"
permissions:
contents: write
jobs:
# First run tests
test:
runs-on: ubuntu-latest
steps:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.23"
- name: Checkout obitools4 project
uses: actions/checkout@v4
- name: Run tests
run: make githubtests
# Build binaries for each platform
build:
needs: test
strategy:
matrix:
include:
- os: ubuntu-latest
goos: linux
goarch: amd64
output_name: linux_amd64
- os: ubuntu-24.04-arm
goos: linux
goarch: arm64
output_name: linux_arm64
- os: macos-15-intel
goos: darwin
goarch: amd64
output_name: darwin_amd64
- os: macos-latest
goos: darwin
goarch: arm64
output_name: darwin_arm64
runs-on: ${{ matrix.os }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.23"
- name: Extract version from tag
id: get_version
run: |
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
- name: Install build tools (macOS)
if: runner.os == 'macOS'
run: |
# Ensure Xcode Command Line Tools are installed
xcode-select --install 2>/dev/null || true
xcode-select -p
- name: Build binaries
env:
GOOS: ${{ matrix.goos }}
GOARCH: ${{ matrix.goarch }}
VERSION: ${{ steps.get_version.outputs.version }}
run: |
make obitools
mkdir -p artifacts
# Create a single tar.gz with all binaries for this platform
tar -czf artifacts/obitools4_${VERSION}_${{ matrix.output_name }}.tar.gz -C build .
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: binaries-${{ matrix.output_name }}
path: artifacts/*
# Create the release
create-release:
needs: build
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Extract version from tag
id: get_version
run: |
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
- name: Download all artifacts
uses: actions/download-artifact@v4
with:
path: release-artifacts
- name: Prepare release directory
run: |
mkdir -p release
find release-artifacts -type f -name "*.tar.gz" -exec cp {} release/ \;
ls -lh release/
- name: Generate Release Notes
env:
VERSION: ${{ steps.get_version.outputs.version }}
run: |
PREV_TAG=$(git describe --tags --abbrev=0 HEAD^ 2>/dev/null || echo "")
echo "# OBITools4 Release ${VERSION}" > release_notes.md
echo "" >> release_notes.md
if [ -n "$PREV_TAG" ]; then
echo "## Changes since ${PREV_TAG}" >> release_notes.md
echo "" >> release_notes.md
git log ${PREV_TAG}..HEAD --pretty=format:"- %s" >> release_notes.md
else
echo "## Changes" >> release_notes.md
echo "" >> release_notes.md
git log --pretty=format:"- %s" -n 20 >> release_notes.md
fi
echo "" >> release_notes.md
echo "" >> release_notes.md
echo "## Installation" >> release_notes.md
echo "" >> release_notes.md
echo "Download the appropriate archive for your system and extract it:" >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (AMD64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_linux_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (ARM64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_linux_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Intel)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_darwin_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Apple Silicon)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf obitools4_${VERSION}_darwin_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "All OBITools4 binaries are included in each archive." >> release_notes.md
- name: Create GitHub Release
uses: softprops/action-gh-release@v1
with:
name: Release ${{ steps.get_version.outputs.version }}
body_path: release_notes.md
files: release/*
draft: false
prerelease: false
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

154
.gitignore vendored
View File

@@ -1,120 +1,36 @@
cpu.pprof
cpu.trace
test
bin
vendor
*.fastq
*.fasta
*.fastq.gz
*.fasta.gz
.DS_Store
*.gml
*.log
/argaly
/obiconvert
/obicount
/obimultiplex
/obipairing
/obipcr
/obifind
/obidistribute
/obiuniq
/build
/Makefile.old
.Rproj.user
obitools.Rproj
Stat_error.knit.md
.Rhistory
Stat_error.nb.html
Stat_error.Rmd
/.luarc.json
/doc/TAXO/
/doc/results/
/doc/_main.log
/doc/_book/_main.tex
/doc/_freeze/
/doc/tutorial_files/
/doc/wolf_data/
/taxdump/
/.vscode/
/Algo-Alignement.numbers
/Estimate_proba_true_seq.html
/Estimate_proba_true_seq.nb.html
/Estimate_proba_true_seq.Rmd
/modele_error_euka.qmd
/obitools.code-workspace
.DS_Store
.RData
x
xxx
y
/doc/wolf_diet.tgz
/doc/man/depends
/sample/wolf_R1.fasta.gz
/sample/wolf_R2.fasta.gz
/sample/euka03.ecotag.fasta.gz
/sample/ratio.csv
/sample/STD_PLN_1.dat
/sample/STD_PLN_2.dat
/sample/subset_Pasvik_R1.fastq.gz
/sample/subset_Pasvik_R2.fastq.gz
/sample/test_gobitools.fasta.bz2
euka03.csv*
gbbct793.seq.gz
gbinv1003.seq.gz
gbpln210.seq
/doc/book/OBITools-V4.aux
/doc/book/OBITools-V4.fdb_latexmk
/doc/book/OBITools-V4.fls
/doc/book/OBITools-V4.log
/doc/book/OBITools-V4.pdf
/doc/book/OBITools-V4.synctex.gz
/doc/book/OBITools-V4.tex
/doc/book/OBITools-V4.toc
getoptions.adoc
Archive.zip
.DS_Store
sample/.DS_Store
sample/consensus_graphs/specimen_hac_plants_Vern_disicolor_.gml
93954
Bact03.e5.gb_R254.obipcr.idx.fasta.save
sample/test.obipcr.log
Bact02.e3.gb_R254.obipcr.fasta.gz
Example_Arth03.ngsfilter
SPER01.csv
SPER03.csv
wolf_diet_ngsfilter.txt
**/cpu.pprof
**/cpu.trace
**/test
**/bin
**/vendor
**/*.fastq
**/*.fasta
**/*.fastq.gz
**/*.fasta.gz
**/.DS_Store
**/*.gml
**/*.log
**/xxx*
**/*.sav
**/*.old
**/*.tgz
**/*.yaml
**/*.csv
xx
xxx.gb
yyy_geom.csv
yyy_LCS.csv
yyy.json
bug_obimultiplex/toto
bug_obimultiplex/toto_mapping
bug_obimultiplex/tutu
bug_obimultiplex/tutu_mapping
bug_obipairing/GIT1_GH_ngsfilter.txt
doc/book/TAXO/citations.dmp
doc/book/TAXO/delnodes.dmp
doc/book/TAXO/division.dmp
doc/book/TAXO/gc.prt
doc/book/TAXO/gencode.dmp
doc/book/TAXO/merged.dmp
doc/book/TAXO/names.dmp
doc/book/TAXO/nodes.dmp
doc/book/TAXO/readme.txt
doc/book/wolf_data/Release-253/ncbitaxo/citations.dmp
doc/book/wolf_data/Release-253/ncbitaxo/delnodes.dmp
doc/book/wolf_data/Release-253/ncbitaxo/division.dmp
doc/book/wolf_data/Release-253/ncbitaxo/gc.prt
doc/book/wolf_data/Release-253/ncbitaxo/gencode.dmp
doc/book/wolf_data/Release-253/ncbitaxo/merged.dmp
doc/book/wolf_data/Release-253/ncbitaxo/names.dmp
doc/book/wolf_data/Release-253/ncbitaxo/nodes.dmp
doc/book/wolf_data/Release-253/ncbitaxo/readme.txt
doc/book/results/toto.tasta
sample/.DS_Store
GO
.rhistory
/.vscode
/build
/bugs
/ncbitaxo
!/obitests/**
!/sample/**
LLM/**
*_files
entropy.html
bug_id.txt
obilowmask_ref
test_*

122
Makefile
View File

@@ -2,8 +2,9 @@
#export GOBIN=$(GOPATH)/bin
#export PATH=$(GOBIN):$(shell echo $${PATH})
GOFLAGS=
GOCMD=go
GOBUILD=$(GOCMD) build # -compiler gccgo -gccgoflags -O3
GOBUILD=$(GOCMD) build $(GOFLAGS)
GOGENERATE=$(GOCMD) generate
GOCLEAN=$(GOCMD) clean
GOTEST=$(GOCMD) test
@@ -16,6 +17,12 @@ PACKAGES_SRC:= $(wildcard pkg/*/*.go pkg/*/*/*.go)
PACKAGE_DIRS:=$(sort $(patsubst %/,%,$(dir $(PACKAGES_SRC))))
PACKAGES:=$(notdir $(PACKAGE_DIRS))
GITHOOK_SRC_DIR=git-hooks
GITHOOKS_SRC:=$(wildcard $(GITHOOK_SRC_DIR)/*)
GITHOOK_DIR=.git/hooks
GITHOOKS:=$(patsubst $(GITHOOK_SRC_DIR)/%,$(GITHOOK_DIR)/%,$(GITHOOKS_SRC))
OBITOOLS_SRC:= $(wildcard cmd/obitools/*/*.go)
OBITOOLS_DIRS:=$(sort $(patsubst %/,%,$(dir $(OBITOOLS_SRC))))
OBITOOLS:=$(notdir $(OBITOOLS_DIRS))
@@ -53,27 +60,31 @@ endif
OUTPUT:=$(shell mktemp)
all: obitools
all: install-githook obitools
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
install-githook: $(GITHOOKS)
$(GITHOOK_DIR)/%: $(GITHOOK_SRC_DIR)/%
@echo installing $$(basename $@)...
@mkdir -p $(GITHOOK_DIR)
@cp $< $@
@chmod +x $@
packages: $(patsubst %,pkg-%,$(PACKAGES))
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
update-deps:
go get -u ./...
test:
test: .FORCE
$(GOTEST) ./...
man:
make -C doc man
obibook:
make -C doc obibook
doc: man obibook
macos-pkg:
@bash pkgs/macos/macos-installer-builder-master/macOS-x64/build-macos-x64.sh \
OBITools \
0.0.1
obitests:
@for t in $$(find obitests -name test.sh -print) ; do \
bash $${t} || exit 1;\
done
githubtests: obitools obitests
$(BUILD_DIR):
mkdir -p $@
@@ -83,19 +94,80 @@ $(foreach P,$(PACKAGE_DIRS),$(eval $(call MAKE_PKG_RULE,$(P))))
$(foreach P,$(OBITOOLS_DIRS),$(eval $(call MAKE_OBITOOLS_RULE,$(P))))
pkg/obioptions/version.go: .FORCE
ifneq ($(strip $(COMMIT_ID)),)
@cat $@ \
| sed -E 's/^var _Commit = "[^"]*"/var _Commit = "'$(COMMIT_ID)'"/' \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "'"$(LAST_TAG)"'"/' \
pkg/obioptions/version.go: version.txt .FORCE
@version=$$(cat version.txt); \
cat $@ \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "Release '$$version'"/' \
> $(OUTPUT)
@diff $@ $(OUTPUT) 2>&1 > /dev/null \
|| echo "Update version.go : $@ to $(LAST_TAG) ($(COMMIT_ID))" \
&& mv $(OUTPUT) $@
|| (echo "Update version.go to $$(cat version.txt)" && mv $(OUTPUT) $@)
@rm -f $(OUTPUT)
endif
.PHONY: all packages obitools man obibook doc update-deps .FORCE
.FORCE:
bump-version:
@echo "Incrementing version..."
@current=$$(cat version.txt); \
echo " Current version: $$current"; \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
echo " New version: $$new_version"; \
echo "$$new_version" > version.txt
@echo "✓ Version updated in version.txt"
@$(MAKE) pkg/obioptions/version.go
jjnew:
@echo "$(YELLOW)→ Creating a new commit...$(NC)"
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
@echo "$(BLUE)→ Done.$(NC)"
@jj new
@echo "$(GREEN)✓ New commit created$(NC)"
jjpush:
@echo "$(YELLOW)→ Pushing commit to repository...$(NC)"
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
@echo "$(BLUE)→ Creating new commit for version bump...$(NC)"
@jj new
@previous_version=$$(cat version.txt); \
$(MAKE) bump-version; \
version=$$(cat version.txt); \
tag_name="Release_$$version"; \
previous_tag="Release_$$previous_version"; \
echo "$(BLUE)→ Documenting version bump commit...$(NC)"; \
jj auto-describe; \
echo "$(BLUE)→ Generating release notes from $$previous_tag to current commit...$(NC)"; \
if command -v orla >/dev/null 2>&1 && command -v jq >/dev/null 2>&1; then \
release_json=$$(jj log -r "$$previous_tag::@" -T 'commit_id.short() ++ " " ++ description' | \
ORLA_MAX_TOOL_CALLS=50 orla agent -m ollama:qwen3-coder-next:latest \
"Summarize the following commits into a GitHub release note for version $$version. Ignore commits related to version bumps, .gitignore changes, or any internal housekeeping that is irrelevant to end users. Describe each user-facing change precisely without exposing code. Eliminate redundancy. Output strictly valid JSON with no surrounding text, using this exact schema: {\"title\": \"<short release title>\", \"body\": \"<detailed markdown release notes>\"}"); \
release_json=$$(echo "$$release_json" | sed -n '/^{/,/^}/p'); \
release_title=$$(echo "$$release_json" | jq -r '.title // empty') ; \
release_body=$$(echo "$$release_json" | jq -r '.body // empty') ; \
if [ -n "$$release_title" ] && [ -n "$$release_body" ]; then \
release_message="$$release_title"$$'\n\n'"$$release_body"; \
else \
echo "$(YELLOW)⚠ JSON parsing failed, falling back to raw output$(NC)"; \
release_message="Release $$version"$$'\n\n'"$$release_json"; \
fi; \
else \
release_message="Release $$version"; \
fi; \
echo "$(BLUE)→ Pushing commits and creating tag $$tag_name...$(NC)"; \
jj git push --change @; \
git tag -a "$$tag_name" -m "$$release_message" 2>/dev/null || echo "Tag $$tag_name already exists"; \
git push origin "$$tag_name" 2>/dev/null || echo "Tag already pushed"
@echo "$(GREEN)✓ Commits and tag pushed to repository$(NC)"
jjfetch:
@echo "$(YELLOW)→ Pulling latest commits...$(NC)"
@jj git fetch
@jj new master@origin
@echo "$(GREEN)✓ Latest commits pulled$(NC)"
.PHONY: all obitools update-deps obitests githubtests jjnew jjpush jjfetch bump-version .FORCE
.FORCE:

View File

@@ -16,12 +16,17 @@ The easiest way to run it is to copy and paste the following command into your t
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash
```
By default, the script installs the *OBITools* commands and other associated files into the `/usr/local` directory.
The names of the commands in the new *OBITools4* are mostly identical to those in *OBITools2*.
Therefore, installing the new *OBITools* may hide or delete the old ones. If you want both versions to be
available on your system, the installation script offers two options:
By default, the script installs the latest version of *OBITools* commands and other associated files into the `/usr/local` directory.
### Installation Options
The installation script offers several options:
> -l, --list List all available versions and exit.
>
> -v, --version Install a specific version (e.g., `-v 4.4.3`).
> By default, the latest version is installed.
>
> -i, --install-dir Directory where obitools are installed
> (as example use `/usr/local` not `/usr/local/bin`).
>
@@ -30,14 +35,31 @@ available on your system, the installation script offers two options:
> same time on your system (as example `-p g` will produce
> `gobigrep` command instead of `obigrep`).
You can use these options by following the installation command:
### Examples
List all available versions:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --list
```
Install a specific version:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | bash -s -- --version 4.4.3
```
Install in a custom directory with command prefix:
```{bash}
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh | \
bash -s -- --install-dir test_install --obitools-prefix k
```
In this case, the binaries will be installed in the `test_install` directory and all command names will be prefixed with the letter `k`. Thus `obigrep` will be named `kobigrep`.
In this last example, the binaries will be installed in the `test_install` directory and all command names will be prefixed with the letter `k`. Thus, `obigrep` will be named `kobigrep`.
### Note on Version Compatibility
The names of the commands in the new *OBITools4* are mostly identical to those in *OBITools2*.
Therefore, installing the new *OBITools* may hide or delete the old ones. If you want both versions to be
available on your system, use the `--install-dir` and `--obitools-prefix` options as shown above.
## Continuing the analysis...

View File

@@ -1,51 +1,190 @@
# OBITools release notes
## Latest changes
## New changes
### Bug fixes
- In `obipairing` correct the misspelling of the `obiparing_*` tags where the `i`
was missing to `obipairing_`.
- In `obigrep` the **-C** option that excludes sequences too abundant was not
functional.
- In `obitaxonomy` the **-l** option that lists all the taxonomic rank defined by
a taxonomy was not functional
- The file type guesser was not using enough data to be able to correctly detect
file format when sequences were too long in fastq and fasta or when lines were
to long in CSV files. That's now corrected
- Options **--fasta** or **--fastq** usable to specify input format were ignored.
They are now correctly considered
- The `obiannotate` command were crashing when a selection option was used but
no editing option.
- The `--fail-on-taxonomy` led to an error on merged taxa even when the
`--update-taxid` option was used.
- The `--compressed` option was not correctly named. It was renamed to `--compress`
### Enhancement
- Some sequences in the Genbank and EMBL databases are several gigabases long. The
sequence parser had to reallocate and recopy memory many times to read them,
resulting in a complexity of O(N^2) for reading such large sequences.
The new file chunk reader has a linear algorithm that speeds up the reading
of very long sequences.
- A new option **--csv** is added to every obitools to indicate that the input
format is CSV
- The new version of obitools are now printing the taxids in a fancy way
including the scientific name and the taxonomic rank (`"taxon:9606 [Homo
sapiens]@species"`). But if you need the old fashion raw taxid, a new option
**--raw-taxid** has been added to get obitools printing the taxids without any
decorations (`"9606"`).
## March 1st, 2025. Release 4.4.0
A new documentation website is available at https://obitools4.metabarcoding.org.
Its development is still in progress.
The biggest step forward in this new version is taxonomy management. The new
version is now able to handle taxonomic identifiers that are not just integer
values. This is a first step towards an easy way to handle other taxonomy
databases soon, such as the GBIF or Catalog of Life taxonomies. This version
is able to handle files containing taxonomic information created by previous
versions of OBITools, but files created by this new version may have some
problems to be analyzed by previous versions, at least for the taxonomic
information.
### Breaking changes
- In `obimultiplex`, the short version of the **--tag-list** option used to
specify the list of tags and primers to be used for the demultiplexing has
been changed from `-t` to `-s`.
- The command `obifind` is now renamed `obitaxonomy`.
- The **--taxdump** option used to specify the path to the taxdump containing
the NCBI taxonomy has been renamed to **--taxonomy**.
### Bug fixes
- Correction of a bug when using paired sequence file with the **--out** option.
- Correction of a bug in `obitag` when trying to annotate very short sequences of
4 bases or less.
- In `obipairing`, correct the stats `seq_a_single` and `seq_b_single` when
on right alignment mode
- Not really a bug but the memory impact of `obiuniq` has been reduced by reducing
the batch size and not reading the qualities from the fastq files as `obiuniq`
is producing only fasta output without qualities.
- In `obitag`, correct the wrong assignment of the **obitag_bestmatch**
attribute.
- In `obiclean`, the **--no-progress-bar** option disables all progress bars,
not just the data.
- Several fixes in reading FASTA and FASTQ files, including some code
simplification and factorization.
- Fixed a bug in all obitools that caused the same file to be processed
multiple times, when specifying a directory name as input.
### New features
- `obigrep` add a new **--valid-taxid** option to keep only sequence with a
valid taxid
- `obiclean` add a new **--min-sample-count** option with a default value of 1,
asking to filter out sequences which are not occurring in at least the
specified number of samples.
- `obitoaxonomy` a new **--dump|D** option allows for dumping a sub-taxonomy.
- Taxonomy dump can now be provided as a four-columns CSV file to the
**--taxonomy** option.
- NCBI Taxonomy dump does not need to be uncompressed and unarchived anymore. The
path of the tar and gziped dump file can be directly specified using the
**--taxonomy** option.
- Most of the time obitools identify automatically sequence file format. But
it fails sometimes. Two new option **--fasta** and **--fastq** are added to
allow the processing of the rare fasta and fastq files not recognized.
- In `obiscript`, adds new methods to the Lua sequence object:
- `md5_string()`: returning the MD5 check sum as a hexadecimal string,
- `subsequence(from,to)`: allows extracting a subsequence on a 0 based
coordinate system, upper bound excluded like in go.
- `reverse_complement`: returning a sequence object corresponding to the
reverse complement of the current sequence.
### Enhancement
- All obitools now have a **--taxonomy** option. If specified, the taxonomy is
loaded first and taxids annotating the sequences are validated against that
taxonomy. A warning is issued for any invalid taxid and for any taxid that
is transferred to a new taxid. The **--update-taxid** option allows these
old taxids to be replaced with their new equivalent in the result of the
obitools command.
- The scoring system used by the `obipairing` command has been changed to be
more coherent. In the new version, the scores associated to a match and a
mismatch involving a nucleotide with a quality score of 0 are equal. Which
is normal as a zero quality score means a perfect indecision on the read
nucleotide, therefore there is no reason to penalize a match differently
from a mismatch (see
https://obitools4.metabarcoding.org/docs/commands/alignments/obipairing/exact-alignment/).
- In every *OBITools* command, the progress bar is automatically deactivated
when the standard error output is redirected.
- Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimized for short sequences, `obipcr` faces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the default `obipcr` parameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact of `obipcr` without changing the
computational efficiency too much.
- Logging system and therefore format, have been homogenized.
## August 2nd, 2024. Release 4.3.0
### Change of git repositiory
### Change of git repository
- The OBITools4 git repository has been moved to the github repository.
- The OBITools4 git repository has been moved to the GitHub repository.
The new address is: https://github.com/metabarcoding/obitools4.
Take care for using the new install script for retrieving the new version.
```bash
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
curl -L https://metabarcoding.org/obitools4/install.sh \
| bash
```
or with options:
```bash
curl -L https://raw.githubusercontent.com/metabarcoding/obitools4/master/install_obitools.sh \
curl -L https://metabarcoding.org/obitools4/install.sh \
| bash -s -- --install-dir test_install --obitools-prefix k
```
### CPU limitation
- By default, *OBITools4* tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by *OBITools4* or by using the **--max-cpu**
option or by setting the **OBIMAXCPU** environment variable. Some strange
behaviour of *OBITools4* has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get *OBITools4* to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, **OBITools4** will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
**--force-one-core** option. But be aware that this can lead to incorrect
calculations.
### New features
- The output of the obitools will evolve to produce results only in standard
formats such as fasta and fastq. For non-sequential data, the output will be
in CSV format, with the separator `,`, the decimal separator `.`, and a
header line with the column names. It is more convenient to use the output
in other programs. For example, you can use the `csvtomd` command to
reformat the csv output into a markdown table. The first command to initiate
reformat the CSV output into a Markdown table. The first command to initiate
this change is `obicount`, which now produces a 3-line CSV output.
```bash
@@ -57,7 +196,7 @@
database for `obitag` is to use `obipcr` on a local copy of Genbank or EMBL.
However, these sequence databases are known to contain many taxonomic
errors, such as bacterial sequences annotated with the taxid of their host
species. obicleandb tries to detect these errors. To do this, it first keeps
species. `obicleandb` tries to detect these errors. To do this, it first keeps
only sequences annotated with the taxid to which a species, genus, and
family taxid can be assigned. Then, for each sequence, it compares the
distance of the sequence to the other sequences belonging to the same genus
@@ -68,7 +207,7 @@
with the p-value of the Mann-Whitney U test in the **obicleandb_trusted**
slot. Later, the distribution of this p-value can be analyzed to determine a
threshold. Empirically, a threshold of 0.05 is a good compromise and allows
to filter out less than 1‰ of the sequences. These sequences can then be
filtering out less than 1‰ of the sequences. These sequences can then be
removed using `obigrep`.
- Adds a new `obijoin` utility to join information contained in a sequence
@@ -78,16 +217,16 @@
- Adds a new tool `obidemerge` to demerge a `merge_xxx` slot by recreating the
multiple identical sequences having the slot `xxx` recreated with its initial
value and the sequence count set to the number of occurences refered in the
value and the sequence count set to the number of occurrences referred in the
`merge_xxx` slot. During the operation, the `merge_xxx` slot is removed.
- Adds CSV as one of the input format for every obitools command. To encode
sequence the CSV file must includes a column named `sequence` and another
sequence the CSV file must include a column named `sequence` and another
column named `id`. An extra column named `qualities` can be added to specify
the quality scores of the sequence following the same ascii encoding than the
the quality scores of the sequence following the same ASCII encoding than the
fastq format. All the other columns will be considered as annotations and will
be interpreted as JSON objects encoding potentially for atomic values. If a
calumn value can not be decoded as JSON it will be considered as a string.
column value can not be decoded as JSON it will be considered as a string.
- A new option **--version** has been added to every obitools command. It will
print the version of the command.
@@ -96,8 +235,8 @@
quality scores from a BioSequence object.\
- In `obimultuplex` the ngsfilter file describing the samples can be no provided
not only using the classical nfsfilter format but also using the csv format.
When using csv, the first line must contain the column names. 5 columns are
not only using the classical ngsfilter format but also using the CSV format.
When using CSV, the first line must contain the column names. 5 columns are
expected:
- `experiment` the name of the experiment
@@ -113,43 +252,34 @@
Supplementary columns are allowed. Their names and content will be used to
annotate the sequence corresponding to the sample, as the `key=value;` did
in the nfsfilter format.
in the ngsfilter format.
The CSV format used allows for comment lines starting with `#` character.
Special data lines starting with `@param` in the first column allow to
configure the algorithm. The options **--template** provided an over
commented example of the csv format, including all the possible options.
Special data lines starting with `@param` in the first column allow configuring the algorithm. The options **--template** provided an over
commented example of the CSV format, including all the possible options.
### CPU limitation
### Enhancement
- By default, *OBITools4* tries to use all the computing power available on
your computer. In some circumstances this can be problematic (e.g. if you
are running on a computer cluster managed by your university). You can limit
the number of CPU cores used by *OBITools4* or by using the **--max-cpu**
option or by setting the **OBIMAXCPU** environment variable. Some strange
behavior of *OBITools4* has been observed when users try to limit the
maximum number of usable CPU cores to one. This seems to be caused by the Go
language, and it is not obvious to get *OBITools4* to run correctly on a
single core in all circumstances. Therefore, if you ask to use a single
core, **OBITools4** will print a warning message and actually set this
parameter to two cores. If you really want a single core, you can use the
**--force-one-core** option. But be aware that this can lead to incorrect
calculations.
- In every *OBITools* command, the progress bar are automatically deactivated
when the standard error output is redirected.
- Because Genbank and ENA:EMBL contain very large sequences, while OBITools4
are optimized As Genbank and ENA:EMBL contain very large sequences, while
OBITools4 is optimised for short sequences, `obipcr` faces some problems
with excessive consumption of computer resources, especially memory. Several
improvements in the tuning of the default `obipcr` parameters and some new
features, currently only available for FASTA and FASTQ file readers, have
been implemented to limit the memory impact of `obipcr` without changing the
computational efficiency too much.
- Logging system and therefore format, have been homogenized.
### Bug
- In `obitag`, correct the wrong assignment of the **obitag_bestmatch**
attribute.
- In `obiclean`, the **--no-progress-bar** option disables all progress bars,
not just the data.
- Several fixes in reading FASTA and FASTQ files, including some code
simplification and and factorization.
- Fixed a bug in all obitools that caused the same file to be processed
multiple times. when specifying a directory name as input.
## April 2nd, 2024. Release 4.2.0
### New features
- A new OBITools named `obiscript` allows to process each sequence according
- A new OBITools named `obiscript` allows processing each sequence according
to a Lua script. This is an experimental tool. The **--template** option
allows for generating an example script on the `stdout`.
@@ -157,7 +287,7 @@
- Two of the main class `obiseq.SeqWorker` and `obiseq.SeqWorker` have their
declaration changed. Both now return two values a `obiseq.BioSequenceSlice`
and an `error`. This allow a worker to return potentially several sequences
and an `error`. This allows a worker to return potentially several sequences
as the result of the processing of a single sequence, or zero, which is
equivalent to filter out the input sequence.
@@ -165,12 +295,12 @@
- In `obitag` if the reference database contains sequences annotated by taxid
not referenced in the taxonomy, the corresponding sequences are discarded
from the reference database and a warning indicating the sequence id and the
from the reference database and a warning indicating the sequence *id* and the
wrong taxid is emitted.
- The bug corrected in the parsing of EMBL and Genbank files as implemented in
version 4.1.2 of OBITools4, potentially induced some reduction in the
performance of the parsing. This should have been now fixed.
- In the same idea, parsing of genbank and EMBL files were reading and storing
- In the same idea, parsing of Genbank and EMBL files were reading and storing
in memory not only the sequence but also the annotations (features table).
Up to now none of the OBITools are using this information, but with large
complete genomes, it is occupying a lot of memory. To reduce this impact,
@@ -209,7 +339,7 @@
### New feature
- In `obimatrix` a **--transpose** option allows to transpose the produced
- In `obimatrix` a **--transpose** option allows transposing the produced
matrix table in CSV format.
- In `obitpairing` and `obipcrtag` two new options **--exact-mode** and
**--fast-absolute** to control the heuristic used in the alignment
@@ -217,7 +347,7 @@
the exact algorithm at the cost of a speed. **--fast-absolute** change the
scoring schema of the heuristic.
- In `obiannotate` adds the possibility to annotate the first match of a
pattern using the same algorithm than the one used in `obipcr` and
pattern using the same algorithm as the one used in `obipcr` and
`obimultiplex`. For that four option were added :
- **--pattern** : to specify the pattern. It can use IUPAC codes and
position with no error tolerated has to be followed by a `#` character.
@@ -298,7 +428,7 @@
### Bugs
- in the obitools language, the `composition` function now returns a map
- In the obitools language, the `composition` function now returns a map
indexed by lowercase string "a", "c", "g", "t" and "o" for other instead of
being indexed by the ASCII codes of the corresponding letters.
- Correction of the reverse-complement operation. Every reverse complement of
@@ -311,18 +441,18 @@
duplicating the quality values. This made `obimultiplex` to produce fastq
files with sequences having quality values duplicated.
### Becareful
### Be careful
GO 1.21.0 is out, and it includes new functionalities which are used in the
OBITools4 code. If you use the recommanded method for compiling OBITools on your
computer, their is no problem, as the script always load the latest GO version.
If you rely on you personnal GO install, please think to update.
OBITools4 code. If you use the recommended method for compiling OBITools on your
computer, there is no problem, as the script always load the latest GO version.
If you rely on your personal GO install, please think to update.
## August 29th, 2023. Release 4.0.5
### Bugs
- Patch a bug in the `obiseq.BioSequence` constructor leading to a error on
- Patch a bug in the `obiseq.BioSequence` constructor leading to an error on
almost every obitools. The error message indicates : `fatal error: sync:
unlock of unlocked mutex` This bug was introduced in the release 4.0.4
@@ -341,7 +471,7 @@ If you rely on you personnal GO install, please think to update.
data structure to limit the number of alignments actually computed. This
increase a bit the speed of both the software. `obirefidx` is nevertheless
still too slow compared to my expectation.
- Switch to a parallel version of the gzip library, allowing for high speed
- Switch to a parallel version of the GZIP library, allowing for high speed
compress and decompress operation on files.
### New feature
@@ -385,12 +515,12 @@ If you rely on you personnal GO install, please think to update.
--unidentified not_assigned.fastq
```
the command produced four files : `tagged_library_R1.fastq` and
The command produced four files : `tagged_library_R1.fastq` and
`tagged_library_R2.fastq` containing the assigned reads and
`not_assigned_R1.fastq` and `not_assigned_R2.fastq` containing the
unassignable reads.
the tagged library files can then be split using `obidistribute`:
The tagged library files can then be split using `obidistribute`:
```{bash}
mkdir pcr_reads
@@ -400,9 +530,9 @@ If you rely on you personnal GO install, please think to update.
- Adding of two options **--add-lca-in** and **--lca-error** to `obiannotate`.
These options aim to help during construction of reference database using
`obipcr`. On obipcr output, it is commonly run obiuniq. To merge identical
`obipcr`. On `obipcr` output, it is commonly run `obiuniq`. To merge identical
sequences annotated with different taxids, it is now possible to use the
following strategie :
following strategies :
```{bash}
obiuniq -m taxid myrefdb.obipcr.fasta \
@@ -433,7 +563,7 @@ If you rely on you personnal GO install, please think to update.
- Correction of a bug in `obiconsensus` leading into the deletion of a base
close to the beginning of the consensus sequence.
## March 31th, 2023. Release 4.0.2
## March 31st, 2023. Release 4.0.2
### Compiler change
@@ -444,15 +574,15 @@ If you rely on you personnal GO install, please think to update.
- Add the possibility for looking pattern with indels. This has been added to
`obimultiplex` through the **--with-indels** option.
- Every obitools command has a **--pprof** option making the command
publishing a profiling web site available at the address :
publishing a profiling website available at the address :
<http://localhost:8080/debug/pprof/>
- A new `obiconsensus` command has been added. It is a prototype. It aims to
build a consensus sequence from a set of reads. The consensus is estimated
for all the sequences contained in the input file. If several input files,
or a directory name are provided the result contains a consensus per file.
The id of the sequence is the name of the input file depleted of its
The *id* of the sequence is the name of the input file depleted of its
directory name and of all its extensions.
- In `obipcr` an experimental option **--fragmented** allows for spliting very
- In `obipcr` an experimental option **--fragmented** allows for splitting very
long query sequences into shorter fragments with an overlap between the two
contiguous fragment insuring that no amplicons are missed despite the split.
As a site effect some amplicon can be identified twice.
@@ -495,7 +625,7 @@ If you rely on you personnal GO install, please think to update.
### Enhancement
- *OBITools* are automatically processing all the sequences files contained in
a directory and its sub-directory\
a directory and its subdirectory\
recursively if its name is provided as input. To process easily Genbank
files, the corresponding filename extensions have been added. Today the
following extensions are recognized as sequence files : `.fasta`, `.fastq`,
@@ -512,7 +642,7 @@ If you rely on you personnal GO install, please think to update.
export OBICPUMAX=4
```
- Adds a new option --out\|-o allowing to specify the name of an outpout file.
- Adds a new option --out\|-o allowing to specify the name of an output file.
``` bash
obiconvert -o xyz.fasta xxx.fastq
@@ -534,10 +664,10 @@ If you rely on you personnal GO install, please think to update.
matched files remain consistent when processed.
- Adding of the function `ifelse` to the expression language for computing
conditionnal values.
conditional values.
- Adding two function to the expression language related to sequence
conposition : `composition` and `gcskew`. Both are taking a sequence as
composition : `composition` and `gcskew`. Both are taking a sequence as
single argument.
## February 18th, 2023. Release 4.0.0
@@ -545,8 +675,8 @@ If you rely on you personnal GO install, please think to update.
It is the first version of the *OBITools* version 4. I decided to tag then
following two weeks of intensive data analysis with them allowing to discover
many small bugs present in the previous non-official version. Obviously other
bugs are certainly persent in the code, and you are welcome to use the git
ticket system to mention them. But they seems to produce now reliable results.
bugs are certainly present in the code, and you are welcome to use the git
ticket system to mention them. But they seem to produce now reliable results.
### Corrected bugs
@@ -554,11 +684,11 @@ ticket system to mention them. But they seems to produce now reliable results.
of sequences and to the production of incorrect file because of the last
sequence record, sometime truncated in its middle. This was only occurring
when more than a single CPU was used. It was affecting every obitools.
- The `obiparing` software had a bug in the right aligment procedure. This led
to the non alignment of very sort barcode during the paring of the forward
- The `obiparing` software had a bug in the right alignment procedure. This led
to the non-alignment of very sort barcode during the paring of the forward
and reverse reads.
- The `obipairing` tools had a non deterministic comportment when aligning a
paor very low quality reads. This induced that the result of the same low
- The `obipairing` tools had a non-deterministic comportment when aligning a
pair very low quality reads. This induced that the result of the same low
quality read pair was not the same from run to run.
### New features
@@ -566,11 +696,10 @@ ticket system to mention them. But they seems to produce now reliable results.
- Adding of a `--compress|-Z` option to every obitools allowing to produce
`gz` compressed output. OBITools were already able to deal with gziped input
files transparently. They can now produce their results in the same format.
- Adding of a `--append|-A` option to the `obidistribute` tool. It allows to
append the result of an `obidistribute` execution to preexisting files. -
- Adding of a `--append|-A` option to the `obidistribute` tool. It allows appending the result of an `obidistribute` execution to preexisting files. -
Adding of a `--directory|-d` option to the `obidistribute` tool. It allows
to declare a secondary classification key over the one defined by the
'--category\|-c\` option. This extra key leads to produce directories in
declaring a secondary classification key over the one defined by the
`--category\|-c\` option. This extra key leads to produce directories in
which files produced according to the primary criterion are stored.
- Adding of the functions `subspc`, `printf`, `int`, `numeric`, and `bool` to
the expression language.

View File

@@ -0,0 +1,755 @@
# Prospective : Index k-mer v3 — Super-kmers canoniques, unitigs, et Aho-Corasick
## 1. Constat sur l'index v1
L'index actuel (`.kdi` delta-varint) stocke 18.6 milliards de k-mers (k=31, m=13, P=4096, 2 sets) en 85 Go, soit 4.8-5.6 bytes/k-mer. Les causes :
- Le canonical standard `min(fwd, rc)` disperse les k-mers sur 62 bits → deltas ~2^40 → 5-6 bytes varint
- Les k-mers partagés entre sets sont stockés N fois (une fois par set)
- Le matching nécessite N×P ouvertures de fichier (N passes)
## 2. Observations expérimentales
### 2.1 Déréplication brute
Sur un génome de *Betula exilis* 15× couvert, le pipeline `obik lowmask | obik super | obiuniq` réduit **80 Go de fastq.gz en 5.6 Go de fasta.gz** — un facteur 14×. Cela montre que la déréplication au niveau super-kmer est extrêmement efficace et que les super-kmers forment une représentation naturellement compacte.
### 2.2 Après filtre de fréquence (count > 1)
En éliminant les super-kmers observés une seule fois (erreurs de séquençage), le fichier passe de 5.6 Go à **2.7 Go de fasta.gz**. Les statistiques détaillées (obicount) :
| Métrique | Valeur |
|----------|--------|
| Variants (super-kmers uniques) | 37,294,271 |
| Reads (somme des counts) | 148,828,167 |
| Symboles (bases totales variants) | 1,415,018,593 |
| Longueur moyenne super-kmer | **37.9 bases** |
| K-mers/super-kmer moyen (k=31) | **7.9** |
| K-mers totaux estimés | **~295M** |
| Count moyen par super-kmer | **4.0×** |
### 2.3 Comparaison avec l'index v1
| Format | Taille | K-mers | Bytes/k-mer |
|--------|--------|--------|-------------|
| Index .kdi v1 (set Human dans Contaminent_idx) | 12.8 Go | ~3B | 4.3 |
| Delta-varint hypothétique (295M k-mers) | ~1.5 Go | 295M | 5.0 |
| Super-kmers 2-bit packed (*Betula* count>1) | ~354 Mo | 295M | **1.2** |
| Super-kmers fasta.gz (*Betula* count>1) | 2.7 Go | 295M | 9.2* |
\* Le fasta.gz inclut les headers, les counts, et la compression gzip — pas directement comparable au format binaire.
**Le format super-kmer 2-bit est ~4× plus compact que le delta-varint** à nombre égal de k-mers. Cette efficacité vient du fait qu'un super-kmer de 38 bases encode 8 k-mers en ~10 bytes au lieu de 8 × 5 = 40 bytes en delta-varint.
Note : la comparaison n'est pas directe (Contaminent_idx = génomes assemblés, *Betula* = reads bruts filtrés), mais le ratio bytes/k-mer est comparable car il dépend de la longueur des super-kmers, pas de la source des données.
## 3. Stratégie proposée : pipeline de construction v3
### 3.1 Définition du k-mer minimizer-canonique
On redéfinit la forme canonique d'un k-mer en fonction de son minimiseur :
```
CanonicalKmer(kmer, k, m) :
minimizer = plus petit m-mer canonique dans le k-mer
si minimizer == forward_mmer(minimizer_pos)
→ garder le k-mer tel quel
sinon
→ prendre le reverse-complement du k-mer
```
Propriétés :
- **m impair** → aucun m-mer ne peut être palindromique (`m_mer != RC(m_mer)` toujours) → la canonisation par le minimiseur est toujours non-ambiguë. C'est m, pas k, qui doit être impair : l'ambiguïté viendrait d'un minimiseur palindrome (`min == RC(min)`), auquel cas on ne saurait pas dans quel sens orienter le k-mer/super-kmer.
- Tous les k-mers d'un super-kmer partagent le même minimiseur
- **La canonisation peut se faire au niveau du super-kmer entier** : si `minimizer != canonical(minimizer)`, on RC le super-kmer complet. Tous les k-mers qu'il contient deviennent automatiquement minimizer-canoniques.
### 3.2 Pipeline de construction
```
Séquences brutes ([]byte, 1 byte/base)
[0] Encodage 2-bit + nettoyage
│ - Encoder chaque séquence en 2 bits/base ([]byte packed)
│ - Couper aux bases ambiguës (N, R, Y, W, S, K, M, B, D, H, V)
│ - Retirer les fragments de longueur < k
│ - Résultat : fragments 2-bit clean, prêts pour toutes les opérations
[1] Filtre de complexité (lowmask sur vecteurs 2-bit)
│ Supprime/masque les régions de faible entropie
[2] Extraction des super-kmers (sur vecteurs 2-bit, non canonisé)
│ Chaque super-kmer a un minimiseur et une séquence 2-bit packed
[3] Canonisation au niveau super-kmer
│ Si minimizer != CanonicalKmer(minimizer) → RC le super-kmer (op bit)
│ Résultat : super-kmers canoniques 2-bit packed
[4] Écriture dans les partitions .skm (partition = minimizer % P)
│ Format natif 2-bit → écriture directe, pas de conversion
[5] Déréplication des super-kmers par partition
│ Trier les super-kmers (comparaison uint64 sur données packed → très rapide)
│ Compter les occurrences identiques
│ Résultat : super-kmers uniques avec count
[6] Construction des unitigs canoniques par partition
│ Assembler les super-kmers qui se chevauchent de (k-1) bases
│ en chaînes linéaires non-branchantes (tout en 2-bit)
│ Propager les counts : vecteur de poids par unitig
[7] Filtre de fréquence sur le graphe pondéré (voir section 4)
│ Supprimer les k-mers (positions) avec poids < seuil
│ Re-calculer les unitigs après filtrage
[8] Stockage des unitigs avec bitmask multiset
│ Format compact sur disque (déjà en 2-bit, écriture directe)
Index v3
```
### 3.2bis Pourquoi encoder en 2-bit dès le début ?
**Alternative rejetée** : travailler en `[]byte` (1 byte/base) puis encoder en 2-bit seulement pour le stockage final.
| Aspect | `[]byte` (1 byte/base) | 2-bit packed |
|--------|----------------------|--------------|
| Programmation | Simple (slicing natif, pas de bit-shift) | Plus complexe (masques, shifts) |
| Mémoire par super-kmer (38 bases) | 38 bytes | 10 bytes (**3.8×** moins) |
| 37M super-kmers en RAM | ~1.4 Go | ~370 Mo |
| Tri (comparaison) | `bytes.Compare` sur slices | Comparaison uint64 (**beaucoup** plus rapide) |
| Format .skm | Conversion encode/decode à chaque I/O | Écriture/lecture directe |
| RC d'un super-kmer | Boucle sur bytes + lookup | Opérations bit (une instruction pour complement) |
L'opération la plus coûteuse du pipeline est le **tri des super-kmers** pour la déréplication (étape 5). En 2-bit packed, un super-kmer de ≤32 bases tient dans un `uint64` → tri par comparaison entière (une instruction CPU). Un super-kmer de 33-64 bases tient dans deux `uint64` → tri en deux comparaisons.
Le code de manipulation 2-bit est plus complexe à écrire mais **s'écrit une seule fois** (bibliothèque de primitives) et bénéficie à toute la chaîne. Le gain en mémoire (4×) et en temps de tri est significatif sur des dizaines de millions de super-kmers.
### 3.3 Canonisation des super-kmers : pourquoi ça marche
**Point crucial** : les super-kmers doivent être construits en utilisant le minimiseur **non-canonique** (le m-mer brut tel qu'il apparaît dans la séquence), et non le minimiseur canonique `min(fwd, rc)`.
**Pourquoi ?** Si on utilise le minimiseur canonique comme critère de regroupement, un même super-kmer pourrait contenir le minimiseur dans ses **deux orientations** à des positions différentes (le m-mer forward à une position, et sa forme RC à une autre position, ayant la même valeur canonique). Dans ce cas, le RC du super-kmer ne résoudrait pas l'ambiguïté.
**Algorithme correct** :
1. **Extraction** : construire les super-kmers en regroupant les k-mers consécutifs qui partagent le même m-mer minimal **non-canonique** (le m-mer brut). Au sein d'un tel super-kmer, le minimiseur apparaît toujours dans **une seule orientation**.
2. **Canonisation** : pour chaque super-kmer, comparer son minimiseur brut à `canonical(minimizer) = min(minimizer, RC(minimizer))` :
- Si `minimizer == canonical(minimizer)` → le minimiseur est déjà en forward → garder le super-kmer tel quel
- Si `minimizer != canonical(minimizer)` → le minimiseur est en RC → RC le super-kmer entier → le minimiseur apparaît maintenant en forward
Après cette étape, **chaque k-mer du super-kmer** contient le minimiseur canonique en position forward, ce qui correspond exactement à notre définition de k-mer minimizer-canonique.
**Note** : cela signifie que l'algorithme `IterSuperKmers` actuel (qui utilise le minimiseur canonique pour le regroupement) doit être modifié pour utiliser le minimiseur brut. C'est un changement dans le critère de rupture des super-kmers : on casse quand le **m-mer minimal brut** change, pas quand le **m-mer minimal canonique** change. Les super-kmers résultants seront potentiellement plus courts (un changement d'orientation du minimiseur force une coupure), mais c'est le prix de la canonicité absolue.
### 3.4 Déréplication des super-kmers
Deux super-kmers identiques (même séquence, même minimiseur) correspondent aux mêmes k-mers. On peut les dérépliquer en triant :
1. Par minimiseur (déjà partitionné)
2. Par séquence (tri lexicographique des séquences 2-bit packed)
Les super-kmers identiques deviennent consécutifs dans le tri → comptage linéaire.
Le tri peut se faire sur les fichiers .skm d'une partition, en mémoire si la partition tient en RAM, ou par merge-sort externe sinon.
## 4. Filtre de fréquence
### 4.1 Problème
Le filtre de fréquence (`--min-occurrence N`) élimine les k-mers vus moins de N fois. Avec la déréplication des super-kmers, on a un count par super-kmer, pas par k-mer. Un k-mer peut apparaître dans plusieurs super-kmers différents (aux jonctions, ou quand le minimiseur change), donc le count exact d'un k-mer n'est connu qu'après fusion.
### 4.2 Solution : filtrage sur le graphe de De Bruijn pondéré
Le filtre de fréquence doit être appliqué **après** la construction des unitigs canoniques (section 5), et non avant. Le pipeline devient :
```
Super-kmers canoniques dérepliqués (avec counts)
Construction des unitigs canoniques (section 5)
│ Chaque position dans un unitig porte un poids
│ = somme des counts des super-kmers couvrant ce k-mer
Graphe de De Bruijn pondéré (implicite dans les unitigs)
Filtrage : supprimer les k-mers (positions) avec poids < seuil
│ Cela casse certains unitigs en fragments
Recalcul des unitigs sur le graphe filtré
Unitigs filtrés finaux
```
**Avantages** :
- Le filtre opère sur les **k-mers exacts** avec leurs **counts exacts** (pas une approximation par super-kmer)
- Le graphe de De Bruijn est implicitement contenu dans les unitigs — pas besoin de le construire explicitement avec une `map[uint64]uint`
- Les k-mers aux jonctions de super-kmers ont leurs counts correctement agrégés
### 4.3 Calcul du poids de chaque position dans un unitig
Un unitig est construit par chaînage de super-kmers. Chaque super-kmer S de longueur L et count C contribue (L-k+1) k-mers, chacun avec poids C. Quand deux super-kmers se chevauchent de (k-1) bases dans l'unitig, les k-mers de la zone de chevauchement reçoivent la **somme** des counts des deux super-kmers.
En pratique, lors de la construction de l'unitig par chaînage, on construit un vecteur de poids `weights[0..nkmers-1]` :
```
Pour chaque super-kmer S (count=C) ajouté à l'unitig:
Pour chaque position i couverte par S dans l'unitig:
weights[i] += C
```
### 4.4 Filtrage et re-construction
Après filtrage (`weights[i] < seuil` → supprimer position i), l'unitig est potentiellement coupé en fragments. Chaque fragment continu de positions conservées forme un nouvel unitig (ou super-kmer si court).
Le recalcul des unitigs après filtrage est trivial : les fragments sont déjà des chemins linéaires, il suffit de vérifier les conditions de non-branchement aux nouvelles extrémités.
### 4.5 Spectre de fréquence
Le spectre de fréquence exact peut être calculé directement depuis les vecteurs de poids des unitigs : `weights[i]` donne le count exact du k-mer à la position i. C'est un histogramme sur toutes les positions de tous les unitigs.
### 4.6 Faisabilité mémoire : graphe pondéré par partition
Données mesurées sur un index *Betula* (k=31, P=4096 partitions, 1 set, génome assemblé) — distribution des tailles de fichiers .kdi :
| Métrique | Taille fichier .kdi | K-mers estimés (~5 B/kmer) | Super-kmers (~8 kmer/skm) |
|----------|--------------------|-----------------------------|---------------------------|
| Mode | 100-200 Ko | 20 000 40 000 | 2 500 5 000 |
| Médiane | ~350-400 Ko | ~70 000 80 000 | ~9 000 10 000 |
| Max | ~2.3 Mo | ~460 000 | ~57 000 |
Le graphe de De Bruijn pondéré pour une partition nécessite d'extraire tous les k-mers (arêtes) et (k-1)-mers (nœuds) des super-kmers :
| Partition | K-mers (arêtes) | RAM arêtes (~20 B) | RAM nœuds (~16 B) | Total |
|-----------|-----------------|--------------------|--------------------|-------|
| Typique (~10K skm, 38 bases avg) | ~80K | ~1.6 Mo | ~1.3 Mo | **~3 Mo** |
| Maximale (~57K skm) | ~460K | ~9.2 Mo | ~7.4 Mo | **~17 Mo** |
C'est **largement en mémoire**. Les partitions étant indépendantes, elles peuvent être traitées en parallèle par un pool de goroutines. Avec 8 goroutines : **~136 Mo** au pic — négligeable. Les tableaux sont réutilisables entre partitions (allocation unique).
**Conclusion** : la construction du graphe de De Bruijn pondéré partition par partition est non seulement faisable mais triviale en termes de mémoire. C'est un argument fort en faveur de l'approche « filtre après unitigs » plutôt que « filtre sur super-kmers ».
### 4.7 Invariance de la distribution par rapport à la canonisation
La redéfinition du k-mer canonique (par le minimiseur au lieu de `min(fwd, rc)`) ne change **rien** à l'ensemble des k-mers ni à leur répartition par partition :
- C'est une **bijection** : chaque k-mer a toujours exactement un représentant canonique, on change juste lequel des deux brins on choisit
- Le partitionnement se fait sur `canonical(minimizer) % P` — la valeur du minimiseur canonique est la même dans les deux conventions
- **Même nombre de k-mers par partition, même distribution de tailles**
- **Même topologie du graphe de De Bruijn** (mêmes nœuds, mêmes arêtes)
Ce qui change, c'est **l'orientation** : avec la canonicité par minimiseur, les unitigs canoniques ne suivent que les arêtes « forward » (`suffix(S1) == prefix(S2)`, identité exacte). Certaines arêtes traversables en RC dans BCALM2 deviennent des points de cassure. Le graphe n'est pas plus gros — il suffit de ne construire que des unitigs canoniques, ce qui **simplifie** l'algorithme (pas de gestion des traversées de brin).
## 5. Construction des unitigs canoniques
### 5.1 Définition : unitig canonique absolu
Un **unitig canonique** est un chemin linéaire non-branchant dans le graphe de De Bruijn où :
1. Chaque k-mer est **minimizer-canonique** (le minimiseur y apparaît en forward)
2. Chaque super-kmer constituant est **canonique** (même convention)
3. Le chaînage se fait **sans traversée de brin** : `suffix(k-1, S1) == prefix(k-1, S2)` dans le même sens (pas en RC)
C'est plus restrictif que les unitigs BCALM2 (qui autorisent `suffix(S1) == RC(prefix(S2))`), mais cela garantit que **tout k-mer extrait par fenêtre glissante est directement dans sa forme canonique**, sans re-canonisation.
### 5.2 Pourquoi la canonicité absolue est essentielle
**Matching** : les k-mers requête sont canonisés une fois (par le minimiseur), puis comparés directement aux k-mers de l'unitig par scan. Pas de re-canonisation à la volée → plus rapide, plus simple.
**Opérations ensemblistes** : deux index utilisant la même convention produisent les mêmes unitigs canoniques pour les mêmes k-mers. L'intersect/union peut opérer par comparaison directe de séquences triées.
**Bitmask multiset** : la fusion de N sets est triviale — merger des listes de super-kmers/unitigs canoniques triés par séquence.
**Déterminisme** : un ensemble de k-mers produit toujours les mêmes unitigs canoniques, quel que soit l'ordre d'insertion ou la source des données.
### 5.3 Impact sur la compaction
La contrainte canonique interdit les traversées de brin aux jonctions → les unitigs canoniques sont **plus courts** que les unitigs BCALM2. Estimation :
- BCALM2 (unitigs libres) : 63 bases moyennes (mesuré sur *Betula*)
- Unitigs canoniques : probablement ~45-55 bases moyennes
- Super-kmers dérepliqués : 38 bases moyennes
Le facteur de compaction est légèrement réduit mais le gain en simplicité opérationnelle compense largement.
### 5.4 Construction par partition — le vrai graphe de De Bruijn
Les super-kmers canoniques dérepliqués sont des **chemins** dans le graphe de De Bruijn, pas des nœuds. On ne peut pas les chaîner directement comme des nœuds car :
- Deux super-kmers peuvent **se chevaucher** (partager des k-mers aux jonctions)
- Un super-kmer court peut avoir ses k-mers **inclus** dans un super-kmer plus long
Un super-kmer de longueur L contient (L-k+1) k-mers, soit (L-k+1) arêtes et (L-k+2) nœuds ((k-1)-mers) dans le graphe de De Bruijn.
#### 5.4.1 Nœuds = (k-1)-mers, Arêtes = k-mers
Le graphe de De Bruijn par partition a :
- **Nœuds** : les (k-1)-mers uniques (extraits de toutes les positions dans les super-kmers)
- **Arêtes** : les k-mers (chaque position dans un super-kmer = une arête entre deux (k-1)-mers consécutifs)
- **Poids** : chaque arête (k-mer) porte le count du super-kmer qui la contient
Les branchements (nœud avec degré entrant > 1 ou degré sortant > 1) peuvent être :
- Aux **bords** des super-kmers (jonctions entre super-kmers)
- Aux **positions internes** si un k-mer d'un autre super-kmer rejoint un (k-1)-mer interne
#### 5.4.2 Graphe complet nécessaire
Construire le graphe avec **tous** les (k-1)-mers (internes et bords) est nécessaire pour détecter correctement les branchements. Se limiter aux seuls bords de super-kmers serait incorrect car un (k-1)-mer de bord d'un super-kmer peut correspondre à un nœud interne d'un autre super-kmer.
Pour une partition typique de 10K super-kmers de longueur moyenne 38 bases → ~80K k-mers → ~80K arêtes et ~80K nœuds. Voir section 4.6 pour la faisabilité mémoire (~3 Mo par partition typique, ~17 Mo max).
#### 5.4.3 Structure de données : tableau trié d'arêtes
Plutôt que des hash maps, on utilise un **tableau trié** pour le graphe :
```go
type Edge struct {
srcKmer uint64 // (k-1)-mer source (prefix du k-mer)
dstKmer uint64 // (k-1)-mer destination (suffix du k-mer)
weight int32 // count du super-kmer contenant ce k-mer
}
```
Pour chaque super-kmer S de longueur L et count C, on émet (L-k+1) arêtes. Tableau total pour une partition typique : ~80K × 20 bytes = **~1.6 Mo**.
On trie par `srcKmer` pour obtenir la liste d'adjacence sortante, ou on construit deux vues triées (par src et par dst) pour avoir adjacence entrante et sortante.
#### 5.4.4 Détection des unitigs canoniques
Un unitig canonique est un chemin maximal non-branchant. L'algorithme :
```
1. Extraire toutes les arêtes des super-kmers → tableau edges[]
2. Trier edges[] par srcKmer → vue sortante
Trier une copie par dstKmer → vue entrante
3. Pour chaque (k-1)-mer unique :
- degré_sortant = nombre d'arêtes avec ce srcKmer
- degré_entrant = nombre d'arêtes avec ce dstKmer
- Si degré_sortant == 1 ET degré_entrant == 1 → nœud interne d'unitig
- Sinon → nœud de branchement (début ou fin d'unitig)
4. Parcourir les chemins non-branchants pour construire les unitigs
- Chaque unitig est une séquence de (k-1)-mers chaînés
- Le vecteur de poids est la séquence des weight des arêtes traversées
```
Les (k-1)-mers ne sont **pas canonisés** — on respecte l'orientation des super-kmers canoniques. Le chaînage est strictement orienté.
#### 5.4.5 Estimation mémoire
| Partition | K-mers (arêtes) | RAM arêtes | RAM nœuds | Total |
|-----------|-----------------|------------|-----------|-------|
| Typique (~10K skm, 38 bases avg) | ~80K | ~1.6 Mo | ~1.3 Mo | **~3 Mo** |
| Maximale (~57K skm) | ~460K | ~9.2 Mo | ~7.4 Mo | **~17 Mo** |
Avec traitement parallèle par un pool de G goroutines : RAM max = G × 17 Mo. Avec G=8 : **~136 Mo** au pic. Le tableau d'arêtes est réutilisable entre partitions (allocation unique, remise à zéro).
Complexité : O(E log E) par partition, avec E = nombre total de k-mers. Dominé par les deux tris.
### 5.5 Graphe par minimiseur, pas par partition
Les k-mers (arêtes) sont partitionnés par minimiseur. Deux k-mers adjacents dans le graphe de De Bruijn peuvent avoir des minimiseurs différents — c'est exactement ce qui définit les frontières de super-kmers. Si on construit le graphe par partition (qui regroupe plusieurs minimiseurs), des (k-1)-mers de jonction entre minimiseurs différents apparaîtraient comme nœuds partagés entre partitions → le graphe par partition n'est pas autonome.
**Solution : construire un graphe par minimiseur.**
Un super-kmer est par définition un chemin dont **tous les k-mers partagent le même minimiseur**. Donc :
- Toutes les arêtes d'un super-kmer appartiennent à un seul minimiseur
- Chaque graphe par minimiseur est **100% autonome** : toutes ses arêtes et nœuds internes sont auto-contenus
- Les (k-1)-mers aux bords des super-kmers qui touchent un autre minimiseur sont des extrémités (degré 0 dans ce graphe) → bouts d'unitig naturels
- Aucune jonction inter-graphe → pas de cassure artificielle d'unitig
**Taille des graphes** : avec ~16K minimiseurs théoriques par partition (P=4096, m=13), le calcul naïf donne ~5 arêtes/minimiseur. Mais en pratique, beaucoup de minimiseurs ne sont pas représentés (séquences biologiques, pas aléatoires) et la distribution est très inégale. Si seuls ~500-1000 minimiseurs sont effectivement présents dans une partition typique de 80K arêtes, on a plutôt **80-160 arêtes en moyenne** par minimiseur, avec une queue de distribution vers les centaines ou milliers pour les minimiseurs les plus fréquents. Même dans ce cas, les graphes restent petits (quelques Ko à quelques dizaines de Ko).
*À mesurer* : nombre de minimiseurs distincts par partition et distribution du nombre de k-mers par minimiseur sur un index existant.
**Algorithme** : les super-kmers étant déjà triés par minimiseur dans la partition, on itère séquentiellement et on construit/détruit un petit graphe à chaque changement de minimiseur. C'est plus simple que le graphe par partition — pas de tri global de toutes les arêtes, juste un buffer local réutilisé.
**Les unitigs résultants** sont les chemins maximaux non-branchants au sein d'un minimiseur. Un unitig ne traverse jamais une frontière de minimiseur, ce qui est correct : tous les k-mers d'un unitig partagent le même minimiseur canonique, ce qui renforce la propriété de canonicité absolue.
### 5.6 Quand les unitigs canoniques n'aident pas
- Si les super-kmers sont courts (peu de chevauchement entre super-kmers adjacents)
- Si le graphe est très branché (zones de divergence entre génomes)
- Si beaucoup de jonctions se font par traversée de brin (la contrainte canonique empêche la fusion)
- Données metabarcoding avec grande diversité taxonomique → courts unitigs
Dans ces cas, stocker les super-kmers dérepliqués directement est suffisant — ils sont déjà canoniques par construction.
## 6. Construction multiset : super-kmers par set, graphe commun
### 6.1 Pipeline en deux phases
La construction d'un index multiset (N sets) se fait en deux phases distinctes :
**Phase 1 — Par set (indépendant, parallélisable)** :
Chaque set i (i = 0..N-1) produit indépendamment ses super-kmers canoniques :
```
Set i : séquences → [0] 2-bit → [1] lowmask → [2] super-kmers → [3] canonisation → [4] partition .skm_i
```
Puis déréplication par partition : super-kmers triés avec counts, écrits dans des fichiers `.skm` distincts par set.
**Phase 2 — Par partition, tous sets confondus** :
Pour chaque partition P (parallélisable par goroutine) :
```
.skm_0[P], .skm_1[P], ..., .skm_{N-1}[P] (super-kmers triés de chaque set)
[a] N-way merge des super-kmers triés
│ Même super-kmer dans sets i et j → fusionner en vecteur de counts [c_0, ..., c_{N-1}]
│ Super-kmer uniquement dans set i → counts = [0, ..., c_i, ..., 0]
[b] Extraction des arêtes du graphe de De Bruijn
│ Chaque k-mer (arête) porte un vecteur de poids [w_0, ..., w_{N-1}]
│ w_i = count du super-kmer contenant ce k-mer dans le set i
[c] Construction d'un SEUL graphe de De Bruijn par partition
│ Les branchements sont définis par l'UNION de tous les sets :
│ si un (k-1)-mer a degré > 1 dans n'importe quel set, c'est un branchement
[d] Extraction des unitigs canoniques communs
│ Même séquence pour tous les sets
│ Chaque position porte un vecteur de poids (un count par set)
[e] Filtre de fréquence (optionnel, par set ou global)
[f] Encodage du bitmask par runs le long de chaque unitig
Écriture dans le .sku de la partition
```
### 6.2 Le graphe est défini par l'union
Point crucial : les unitigs sont déterminés par la **topologie de l'union** de tous les sets. Un branchement dans un seul set force une coupure d'unitig pour tous les sets. Cela garantit que :
- Les unitigs sont les mêmes quelle que soit l'ordre des sets
- Un k-mer donné se trouve toujours au même endroit (même unitig, même position)
- Les opérations ensemblistes (intersect, union, difference) opèrent sur les mêmes unitigs
### 6.3 Arêtes à vecteur de poids
La structure Edge (section 5.4.3) est étendue pour le multiset :
```go
type Edge struct {
srcKmer uint64 // (k-1)-mer source
dstKmer uint64 // (k-1)-mer destination
weights []int32 // weights[i] = count dans le set i (0 si absent)
}
```
Pour la détection des branchements, le degré d'un nœud est le nombre d'arêtes distinctes (par dstKmer pour le degré sortant), **indépendamment** des sets. Une arête présente dans le set 0 mais pas le set 1 compte quand même.
### 6.4 Bitmask par runs le long des unitigs
Le long d'un unitig, le bitmask (quels sets contiennent ce k-mer) change rarement — les régions conservées entre génomes sont longues. On encode :
```
unitig_bitmask = [(bitmask_1, run_length_1), (bitmask_2, run_length_2), ...]
```
`bitmask_i` a un bit par set (bit j = 1 si `weights[j] > 0` à cette position).
Pour un unitig de 70 k-mers avec 2 sets :
- Si complètement partagé : 1 run `(0b11, 70)` → 2 bytes
- Si divergent au milieu : 2-3 runs → 4-6 bytes
- Pire cas : 70 runs → 140 bytes (très rare)
### 6.5 Impact mémoire du multiset
Le vecteur de poids par arête augmente la taille du graphe :
- 1 set : `weight int32` → 4 bytes/arête
- N sets : `weights [N]int32` → 4N bytes/arête
Pour la partition typique (~80K arêtes) avec N=2 sets : overhead = 80K × 4 = **320 Ko** supplémentaires. Négligeable.
Pour N=64 sets (cas extrême) : 80K × 256 = **~20 Mo** par partition. Reste faisable mais les sets très nombreux pourraient nécessiter un encodage plus compact (sparse vector si beaucoup de zéros).
### 6.6 Merge des super-kmers : N-way sur séquences triées
Le merge des N listes de super-kmers triés (par séquence 2-bit) est un N-way merge classique avec min-heap :
- Chaque .skm est déjà trié par séquence (étape de déréplication)
- On compare les séquences 2-bit packed (comparaison uint64, très rapide)
- Quand le même super-kmer apparaît dans plusieurs sets, on fusionne les counts
- Quand un super-kmer est unique à un set, les autres counts sont 0
C'est analogue au `KWayMerge` existant sur les k-mers triés, étendu aux super-kmers.
## 7. Format de stockage v3 : fichiers parallèles
### 7.1 Architecture : 3 fichiers par partition
Pour chaque partition, trois fichiers alignés :
```
index_v3/
metadata.toml
parts/
part_PPPP.sku # séquences 2-bit des unitigs concaténés
part_PPPP.skx # index par minimiseur (offsets dans .sku)
part_PPPP.skb # bitmask multiset (1 entrée par k-mer)
...
set_N/spectrum.bin # spectre de fréquence par set
```
### 7.2 Fichier .sku — séquences d'unitigs concaténées
Tous les unitigs d'une partition sont concaténés bout à bout en 2-bit packed, **ordonnés par minimiseur**. Entre deux unitigs, pas de séparateur dans le flux 2-bit.
Un **tableau de longueurs** stocké en en-tête ou dans le .skx donne la longueur (en bases) de chaque unitig dans l'ordre. Ce tableau permet :
- De retrouver les frontières d'unitigs
- De savoir si un match AC chevauche une jonction (à filtrer)
- D'indexer directement un unitig par son numéro
```
Format .sku :
Magic: "SKU\x01" (4 bytes)
TotalBases: uint64 LE (nombre total de bases dans la partition)
NUnitigs: uint64 LE (nombre d'unitigs)
Lengths: [NUnitigs]varint (longueur en bases de chaque unitig)
Sequence: ceil(TotalBases/4) bytes (flux 2-bit continu)
```
### 7.3 Fichier .skx — index par minimiseur
Pour chaque minimiseur présent dans la partition, l'index donne l'offset (en bases) dans le flux .sku et le nombre d'unitigs :
```
Format .skx :
Magic: "SKX\x01" (4 bytes)
NMinimizers: uint32 LE (nombre de minimiseurs présents)
Entries: [NMinimizers] {
Minimizer: uint64 LE (valeur du minimiseur canonique)
BaseOffset: uint64 LE (offset en bases dans le flux .sku)
UnitigOffset: uint32 LE (index du premier unitig de ce minimiseur dans le tableau de longueurs)
NUnitigs: uint32 LE (nombre d'unitigs pour ce minimiseur)
}
```
Les entrées sont triées par `Minimizer` → recherche binaire en O(log N).
Pour accéder aux unitigs d'un minimiseur donné :
1. Recherche binaire dans le .skx → `BaseOffset`, `UnitigOffset`, `NUnitigs`
2. Seek dans le .sku au bit `BaseOffset × 2`
3. Lecture de `NUnitigs` unitigs (longueurs dans le tableau à partir de `UnitigOffset`)
### 7.4 Fichier .skb — bitmask multiset parallèle
Le fichier bitmask est **aligné position par position** avec le flux de k-mers des unitigs. Chaque k-mer (position dans un unitig) a exactement une entrée dans le .skb, dans le même ordre que les k-mers apparaissent en lisant les unitigs séquentiellement.
```
Format .skb :
Magic: "SKB\x01" (4 bytes)
TotalKmers: uint64 LE (nombre total de k-mers)
NSets: uint8 (nombre de sets)
BitmaskSize: uint8 (ceil(NSets/8) bytes par entrée)
Bitmasks: [TotalKmers × BitmaskSize] bytes
```
**Accès direct** : la position absolue d'un k-mer dans le flux d'unitigs (offset en k-mers depuis le début de la partition) donne directement l'index dans le fichier .skb :
```
bitmask_offset = header_size + kmer_position × BitmaskSize
```
Pour 2 sets : 1 byte par k-mer (6 bits inutilisés).
Pour ≤8 sets : 1 byte par k-mer.
Pour ≤16 sets : 2 bytes par k-mer.
**Coût** : pour 295M k-mers (*Betula*, 2 sets) : 295 Mo. Pour l'index Contaminent_idx (18.6B k-mers, 2 sets) : ~18.6 Go. C'est significatif — voir section 7.5 pour la compression.
### 7.5 Compression du bitmask : RLE ou non ?
| Approche | Taille (2 sets, 295M k-mers) | Accès |
|----------|------------------------------|-------|
| Non compressé (1 byte/k-mer) | 295 Mo | O(1) direct |
| RLE par unitig | ~10-50 Mo (estimé) | O(decode) par unitig |
| Bitset par set (1 bit/k-mer/set) | 74 Mo | O(1) direct |
L'approche **bitset par set** (1 bit par k-mer par set, packed en bytes) est un bon compromis :
- 2 sets : 2 bits/k-mer → ~74 Mo (vs 295 Mo non compressé)
- Accès O(1) : `bit = (data[kmer_pos / 4] >> ((kmer_pos % 4) × 2)) & 0x3`
- Pas besoin de décompression séquentielle
Pour les très grands index (18.6B k-mers), même le bitset fait ~4.6 Go. Le RLE par minimiseur (ou par unitig) pourrait réduire à ~1-2 Go mais perd l'accès O(1).
**Recommandation** : bitset packed pour ≤8 sets (accès O(1)), RLE pour >8 sets ou très grands index.
## 8. Matching avec Aho-Corasick sur le flux d'unitigs
### 8.1 Principe
Pour chaque partition dont les k-mers requête partagent le minimiseur :
1. Seek dans le .sku au bloc du minimiseur (via .skx)
2. Construire un automate AC avec les k-mers requête canoniques de ce minimiseur
3. Scanner le flux 2-bit des unitigs de ce minimiseur
4. Pour chaque match : vérifier qu'il ne chevauche pas une frontière d'unitig
5. Pour chaque match valide : lookup dans le .skb à la position correspondante → bitmask
### 8.2 Le problème des faux matches aux jonctions
En 2-bit, pas de 5e lettre pour séparer les unitigs. Le scan AC sur le flux continu peut produire des matches à cheval sur deux unitigs adjacents.
**Solution : post-filtrage par le tableau de longueurs.**
Pendant le scan, on maintient un compteur de position et un index dans le tableau de longueurs (préfixe cumulé). Quand un match est trouvé à la position `p` :
- Le match couvre les bases `[p, p+k-1]`
- Si ces bases chevauchent une frontière d'unitig → faux positif, ignorer
- Sinon → match valide
Le coût du post-filtrage est O(1) par match (le compteur de frontière avance séquentiellement).
**Estimation du taux de faux positifs** : avec des unitigs de ~50 bases en moyenne, une jonction tous les ~50 bases, et k=31 : ~31/50 = ~62% des positions de jonction peuvent produire un faux match. Mais seule une infime fraction de ces positions correspond à un pattern dans l'automate AC. En pratique, le nombre de faux positifs est négligeable.
### 8.3 Du match à la position absolue dans le .skb
Un match AC à la position `p` dans le flux du minimiseur se traduit en position k-mer dans le .skb :
```
kmer_position_in_partition = base_offset_of_minimizer_in_partition
+ p
- (nombre de bases de padding/frontières avant p)
```
En fait, si le tableau de longueurs donne les longueurs d'unitigs en bases, la position k-mer cumulative est :
```
Pour l'unitig i contenant le match :
kmer_base = somme des longueurs des unitigs 0..i-1
kmer_offset_in_unitig = p - kmer_base
kmer_index = somme des (len_j - k + 1) pour j=0..i-1 + kmer_offset_in_unitig
```
Ce `kmer_index` est l'index direct dans le fichier .skb.
### 8.4 Comparaison avec le merge-scan v1
| Aspect | Merge-scan (v1) | AC sur unitigs (v3) |
|--------|----------------|---------------------|
| Pré-requis | Tri des requêtes O(Q log Q) | Construction automate AC O(Q×k) |
| Seek | .kdx sparse index | .skx index par minimiseur |
| Scan | O(Q + K) merge linéaire par set | O(bases_du_minimiseur + matches) |
| Multi-set | **N passes** (une par set) | **1 seule passe** (bitmask .skb) |
| I/O | N×P ouvertures de fichier | 1 seek + lecture séquentielle + lookup .skb |
| Accès bitmask | implicite (chaque .kdi = 1 set) | O(1) dans .skb |
Le gain principal du v3 est l'**élimination des N passes** : au lieu de scanner N fois (une par set), on scanne une seule fois et on consulte le bitmask. Pour N=2 sets et P=4096 partitions, cela réduit les ouvertures de fichier de 2×4096 = 8192 à 4096.
## 9. Estimations de taille et validation expérimentale
### 9.1 Cas mesuré : *Betula exilis* 15× (reads bruts, count > 1)
| Métrique | Valeur |
|----------|--------|
| Super-kmers uniques (count > 1) | 37.3M |
| Longueur moyenne | 37.9 bases |
| Bases totales | 1.415G |
**Stockage binaire 2-bit packed** :
- Séquences : 1.415G / 4 = **354 Mo**
- Headers (longueur varint + minimiseur) : 37.3M × ~4 bytes = **150 Mo**
- Bitmask (1 set → 0 bytes, ou 2 sets → 1 byte/entrée = 37 Mo)
- **Total estimé : ~500-550 Mo** pour un set
### 9.2 Extrapolation pour l'index Plants+Human (2 sets)
L'index v1 actuel contient 18.6B k-mers en 85 Go. Avec le pipeline v3 :
**Scénario reads bruts 15× par génome** (extrapolé depuis *Betula exilis*) :
- *Betula exilis* mesuré : ~37M super-kmers, ~1.4G bases → ~500 Mo
- Proportionnellement pour l'index Contaminent_idx (18.6B k-mers) : **~2-5 Go**
**Scénario génome assemblé (pas de filtre de fréquence)** :
- Un génome assemblé de 3 Gbases → estimation ~80M super-kmers × 38 bases → **760 Mo**
- Un génome assemblé de 10 Gbases → estimation ~350M super-kmers × 38 bases → **3.3 Go**
- Avec overlap multiset : super-kmers partagés fusionnés (bitmask) → **~4 Go**
**Le gain est spectaculaire dans les deux scénarios** :
- Reads bruts : facteur **~30-40×** grâce à la déréplication + filtre de fréquence
- Génomes assemblés : facteur **~20×** grâce au format super-kmer seul
Le format super-kmer est intrinsèquement plus efficace que le delta-varint car il exploite la structure locale du graphe de De Bruijn : des k-mers consécutifs partagent (k-1) bases, encodées une seule fois dans le super-kmer.
### 9.3 Validation expérimentale : unitigs BCALM2
*Betula exilis* 15×, après lowmask + super-kmers canoniques + déréplication + filtre count>1, passé dans BCALM2 (`-kmer-size 31 -abundance-min 1`) :
| Métrique | Super-kmers (count>1) | Unitigs (BCALM2) | Ratio |
|----------|----------------------|-------------------|-------|
| Variants | 37,294,271 | 6,473,171 | **5.8×** |
| Bases totales | 1,415,018,593 | 408,070,894 | **3.5×** |
| Longueur moyenne | 37.9 bases | 63.0 bases | 1.7× |
| K-mers estimés | ~295M | ~213M | — |
### Stockage estimé
| Format | Taille estimée | Bytes/k-mer | Facteur vs v1 |
|--------|---------------|-------------|---------------|
| .kdi v1 (delta-varint, assemblé) | 12.8 Go | 4.3 | 1× |
| Super-kmers 2-bit (count>1) | ~500 Mo | 1.7 | 25× |
| **Unitigs 2-bit (BCALM2)** | **~130 Mo** | **0.6** | **98×** |
### Extrapolation pour l'index Contaminent_idx (Plants+Human, 2 sets)
Le facteur ~100× mesuré sur *Betula exilis* 15× se décompose :
- Déréplication des reads redondants : facteur ~15× (couverture 15×)
- Compaction super-kmer/unitig vs delta-varint : facteur ~100/15 ≈ **6.7×**
L'index Contaminent_idx est construit à partir de **génomes assemblés** (sans redondance de séquençage). Seul le facteur de compaction unitig s'applique :
- Index v1 actuel : 85 Go (Plants 72 Go + Human 12.8 Go)
- **Estimation unitigs : ~85 / 6.7 ≈ 12-13 Go** (facteur **~6.7×**)
C'est un gain significatif mais bien moins spectaculaire que sur des reads bruts. Le facteur pourrait être meilleur si les unitigs des génomes assemblés sont plus longs que ceux des reads (moins de fragmentation par les erreurs de séquençage).
### Observation sur le nombre de k-mers
Les unitigs contiennent ~213M k-mers vs ~295M estimés dans les super-kmers. La différence (~80M) provient probablement de k-mers qui étaient comptés dans plusieurs super-kmers (aux jonctions) et qui ne sont comptés qu'une fois dans les unitigs (déduplication exacte par le graphe de De Bruijn).
### Conclusion
L'approche unitig est massivement plus compacte que toutes les alternatives. Le format de stockage final devrait être basé sur les unitigs (ou au minimum sur les super-kmers dérepliqués) plutôt que sur des k-mers individuels en delta-varint.
## 10. Questions ouvertes
### 10.1 Le format super-kmer est-il toujours meilleur que delta-varint ?
D'après les estimations révisées (section 8.3), le format super-kmer 2-bit est **toujours plus compact** que le delta-varint, même pour des génomes assemblés :
- Reads bruts 15× : ~500 Mo vs ~1.5 Go (facteur 3×, à k-mers égaux) + déréplication massive
- Génomes assemblés : ~1.2 bytes/k-mer vs ~5 bytes/k-mer (facteur 4×)
La raison fondamentale : le delta-varint encode chaque k-mer indépendamment (même avec deltas), tandis que le super-kmer exploite le chevauchement de (k-1) bases entre k-mers consécutifs. C'est un avantage structurel irrattrapable par le delta-varint.
**Le format super-kmer semble donc préférable dans tous les cas.**
### 10.2 L'index doit-il stocker les super-kmers ou les k-mers ?
Stocker les super-kmers/unitigs comme format d'index final a des avantages (compacité, scan naturel) mais des inconvénients :
- Pas de seek rapide vers un k-mer spécifique (vs .kdx sparse index)
- Le matching par scan complet est O(total_bases) vs O(Q + K) pour le merge-scan
- Les opérations ensemblistes (Union, Intersect) deviennent plus complexes
**Approche hybride possible** :
1. Phase de construction : lowmask → super-kmers canoniques → déréplication → filtre de fréquence
2. Phase de finalisation : extraire les k-mers uniques des super-kmers filtrés → delta-varint .kdi (v1 ou v2)
3. Les super-kmers servent de **format intermédiaire efficace**, pas de format d'index final
Cela combine le meilleur des deux mondes :
- Déréplication ultra-efficace au niveau super-kmer (facteur 16× sur reads bruts)
- Index final compact et query-efficient en delta-varint
### 10.3 Le filtre de fréquence simple (niveau super-kmer) est-il suffisant ?
À valider expérimentalement :
- Comparer le nombre de k-mers retenus par filtre super-kmer vs filtre k-mer exact
- Mesurer l'impact sur les métriques biologiques (Jaccard, match positions)
- Si la différence est <1%, le filtre simple suffit
### 10.4 Aho-Corasick vs merge-scan pour le matching final ?
Si le format d'index final reste delta-varint (question 9.2), le merge-scan reste la méthode naturelle de matching. L'AC/hash-set n'a d'intérêt que si le format de stockage est basé sur des séquences (unitigs/super-kmers).
## 11. Prochaine étape : validation expérimentale
Avant de modifier l'architecture, valider sur des données réelles :
1. **Taux de compaction super-kmer** : sur un génome assemblé vs reads bruts, mesurer le nombre de super-kmers uniques et leur longueur moyenne
2. **Impact du filtre super-kmer** : comparer filtre au niveau super-kmer vs filtre au niveau k-mer exact sur un jeu de données de référence
3. **Taux d'assembly en unitigs** : mesurer la longueur des unitigs obtenus à partir des super-kmers dérepliqués
4. **Benchmark stockage** : comparer taille index super-kmer vs delta-varint vs unitig sur les mêmes données
5. **Benchmark matching** : comparer temps de matching AC/hash vs merge-scan sur différentes densités de requêtes

View File

@@ -0,0 +1,508 @@
# Plan de refonte du package obikmer : index disk-based par partitions minimizer
## Constat
Les roaring64 bitmaps ne sont pas adaptés au stockage de 10^10 k-mers
(k=31) dispersés sur un espace de 2^62. L'overhead structurel (containers
roaring par high key 32 bits) dépasse la taille des données elles-mêmes,
et les opérations `Or()` entre bitmaps fragmentés ne terminent pas en
temps raisonnable.
## Principe de la nouvelle architecture
Un `KmerSet` est un ensemble trié de k-mers canoniques (uint64) stocké
sur disque, partitionné par minimizer. Chaque partition est un fichier
binaire contenant des uint64 triés, compressés par delta-varint.
Un `KmerSetGroup` est un répertoire contenant N ensembles partitionnés
de la même façon (même k, même m, même P).
Un `KmerSet` est un `KmerSetGroup` de taille 1 (singleton).
Les opérations ensemblistes se font partition par partition, en merge
streaming, sans charger l'index complet en mémoire.
## Cycle de vie d'un index
L'index a deux phases distinctes :
1. **Phase de construction (mutable)** : on ouvre un index, on y ajoute
des séquences. Pour chaque séquence, les super-kmers sont extraits
et écrits de manière compacte (2 bits/base) dans le fichier
temporaire de partition correspondant (`minimizer % P`). Les
super-kmers sont une représentation compressée naturelle des k-mers
chevauchants : un super-kmer de longueur L encode L-k+1 k-mers en
ne stockant que ~L/4 bytes au lieu de (L-k+1) × 8 bytes.
2. **Phase de clôture (optimisation)** : on ferme l'index, ce qui
déclenche le traitement **partition par partition** (indépendant,
parallélisable) :
- Charger les super-kmers de la partition
- En extraire tous les k-mers canoniques
- Trier le tableau de k-mers
- Dédupliquer (et compter si FrequencyFilter)
- Delta-encoder et écrire le fichier .kdi final
Après clôture, l'index est statique et immuable.
3. **Phase de lecture (immutable)** : opérations ensemblistes,
Jaccard, Quorum, Contains, itération. Toutes en streaming.
---
## Format sur disque
### Index finalisé
```
index_dir/
metadata.toml
set_0/
part_0000.kdi
part_0001.kdi
...
part_{P-1}.kdi
set_1/
part_0000.kdi
...
...
set_{N-1}/
...
```
### Fichiers temporaires pendant la construction
```
index_dir/
.build/
set_0/
part_0000.skm # super-kmers encodés 2 bits/base
part_0001.skm
...
set_1/
...
```
Le répertoire `.build/` est supprimé après Close().
### metadata.toml
```toml
id = "mon_index"
k = 31
m = 13
partitions = 1024
type = "KmerSetGroup" # ou "KmerSet" (N=1)
size = 3 # nombre de sets (N)
sets_ids = ["genome_A", "genome_B", "genome_C"]
[user_metadata]
organism = "Triticum aestivum"
[sets_metadata]
# métadonnées individuelles par set si nécessaire
```
### Fichier .kdi (Kmer Delta Index)
Format binaire :
```
[magic: 4 bytes "KDI\x01"]
[count: uint64 little-endian] # nombre de k-mers dans cette partition
[first: uint64 little-endian] # premier k-mer (valeur absolue)
[delta_1: varint] # arr[1] - arr[0]
[delta_2: varint] # arr[2] - arr[1]
...
[delta_{count-1}: varint] # arr[count-1] - arr[count-2]
```
Varint : encoding unsigned, 7 bits utiles par byte, bit de poids fort
= continuation (identique au varint protobuf).
Fichier vide (partition sans k-mer) : magic + count=0.
### Fichier .skm (Super-Kmer temporaire)
Format binaire, séquence de super-kmers encodés :
```
[len: uint16 little-endian] # longueur du super-kmer en bases
[sequence: ceil(len/4) bytes] # séquence encodée 2 bits/base, packed
...
```
**Compression par rapport au stockage de k-mers bruts** :
Un super-kmer de longueur L contient L-k+1 k-mers.
- Stockage super-kmer : 2 + ceil(L/4) bytes
- Stockage k-mers bruts : (L-k+1) × 8 bytes
Exemple avec k=31, super-kmer typique L=50 :
- Super-kmer : 2 + 13 = 15 bytes → encode 20 k-mers
- K-mers bruts : 20 × 8 = 160 bytes
- **Facteur de compression : ~10×**
Pour un génome de 10 Gbases (~10^10 k-mers bruts) :
- K-mers bruts : ~80 Go par set temporaire
- Super-kmers : **~8 Go** par set temporaire
Avec FrequencyFilter et couverture 30× :
- K-mers bruts : ~2.4 To
- Super-kmers : **~240 Go**
---
## FrequencyFilter
Le FrequencyFilter n'est plus un type de données séparé. C'est un
**mode de construction** du builder. Le résultat est un KmerSetGroup
standard.
### Principe
Pendant la construction, tous les super-kmers sont écrits dans les
fichiers temporaires .skm, y compris les doublons (chaque occurrence
de chaque séquence est écrite).
Pendant Close(), pour chaque partition :
1. Charger tous les super-kmers de la partition
2. Extraire tous les k-mers canoniques dans un tableau []uint64
3. Trier le tableau
4. Parcourir linéairement : les k-mers identiques sont consécutifs
5. Compter les occurrences de chaque k-mer
6. Si count >= minFreq → écrire dans le .kdi final (une seule fois)
7. Sinon → ignorer
### Dimensionnement
Pour un génome de 10 Gbases avec couverture 30× :
- N_brut ≈ 3×10^11 k-mers bruts
- Espace temporaire .skm ≈ 240 Go (compressé super-kmer)
- RAM par partition pendant Close() :
Avec P=1024 : ~3×10^8 k-mers/partition × 8 = **~2.4 Go**
Avec P=4096 : ~7.3×10^7 k-mers/partition × 8 = **~600 Mo**
Le choix de P détermine le compromis nombre de fichiers vs RAM par
partition.
### Sans FrequencyFilter (déduplication simple)
Pour de la déduplication simple (chaque k-mer écrit une fois), le
builder peut dédupliquer au niveau des buffers en RAM avant flush.
Cela réduit significativement l'espace temporaire car les doublons
au sein d'un même buffer (provenant de séquences proches) sont
éliminés immédiatement.
---
## API publique visée
### Structures
```go
// KmerSetGroup est l'entité de base.
// Un KmerSet est un KmerSetGroup avec Size() == 1.
type KmerSetGroup struct {
// champs internes : path, k, m, P, N, metadata, état
}
// KmerSetGroupBuilder construit un KmerSetGroup mutable.
type KmerSetGroupBuilder struct {
// champs internes : buffers I/O par partition et par set,
// fichiers temporaires .skm, paramètres (minFreq, etc.)
}
```
### Construction
```go
// NewKmerSetGroupBuilder crée un builder pour un nouveau KmerSetGroup.
// directory : répertoire de destination
// k : taille des k-mers (1-31)
// m : taille des minimizers (-1 pour auto = ceil(k/2.5))
// n : nombre de sets dans le groupe
// P : nombre de partitions (-1 pour auto)
// options : options de construction (FrequencyFilter, etc.)
func NewKmerSetGroupBuilder(directory string, k, m, n, P int,
options ...BuilderOption) (*KmerSetGroupBuilder, error)
// WithMinFrequency active le mode FrequencyFilter.
// Seuls les k-mers vus >= minFreq fois sont conservés dans l'index
// final. Les super-kmers sont écrits avec leurs doublons pendant
// la construction ; le comptage exact se fait au Close().
func WithMinFrequency(minFreq int) BuilderOption
// AddSequence extrait les super-kmers d'une séquence et les écrit
// dans les fichiers temporaires de partition du set i.
func (b *KmerSetGroupBuilder) AddSequence(setIndex int, seq *obiseq.BioSequence)
// AddSuperKmer écrit un super-kmer dans le fichier temporaire de
// sa partition pour le set i.
func (b *KmerSetGroupBuilder) AddSuperKmer(setIndex int, sk SuperKmer)
// Close finalise la construction :
// - flush des buffers d'écriture
// - pour chaque partition de chaque set (parallélisable) :
// - charger les super-kmers depuis le .skm
// - extraire les k-mers canoniques
// - trier, dédupliquer (compter si freq filter)
// - delta-encoder et écrire le .kdi
// - écrire metadata.toml
// - supprimer le répertoire .build/
// Retourne le KmerSetGroup en lecture seule.
func (b *KmerSetGroupBuilder) Close() (*KmerSetGroup, error)
```
### Lecture et opérations
```go
// OpenKmerSetGroup ouvre un index finalisé en lecture seule.
func OpenKmerSetGroup(directory string) (*KmerSetGroup, error)
// --- Métadonnées (API inchangée) ---
func (ksg *KmerSetGroup) K() int
func (ksg *KmerSetGroup) M() int // nouveau : taille du minimizer
func (ksg *KmerSetGroup) Partitions() int // nouveau : nombre de partitions
func (ksg *KmerSetGroup) Size() int
func (ksg *KmerSetGroup) Id() string
func (ksg *KmerSetGroup) SetId(id string)
func (ksg *KmerSetGroup) HasAttribute(key string) bool
func (ksg *KmerSetGroup) GetAttribute(key string) (interface{}, bool)
func (ksg *KmerSetGroup) SetAttribute(key string, value interface{})
// ... etc (toute l'API attributs actuelle est conservée)
// --- Opérations ensemblistes ---
// Toutes produisent un nouveau KmerSetGroup singleton sur disque.
// Opèrent partition par partition en streaming.
func (ksg *KmerSetGroup) Union(outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) Intersect(outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) Difference(outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) QuorumAtLeast(q int, outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) QuorumExactly(q int, outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) QuorumAtMost(q int, outputDir string) (*KmerSetGroup, error)
// --- Opérations entre deux KmerSetGroups ---
// Les deux groupes doivent avoir les mêmes k, m, P.
func (ksg *KmerSetGroup) UnionWith(other *KmerSetGroup, outputDir string) (*KmerSetGroup, error)
func (ksg *KmerSetGroup) IntersectWith(other *KmerSetGroup, outputDir string) (*KmerSetGroup, error)
// --- Métriques (résultat en mémoire, pas de sortie disque) ---
func (ksg *KmerSetGroup) JaccardDistanceMatrix() *obidist.DistMatrix
func (ksg *KmerSetGroup) JaccardSimilarityMatrix() *obidist.DistMatrix
// --- Accès individuel ---
func (ksg *KmerSetGroup) Len(setIndex ...int) uint64
func (ksg *KmerSetGroup) Contains(setIndex int, kmer uint64) bool
func (ksg *KmerSetGroup) Iterator(setIndex int) iter.Seq[uint64]
```
---
## Implémentation interne
### Primitives bas niveau
**`varint.go`** : encode/decode varint uint64
```go
func EncodeVarint(w io.Writer, v uint64) (int, error)
func DecodeVarint(r io.Reader) (uint64, error)
```
### Format .kdi
**`kdi_writer.go`** : écriture d'un fichier .kdi à partir d'un flux
trié de uint64 (delta-encode au vol).
```go
type KdiWriter struct { ... }
func NewKdiWriter(path string) (*KdiWriter, error)
func (w *KdiWriter) Write(kmer uint64) error
func (w *KdiWriter) Close() error
```
**`kdi_reader.go`** : lecture streaming d'un fichier .kdi (décode
les deltas au vol).
```go
type KdiReader struct { ... }
func NewKdiReader(path string) (*KdiReader, error)
func (r *KdiReader) Next() (uint64, bool)
func (r *KdiReader) Count() uint64
func (r *KdiReader) Close() error
```
### Format .skm
**`skm_writer.go`** : écriture de super-kmers encodés 2 bits/base.
```go
type SkmWriter struct { ... }
func NewSkmWriter(path string) (*SkmWriter, error)
func (w *SkmWriter) Write(sk SuperKmer) error
func (w *SkmWriter) Close() error
```
**`skm_reader.go`** : lecture de super-kmers depuis un fichier .skm.
```go
type SkmReader struct { ... }
func NewSkmReader(path string) (*SkmReader, error)
func (r *SkmReader) Next() (SuperKmer, bool)
func (r *SkmReader) Close() error
```
### Merge streaming
**`kdi_merge.go`** : k-way merge de plusieurs flux triés.
```go
type KWayMerge struct { ... }
func NewKWayMerge(readers []*KdiReader) *KWayMerge
func (m *KWayMerge) Next() (kmer uint64, count int, ok bool)
func (m *KWayMerge) Close() error
```
### Builder
**`kmer_set_builder.go`** : construction d'un KmerSetGroup.
Le builder gère :
- P × N écrivains .skm bufferisés (un par partition × set)
- À la clôture : traitement partition par partition
(parallélisable sur plusieurs cores)
Gestion mémoire des buffers d'écriture :
- Chaque SkmWriter a un buffer I/O de taille raisonnable (~64 Ko)
- Avec P=1024 et N=1 : 1024 × 64 Ko = 64 Mo de buffers
- Avec P=1024 et N=10 : 640 Mo de buffers
- Pas de buffer de k-mers en RAM : tout est écrit sur disque
immédiatement via les super-kmers
RAM pendant Close() (tri d'une partition) :
- Charger les super-kmers → extraire les k-mers → tableau []uint64
- Avec P=1024 et 10^10 k-mers/set : ~10^7 k-mers/partition × 8 = ~80 Mo
- Avec FrequencyFilter (doublons) et couverture 30× :
~3×10^8/partition × 8 = ~2.4 Go (ajustable via P)
### Structure disk-based
**`kmer_set_disk.go`** : KmerSetGroup en lecture seule.
**`kmer_set_disk_ops.go`** : opérations ensemblistes par merge
streaming partition par partition.
---
## Ce qui change par rapport à l'API actuelle
### Changements de sémantique
| Aspect | Ancien (roaring) | Nouveau (disk-based) |
|---|---|---|
| Stockage | En mémoire (roaring64.Bitmap) | Sur disque (.kdi delta-encoded) |
| Temporaire construction | En mémoire | Super-kmers sur disque (.skm 2 bits/base) |
| Mutabilité | Mutable à tout moment | Builder → Close() → immutable |
| Opérations ensemblistes | Résultat en mémoire | Résultat sur disque (nouveau répertoire) |
| Contains | O(1) roaring lookup | O(log n) recherche binaire sur .kdi |
| Itération | Roaring iterator | Streaming décodage delta-varint |
### API conservée (signatures identiques ou quasi-identiques)
- `KmerSetGroup` : `K()`, `Size()`, `Id()`, `SetId()`
- Toute l'API attributs
- `JaccardDistanceMatrix()`, `JaccardSimilarityMatrix()`
- `Len()`, `Contains()`
### API modifiée
- `Union()`, `Intersect()`, etc. : ajout du paramètre `outputDir`
- `QuorumAtLeast()`, etc. : idem
- Construction : `NewKmerSetGroupBuilder()` + `AddSequence()` + `Close()`
au lieu de manipulation directe
### API supprimée
- `KmerSet` comme type distinct (remplacé par KmerSetGroup singleton)
- `FrequencyFilter` comme type distinct (mode du Builder)
- Tout accès direct à `roaring64.Bitmap`
- `KmerSet.Copy()` (copie de répertoire à la place)
- `KmerSet.Union()`, `.Intersect()`, `.Difference()` (deviennent méthodes
de KmerSetGroup avec outputDir)
---
## Fichiers à créer / modifier dans pkg/obikmer
### Nouveaux fichiers
| Fichier | Contenu |
|---|---|
| `varint.go` | Encode/Decode varint uint64 |
| `kdi_writer.go` | Écrivain de fichiers .kdi (delta-encoded) |
| `kdi_reader.go` | Lecteur streaming de fichiers .kdi |
| `skm_writer.go` | Écrivain de super-kmers encodés 2 bits/base |
| `skm_reader.go` | Lecteur de super-kmers depuis .skm |
| `kdi_merge.go` | K-way merge streaming de flux triés |
| `kmer_set_builder.go` | KmerSetGroupBuilder (construction) |
| `kmer_set_disk.go` | KmerSetGroup disk-based (lecture, métadonnées) |
| `kmer_set_disk_ops.go` | Opérations ensemblistes streaming |
### Fichiers à supprimer
| Fichier | Raison |
|---|---|
| `kmer_set.go` | Remplacé par kmer_set_disk.go |
| `kmer_set_group.go` | Idem |
| `kmer_set_attributes.go` | Intégré dans kmer_set_disk.go |
| `kmer_set_persistence.go` | L'index est nativement sur disque |
| `kmer_set_group_quorum.go` | Intégré dans kmer_set_disk_ops.go |
| `frequency_filter.go` | Mode du Builder, plus de type séparé |
| `kmer_index_builder.go` | Remplacé par kmer_set_builder.go |
### Fichiers conservés tels quels
| Fichier | Contenu |
|---|---|
| `encodekmer.go` | Encodage/décodage k-mers |
| `superkmer.go` | Structure SuperKmer |
| `superkmer_iter.go` | IterSuperKmers, IterCanonicalKmers |
| `encodefourmer.go` | Encode4mer |
| `counting.go` | Count4Mer |
| `kmermap.go` | KmerMap (usage indépendant) |
| `debruijn.go` | Graphe de de Bruijn |
---
## Ordre d'implémentation
1. `varint.go` + tests
2. `skm_writer.go` + `skm_reader.go` + tests
3. `kdi_writer.go` + `kdi_reader.go` + tests
4. `kdi_merge.go` + tests
5. `kmer_set_builder.go` + tests (construction + Close)
6. `kmer_set_disk.go` (structure, métadonnées, Open)
7. `kmer_set_disk_ops.go` + tests (Union, Intersect, Quorum, Jaccard)
8. Adaptation de `pkg/obitools/obikindex/`
9. Suppression des anciens fichiers roaring
10. Adaptation des tests existants
Chaque étape est testable indépendamment.
---
## Dépendances externes
### Supprimées
- `github.com/RoaringBitmap/roaring` : plus nécessaire pour les
index k-mers (vérifier si d'autres packages l'utilisent encore)
### Ajoutées
- Aucune. Varint, delta-encoding, merge, encodage 2 bits/base :
tout est implémentable en Go standard.

View File

@@ -0,0 +1,213 @@
# Index de k-mers pour génomes de grande taille
## Contexte et objectifs
### Cas d'usage
- Indexation de k-mers longs (k=31) pour des génomes de grande taille (< 10 Go par génome)
- Nombre de génomes : plusieurs dizaines à quelques centaines
- Indexation en parallèle
- Stockage sur disque
- Possibilité d'ajouter des génomes, mais pas de modifier un génome existant
### Requêtes cibles
- **Présence/absence** d'un k-mer dans un génome
- **Intersection** entre génomes
- **Distances** : Jaccard (présence/absence) et potentiellement Bray-Curtis (comptage)
### Ressources disponibles
- 128 Go de RAM
- Stockage disque
---
## Estimation des volumes
### Par génome
- **10 Go de séquence** → ~10¹⁰ k-mers bruts (chevauchants)
- **Après déduplication** : typiquement 10-50% de k-mers uniques → **~1-5 × 10⁹ k-mers distincts**
### Espace théorique
- **k=31** → 62 bits → ~4.6 × 10¹⁸ k-mers possibles
- Table d'indexation directe impossible
---
## Métriques de distance
### Présence/absence (binaire)
- **Jaccard** : |A ∩ B| / |A B|
- **Sørensen-Dice** : 2|A ∩ B| / (|A| + |B|)
### Comptage (abondance)
- **Bray-Curtis** : 1 - (2 × Σ min(aᵢ, bᵢ)) / (Σ aᵢ + Σ bᵢ)
Note : Pour Bray-Curtis, le stockage des comptages est nécessaire, ce qui augmente significativement la taille de l'index.
---
## Options d'indexation
### Option 1 : Bloom Filter par génome
**Principe** : Structure probabiliste pour test d'appartenance.
**Avantages :**
- Très compact : ~10 bits/élément pour FPR ~1%
- Construction rapide, streaming
- Facile à sérialiser/désérialiser
- Intersection et Jaccard estimables via formules analytiques
**Inconvénients :**
- Faux positifs (pas de faux négatifs)
- Distances approximatives
**Taille estimée** : 1-6 Go par génome (selon FPR cible)
#### Dimensionnement des Bloom filters
```
\mathrm{FPR} ;=; \left(1 - e^{-h n / m}\right)^h
```
| Bits/élément | FPR optimal | k (hash functions) |
|--------------|-------------|---------------------|
| 8 | ~2% | 5-6 |
| 10 | ~1% | 7 |
| 12 | ~0.3% | 8 |
| 16 | ~0.01% | 11 |
Formule du taux de faux positifs :
```
FPR ≈ (1 - e^(-kn/m))^k
```
Où n = nombre d'éléments, m = nombre de bits, k = nombre de hash functions.
### Option 2 : Ensemble trié de k-mers
**Principe** : Stocker les k-mers (uint64) triés, avec compression possible.
**Avantages :**
- Exact (pas de faux positifs)
- Intersection/union par merge sort O(n+m)
- Compression efficace (delta encoding sur k-mers triés)
**Inconvénients :**
- Plus volumineux : 8 octets/k-mer
- Construction plus lente (tri nécessaire)
**Taille estimée** : 8-40 Go par génome (non compressé)
### Option 3 : MPHF (Minimal Perfect Hash Function)
**Principe** : Fonction de hash parfaite minimale pour les k-mers présents.
**Avantages :**
- Très compact : ~3-4 bits/élément
- Lookup O(1)
- Exact pour les k-mers présents
**Inconvénients :**
- Construction coûteuse (plusieurs passes)
- Statique (pas d'ajout de k-mers après construction)
- Ne distingue pas "absent" vs "jamais vu" sans structure auxiliaire
### Option 4 : Hybride MPHF + Bloom filter
- MPHF pour mapping compact des k-mers présents
- Bloom filter pour pré-filtrage des absents
---
## Optimisation : Indexation de (k-2)-mers pour requêtes k-mers
### Principe
Au lieu d'indexer directement les 31-mers dans un Bloom filter, on indexe les 29-mers. Pour tester la présence d'un 31-mer, on vérifie que les **trois 29-mers** qu'il contient sont présents :
- positions 0-28
- positions 1-29
- positions 2-30
### Analyse probabiliste
Si le Bloom filter a un FPR de p pour un 29-mer individuel, le FPR effectif pour un 31-mer devient **p³** (les trois requêtes doivent toutes être des faux positifs).
| FPR 29-mer | FPR 31-mer effectif |
|------------|---------------------|
| 10% | 0.1% |
| 5% | 0.0125% |
| 1% | 0.0001% |
### Avantages
1. **Moins d'éléments à stocker** : il y a moins de 29-mers distincts que de 31-mers distincts dans un génome (deux 31-mers différents peuvent partager un même 29-mer)
2. **FPR drastiquement réduit** : FPR³ avec seulement 3 requêtes
3. **Index plus compact** : on peut utiliser moins de bits par élément (FPR plus élevé acceptable sur le 29-mer) tout en obtenant un FPR très bas sur le 31-mer
### Trade-off
Un Bloom filter à **5-6 bits/élément** pour les 29-mers donnerait un FPR effectif < 0.01% pour les 31-mers, soit environ **2× plus compact** que l'approche directe à qualité égale.
**Coût** : 3× plus de requêtes par lookup (mais les requêtes Bloom sont très rapides).
---
## Accélération des calculs de distance : MinHash
### Principe
Pré-calculer une "signature" compacte (sketch) de chaque génome permettant d'estimer rapidement Jaccard sans charger les index complets.
### Avantages
- Matrice de distances entre 100+ génomes en quelques secondes
- Signature de taille fixe (ex: 1000-10000 hash values) quel que soit le génome
- Stockage minimal
### Utilisation
1. Construction : une passe sur les k-mers de chaque génome
2. Distance : comparaison des sketches en O(taille du sketch)
---
## Architecture recommandée
### Pour présence/absence + Jaccard
1. **Index principal** : Bloom filter de (k-2)-mers avec l'optimisation décrite
- Compact (~3-5 Go par génome)
- FPR très bas pour les k-mers grâce aux requêtes triples
2. **Sketches MinHash** : pour calcul rapide des distances entre génomes
- Quelques Ko par génome
- Permet exploration rapide de la matrice de distances
### Pour comptage + Bray-Curtis
1. **Index principal** : k-mers triés + comptages
- uint64 (k-mer) + uint8/uint16 (count)
- Compression delta possible
- Plus volumineux mais exact
2. **Sketches** : variantes de MinHash pour données pondérées (ex: HyperMinHash)
---
## Prochaines étapes
1. Implémenter un Bloom filter optimisé pour k-mers
2. Implémenter l'optimisation (k-2)-mer → k-mer
3. Implémenter MinHash pour les sketches
4. Définir le format de sérialisation sur disque
5. Benchmarker sur des génomes réels

View File

@@ -0,0 +1,3 @@
lit le ficier [@canonical-super-kmer-strategy.md](file:///Users/coissac/Sync/travail/__MOI__/GO/obitools4/blackboard/Prospective/canonical-super-kmer-strategy.md).
Dans le fichier [@superkmer_iter.go](file:///Users/coissac/Sync/travail/__MOI__/GO/obitools4/pkg/obikmer/superkmer_iter.go) implemente une nouvelle fonction IterCanonicalSuperKmers sur le modèle de IterSuperKmers, qui implémente la notion de SuperKmers canonique présenté dans le document d'architecture.

View File

@@ -0,0 +1,735 @@
# Architecture d'une commande OBITools
## Vue d'ensemble
Une commande OBITools suit une architecture modulaire et standardisée qui sépare clairement les responsabilités entre :
- Le package de la commande dans `pkg/obitools/<nom_commande>/`
- L'exécutable dans `cmd/obitools/<nom_commande>/`
Cette architecture favorise la réutilisabilité du code, la testabilité et la cohérence entre les différentes commandes de la suite OBITools.
## Structure du projet
```
obitools4/
├── pkg/obitools/
│ ├── obiconvert/ # Commande de conversion (base pour toutes)
│ │ ├── obiconvert.go # Fonctions vides (pas d'implémentation)
│ │ ├── options.go # Définition des options CLI
│ │ ├── sequence_reader.go # Lecture des séquences
│ │ └── sequence_writer.go # Écriture des séquences
│ ├── obiuniq/ # Commande de déréplication
│ │ ├── obiuniq.go # (fichier vide)
│ │ ├── options.go # Options spécifiques à obiuniq
│ │ └── unique.go # Implémentation du traitement
│ ├── obipairing/ # Assemblage de lectures paired-end
│ ├── obisummary/ # Résumé de fichiers de séquences
│ └── obimicrosat/ # Détection de microsatellites
└── cmd/obitools/
├── obiconvert/
│ └── main.go # Point d'entrée de la commande
├── obiuniq/
│ └── main.go
├── obipairing/
│ └── main.go
├── obisummary/
│ └── main.go
└── obimicrosat/
└── main.go
```
## Composants de l'architecture
### 1. Package `pkg/obitools/<commande>/`
Chaque commande possède son propre package dans `pkg/obitools/` qui contient l'implémentation complète de la logique métier. Ce package est structuré en plusieurs fichiers :
#### a) `options.go` - Gestion des options CLI
Ce fichier définit :
- Les **variables globales** privées (préfixées par `_`) stockant les valeurs des options
- La fonction **`OptionSet()`** qui configure toutes les options pour la commande
- Les fonctions **`CLI*()`** qui retournent les valeurs des options (getters)
- Les fonctions **`Set*()`** qui permettent de définir les options programmatiquement (setters)
**Exemple (obiuniq/options.go) :**
```go
package obiuniq
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"github.com/DavidGamba/go-getoptions"
)
// Variables globales privées pour stocker les options
var _StatsOn = make([]string, 0, 10)
var _Keys = make([]string, 0, 10)
var _InMemory = false
var _chunks = 100
// Configuration des options spécifiques à la commande
func UniqueOptionSet(options *getoptions.GetOpt) {
options.StringSliceVar(&_StatsOn, "merge", 1, 1,
options.Alias("m"),
options.ArgName("KEY"),
options.Description("Adds a merged attribute..."))
options.BoolVar(&_InMemory, "in-memory", _InMemory,
options.Description("Use memory instead of disk..."))
options.IntVar(&_chunks, "chunk-count", _chunks,
options.Description("In how many chunks..."))
}
// OptionSet combine les options de base + les options spécifiques
func OptionSet(options *getoptions.GetOpt) {
obiconvert.OptionSet(false)(options) // Options de base
UniqueOptionSet(options) // Options spécifiques
}
// Getters pour accéder aux valeurs des options
func CLIStatsOn() []string {
return _StatsOn
}
func CLIUniqueInMemory() bool {
return _InMemory
}
// Setters pour définir les options programmatiquement
func SetUniqueInMemory(inMemory bool) {
_InMemory = inMemory
}
```
**Convention de nommage :**
- Variables privées : `_NomOption` (underscore préfixe)
- Getters : `CLINomOption()` (préfixe CLI)
- Setters : `SetNomOption()` (préfixe Set)
#### b) Fichier(s) d'implémentation
Un ou plusieurs fichiers contenant la logique métier de la commande :
**Exemple (obiuniq/unique.go) :**
```go
package obiuniq
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obichunk"
)
// Fonction CLI principale qui orchestre le traitement
func CLIUnique(sequences obiiter.IBioSequence) obiiter.IBioSequence {
// Récupération des options via les getters CLI*()
options := make([]obichunk.WithOption, 0, 30)
options = append(options,
obichunk.OptionBatchCount(CLINumberOfChunks()),
)
if CLIUniqueInMemory() {
options = append(options, obichunk.OptionSortOnMemory())
} else {
options = append(options, obichunk.OptionSortOnDisk())
}
// Appel de la fonction de traitement réelle
iUnique, err := obichunk.IUniqueSequence(sequences, options...)
if err != nil {
log.Fatal(err)
}
return iUnique
}
```
**Autres exemples d'implémentation :**
- **obimicrosat/microsat.go** : Contient `MakeMicrosatWorker()` et `CLIAnnotateMicrosat()`
- **obisummary/obisummary.go** : Contient `ISummary()` et les structures de données
#### c) Fichiers utilitaires (optionnel)
Certaines commandes ont des fichiers additionnels pour des fonctionnalités spécifiques.
**Exemple (obipairing/options.go) :**
```go
// Fonction spéciale pour créer un itérateur de séquences pairées
func CLIPairedSequence() (obiiter.IBioSequence, error) {
forward, err := obiconvert.CLIReadBioSequences(_ForwardFile)
if err != nil {
return obiiter.NilIBioSequence, err
}
reverse, err := obiconvert.CLIReadBioSequences(_ReverseFile)
if err != nil {
return obiiter.NilIBioSequence, err
}
paired := forward.PairTo(reverse)
return paired, nil
}
```
### 2. Package `obiconvert` - La base commune
Le package `obiconvert` est spécial car il fournit les fonctionnalités de base utilisées par toutes les autres commandes :
#### Fonctionnalités fournies :
1. **Lecture de séquences** (`sequence_reader.go`)
- `CLIReadBioSequences()` : lecture depuis fichiers ou stdin
- Support de multiples formats (FASTA, FASTQ, EMBL, GenBank, etc.)
- Gestion des fichiers multiples
- Barre de progression optionnelle
2. **Écriture de séquences** (`sequence_writer.go`)
- `CLIWriteBioSequences()` : écriture vers fichiers ou stdout
- Support de multiples formats
- Gestion des lectures pairées
- Compression optionnelle
3. **Options communes** (`options.go`)
- Options d'entrée (format, skip, etc.)
- Options de sortie (format, fichier, compression)
- Options de mode (barre de progression, etc.)
#### Utilisation par les autres commandes :
Toutes les commandes incluent les options de `obiconvert` via :
```go
func OptionSet(options *getoptions.GetOpt) {
obiconvert.OptionSet(false)(options) // false = pas de fichiers pairés
MaCommandeOptionSet(options) // Options spécifiques
}
```
### 3. Exécutable `cmd/obitools/<commande>/main.go`
Le fichier `main.go` de chaque commande est volontairement **minimaliste** et suit toujours le même pattern :
```go
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/macommande"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
// 1. Configuration optionnelle de paramètres par défaut
obidefault.SetBatchSize(10)
// 2. Génération du parser d'options
optionParser := obioptions.GenerateOptionParser(
"macommande", // Nom de la commande
"description de la commande", // Description
macommande.OptionSet) // Fonction de configuration des options
// 3. Parsing des arguments
_, args := optionParser(os.Args)
// 4. Lecture des séquences d'entrée
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
// 5. Traitement spécifique de la commande
resultat := macommande.CLITraitement(sequences)
// 6. Écriture des résultats
obiconvert.CLIWriteBioSequences(resultat, true)
// 7. Attente de la fin du pipeline
obiutils.WaitForLastPipe()
}
```
## Patterns architecturaux
### Pattern 1 : Pipeline de traitement de séquences
La plupart des commandes suivent ce pattern :
```
Lecture → Traitement → Écriture
```
**Exemples :**
- **obiconvert** : Lecture → Écriture (conversion de format)
- **obiuniq** : Lecture → Déréplication → Écriture
- **obimicrosat** : Lecture → Annotation → Filtrage → Écriture
### Pattern 2 : Traitement avec entrées multiples
Certaines commandes acceptent plusieurs fichiers d'entrée :
**obipairing** :
```
Lecture Forward + Lecture Reverse → Pairing → Assemblage → Écriture
```
### Pattern 3 : Traitement sans écriture de séquences
**obisummary** : produit un résumé JSON/YAML au lieu de séquences
```go
func main() {
// ... parsing options et lecture ...
summary := obisummary.ISummary(fs, obisummary.CLIMapSummary())
// Formatage et affichage direct
if obisummary.CLIOutFormat() == "json" {
output, _ := json.MarshalIndent(summary, "", " ")
fmt.Print(string(output))
} else {
output, _ := yaml.Marshal(summary)
fmt.Print(string(output))
}
}
```
### Pattern 4 : Utilisation de Workers
Les commandes qui transforment des séquences utilisent souvent le pattern Worker :
```go
// Création d'un worker
worker := MakeMicrosatWorker(
CLIMinUnitLength(),
CLIMaxUnitLength(),
// ... autres paramètres
)
// Application du worker sur l'itérateur
newIter = iterator.MakeIWorker(
worker,
false, // merge results
obidefault.ParallelWorkers() // parallélisation
)
```
## Étapes d'implémentation d'une nouvelle commande
### Étape 1 : Créer le package dans `pkg/obitools/`
```bash
mkdir -p pkg/obitools/macommande
```
### Étape 2 : Créer `options.go`
```go
package macommande
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"github.com/DavidGamba/go-getoptions"
)
// Variables privées pour les options
var _MonOption = "valeur_par_defaut"
// Configuration des options spécifiques
func MaCommandeOptionSet(options *getoptions.GetOpt) {
options.StringVar(&_MonOption, "mon-option", _MonOption,
options.Alias("o"),
options.Description("Description de l'option"))
}
// OptionSet combine options de base + spécifiques
func OptionSet(options *getoptions.GetOpt) {
obiconvert.OptionSet(false)(options) // false si pas de fichiers pairés
MaCommandeOptionSet(options)
}
// Getters
func CLIMonOption() string {
return _MonOption
}
// Setters
func SetMonOption(value string) {
_MonOption = value
}
```
### Étape 3 : Créer le fichier d'implémentation
Créer `macommande.go` (ou un nom plus descriptif) :
```go
package macommande
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
)
// Fonction de traitement principale
func CLIMaCommande(sequences obiiter.IBioSequence) obiiter.IBioSequence {
// Récupération des options
option := CLIMonOption()
// Implémentation du traitement
// ...
return resultat
}
```
### Étape 4 : Créer l'exécutable dans `cmd/obitools/`
```bash
mkdir -p cmd/obitools/macommande
```
Créer `main.go` :
```go
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/macommande"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
// Parser d'options
optionParser := obioptions.GenerateOptionParser(
"macommande",
"Description courte de ma commande",
macommande.OptionSet)
_, args := optionParser(os.Args)
// Lecture
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
// Traitement
resultat := macommande.CLIMaCommande(sequences)
// Écriture
obiconvert.CLIWriteBioSequences(resultat, true)
// Attente
obiutils.WaitForLastPipe()
}
```
### Étape 5 : Configurations optionnelles
Dans `main.go`, avant le parsing des options, on peut configurer :
```go
// Taille des batchs de séquences
obidefault.SetBatchSize(10)
// Nombre de workers en lecture (strict)
obidefault.SetStrictReadWorker(2)
// Nombre de workers en écriture
obidefault.SetStrictWriteWorker(2)
// Désactiver la lecture des qualités
obidefault.SetReadQualities(false)
```
### Étape 6 : Gestion des erreurs
Utiliser les fonctions utilitaires pour les messages d'erreur cohérents :
```go
// Pour les erreurs d'ouverture de fichiers
obiconvert.OpenSequenceDataErrorMessage(args, err)
// Pour les erreurs générales
if err != nil {
log.Errorf("Message d'erreur: %v", err)
os.Exit(1)
}
```
### Étape 7 : Tests et debugging (optionnel)
Des commentaires dans le code montrent comment activer le profiling :
```go
// go tool pprof -http=":8000" ./macommande ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
```
## Bonnes pratiques observées
### 1. Séparation des responsabilités
- **`main.go`** : orchestration minimale
- **`options.go`** : définition et gestion des options
- **Fichiers d'implémentation** : logique métier
### 2. Convention de nommage cohérente
- Variables d'options : `_NomOption`
- Getters CLI : `CLINomOption()`
- Setters : `SetNomOption()`
- Fonctions de traitement CLI : `CLITraitement()`
### 3. Réutilisation du code
- Toutes les commandes réutilisent `obiconvert` pour l'I/O
- Les options communes sont partagées
- Les fonctions utilitaires sont centralisées
### 4. Configuration par défaut
Les valeurs par défaut sont :
- Définies lors de l'initialisation des variables
- Modifiables via les options CLI
- Modifiables programmatiquement via les setters
### 5. Gestion des formats
Support automatique de multiples formats :
- FASTA / FASTQ (avec compression gzip)
- EMBL / GenBank
- ecoPCR
- CSV
- JSON (avec différents formats d'en-têtes)
### 6. Parallélisation
Les commandes utilisent les workers parallèles via :
- `obidefault.ParallelWorkers()`
- `obidefault.SetStrictReadWorker(n)`
- `obidefault.SetStrictWriteWorker(n)`
### 7. Logging cohérent
Utilisation de `logrus` pour tous les logs :
```go
log.Printf("Message informatif")
log.Errorf("Message d'erreur: %v", err)
log.Fatal(err) // Arrêt du programme
```
## Dépendances principales
### Packages internes OBITools
- `pkg/obidefault` : valeurs par défaut et configuration globale
- `pkg/obioptions` : génération du parser d'options
- `pkg/obiiter` : itérateurs de séquences biologiques
- `pkg/obiseq` : structures et fonctions pour séquences biologiques
- `pkg/obiformats` : lecture/écriture de différents formats
- `pkg/obiutils` : fonctions utilitaires diverses
- `pkg/obichunk` : traitement par chunks (pour dereplication, etc.)
### Packages externes
- `github.com/DavidGamba/go-getoptions` : parsing des options CLI
- `github.com/sirupsen/logrus` : logging structuré
- `gopkg.in/yaml.v3` : encodage/décodage YAML
- `github.com/dlclark/regexp2` : expressions régulières avancées
## Cas spéciaux
### Commande avec fichiers pairés (obipairing)
```go
func OptionSet(options *getoptions.GetOpt) {
obiconvert.OutputOptionSet(options)
obiconvert.InputOptionSet(options)
PairingOptionSet(options) // Options spécifiques au pairing
}
func CLIPairedSequence() (obiiter.IBioSequence, error) {
forward, err := obiconvert.CLIReadBioSequences(_ForwardFile)
// ...
reverse, err := obiconvert.CLIReadBioSequences(_ReverseFile)
// ...
paired := forward.PairTo(reverse)
return paired, nil
}
```
Dans `main.go` :
```go
pairs, err := obipairing.CLIPairedSequence() // Lecture spéciale
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
paired := obipairing.IAssemblePESequencesBatch(
pairs,
obipairing.CLIGapPenality(),
// ... autres paramètres
)
```
### Commande sans sortie de séquences (obisummary)
Au lieu de `obiconvert.CLIWriteBioSequences()`, affichage direct :
```go
summary := obisummary.ISummary(fs, obisummary.CLIMapSummary())
if obisummary.CLIOutFormat() == "json" {
output, _ := json.MarshalIndent(summary, "", " ")
fmt.Print(string(output))
} else {
output, _ := yaml.Marshal(summary)
fmt.Print(string(output))
}
fmt.Printf("\n")
```
### Commande avec Workers personnalisés (obimicrosat)
```go
func CLIAnnotateMicrosat(iterator obiiter.IBioSequence) obiiter.IBioSequence {
// Création du worker
worker := MakeMicrosatWorker(
CLIMinUnitLength(),
CLIMaxUnitLength(),
CLIMinUnitCount(),
CLIMinLength(),
CLIMinFlankLength(),
CLIReoriented(),
)
// Application du worker
newIter := iterator.MakeIWorker(
worker,
false, // pas de merge
obidefault.ParallelWorkers(), // parallélisation
)
return newIter.FilterEmpty() // Filtrage des résultats vides
}
```
## Diagramme de flux d'exécution
```
┌─────────────────────────────────────────────────────────────┐
│ cmd/obitools/macommande/main.go │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 1. Génération du parser d'options │
│ obioptions.GenerateOptionParser( │
│ "macommande", │
│ "description", │
│ macommande.OptionSet) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ pkg/obitools/macommande/options.go │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ func OptionSet(options *getoptions.GetOpt) │ │
│ │ obiconvert.OptionSet(false)(options) ───────────┐ │ │
│ │ MaCommandeOptionSet(options) │ │ │
│ └───────────────────────────────────────────────────┼─┘ │
└────────────────────────────────────────────────────────┼─────┘
│ │
│ │
┌─────────────┘ │
│ │
▼ ▼
┌─────────────────────────────────┐ ┌───────────────────────────────┐
│ 2. Parsing des arguments │ │ pkg/obitools/obiconvert/ │
│ _, args := optionParser(...) │ │ options.go │
└─────────────────────────────────┘ │ - InputOptionSet() │
│ │ - OutputOptionSet() │
▼ │ - PairedFilesOptionSet() │
┌─────────────────────────────────┐ └───────────────────────────────┘
│ 3. Lecture des séquences │
│ CLIReadBioSequences(args) │
└─────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ pkg/obitools/obiconvert/sequence_reader.go │
│ - ExpandListOfFiles() │
│ - ReadSequencesFromFile() / ReadSequencesFromStdin() │
│ - Support: FASTA, FASTQ, EMBL, GenBank, ecoPCR, CSV │
└─────────────────────────────────────────────────────────────┘
▼ obiiter.IBioSequence
┌─────────────────────────────────────────────────────────────┐
│ 4. Traitement spécifique │
│ macommande.CLITraitement(sequences) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ pkg/obitools/macommande/<implementation>.go │
│ - Récupération des options via CLI*() getters │
│ - Application de la logique métier │
│ - Retour d'un nouvel iterator │
└─────────────────────────────────────────────────────────────┘
▼ obiiter.IBioSequence
┌─────────────────────────────────────────────────────────────┐
│ 5. Écriture des résultats │
│ CLIWriteBioSequences(resultat, true) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ pkg/obitools/obiconvert/sequence_writer.go │
│ - WriteSequencesToFile() / WriteSequencesToStdout() │
│ - Support: FASTA, FASTQ, JSON │
│ - Gestion des lectures pairées │
│ - Compression optionnelle │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. Attente de fin du pipeline │
│ obiutils.WaitForLastPipe() │
└─────────────────────────────────────────────────────────────┘
```
## Conclusion
L'architecture des commandes OBITools est conçue pour :
1. **Maximiser la réutilisation** : `obiconvert` fournit les fonctionnalités communes
2. **Simplifier l'ajout de nouvelles commandes** : pattern standardisé et minimaliste
3. **Faciliter la maintenance** : séparation claire des responsabilités
4. **Garantir la cohérence** : conventions de nommage et structure uniforme
5. **Optimiser les performances** : parallélisation intégrée et traitement par batch
Cette architecture modulaire permet de créer rapidement de nouvelles commandes tout en maintenant une qualité et une cohérence élevées dans toute la suite OBITools.

View File

@@ -0,0 +1,99 @@
# Définition du super k-mer
## Définition
Un **super k-mer** est une **sous-séquence MAXIMALE** d'une séquence dans laquelle **tous les k-mers consécutifs partagent le même minimiseur**.
### Termes
- **k-mer** : sous-séquence de longueur k
- **minimiseur** : le plus petit m-mer canonique parmi tous les m-mers d'un k-mer
- **k-mers consécutifs** : k-mers aux positions i et i+1 (chevauchement de k-1 nucléotides)
- **MAXIMALE** : ne peut être étendue ni à gauche ni à droite
## RÈGLES ABSOLUES
### RÈGLE 1 : Longueur minimum = k
Un super k-mer contient au minimum k nucléotides.
```
longueur(super-kmer) >= k
```
### RÈGLE 2 : Chevauchement obligatoire = k-1
Deux super-kmers consécutifs se chevauchent d'EXACTEMENT k-1 nucléotides.
```
SK1.End - SK2.Start = k - 1
```
### RÈGLE 3 : Bijection séquence ↔ minimiseur
Une séquence de super k-mer a UN et UN SEUL minimiseur.
```
Même séquence → Même minimiseur (TOUJOURS)
```
**Si vous observez la même séquence avec deux minimiseurs différents, c'est un BUG.**
### RÈGLE 4 : Tous les k-mers partagent le minimiseur
TOUS les k-mers contenus dans un super k-mer ont le même minimiseur.
```
∀ k-mer K dans SK : minimiseur(K) = SK.minimizer
```
### RÈGLE 5 : Maximalité
Un super k-mer ne peut pas être étendu.
- Si on ajoute un nucléotide à gauche : le nouveau k-mer a un minimiseur différent
- Si on ajoute un nucléotide à droite : le nouveau k-mer a un minimiseur différent
## VIOLATIONS INTERDITES
**Super k-mer de longueur < k**
**Chevauchement ≠ k-1 entre consécutifs**
**Même séquence avec minimiseurs différents**
**K-mer dans le super k-mer avec minimiseur différent**
**Super k-mer extensible (non-maximal)**
## CONSÉQUENCES PRATIQUES
### Pour l'extraction
L'algorithme doit :
1. Calculer le minimiseur de chaque k-mer
2. Découper quand le minimiseur change
3. Assigner au super k-mer le minimiseur commun à tous ses k-mers
4. Garantir que chaque super k-mer contient au moins k nucléotides
5. Garantir le chevauchement de k-1 entre consécutifs
### Pour la validation
Si après déduplication (obiuniq) on observe :
```
Séquence: ACGT...
Minimiseurs: {M1, M2} // plusieurs minimiseurs
```
C'est la PREUVE d'un bug : l'algorithme a produit cette séquence avec des minimiseurs différents, ce qui viole la RÈGLE 3.
## DIAGNOSTIC DU BUG
**Bug observé** : Même séquence avec minimiseurs différents après obiuniq
**Cause possible** : L'algorithme assigne le mauvais minimiseur OU découpe mal les super-kmers
**Ce que le bug NE PEUT PAS être** :
- Un problème d'obiuniq (révèle le bug, ne le crée pas)
- Un problème de chevauchement légitime (k-1 est correct)
**Ce que le bug DOIT être** :
- Minimiseur mal calculé ou mal assigné
- Découpage incorrect (mauvais endPos)
- Copie incorrecte des données

View File

@@ -0,0 +1,316 @@
# Guide de rédaction d'un obitest
## Règles essentielles
1. **Données < 1 KB** - Fichiers de test très petits
2. **Exécution < 10 sec** - Tests rapides pour CI/CD
3. **Auto-contenu** - Pas de dépendances externes
4. **Auto-nettoyage** - Pas de fichiers résiduels
## Structure minimale
```
obitests/obitools/<commande>/
├── test.sh # Script exécutable
└── data.fasta # Données minimales (optionnel)
```
## Template de test.sh
```bash
#!/bin/bash
TEST_NAME=<commande>
CMD=<commande>
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR"
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
########## TESTS ##########
# Test 1: Help (OBLIGATOIRE)
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
# Ajoutez vos tests ici...
###########################
cleanup
```
## Pattern de test
```bash
((ntest++))
if commande args > "${TMPDIR}/output.txt" 2>&1
then
log "$MCMD: description OK"
((success++))
else
log "$MCMD: description failed"
((failed++))
fi
```
## Tests courants
### Exécution basique
```bash
((ntest++))
if $CMD "${TEST_DIR}/input.fasta" > "${TMPDIR}/output.fasta" 2>&1
then
log "$MCMD: basic execution OK"
((success++))
else
log "$MCMD: basic execution failed"
((failed++))
fi
```
### Sortie non vide
```bash
((ntest++))
if [ -s "${TMPDIR}/output.fasta" ]
then
log "$MCMD: output not empty OK"
((success++))
else
log "$MCMD: output empty - failed"
((failed++))
fi
```
### Comptage
```bash
((ntest++))
count=$(grep -c "^>" "${TMPDIR}/output.fasta")
if [ "$count" -gt 0 ]
then
log "$MCMD: extracted $count sequences OK"
((success++))
else
log "$MCMD: no sequences - failed"
((failed++))
fi
```
### Présence de contenu
```bash
((ntest++))
if grep -q "expected_string" "${TMPDIR}/output.fasta"
then
log "$MCMD: expected content found OK"
((success++))
else
log "$MCMD: content not found - failed"
((failed++))
fi
```
### Comparaison avec référence
```bash
((ntest++))
if diff "${TEST_DIR}/expected.fasta" "${TMPDIR}/output.fasta" > /dev/null
then
log "$MCMD: matches reference OK"
((success++))
else
log "$MCMD: differs from reference - failed"
((failed++))
fi
```
### Test avec options
```bash
((ntest++))
if $CMD --opt value "${TEST_DIR}/input.fasta" > "${TMPDIR}/out.fasta" 2>&1
then
log "$MCMD: with option OK"
((success++))
else
log "$MCMD: with option failed"
((failed++))
fi
```
## Variables importantes
- **TEST_DIR** - Répertoire du test (données d'entrée)
- **TMPDIR** - Répertoire temporaire (sorties)
- **CMD** - Nom de la commande
- **MCMD** - Nom formaté pour les logs
## Règles d'or
**Entrées**`${TEST_DIR}/`
**Sorties**`${TMPDIR}/`
**Toujours rediriger**`> file 2>&1`
**Incrémenter ntest** → Avant chaque test
**Messages clairs** → Descriptions explicites
**Pas de chemins en dur**
**Pas de /tmp direct**
**Pas de sortie vers TEST_DIR**
**Pas de commandes sans redirection**
## Données de test
Créer un fichier minimal (< 500 bytes) :
```fasta
>seq1
ACGTACGTACGTACGT
>seq2
AAAACCCCGGGGTTTT
>seq3
ATCGATCGATCGATCG
```
## Création rapide
```bash
# 1. Créer le répertoire
mkdir -p obitests/obitools/<commande>
cd obitests/obitools/<commande>
# 2. Créer les données de test
cat > test_data.fasta << 'EOF'
>seq1
ACGTACGTACGTACGT
>seq2
AAAACCCCGGGGTTTT
EOF
# 3. Copier le template dans test.sh
# 4. Adapter le TEST_NAME et CMD
# 5. Ajouter les tests
# 6. Rendre exécutable
chmod +x test.sh
# 7. Tester
./test.sh
```
## Checklist
- [ ] `test.sh` exécutable (`chmod +x`)
- [ ] Test d'aide inclus
- [ ] Données < 1 KB
- [ ] Sorties vers `${TMPDIR}/`
- [ ] Entrées depuis `${TEST_DIR}/`
- [ ] Redirections `2>&1`
- [ ] Messages clairs
- [ ] Testé localement
- [ ] Exit code 0 si succès
## Debug
Conserver TMPDIR pour inspection :
```bash
cleanup() {
echo "Temporary directory: $TMPDIR" 1>&2
# rm -rf "$TMPDIR" # Commenté
...
}
```
Mode verbose :
```bash
set -x # Au début du script
```
## Exemples
**Simple (1 test)** - obimicrosat
```bash
# Juste l'aide
```
**Moyen (4-5 tests)** - obisuperkmer
```bash
# Aide + exécution + validation sortie + contenu
```
**Complet (7+ tests)** - obiuniq
```bash
# Aide + exécution + comparaison CSV + options + multiples cas
```
## Commandes utiles
```bash
# Compter séquences
grep -c "^>" file.fasta
# Fichier non vide
[ -s file ]
# Comparer
diff file1 file2 > /dev/null
# Comparer compressés
zdiff file1.gz file2.gz
# Compter bases
grep -v "^>" file | tr -d '\n' | wc -c
```
## Ce qu'il faut retenir
Un bon test est **COURT**, **RAPIDE** et **SIMPLE** :
- 3-10 tests maximum
- Données < 1 KB
- Exécution < 10 secondes
- Pattern standard respecté

View File

@@ -0,0 +1,268 @@
# Implémentation de la commande obisuperkmer
## Vue d'ensemble
La commande `obisuperkmer` a été implémentée en suivant l'architecture standard des commandes OBITools décrite dans `architecture-commande-obitools.md`. Cette commande permet d'extraire les super k-mers de fichiers de séquences biologiques.
## Qu'est-ce qu'un super k-mer ?
Un super k-mer est une sous-séquence maximale dans laquelle tous les k-mers consécutifs partagent le même minimiseur. Cette décomposition est utile pour :
- L'indexation efficace de k-mers
- La réduction de la redondance dans les analyses
- L'optimisation de la mémoire pour les structures de données de k-mers
## Structure de l'implémentation
### 1. Package `pkg/obitools/obisuperkmer/`
Le package contient trois fichiers :
#### `obisuperkmer.go`
Documentation du package avec une description de son rôle.
#### `options.go`
Définit les options de ligne de commande :
```go
var _KmerSize = 21 // Taille des k-mers (par défaut 21)
var _MinimizerSize = 11 // Taille des minimiseurs (par défaut 11)
```
**Options CLI disponibles :**
- `--kmer-size` / `-k` : Taille des k-mers (entre m+1 et 31)
- `--minimizer-size` / `-m` : Taille des minimiseurs (entre 1 et k-1)
**Fonctions d'accès :**
- `CLIKmerSize()` : retourne la taille des k-mers
- `CLIMinimizerSize()` : retourne la taille des minimiseurs
- `SetKmerSize(k int)` : définit la taille des k-mers
- `SetMinimizerSize(m int)` : définit la taille des minimiseurs
#### `superkmer.go`
Implémente la logique de traitement :
```go
func CLIExtractSuperKmers(iterator obiiter.IBioSequence) obiiter.IBioSequence
```
Cette fonction :
1. Récupère les paramètres k et m depuis les options CLI
2. Valide les paramètres (m < k, k <= 31, etc.)
3. Crée un worker utilisant `obikmer.SuperKmerWorker(k, m)`
4. Applique le worker en parallèle sur l'itérateur de séquences
5. Retourne un itérateur de super k-mers
### 2. Exécutable `cmd/obitools/obisuperkmer/main.go`
L'exécutable suit le pattern standard minimal :
```go
func main() {
// 1. Génération du parser d'options
optionParser := obioptions.GenerateOptionParser(
"obisuperkmer",
"extract super k-mers from sequence files",
obisuperkmer.OptionSet)
// 2. Parsing des arguments
_, args := optionParser(os.Args)
// 3. Lecture des séquences
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
// 4. Extraction des super k-mers
superkmers := obisuperkmer.CLIExtractSuperKmers(sequences)
// 5. Écriture des résultats
obiconvert.CLIWriteBioSequences(superkmers, true)
// 6. Attente de la fin du pipeline
obiutils.WaitForLastPipe()
}
```
## Utilisation du package `obikmer`
L'implémentation s'appuie sur le package `obikmer` qui fournit :
### `SuperKmerWorker(k int, m int) obiseq.SeqWorker`
Crée un worker qui :
- Extrait les super k-mers d'une BioSequence
- Retourne une slice de BioSequence, une par super k-mer
- Chaque super k-mer contient les attributs suivants :
```go
// Métadonnées ajoutées à chaque super k-mer :
{
"minimizer_value": uint64, // Valeur canonique du minimiseur
"minimizer_seq": string, // Séquence ADN du minimiseur
"k": int, // Taille des k-mers utilisée
"m": int, // Taille des minimiseurs utilisée
"start": int, // Position de début (0-indexé)
"end": int, // Position de fin (exclusif)
"parent_id": string, // ID de la séquence parente
}
```
### Algorithme sous-jacent
Le package `obikmer` utilise :
- `IterSuperKmers(seq []byte, k int, m int)` : itérateur sur les super k-mers
- Une deque monotone pour suivre les minimiseurs dans une fenêtre glissante
- Complexité temporelle : O(n) où n est la longueur de la séquence
- Complexité spatiale : O(k-m+1) pour la deque
## Exemple d'utilisation
### Ligne de commande
```bash
# Extraction avec paramètres par défaut (k=21, m=11)
obisuperkmer sequences.fasta > superkmers.fasta
# Spécifier les tailles de k-mers et minimiseurs
obisuperkmer -k 25 -m 13 sequences.fasta -o superkmers.fasta
# Avec plusieurs fichiers d'entrée
obisuperkmer --kmer-size 31 --minimizer-size 15 file1.fasta file2.fasta > output.fasta
# Format FASTQ en entrée, FASTA en sortie
obisuperkmer sequences.fastq --fasta-output -o superkmers.fasta
# Avec compression
obisuperkmer sequences.fasta -o superkmers.fasta.gz --compress
```
### Exemple de sortie
Pour une séquence d'entrée :
```
>seq1
ACGTACGTACGTACGTACGTACGT
```
La sortie contiendra plusieurs super k-mers :
```
>seq1_superkmer_0_15 {"minimizer_value":123456,"minimizer_seq":"acgtacgt","k":21,"m":11,"start":0,"end":15,"parent_id":"seq1"}
ACGTACGTACGTACG
>seq1_superkmer_8_24 {"minimizer_value":789012,"minimizer_seq":"gtacgtac","k":21,"m":11,"start":8,"end":24,"parent_id":"seq1"}
TACGTACGTACGTACGT
```
## Options héritées de `obiconvert`
La commande hérite de toutes les options standard d'OBITools :
### Options d'entrée
- `--fasta` : forcer le format FASTA
- `--fastq` : forcer le format FASTQ
- `--ecopcr` : format ecoPCR
- `--embl` : format EMBL
- `--genbank` : format GenBank
- `--input-json-header` : en-têtes JSON
- `--input-OBI-header` : en-têtes OBI
### Options de sortie
- `--out` / `-o` : fichier de sortie (défaut : stdout)
- `--fasta-output` : sortie en format FASTA
- `--fastq-output` : sortie en format FASTQ
- `--json-output` : sortie en format JSON
- `--output-json-header` : en-têtes JSON en sortie
- `--output-OBI-header` / `-O` : en-têtes OBI en sortie
- `--compress` / `-Z` : compression gzip
- `--skip-empty` : ignorer les séquences vides
- `--no-progressbar` : désactiver la barre de progression
## Compilation
Pour compiler la commande :
```bash
cd /chemin/vers/obitools4
go build -o bin/obisuperkmer ./cmd/obitools/obisuperkmer/
```
## Tests
Pour tester la commande :
```bash
# Créer un fichier de test
echo -e ">test\nACGTACGTACGTACGTACGTACGTACGTACGT" > test.fasta
# Exécuter obisuperkmer
obisuperkmer test.fasta
# Vérifier avec des paramètres différents
obisuperkmer -k 15 -m 7 test.fasta
```
## Validation des paramètres
La commande valide automatiquement :
- `1 <= m < k` : le minimiseur doit être plus petit que le k-mer
- `2 <= k <= 31` : contrainte du codage sur 64 bits
- `len(sequence) >= k` : la séquence doit être assez longue
En cas de paramètres invalides, la commande affiche une erreur explicite et s'arrête.
## Intégration avec le pipeline OBITools
La commande s'intègre naturellement dans les pipelines OBITools :
```bash
# Pipeline complet d'analyse
obiconvert sequences.fastq --fasta-output | \
obisuperkmer -k 21 -m 11 | \
obiuniq | \
obigrep -p "minimizer_value>1000" > filtered_superkmers.fasta
```
## Parallélisation
La commande utilise automatiquement :
- `obidefault.ParallelWorkers()` pour le traitement parallèle
- Les workers sont distribués sur les séquences d'entrée
- La parallélisation est transparente pour l'utilisateur
## Conformité avec l'architecture OBITools
L'implémentation respecte tous les principes de l'architecture :
✅ Séparation des responsabilités (package + commande)
✅ Convention de nommage cohérente (CLI*, Set*, _variables)
✅ Réutilisation de `obiconvert` pour l'I/O
✅ Options standard partagées
✅ Pattern Worker pour le traitement
✅ Validation des paramètres
✅ Logging avec `logrus`
✅ Gestion d'erreurs cohérente
✅ Documentation complète
## Fichiers créés
```
pkg/obitools/obisuperkmer/
├── obisuperkmer.go # Documentation du package
├── options.go # Définition des options CLI
└── superkmer.go # Implémentation du traitement
cmd/obitools/obisuperkmer/
└── main.go # Point d'entrée de la commande
```
## Prochaines étapes
1. **Compilation** : Compiler la commande avec `go build`
2. **Tests unitaires** : Créer des tests dans `pkg/obitools/obisuperkmer/superkmer_test.go`
3. **Documentation utilisateur** : Ajouter la documentation de la commande
4. **Intégration CI/CD** : Ajouter aux tests d'intégration
5. **Benchmarks** : Mesurer les performances sur différents jeux de données
## Références
- Architecture des commandes OBITools : `architecture-commande-obitools.md`
- Package `obikmer` : `pkg/obikmer/`
- Tests du package : `pkg/obikmer/superkmer_iter_test.go`

View File

@@ -0,0 +1,440 @@
# Tests automatisés pour obisuperkmer
## Vue d'ensemble
Des tests automatisés ont été créés pour la commande `obisuperkmer` dans le répertoire `obitests/obitools/obisuperkmer/`. Ces tests suivent le pattern standard utilisé par toutes les commandes OBITools et sont conçus pour être exécutés dans un environnement CI/CD.
## Fichiers créés
```
obitests/obitools/obisuperkmer/
├── test.sh # Script de test principal (6.7 KB)
├── test_sequences.fasta # Données de test (117 bytes)
└── README.md # Documentation (4.1 KB)
```
### Taille totale : ~11 KB
Cette taille minimale est idéale pour un dépôt Git et des tests CI/CD rapides.
## Jeu de données de test
### Fichier : `test_sequences.fasta` (117 bytes)
Le fichier contient 3 séquences de 32 nucléotides chacune :
```fasta
>seq1
ACGTACGTACGTACGTACGTACGTACGTACGT
>seq2
AAAACCCCGGGGTTTTAAAACCCCGGGGTTTT
>seq3
ATCGATCGATCGATCGATCGATCGATCGATCG
```
#### Justification du choix
1. **seq1** : Motif répétitif simple (ACGT)
- Teste l'extraction de super k-mers sur une séquence avec faible complexité
- Les minimiseurs devraient être assez réguliers
2. **seq2** : Blocs homopolymères
- Teste le comportement avec des régions de très faible complexité
- Les minimiseurs varieront entre les blocs A, C, G et T
3. **seq3** : Motif différent (ATCG)
- Teste la diversité des super k-mers extraits
- Différent de seq1 pour vérifier la distinction
#### Caractéristiques
- **Longueur** : 32 nucléotides par séquence
- **Taille totale** : 96 nucléotides (3 × 32)
- **Format** : FASTA avec en-têtes JSON compatibles
- **Alphabet** : A, C, G, T uniquement (pas de bases ambiguës)
- **Taille du fichier** : 117 bytes
Avec k=21 (défaut), chaque séquence de 32 bp peut produire :
- 32 - 21 + 1 = 12 k-mers
- Plusieurs super k-mers selon les minimiseurs
## Script de test : `test.sh`
### Structure
Le script suit le pattern standard OBITools :
```bash
#!/bin/bash
TEST_NAME=obisuperkmer
CMD=obisuperkmer
# Variables et fonctions standard
TEST_DIR="..."
OBITOOLS_DIR="..."
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() { ... }
log() { ... }
# Tests (12 au total)
# ...
cleanup
```
### Tests implémentés
#### 1. Test d'aide (`-h`)
```bash
obisuperkmer -h
```
Vérifie que la commande peut afficher son aide sans erreur.
#### 2. Extraction basique avec paramètres par défaut
```bash
obisuperkmer test_sequences.fasta > output_default.fasta
```
Teste l'exécution avec k=21, m=11 (défaut).
#### 3. Vérification de sortie non vide
```bash
[ -s output_default.fasta ]
```
S'assure que la commande produit un résultat.
#### 4. Comptage des super k-mers
```bash
grep -c "^>" output_default.fasta
```
Vérifie qu'au moins un super k-mer a été extrait.
#### 5. Présence des métadonnées
```bash
grep -q "minimizer_value" output_default.fasta
grep -q "minimizer_seq" output_default.fasta
grep -q "parent_id" output_default.fasta
```
Vérifie que les attributs requis sont présents.
#### 6. Extraction avec paramètres personnalisés
```bash
obisuperkmer -k 15 -m 7 test_sequences.fasta > output_k15_m7.fasta
```
Teste la configuration de k et m.
#### 7. Validation des paramètres personnalisés
```bash
grep -q '"k":15' output_k15_m7.fasta
grep -q '"m":7' output_k15_m7.fasta
```
Vérifie que les paramètres sont correctement enregistrés.
#### 8. Format de sortie FASTA
```bash
obisuperkmer --fasta-output test_sequences.fasta > output_fasta.fasta
```
Teste l'option de format explicite.
#### 9. Vérification des IDs
```bash
grep "^>" output_default.fasta | grep -q "superkmer"
```
S'assure que les IDs contiennent "superkmer".
#### 10. Préservation des IDs parents
```bash
grep -q "seq1" output_default.fasta
grep -q "seq2" output_default.fasta
grep -q "seq3" output_default.fasta
```
Vérifie que les IDs des séquences parentes sont préservés.
#### 11. Option de fichier de sortie (`-o`)
```bash
obisuperkmer -o output_file.fasta test_sequences.fasta
```
Teste la redirection vers un fichier.
#### 12. Vérification de création du fichier
```bash
[ -s output_file.fasta ]
```
S'assure que le fichier a été créé.
#### 13. Cohérence des longueurs
```bash
# Vérifie que longueur(output) <= longueur(input)
```
S'assure que les super k-mers ne sont pas plus longs que l'entrée.
### Compteurs
- **ntest** : Nombre de tests exécutés
- **success** : Nombre de tests réussis
- **failed** : Nombre de tests échoués
### Sortie du script
#### En cas de succès
```
========================================
## Results of the obisuperkmer tests:
- 12 tests run
- 12 successfully completed
- 0 failed tests
Cleaning up the temporary directory...
========================================
```
Exit code : **0**
#### En cas d'échec
```
========================================
## Results of the obisuperkmer tests:
- 12 tests run
- 10 successfully completed
- 2 failed tests
Cleaning up the temporary directory...
========================================
```
Exit code : **1**
## Intégration CI/CD
### Exécution automatique
Le script est conçu pour être exécuté automatiquement dans un pipeline CI/CD :
1. Le build produit l'exécutable dans `build/obisuperkmer`
2. Le script de test ajoute `build/` au PATH
3. Les tests s'exécutent
4. Le code de retour indique le succès (0) ou l'échec (1)
### Exemple de configuration CI/CD
```yaml
# .github/workflows/test.yml ou équivalent
test-obisuperkmer:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build obitools
run: make build
- name: Test obisuperkmer
run: ./obitests/obitools/obisuperkmer/test.sh
```
### Avantages
**Rapidité** : Données de test minimales (117 bytes)
**Fiabilité** : Tests reproductibles
**Isolation** : Utilisation d'un répertoire temporaire
**Nettoyage automatique** : Pas de fichiers résiduels
**Logging** : Messages horodatés et détaillés
**Compatibilité** : Pattern standard OBITools
## Exécution locale
### Prérequis
1. Compiler obisuperkmer :
```bash
cd /chemin/vers/obitools4
go build -o build/obisuperkmer ./cmd/obitools/obisuperkmer/
```
2. Se placer dans le répertoire de test :
```bash
cd obitests/obitools/obisuperkmer
```
3. Exécuter le script :
```bash
./test.sh
```
### Exemple de sortie
```
[obisuperkmer @ Fri Feb 7 13:00:00 CET 2026] Testing obisuperkmer...
[obisuperkmer @ Fri Feb 7 13:00:00 CET 2026] Test directory is /path/to/obitests/obitools/obisuperkmer
[obisuperkmer @ Fri Feb 7 13:00:00 CET 2026] obitools directory is /path/to/build
[obisuperkmer @ Fri Feb 7 13:00:00 CET 2026] Temporary directory is /tmp/tmp.abc123
[obisuperkmer @ Fri Feb 7 13:00:00 CET 2026] files: README.md test.sh test_sequences.fasta
[obisuperkmer @ Fri Feb 7 13:00:01 CET 2026] OBISuperkmer: printing help OK
[obisuperkmer @ Fri Feb 7 13:00:02 CET 2026] OBISuperkmer: basic extraction with default parameters OK
[obisuperkmer @ Fri Feb 7 13:00:02 CET 2026] OBISuperkmer: output file is not empty OK
[obisuperkmer @ Fri Feb 7 13:00:02 CET 2026] OBISuperkmer: extracted 8 super k-mers OK
[obisuperkmer @ Fri Feb 7 13:00:02 CET 2026] OBISuperkmer: super k-mers contain required metadata OK
[obisuperkmer @ Fri Feb 7 13:00:03 CET 2026] OBISuperkmer: extraction with custom k=15, m=7 OK
[obisuperkmer @ Fri Feb 7 13:00:03 CET 2026] OBISuperkmer: custom parameters correctly set in metadata OK
[obisuperkmer @ Fri Feb 7 13:00:03 CET 2026] OBISuperkmer: FASTA output format OK
[obisuperkmer @ Fri Feb 7 13:00:03 CET 2026] OBISuperkmer: super k-mer IDs contain 'superkmer' OK
[obisuperkmer @ Fri Feb 7 13:00:03 CET 2026] OBISuperkmer: parent sequence IDs preserved OK
[obisuperkmer @ Fri Feb 7 13:00:04 CET 2026] OBISuperkmer: output to file with -o option OK
[obisuperkmer @ Fri Feb 7 13:00:04 CET 2026] OBISuperkmer: output file created with -o option OK
[obisuperkmer @ Fri Feb 7 13:00:04 CET 2026] OBISuperkmer: super k-mer total length <= input length OK
========================================
## Results of the obisuperkmer tests:
- 12 tests run
- 12 successfully completed
- 0 failed tests
Cleaning up the temporary directory...
========================================
```
## Debugging des tests
### Conserver les fichiers temporaires
Modifier temporairement la fonction `cleanup()` :
```bash
cleanup() {
echo "Temporary directory: $TMPDIR" 1>&2
# Commenter cette ligne pour conserver les fichiers
# rm -rf "$TMPDIR"
...
}
```
### Activer le mode verbose
Ajouter au début du script :
```bash
set -x # Active l'affichage de toutes les commandes
```
### Tester une seule commande
Extraire et exécuter manuellement :
```bash
export TEST_DIR=/chemin/vers/obitests/obitools/obisuperkmer
export TMPDIR=$(mktemp -d)
obisuperkmer "${TEST_DIR}/test_sequences.fasta" > "${TMPDIR}/output.fasta"
cat "${TMPDIR}/output.fasta"
```
## Ajout de nouveaux tests
Pour ajouter un test supplémentaire :
1. Incrémenter le compteur `ntest`
2. Écrire la condition de test
3. Logger le succès ou l'échec
4. Incrémenter le bon compteur
```bash
((ntest++))
if ma_nouvelle_commande_de_test
then
log "Description du test: OK"
((success++))
else
log "Description du test: failed"
((failed++))
fi
```
## Comparaison avec d'autres tests
### Taille des données de test
| Commande | Taille des données | Nombre de fichiers |
|----------|-------------------|-------------------|
| obiconvert | 925 KB | 1 fichier |
| obiuniq | ~600 bytes | 4 fichiers |
| obimicrosat | 0 bytes | 0 fichiers (génère à la volée) |
| **obisuperkmer** | **117 bytes** | **1 fichier** |
Notre test `obisuperkmer` est parmi les plus légers, ce qui est optimal pour CI/CD.
### Nombre de tests
| Commande | Nombre de tests |
|----------|----------------|
| obiconvert | 3 tests |
| obiuniq | 7 tests |
| obimicrosat | 1 test |
| **obisuperkmer** | **12 tests** |
Notre test `obisuperkmer` offre une couverture complète avec 12 tests différents.
## Couverture de test
Les tests couvrent :
✅ Affichage de l'aide
✅ Exécution basique
✅ Paramètres par défaut (k=21, m=11)
✅ Paramètres personnalisés (k=15, m=7)
✅ Formats de sortie (FASTA)
✅ Redirection vers fichier (`-o`)
✅ Présence des métadonnées
✅ Validation des IDs
✅ Préservation des IDs parents
✅ Cohérence des longueurs
✅ Production de résultats non vides
## Maintenance
### Mise à jour des tests
Si l'implémentation de `obisuperkmer` change :
1. Vérifier que les tests existants passent toujours
2. Ajouter de nouveaux tests pour les nouvelles fonctionnalités
3. Mettre à jour `README.md` si nécessaire
4. Documenter les changements
### Vérification régulière
Exécuter périodiquement :
```bash
cd obitests/obitools/obisuperkmer
./test.sh
```
Ou via l'ensemble des tests :
```bash
cd obitests
for dir in obitools/*/; do
if [ -f "$dir/test.sh" ]; then
echo "Testing $(basename $dir)..."
(cd "$dir" && ./test.sh) || echo "FAILED: $(basename $dir)"
fi
done
```
## Conclusion
Les tests pour `obisuperkmer` sont :
-**Complets** : 12 tests couvrant toutes les fonctionnalités principales
-**Légers** : 117 bytes de données de test
-**Rapides** : Exécution en quelques secondes
-**Fiables** : Pattern éprouvé utilisé par toutes les commandes OBITools
-**Maintenables** : Structure claire et documentée
-**CI/CD ready** : Code de retour approprié et nettoyage automatique
Ils garantissent que la commande fonctionne correctement à chaque commit et facilitent la détection précoce des régressions.

View File

@@ -3,13 +3,11 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiannotate"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -32,20 +30,25 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obiannotate.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiannotate",
"edits the sequence annotations",
obiannotate.OptionSet,
)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
annotator := obiannotate.CLIAnnotationPipeline()
if obiannotate.CLIHasSetNumberFlag() {
sequences = sequences.NumberSequences(1, !obiconvert.CLINoInputOrder())
}
obiconvert.CLIWriteBioSequences(sequences.Pipe(annotator), true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,31 +3,28 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiclean"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiclean.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiclean",
"",
obiclean.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
cleaned := obiclean.CLIOBIClean(fs)
obiconvert.CLIWriteBioSequences(cleaned, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,33 +3,31 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obicleandb"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
obioptions.SetBatchSize(10)
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obicleandb.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicleandb",
"clean-up reference databases",
obicleandb.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
cleaned := obicleandb.ICleanDB(fs)
toconsume, _ := obiconvert.CLIWriteBioSequences(cleaned, false)
toconsume.Consume()
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,30 +3,27 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiconvert.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicomplement",
"reverse complement of sequences",
obiconvert.OptionSet(true))
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
comp := fs.MakeIWorker(obiseq.ReverseComplementWorker(true), true)
obiconvert.CLIWriteBioSequences(comp, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,31 +3,28 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconsensus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiconsensus.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiconsensus",
"ONT reads denoising",
obiconsensus.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
cleaned := obiconsensus.CLIOBIMinion(fs)
obiconvert.CLIWriteBioSequences(cleaned, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,31 +3,29 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
obioptions.SetStrictReadWorker(2)
obioptions.SetStrictWriteWorker(2)
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obiconvert.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiconvert",
"convertion of sequence files to various formats",
obiconvert.OptionSet(true))
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
obiconvert.CLIWriteBioSequences(fs, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -4,8 +4,7 @@ import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obicount"
@@ -29,23 +28,21 @@ func main() {
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obicount",
"counts the sequences present in a file of sequences",
obiconvert.InputOptionSet,
obicount.OptionSet,
)
_, args := optionParser(os.Args)
obioptions.SetStrictReadWorker(min(4, obioptions.CLIParallelWorkers()))
obidefault.SetStrictReadWorker(min(4, obidefault.ParallelWorkers()))
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
nvariant, nread, nsymbol := fs.Count(true)
fmt.Print("entites,n\n")
fmt.Print("entities,n\n")
if obicount.CLIIsPrintingVariantCount() {
fmt.Printf("variants,%d\n", nvariant)

View File

@@ -3,28 +3,25 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obicsv"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obicsv.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicsv",
"converts sequence files to CSV format",
obicsv.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obicsv.CLIWriteSequenceCSV(fs, true)
obicsv.CLIWriteCSV(fs, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,34 +3,32 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obidemerge"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
obioptions.SetStrictReadWorker(2)
obioptions.SetStrictWriteWorker(2)
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obidemerge.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obidemerge",
"",
obidemerge.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
demerged := obidemerge.CLIDemergeSequences(fs)
obiconvert.CLIWriteBioSequences(demerged, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,29 +3,26 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obidistribute"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obidistribute.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obidistribute",
"divided an input set of sequences into subsets",
obidistribute.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
obidistribute.CLIDistributeSequence(fs)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -1,68 +0,0 @@
package main
import (
"fmt"
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obifind"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obifind.OptionSet)
_, args := optionParser(os.Args)
//prof, _ := os.Create("obifind.prof")
//pprof.StartCPUProfile(prof)
restrictions, err := obifind.ITaxonRestrictions()
if err != nil {
fmt.Printf("%+v", err)
}
switch {
case obifind.CLIRequestsPathForTaxid() >= 0:
taxonomy, err := obifind.CLILoadSelectedTaxonomy()
if err != nil {
fmt.Printf("%+v", err)
}
taxon, err := taxonomy.Taxon(obifind.CLIRequestsPathForTaxid())
if err != nil {
fmt.Printf("%+v", err)
}
s, err := taxon.Path()
if err != nil {
fmt.Printf("%+v", err)
}
obifind.TaxonWriter(s.Iterator(),
fmt.Sprintf("path:%d", taxon.Taxid()))
case len(args) == 0:
taxonomy, err := obifind.CLILoadSelectedTaxonomy()
if err != nil {
fmt.Printf("%+v", err)
}
obifind.TaxonWriter(restrictions(taxonomy.Iterator()), "")
default:
matcher, err := obifind.ITaxonNameMatcher()
if err != nil {
fmt.Printf("%+v", err)
}
for _, pattern := range args {
s := restrictions(matcher(pattern))
obifind.TaxonWriter(s, pattern)
}
}
//pprof.StopCPUProfile()
}

View File

@@ -3,13 +3,11 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obigrep"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -32,18 +30,18 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obigrep.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obigrep",
"select a subset of sequences on various criteria",
obigrep.OptionSet)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
selected := obigrep.CLIFilterSequence(sequences)
obiconvert.CLIWriteBioSequences(selected, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,34 +3,32 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obijoin"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
obioptions.SetStrictReadWorker(2)
obioptions.SetStrictWriteWorker(2)
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obijoin.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obijoin",
"merge annotations contained in a file to another file",
obijoin.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
joined := obijoin.CLIJoinSequences(fs)
obiconvert.CLIWriteBioSequences(joined, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

34
cmd/obitools/obik/main.go Normal file
View File

@@ -0,0 +1,34 @@
package main
import (
"context"
"errors"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obik"
"github.com/DavidGamba/go-getoptions"
)
func main() {
defer obiseq.LogBioSeqStatus()
opt, parser := obioptions.GenerateSubcommandParser(
"obik",
"Manage disk-based kmer indices",
obik.OptionSet,
)
_, remaining := parser(os.Args)
err := opt.Dispatch(context.Background(), remaining)
if err != nil {
if errors.Is(err, getoptions.ErrorHelpCalled) {
os.Exit(0)
}
log.Fatalf("Error: %v", err)
}
}

View File

@@ -0,0 +1,54 @@
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obikmersim"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
defer obiseq.LogBioSeqStatus()
// go tool pprof -http=":8000" ./obipairing ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obikmermatch",
"",
obikmersim.MatchOptionSet)
_, args := optionParser(os.Args)
var err error
sequences := obiiter.NilIBioSequence
if !obikmersim.CLISelf() {
sequences, err = obiconvert.CLIReadBioSequences(args...)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
selected := obikmersim.CLIAlignSequences(sequences)
obiconvert.CLIWriteBioSequences(selected, true)
obiutils.WaitForLastPipe()
}

View File

@@ -0,0 +1,62 @@
package main
import (
"log"
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obikmersim"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
defer obiseq.LogBioSeqStatus()
// go tool pprof -http=":8000" ./obipairing ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obikmersimcount",
"",
obikmersim.CountOptionSet)
_, args := optionParser(os.Args)
var err error
sequences := obiiter.NilIBioSequence
if !obikmersim.CLISelf() {
sequences, err = obiconvert.CLIReadBioSequences(args...)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
counted := obikmersim.CLILookForSharedKmers(sequences)
topull, err := obiconvert.CLIWriteBioSequences(counted, false)
if err != nil {
log.Panic(err)
}
topull.Consume()
obiutils.WaitForLastPipe()
}

View File

@@ -3,30 +3,27 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obilandmark"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obilandmark.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obilandmark",
"",
obilandmark.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
indexed := obilandmark.CLISelectLandmarkSequences(fs)
obiconvert.CLIWriteBioSequences(indexed, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -4,8 +4,6 @@ import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
@@ -33,17 +31,15 @@ func main() {
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obimatrix",
"",
obimatrix.OptionSet,
)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
matrix := obimatrix.IMatrix(fs)

View File

@@ -0,0 +1,47 @@
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obimicrosat"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
defer obiseq.LogBioSeqStatus()
// go tool pprof -http=":8000" ./obipairing ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obimicrosat",
"looks for microsatellites sequences in a sequence file",
obimicrosat.OptionSet)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
selected := obimicrosat.CLIAnnotateMicrosat(sequences)
obiconvert.CLIWriteBioSequences(selected, true)
obiutils.WaitForLastPipe()
}

View File

@@ -6,10 +6,10 @@ import (
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obimultiplex"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -28,7 +28,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obimultiplex.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obimultiplex",
"demultiplex amplicons",
obimultiplex.OptionSet)
_, args := optionParser(os.Args)
@@ -43,14 +46,11 @@ func main() {
}
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
amplicons, _ := obimultiplex.IExtractBarcode(sequences)
obiconvert.CLIWriteBioSequences(amplicons, true)
amplicons.Wait()
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -5,10 +5,11 @@ import (
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obipairing"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -29,12 +30,15 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obipairing.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obipairing",
"align forward with reverse reads with paired reads",
obipairing.OptionSet)
optionParser(os.Args)
obioptions.SetStrictReadWorker(2)
obioptions.SetStrictWriteWorker(2)
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
pairs, err := obipairing.CLIPairedSequence()
if err != nil {
@@ -51,10 +55,10 @@ func main() {
obipairing.CLIFastMode(),
obipairing.CLIFastRelativeScore(),
obipairing.CLIWithStats(),
obioptions.CLIParallelWorkers(),
obidefault.ParallelWorkers(),
)
obiconvert.CLIWriteBioSequences(paired, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,12 +3,11 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obipcr"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -25,24 +24,23 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
obioptions.SetWorkerPerCore(2)
obioptions.SetReadWorkerPerCore(0.5)
obioptions.SetParallelFilesRead(obioptions.CLIParallelWorkers() / 4)
obioptions.SetBatchSize(10)
obidefault.SetWorkerPerCore(2)
obidefault.SetReadWorkerPerCore(0.5)
obidefault.SetParallelFilesRead(obidefault.ParallelWorkers() / 4)
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obipcr.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obipcr",
"simulates a PCR on a sequence files",
obipcr.OptionSet)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
amplicons, _ := obipcr.CLIPCR(sequences)
obiconvert.CLIWriteBioSequences(amplicons, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,30 +3,27 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obirefidx"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obirefidx.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obireffamidx",
"",
obirefidx.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
indexed := obirefidx.IndexFamilyDB(fs)
obiconvert.CLIWriteBioSequences(indexed, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,29 +3,27 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obirefidx"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obirefidx.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obirefidx",
"",
obirefidx.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
indexed := obirefidx.IndexReferenceDB(fs)
obiconvert.CLIWriteBioSequences(indexed, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -4,13 +4,11 @@ import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiscript"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -33,7 +31,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obiscript.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiscript",
"executes a lua script on the input sequences",
obiscript.OptionSet)
_, args := optionParser(os.Args)
@@ -43,15 +44,11 @@ func main() {
}
sequences, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
annotator := obiscript.CLIScriptPipeline()
obiconvert.CLIWriteBioSequences(sequences.Pipe(annotator), true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -4,13 +4,11 @@ import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obisplit"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -33,7 +31,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obisplit.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obisplit",
"",
obisplit.OptionSet)
_, args := optionParser(os.Args)
@@ -43,15 +44,11 @@ func main() {
}
sequences, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
annotator := obisplit.CLISlitPipeline()
obiconvert.CLIWriteBioSequences(sequences.Pipe(annotator), true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -5,7 +5,6 @@ import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"gopkg.in/yaml.v3"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
@@ -34,16 +33,15 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obisummary.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obisummary",
"resume main information from a sequence file",
obisummary.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
summary := obisummary.ISummary(fs, obisummary.CLIMapSummary())

View File

@@ -6,10 +6,13 @@ import (
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitax"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obifind"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitag"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitaxonomy"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
@@ -32,39 +35,55 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
obioptions.SetWorkerPerCore(2)
obioptions.SetStrictReadWorker(1)
obioptions.SetStrictWriteWorker(1)
obioptions.SetBatchSize(10)
obidefault.SetWorkerPerCore(2)
obidefault.SetStrictReadWorker(1)
obidefault.SetStrictWriteWorker(1)
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obitag.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obitag",
"realizes taxonomic assignment",
obitag.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
taxo, error := obifind.CLILoadSelectedTaxonomy()
if error != nil {
log.Panicln(error)
}
taxo := obitax.DefaultTaxonomy()
references := obitag.CLIRefDB()
if references == nil {
log.Panicln("No loaded reference database")
}
if taxo == nil {
taxo, err = references.ExtractTaxonomy(nil, obitaxonomy.CLINewickWithLeaves())
if err != nil {
log.Fatalf("No taxonomy specified or extractable from reference database: %v", err)
}
taxo.SetAsDefault()
}
if taxo == nil {
log.Panicln("No loaded taxonomy")
}
var identified obiiter.IBioSequence
fsrb := fs.Rebatch(obidefault.BatchSize())
if obitag.CLIGeometricMode() {
identified = obitag.CLIGeomAssignTaxonomy(fs, references, taxo)
identified = obitag.CLIGeomAssignTaxonomy(fsrb, references, taxo)
} else {
identified = obitag.CLIAssignTaxonomy(fs, references, taxo)
identified = obitag.CLIAssignTaxonomy(fsrb, references, taxo)
}
obiconvert.CLIWriteBioSequences(identified, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
obitag.CLISaveRefetenceDB(references)

View File

@@ -1,65 +0,0 @@
package main
import (
"fmt"
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obifind"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitag"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitag2"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
)
func main() {
// go tool pprof -http=":8000" ./build/obitag ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
obioptions.SetWorkerPerCore(2)
obioptions.SetStrictReadWorker(1)
obioptions.SetStrictWriteWorker(1)
obioptions.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obitag.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
taxo, error := obifind.CLILoadSelectedTaxonomy()
if error != nil {
log.Panicln(error)
}
references := obitag.CLIRefDB()
identified := obitag2.CLIAssignTaxonomy(fs, references, taxo)
obiconvert.CLIWriteBioSequences(identified, true)
obiiter.WaitForLastPipe()
fmt.Println("")
}

View File

@@ -5,11 +5,12 @@ import (
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obipairing"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitagpcr"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -30,9 +31,12 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
obioptions.SetWorkerPerCore(1)
obidefault.SetWorkerPerCore(1)
optionParser := obioptions.GenerateOptionParser(obitagpcr.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obitagpcr",
"split a paired raw read data set per sample",
obitagpcr.OptionSet)
optionParser(os.Args)
pairs, err := obipairing.CLIPairedSequence()
@@ -54,5 +58,5 @@ func main() {
obiconvert.CLIWriteBioSequences(paired, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -0,0 +1,142 @@
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiitercsv"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitax"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obicsv"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitaxonomy"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
log "github.com/sirupsen/logrus"
)
func main() {
optionParser := obioptions.GenerateOptionParser(
"obitaxonomy",
"manipulates and queries taxonomy",
obitaxonomy.OptionSet)
_, args := optionParser(os.Args)
var iterator *obitax.ITaxon
if obitaxonomy.CLIDownloadNCBI() {
err := obitaxonomy.CLIDownloadNCBITaxdump()
if err != nil {
log.Errorf("Cannot download NCBI taxonomy: %s", err.Error())
os.Exit(1)
}
os.Exit(0)
}
if !obidefault.HasSelectedTaxonomy() {
log.Fatal("you must indicate a taxonomy using the -t or --taxonomy option")
}
switch {
case obitaxonomy.CLIAskForRankList():
newIter := obiitercsv.NewICSVRecord()
newIter.Add(1)
newIter.AppendField("rank")
go func() {
ranks := obitax.DefaultTaxonomy().RankList()
data := make([]obiitercsv.CSVRecord, len(ranks))
for i, rank := range ranks {
record := make(obiitercsv.CSVRecord)
record["rank"] = rank
data[i] = record
}
newIter.Push(obiitercsv.MakeCSVRecordBatch(obitax.DefaultTaxonomy().Name(), 0, data))
newIter.Close()
newIter.Done()
}()
obicsv.CLICSVWriter(newIter, true)
obiutils.WaitForLastPipe()
os.Exit(0)
case obitaxonomy.CLIExtractTaxonomy():
iter, err := obiconvert.CLIReadBioSequences(args...)
iter = iter.NumberSequences(1, true)
if err != nil {
log.Fatalf("Cannot extract taxonomy: %v", err)
}
taxonomy, err := iter.ExtractTaxonomy(obitaxonomy.CLINewickWithLeaves())
if err != nil {
log.Fatalf("Cannot extract taxonomy: %v", err)
}
taxonomy.SetAsDefault()
log.Infof("Number of extracted taxa: %d", taxonomy.Len())
iterator = taxonomy.AsTaxonSet().Sort().Iterator()
case obitaxonomy.CLIDumpSubtaxonomy():
iterator = obitaxonomy.CLISubTaxonomyIterator()
case obitaxonomy.CLIRequestsPathForTaxid() != "NA":
taxon, isAlias, err := obitax.DefaultTaxonomy().Taxon(obitaxonomy.CLIRequestsPathForTaxid())
if err != nil {
log.Fatalf("Cannot identify the requested taxon: %s (%v)",
obitaxonomy.CLIRequestsPathForTaxid(), err)
}
if isAlias {
if obidefault.FailOnTaxonomy() {
log.Fatalf("Taxon %s is an alias for %s", taxon.String(), taxon.Parent().String())
}
}
s := taxon.Path()
if s == nil {
log.Fatalf("Cannot extract taxonomic path describing %s", taxon.String())
}
iterator = s.Iterator()
if obitaxonomy.CLIWithQuery() {
iterator = iterator.AddMetadata("query", taxon.String())
}
case len(args) == 0:
iterator = obitax.DefaultTaxonomy().Iterator()
default:
iters := make([]*obitax.ITaxon, len(args))
for i, pat := range args {
ii := obitax.DefaultTaxonomy().IFilterOnName(pat, obitaxonomy.CLIFixedPattern(), true)
if obitaxonomy.CLIWithQuery() {
ii = ii.AddMetadata("query", pat)
}
iters[i] = ii
}
iterator = iters[0]
if len(iters) > 1 {
iterator = iterator.Concat(iters[1:]...)
}
}
iterator = obitaxonomy.CLITaxonRestrictions(iterator)
if obitaxonomy.CLIAsNewick() {
obitaxonomy.CLINewickWriter(iterator, true)
} else {
obitaxonomy.CLICSVTaxaWriter(iterator, true)
}
obiutils.WaitForLastPipe()
}

View File

@@ -3,13 +3,12 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiuniq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
@@ -32,20 +31,21 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obiuniq.OptionSet)
obidefault.SetBatchSize(10)
obidefault.SetReadQualities(false)
optionParser := obioptions.GenerateOptionParser(
"obiuniq",
"dereplicate sequence data sets",
obiuniq.OptionSet)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
obiconvert.OpenSequenceDataErrorMessage(args, err)
unique := obiuniq.CLIUnique(sequences)
obiconvert.CLIWriteBioSequences(unique, true)
obiiter.WaitForLastPipe()
obiutils.WaitForLastPipe()
}

View File

@@ -3,36 +3,14 @@ package main
import (
"os"
log "github.com/sirupsen/logrus"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiformats"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiconvert.OptionSet)
_, args := optionParser(os.Args)
fs, err := obiconvert.CLIReadBioSequences(args...)
if err != nil {
log.Errorf("Cannot open file (%v)", err)
os.Exit(1)
}
frags := obiiter.IFragments(
1000,
100,
10,
100,
obioptions.CLIParallelWorkers(),
)
obiconvert.CLIWriteBioSequences(fs.Pipe(frags), true)
obiiter.WaitForLastPipe()
obiformats.DetectTaxonomyFormat(os.Args[1])
println(obiutils.RemoveAllExt("toto/tutu/test.txt"))
println(obiutils.Basename("toto/tutu/test.txt"))
}

23
git-hooks/pre-push Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
remote="$1"
#url="$2"
log() {
echo -e "[Pre-Push tests @ $(date)] $*" 1>&2
}
current_branch=$(git symbolic-ref --short head)
cmd="make githubtests"
if [[ $current_branch = "master" ]]; then
log "you are on $current_branch, running build test"
if ! eval "$cmd"; then
log "Pre-push tests failed $cmd"
exit 1
fi
fi
log "Tests are OK, ready to push on $remote"
exit 0

27
go.mod
View File

@@ -1,42 +1,39 @@
module git.metabarcoding.org/obitools/obitools4/obitools4
go 1.22.1
go 1.23.4
toolchain go1.24.2
require (
github.com/DavidGamba/go-getoptions v0.28.0
github.com/PaesslerAG/gval v1.2.2
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df
github.com/buger/jsonparser v1.1.1
github.com/chen3feng/stl4go v0.1.1
github.com/dlclark/regexp2 v1.11.4
github.com/goccy/go-json v0.10.3
github.com/klauspost/pgzip v1.2.6
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58
github.com/pelletier/go-toml/v2 v2.2.4
github.com/rrethy/ahocorasick v1.0.0
github.com/schollz/progressbar/v3 v3.13.1
github.com/sirupsen/logrus v1.9.3
github.com/stretchr/testify v1.8.4
github.com/tevino/abool/v2 v2.1.0
github.com/yuin/gopher-lua v1.1.1
golang.org/x/exp v0.0.0-20231006140011-7918f672742d
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa
gonum.org/v1/gonum v0.14.0
gopkg.in/yaml.v3 v3.0.1
scientificgo.org/special v0.0.0
)
require (
github.com/bytedance/sonic v1.11.9 // indirect
github.com/bytedance/sonic/loader v0.1.1 // indirect
github.com/cloudwego/base64x v0.1.4 // indirect
github.com/cloudwego/iasm v0.2.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/goombaio/orderedmap v0.0.0-20180924084748-ba921b7e2419 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/kr/pretty v0.3.0 // indirect
github.com/kr/pretty v0.3.1 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/montanaflynn/stats v0.7.1 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/rogpeppe/go-internal v1.6.1 // indirect
github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
golang.org/x/arch v0.0.0-20210923205945-b76863e36670 // indirect
github.com/rogpeppe/go-internal v1.12.0 // indirect
)
require (
@@ -49,8 +46,8 @@ require (
github.com/rivo/uniseg v0.4.4 // indirect
github.com/shopspring/decimal v1.3.1 // indirect
github.com/ulikunitz/xz v0.5.11
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.17.0 // indirect
golang.org/x/term v0.13.0 // indirect
golang.org/x/net v0.35.0 // indirect
golang.org/x/sys v0.30.0 // indirect
golang.org/x/term v0.29.0 // indirect
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
)

62
go.sum
View File

@@ -6,27 +6,21 @@ github.com/PaesslerAG/jsonpath v0.1.0 h1:gADYeifvlqK3R3i2cR5B4DGgxLXIPb3TRTH1mGi
github.com/PaesslerAG/jsonpath v0.1.0/go.mod h1:4BzmtoM/PI8fPO4aQGIusjGxGir2BzcV0grWtFzq1Y8=
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df h1:GSoSVRLoBaFpOOds6QyY1L8AX7uoY+Ln3BHc22W40X0=
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df/go.mod h1:hiVxq5OP2bUGBRNS3Z/bt/reCLFNbdcST6gISi1fiOM=
github.com/bytedance/sonic v1.11.9 h1:LFHENlIY/SLzDWverzdOvgMztTxcfcF+cqNsz9pK5zg=
github.com/bytedance/sonic v1.11.9/go.mod h1:LysEHSvpvDySVdC2f87zGWf6CIKJcAvqab1ZaiQtds4=
github.com/bytedance/sonic/loader v0.1.1 h1:c+e5Pt1k/cy5wMveRDyk2X4B9hF4g7an8N3zCYjJFNM=
github.com/bytedance/sonic/loader v0.1.1/go.mod h1:ncP89zfokxS5LZrJxl5z0UJcsk4M4yY2JpfqGeCtNLU=
github.com/buger/jsonparser v1.1.1 h1:2PnMjfWD7wBILjqQbt530v576A/cAbQvEW9gGIpYMUs=
github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0=
github.com/chen3feng/stl4go v0.1.1 h1:0L1+mDw7pomftKDruM23f1mA7miavOj6C6MZeadzN2Q=
github.com/chen3feng/stl4go v0.1.1/go.mod h1:5ml3psLgETJjRJnMbPE+JiHLrCpt+Ajc2weeTECXzWU=
github.com/cloudwego/base64x v0.1.4 h1:jwCgWpFanWmN8xoIUHa2rtzmkd5J2plF/dnLS6Xd/0Y=
github.com/cloudwego/base64x v0.1.4/go.mod h1:0zlkT4Wn5C6NdauXdJRhSKRlJvmclQ1hhJgA0rcu/8w=
github.com/cloudwego/iasm v0.2.0 h1:1KNIy1I1H9hNNFEEH3DVnI4UujN+1zjpuk6gwHLTssg=
github.com/cloudwego/iasm v0.2.0/go.mod h1:8rXZaNYT2n95jn+zTI1sDr+IgcD2GVs0nlbbQPiEFhY=
github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dlclark/regexp2 v1.11.4 h1:rPYF9/LECdNymJufQKmri9gV604RvvABwgOA8un7yAo=
github.com/dlclark/regexp2 v1.11.4/go.mod h1:DHkYz0B9wPfa6wondMfaivmHpzrQ3v9q8cnmRbL6yW8=
github.com/dsnet/compress v0.0.1 h1:PlZu0n3Tuv04TzpfPbrnI0HW/YwodEXDS+oPKahKF0Q=
github.com/dsnet/compress v0.0.1/go.mod h1:Aw8dCMJ7RioblQeTqt88akK31OvO8Dhf5JflhBbQEHo=
github.com/dsnet/golib v0.0.0-20171103203638-1ea166775780/go.mod h1:Lj+Z9rebOhdfkVLjJ8T6VcRQv3SXugXy999NBtR9aFY=
github.com/gabriel-vasile/mimetype v1.4.3 h1:in2uUcidCuFcDKtdcBxlR0rJ1+fsokWf+uqxgUFjbI0=
github.com/gabriel-vasile/mimetype v1.4.3/go.mod h1:d8uq/6HKRL6CGdk+aubisF/M5GcPfT7nKyLpA0lbSSk=
github.com/goccy/go-json v0.10.2 h1:CrxCmQqYDkv1z7lO7Wbh2HN93uovUHgrECaO5ZrCXAU=
github.com/goccy/go-json v0.10.2/go.mod h1:6MelG93GURQebXPDq3khkgXZkazVtN9CRI+MGFi0w8I=
github.com/goccy/go-json v0.10.3 h1:KZ5WoDbxAIgm2HNbYckL0se1fHD6rz5j4ywS6ebzDqA=
github.com/goccy/go-json v0.10.3/go.mod h1:oq7eo15ShAhp70Anwd5lgX2pLfOS3QCiwU/PULtXL6M=
github.com/goombaio/orderedmap v0.0.0-20180924084748-ba921b7e2419 h1:SajEQ6tktpF9SRIuzbiPOX9AEZZ53Bvw0k9Mzrts8Lg=
@@ -37,17 +31,12 @@ github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213/go.mod h1:vNUNkEQ1
github.com/klauspost/compress v1.4.1/go.mod h1:RyIbtBH6LamlWaDj8nUwkbUhJ87Yi3uG0guNDohfE1A=
github.com/klauspost/compress v1.17.2 h1:RlWWUY/Dr4fL8qk9YG7DTZ7PDgME2V4csBXA8L/ixi4=
github.com/klauspost/compress v1.17.2/go.mod h1:ntbaceVETuRiXiv4DpjP66DpAtAGkEQskQzEyD//IeE=
github.com/klauspost/cpuid v1.2.0 h1:NMpwD2G9JSFOE1/TJjGSo5zG7Yb2bTe7eq1jH+irmeE=
github.com/klauspost/cpuid v1.2.0/go.mod h1:Pj4uuM528wm8OyEC2QMXAi2YiTZ96dNQPGgoMS4s3ek=
github.com/klauspost/cpuid/v2 v2.0.9 h1:lgaqFMSdTdQYdZ04uHyN2d/eKdOMyi2YLSvlQIBFYa4=
github.com/klauspost/cpuid/v2 v2.0.9/go.mod h1:FInQzS24/EEf25PyTYn52gqo7WaD8xa0213Md/qVLRg=
github.com/klauspost/pgzip v1.2.6 h1:8RXeL5crjEUFnR2/Sn6GJNWtSQ3Dk8pq4CL3jvdDyjU=
github.com/klauspost/pgzip v1.2.6/go.mod h1:Ch1tH69qFZu15pkjo5kYi6mth2Zzwzt50oCQKQE9RUs=
github.com/knz/go-libedit v1.10.1/go.mod h1:MZTVkCWyz0oBc7JOWP3wNAzd002ZbM/5hgShxwh4x8M=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
github.com/kr/pretty v0.3.0 h1:WgNl7dwNpEZ6jJ9k1snq4pZsg7DOEN8hP9Xw0Tsjwk0=
github.com/kr/pretty v0.3.0/go.mod h1:640gp4NfQd8pI5XOwp5fnNeVWj67G7CFk/SaSQn7NBk=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
@@ -58,17 +47,19 @@ github.com/mattn/go-runewidth v0.0.15 h1:UNAjwbU9l54TA3KzvqLGxwWjHmMgBUVhBiTjelZ
github.com/mattn/go-runewidth v0.0.15/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ=
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw=
github.com/montanaflynn/stats v0.7.1 h1:etflOAAHORrCC44V+aR6Ftzort912ZU+YLiSTuV8eaE=
github.com/montanaflynn/stats v0.7.1/go.mod h1:etXPPgVO6n31NxCd9KQUMvCM+ve0ruNzt6R8Bnaayow=
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58 h1:onHthvaw9LFnH4t2DcNVpwGmV9E1BkGknEliJkfwQj0=
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58/go.mod h1:DXv8WO4yhMYhSNPKjeNKa5WY9YCIEBRbNzFFPJbWO6Y=
github.com/pelletier/go-toml/v2 v2.2.4 h1:mye9XuhQ6gvn5h28+VilKrrPoQVanw5PMw/TB0t5Ec4=
github.com/pelletier/go-toml/v2 v2.2.4/go.mod h1:2gIqNv+qfxSVS7cM2xJQKtLSTLUE9V8t9Stt+h56mCY=
github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e/go.mod h1:pJLUxLENpZxwdsKMEsNbx1VGcRFpLqf3715MtcvvzbA=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
github.com/rivo/uniseg v0.4.4 h1:8TfxU8dW6PdqD27gjM8MVNuicgxIjxpm4K7x4jp8sis=
github.com/rivo/uniseg v0.4.4/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
github.com/rogpeppe/go-internal v1.6.1 h1:/FiVV8dS/e+YqF2JvO3yXRFbBLTIuSDkuC7aBOAvL+k=
github.com/rogpeppe/go-internal v1.6.1/go.mod h1:xXDCJY+GAPziupqXw64V24skbSoqbTEfhy4qGm1nDQc=
github.com/rogpeppe/go-internal v1.9.0/go.mod h1:WtVeX8xhTBvf0smdhujwtBcq4Qrzq/fJaraNFVN+nFs=
github.com/rogpeppe/go-internal v1.12.0 h1:exVL4IDcn6na9z1rAb56Vxr+CgyK3nn3O+epU5NdKM8=
github.com/rogpeppe/go-internal v1.12.0/go.mod h1:E+RYuTGaKKdloAfM02xzb0FW3Paa99yedzYV+kq4uf4=
github.com/rrethy/ahocorasick v1.0.0 h1:YKkCB+E5PXc0xmLfMrWbfNht8vG9Re97IHSWZk/Lk8E=
github.com/rrethy/ahocorasick v1.0.0/go.mod h1:nq8oScE7Vy1rOppoQxpQiiDmPHuKCuk9rXrNcxUV3R0=
github.com/schollz/progressbar/v3 v3.13.1 h1:o8rySDYiQ59Mwzy2FELeHY5ZARXZTVJC7iHD6PEFUiE=
@@ -78,50 +69,37 @@ github.com/shopspring/decimal v1.3.1/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFR
github.com/sirupsen/logrus v1.9.3 h1:dueUQJ1C2q9oE3F7wvmSGAaVtTmUizReu6fjN8uqzbQ=
github.com/sirupsen/logrus v1.9.3/go.mod h1:naHLuLoDiP4jHNo9R0sCBMtWGeIprob74mVsIT4qYEQ=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
github.com/tevino/abool/v2 v2.1.0 h1:7w+Vf9f/5gmKT4m4qkayb33/92M+Um45F2BkHOR+L/c=
github.com/tevino/abool/v2 v2.1.0/go.mod h1:+Lmlqk6bHDWHqN1cbxqhwEAwMPXgc8I1SDEamtseuXY=
github.com/twitchyliquid64/golang-asm v0.15.1 h1:SU5vSMR7hnwNxj24w34ZyCi/FmDZTkS4MhqMhdFk5YI=
github.com/twitchyliquid64/golang-asm v0.15.1/go.mod h1:a1lVb/DtPvCB8fslRZhAngC2+aY1QWCk3Cedj/Gdt08=
github.com/ulikunitz/xz v0.5.6/go.mod h1:2bypXElzHzzJZwzH67Y6wb67pO62Rzfn7BSiF4ABRW8=
github.com/ulikunitz/xz v0.5.11 h1:kpFauv27b6ynzBNT/Xy+1k+fK4WswhN/6PN5WhFAGw8=
github.com/ulikunitz/xz v0.5.11/go.mod h1:nbz6k7qbPmH4IRqmfOplQw/tblSgqTqBwxkY0oWt/14=
github.com/yuin/gopher-lua v1.1.1 h1:kYKnWBjvbNP4XLT3+bPEwAXJx262OhaHDWDVOPjL46M=
github.com/yuin/gopher-lua v1.1.1/go.mod h1:GBR0iDaNXjAgGg9zfCvksxSRnQx76gclCIb7kdAd1Pw=
golang.org/x/arch v0.0.0-20210923205945-b76863e36670 h1:18EFjUmQOcUvxNYSkA6jO9VAiXCnxFY6NyDX0bHDmkU=
golang.org/x/arch v0.0.0-20210923205945-b76863e36670/go.mod h1:5om86z9Hs0C8fWVUuoMHwpExlXzs5Tkyp9hOrfG7pp8=
golang.org/x/exp v0.0.0-20231006140011-7918f672742d h1:jtJma62tbqLibJ5sFQz8bKtEM8rJBtfilJ2qTU199MI=
golang.org/x/exp v0.0.0-20231006140011-7918f672742d/go.mod h1:ldy0pHrwJyGW56pPQzzkH36rKxoZW1tw7ZJpeKx+hdo=
golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM=
golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE=
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa h1:FRnLl4eNAQl8hwxVVC17teOw8kdjVDVAiFMtgUdTSRQ=
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa/go.mod h1:zk2irFbV9DP96SEBUUAy67IdHUaZuSnrz1n472HUCLE=
golang.org/x/net v0.35.0 h1:T5GQRQb2y08kTAByq9L4/bz8cipCdA8FbRTXewonqY8=
golang.org/x/net v0.35.0/go.mod h1:EglIi67kWsHKlRzzVMUD93VMSWGFOMSZgxFjparz1Qk=
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.17.0 h1:25cE3gD+tdBA7lp7QfhuV+rJiE9YXTcS3VG1SqssI/Y=
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.30.0 h1:QjkSwP/36a20jFYWkSue1YwXzLmsV5Gfq7Eiy72C1uc=
golang.org/x/sys v0.30.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/term v0.6.0/go.mod h1:m6U89DPEgQRMq3DNkDClhWw02AUbt2daBVO4cn4Hv9U=
golang.org/x/term v0.13.0 h1:bb+I9cTfFazGW51MZqBVmZy7+JEJMouUHTUSKVQLBek=
golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U=
golang.org/x/term v0.29.0 h1:L6pJp37ocefwRRtYPKSWOWzOtWSxVajvz2ldH/xi3iU=
golang.org/x/term v0.29.0/go.mod h1:6bl4lRlvVuDgSf3179VpIxBF0o10JUpXWOnI7nErv7s=
gonum.org/v1/gonum v0.14.0 h1:2NiG67LD1tEH0D7kM+ps2V+fXmsAnpUeec7n8tcr4S0=
gonum.org/v1/gonum v0.14.0/go.mod h1:AoWeoz0becf9QMWtE8iWXNXc27fK4fNeHNf/oMejGfU=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
nullprogram.com/x/optparse v1.0.0/go.mod h1:KdyPE+Igbe0jQUrVfMqDMeJQIJZEuyV7pjYmp6pbG50=
rsc.io/pdf v0.1.1/go.mod h1:n8OzWcQ6Sp37PL01nO98y4iUCRdTGarVfzxY20ICaU4=
scientificgo.org/special v0.0.0 h1:P6WJkECo6tgtvZAEfNXl+KEB9ReAatjKAeX8U07mjSc=
scientificgo.org/special v0.0.0/go.mod h1:LoGVh9tS431RLTJo7gFlYDKFWq44cEb7QqL+M0EKtZU=
scientificgo.org/testutil v0.0.0 h1:y356DHRo0tAz9zIFmxlhZoKDlHPHaWW/DCm9k3PhIMA=

View File

@@ -1,3 +1,5 @@
go 1.22.1
go 1.23.4
toolchain go1.24.2
use .

View File

@@ -2,7 +2,6 @@ git.sr.ht/~sbinet/gg v0.3.1 h1:LNhjNn8DerC8f9DHLz6lS0YYul/b602DUxDgGkd/Aik=
git.sr.ht/~sbinet/gg v0.3.1/go.mod h1:KGYtlADtqsqANL9ueOFkWymvzUvLMQllU5Ixo+8v3pc=
github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b h1:slYM766cy2nI3BwyRiyQj/Ud48djTMtMebDqepE95rw=
github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b/go.mod h1:1KcenG0jGWcpt8ov532z81sp/kMMUG485J2InIOyADM=
github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0=
github.com/chzyer/logex v1.1.10 h1:Swpa1K6QvQznwJRcfTfQJmTE72DqScAa40E+fbHEXEE=
github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
github.com/chzyer/logex v1.2.0 h1:+eqR0HfOetur4tgnC8ftU5imRnhi4te+BadWS95c5AM=
@@ -20,15 +19,20 @@ github.com/go-latex/latex v0.0.0-20230307184459-12ec69307ad9 h1:NxXI5pTAtpEaU49b
github.com/go-latex/latex v0.0.0-20230307184459-12ec69307ad9/go.mod h1:gWuR/CrFDDeVRFQwHPvsv9soJVB/iqymhuZQuJ3a9OM=
github.com/go-pdf/fpdf v0.6.0 h1:MlgtGIfsdMEEQJr2le6b/HNr1ZlQwxyWr77r2aj2U/8=
github.com/go-pdf/fpdf v0.6.0/go.mod h1:HzcnA+A23uwogo0tp9yU+l3V+KXhiESpt1PMayhOh5M=
github.com/go-quicktest/qt v1.101.0/go.mod h1:14Bz/f7NwaXPtdYEgzsx46kqSxVwTbzVZsDC26tQJow=
github.com/goccmack/gocc v0.0.0-20230228185258-2292f9e40198 h1:FSii2UQeSLngl3jFoR4tUKZLprO7qUlh/TKKticc0BM=
github.com/goccmack/gocc v0.0.0-20230228185258-2292f9e40198/go.mod h1:DTh/Y2+NbnOVVoypCCQrovMPDKUGp4yZpSbWg5D0XIM=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
github.com/google/go-cmdtest v0.4.1-0.20220921163831-55ab3332a786/go.mod h1:apVn/GCasLZUVpAJ6oWAuyP7Ne7CEsQbTnc0plM3m+o=
github.com/google/go-cmp v0.5.8 h1:e6P7q2lk1O+qJJb4BtCQXlK8vWEO8V1ZeuEdJNOqZyg=
github.com/google/go-cmp v0.5.8/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI=
github.com/google/safehtml v0.1.0/go.mod h1:L4KWwDsUJdECRAEpZoBn3O64bQaywRscowZjJAzjHnU=
github.com/hashicorp/errwrap v1.0.0/go.mod h1:YH+1FKiLXxHSkmPseP+kNlulaMuP3n2brvKWEqk/Jc4=
github.com/hashicorp/go-multierror v1.1.1/go.mod h1:iw975J/qwKPdAO1clOe2L8331t/9/fmwbPZ6JB6eMoM=
github.com/ianlancetaylor/demangle v0.0.0-20220319035150-800ac71e25c2 h1:rcanfLhLDA8nozr/K289V1zcntHr3V+SHlXwzz1ZI2g=
github.com/jba/templatecheck v0.7.1/go.mod h1:n1Etw+Rrw1mDDD8dDRsEKTwMZsJ98EkktgNJC6wLUGo=
github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213 h1:qGQQKEcAR99REcMpsXCp3lJ03zYT1PkRd3kQGPn9GVg=
github.com/klauspost/cpuid v1.2.0 h1:NMpwD2G9JSFOE1/TJjGSo5zG7Yb2bTe7eq1jH+irmeE=
github.com/kr/pty v1.1.1 h1:VkoXIwSboBpnk99O/KFauAEILuNHv5DVFKZMBN/gUgw=
@@ -39,17 +43,22 @@ github.com/stretchr/objx v0.1.0 h1:4G4v2dO3VZwixGIRoQ5Lfboy6nUhCyYzaqnIAPPhYs4=
github.com/stretchr/objx v0.5.0 h1:1zr/of2m5FGMsad5YfcqgdqdWrIhu+EBEJRhR1U7z/c=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/yuin/goldmark v1.4.13 h1:fVcFKWvrslecOb/tg+Cc05dkeYx540o0FuFt3nUVDoE=
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
golang.org/x/crypto v0.14.0 h1:wBqGXzWJW6m1XrIKlAH0Hs1JJ7+9KBwnIO8v66Q9cHc=
golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4=
golang.org/x/crypto v0.33.0/go.mod h1:bVdXmD7IV/4GdElGPozy6U7lWdRXA4qyRVGJV57uQ5M=
golang.org/x/image v0.6.0 h1:bR8b5okrPI3g/gyZakLZHeWxAR8Dn5CyxXv1hLH5g/4=
golang.org/x/image v0.6.0/go.mod h1:MXLdDR43H7cDJq5GEGXEVeeNhPgi+YYEQ2pC1byI1x0=
golang.org/x/mod v0.13.0 h1:I/DsJXRlw/8l/0c24sM9yb0T4z9liZTduXvdAWYiysY=
golang.org/x/mod v0.13.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=
golang.org/x/mod v0.14.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4 h1:uVc8UZUe6tr40fFVnUP5Oj+veunVezqYl9z7DYw9xzw=
golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k=
golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
golang.org/x/text v0.22.0/go.mod h1:YRoo4H8PVmsu+E3Ou7cqLVH8oXWIHVoX0jqUWALQhfY=
golang.org/x/tools v0.14.0 h1:jvNa2pY0M4r62jkRQ6RwEZZyPcymeL9XZMLBbV7U2nc=
golang.org/x/tools v0.14.0/go.mod h1:uYBEerGOWcJyEORxN+Ek8+TT266gXkNlHdJBwexUsBg=
golang.org/x/tools v0.15.0/go.mod h1:hpksKq4dtpQWS1uQ61JkdqWM3LscIS6Slf+VVkm+wQk=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7 h1:9zdDQZ7Thm29KFXgAX/+yaf3eVbP7djjWp/dXAppNCc=
gonum.org/v1/plot v0.10.1 h1:dnifSs43YJuNMDzB7v8wV64O4ABBHReuAVAoBxqBqS4=
gonum.org/v1/plot v0.10.1/go.mod h1:VZW5OlhkL1mysU9vaqNHnsy86inf6Ot+jB3r+BczCEo=

View File

@@ -1,27 +1,56 @@
#!/bin/bash
INSTALL_DIR="/usr/local"
OBITOOLS_PREFIX=""
# default values
# Default values
URL="https://go.dev/dl/"
OBIURL4="https://github.com/metabarcoding/obitools4/archive/refs/heads/master.zip"
GITHUB_REPO="https://github.com/metabarcoding/obitools4"
INSTALL_DIR="/usr/local"
OBITOOLS_PREFIX=""
VERSION=""
LIST_VERSIONS=false
# help message
# Help message
function display_help {
echo "Usage: $0 [OPTIONS]"
echo ""
echo "Options:"
echo " -i, --install-dir Directory where obitools are installed "
echo " (as example use /usr/local not /usr/local/bin)."
echo " (e.g., use /usr/local not /usr/local/bin)."
echo " -p, --obitools-prefix Prefix added to the obitools command names if you"
echo " want to have several versions of obitools at the"
echo " same time on your system (as example -p g will produce "
echo " same time on your system (e.g., -p g will produce "
echo " gobigrep command instead of obigrep)."
echo " -v, --version Install a specific version (e.g., 4.4.8)."
echo " If not specified, installs the latest version."
echo " -l, --list List all available versions and exit."
echo " -h, --help Display this help message."
echo ""
echo "Examples:"
echo " $0 # Install latest version"
echo " $0 -l # List available versions"
echo " $0 -v 4.4.8 # Install specific version"
echo " $0 -i /opt/local # Install to custom directory"
}
# List available versions from GitHub releases
function list_versions {
echo "Fetching available versions..." 1>&2
echo ""
curl -s "https://api.github.com/repos/metabarcoding/obitools4/releases" \
| grep '"tag_name":' \
| sed -E 's/.*"tag_name": "Release_([0-9.]+)".*/\1/' \
| sort -V -r
}
# Get latest version from GitHub releases
function get_latest_version {
curl -s "https://api.github.com/repos/metabarcoding/obitools4/releases" \
| grep '"tag_name":' \
| sed -E 's/.*"tag_name": "Release_([0-9.]+)".*/\1/' \
| sort -V -r \
| head -1
}
# Parse command line arguments
while [ "$#" -gt 0 ]; do
case "$1" in
-i|--install-dir)
@@ -32,62 +61,104 @@ while [ "$#" -gt 0 ]; do
OBITOOLS_PREFIX="$2"
shift 2
;;
-v|--version)
VERSION="$2"
shift 2
;;
-l|--list)
LIST_VERSIONS=true
shift
;;
-h|--help)
display_help 1>&2
display_help
exit 0
;;
*)
echo "Error: Unsupported option $1" 1>&2
echo "Error: Unsupported option $1" 1>&2
display_help 1>&2
exit 1
;;
esac
done
# the directory from where the script is run
# List versions and exit if requested
if [ "$LIST_VERSIONS" = true ]; then
echo "Available OBITools4 versions:"
echo "=============================="
list_versions
exit 0
fi
# Determine version to install
if [ -z "$VERSION" ]; then
echo "Fetching latest version..." 1>&2
VERSION=$(get_latest_version)
if [ -z "$VERSION" ]; then
echo "Error: Could not determine latest version" 1>&2
exit 1
fi
echo "Latest version: $VERSION" 1>&2
else
echo "Installing version: $VERSION" 1>&2
fi
# Construct source URL for the specified version
OBIURL4="${GITHUB_REPO}/archive/refs/tags/Release_${VERSION}.zip"
# The directory from where the script is run
DIR="$(pwd)"
# the temp directory used, within $DIR
# omit the -p parameter to create a temporal directory in the default location
# WORK_DIR=$(mktemp -d -p "$DIR" "obitools4.XXXXXX" 2> /dev/null || \
# mktemp -d -t "$DIR" "obitools4.XXXXXX")
# Create temporary directory
WORK_DIR=$(mktemp -d "obitools4.XXXXXX")
# check if tmp dir was created
# Check if tmp dir was created
if [[ ! "$WORK_DIR" || ! -d "$WORK_DIR" ]]; then
echo "Could not create temp dir" 1>&2
echo "Could not create temp dir" 1>&2
exit 1
fi
mkdir -p "${WORK_DIR}/cache" \
|| (echo "Cannot create ${WORK_DIR}/cache directory" 1>&2
exit 1)
# Create installation directory
mkdir -p "${INSTALL_DIR}/bin" 2> /dev/null \
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo mkdir -p "${INSTALL_DIR}/bin")
if [[ ! -d "${INSTALL_DIR}/bin" ]]; then
echo "Could not create ${INSTALL_DIR}/bin directory for installing obitools" 1>&2
echo "Could not create ${INSTALL_DIR}/bin directory for installing obitools" 1>&2
exit 1
fi
INSTALL_DIR="$(cd $INSTALL_DIR && pwd)"
INSTALL_DIR="$(cd ${INSTALL_DIR} && pwd)"
echo WORK_DIR=$WORK_DIR 1>&2
echo INSTALL_DIR=$INSTALL_DIR 1>&2
echo OBITOOLS_PREFIX=$OBITOOLS_PREFIX 1>&2
echo "================================" 1>&2
echo "OBITools4 Installation" 1>&2
echo "================================" 1>&2
echo "VERSION=$VERSION" 1>&2
echo "WORK_DIR=$WORK_DIR" 1>&2
echo "INSTALL_DIR=$INSTALL_DIR" 1>&2
echo "OBITOOLS_PREFIX=$OBITOOLS_PREFIX" 1>&2
echo "================================" 1>&2
pushd "$WORK_DIR"|| exit
pushd "$WORK_DIR" > /dev/null || exit
# Detect OS and architecture
OS=$(uname -a | awk '{print $1}')
ARCH=$(uname -m)
if [[ "$ARCH" == "x86_64" ]] ; then
ARCH="amd64"
if [[ "$ARCH" == "x86_64" ]] ; then
ARCH="amd64"
fi
if [[ "$ARCH" == "aarch64" ]] ; then
ARCH="arm64"
if [[ "$ARCH" == "aarch64" ]] ; then
ARCH="arm64"
fi
GOFILE=$(curl "$URL" \
# Download and install Go
echo "Downloading Go..." 1>&2
GOFILE=$(curl -s "$URL" \
| grep 'class="download"' \
| grep "\.tar\.gz" \
| sed -E 's@^.*/dl/(go[1-9].+\.tar\.gz)".*$@\1@' \
@@ -95,35 +166,71 @@ GOFILE=$(curl "$URL" \
| grep -i "$ARCH" \
| head -1)
GOURL=$(curl "${URL}${GOFILE}" \
GOURL=$(curl -s "${URL}${GOFILE}" \
| sed -E 's@^.*href="(.*\.tar\.gz)".*$@\1@')
echo "Install GO from : $GOURL" 1>&2
curl "$GOURL" \
| tar zxf -
echo "Installing Go from: $GOURL" 1>&2
curl -s "$GOURL" | tar zxf -
PATH="$(pwd)/go/bin:$PATH"
export PATH
GOPATH="$(pwd)/go"
export GOPATH
export GOCACHE="$(pwd)/cache"
curl -L "$OBIURL4" > master.zip
unzip master.zip
echo "GOCACHE=$GOCACHE" 1>&2
mkdir -p "$GOCACHE"
echo "Install OBITOOLS from : $OBIURL4"
# Download OBITools4 source
echo "Downloading OBITools4 v${VERSION}..." 1>&2
echo "Source URL: $OBIURL4" 1>&2
cd obitools4-master || exit
if [[ -z "$OBITOOLS_PREFIX" ]] ; then
make
else
make OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
if ! curl -sL "$OBIURL4" > obitools4.zip; then
echo "Error: Could not download OBITools4 version ${VERSION}" 1>&2
echo "Please check that this version exists with: $0 --list" 1>&2
exit 1
fi
unzip -q obitools4.zip
# Find the extracted directory
OBITOOLS_DIR=$(ls -d obitools4-* 2>/dev/null | head -1)
if [ -z "$OBITOOLS_DIR" ] || [ ! -d "$OBITOOLS_DIR" ]; then
echo "Error: Could not find extracted OBITools4 directory" 1>&2
exit 1
fi
echo "Building OBITools4..." 1>&2
cd "$OBITOOLS_DIR" || exit
mkdir -p vendor
# Build with or without prefix
if [[ -z "$OBITOOLS_PREFIX" ]] ; then
make GOFLAGS="-buildvcs=false"
else
make GOFLAGS="-buildvcs=false" OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
fi
# Install binaries
echo "Installing binaries to ${INSTALL_DIR}/bin..." 1>&2
(cp build/* "${INSTALL_DIR}/bin" 2> /dev/null) \
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}"
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo cp build/* "${INSTALL_DIR}/bin")
popd || exit
popd > /dev/null || exit
# Cleanup
echo "Cleaning up..." 1>&2
chmod -R +w "$WORK_DIR"
rm -rf "$WORK_DIR"
echo "" 1>&2
echo "================================" 1>&2
echo "OBITools4 v${VERSION} installed successfully!" 1>&2
echo "Binaries location: ${INSTALL_DIR}/bin" 1>&2
if [[ -n "$OBITOOLS_PREFIX" ]] ; then
echo "Command prefix: ${OBITOOLS_PREFIX}" 1>&2
fi
echo "================================" 1>&2

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiannotate
CMD=obiannotate
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiclean
CMD=obiclean
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicleandb
CMD=obicleandb
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicomplement
CMD=obicomplement
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiconsensus
CMD=obiconsensus
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

Binary file not shown.

View File

@@ -0,0 +1,144 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiconvert
CMD=obiconvert
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
if [ -z "$TEST_DIR" ] ; then
TEST_DIR="."
fi
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obiconvert -Z "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
> "${TMPDIR}/xxx.fasta.gz" && \
zdiff "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
"${TMPDIR}/xxx.fasta.gz"
then
log "$MCMD: converting large fasta file to fasta OK"
((success++))
else
log "$MCMD: converting large fasta file to fasta failed"
((failed++))
fi
((ntest++))
if obiconvert -Z --fastq-output \
"${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
> "${TMPDIR}/xxx.fastq.gz" && \
obiconvert -Z --fasta-output \
"${TMPDIR}/xxx.fastq.gz" \
> "${TMPDIR}/yyy.fasta.gz" && \
zdiff "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
"${TMPDIR}/yyy.fasta.gz"
then
log "$MCMD: converting large file between fasta and fastq OK"
((success++))
else
log "$MCMD: converting large file between fasta and fastq failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,163 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicount
CMD=obicount
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obicount "${TEST_DIR}/wolf_F.fasta.gz" \
> "${TMPDIR}/wolf_F.fasta_count.csv"
then
log "OBICount: fasta reading OK"
((success++))
else
log "OBICount: fasta reading failed"
((failed++))
fi
((ntest++))
if obicount "${TEST_DIR}/wolf_F.fastq.gz" \
> "${TMPDIR}/wolf_F.fastq_count.csv"
then
log "OBICount: fastq reading OK"
((success++))
else
log "OBICount: fastq reading failed"
((failed++))
fi
((ntest++))
if obicount "${TEST_DIR}/wolf_F.csv.gz" \
> "${TMPDIR}/wolf_F.csv_count.csv"
then
log "OBICount: csv reading OK"
((success++))
else
log "OBICount: csv reading failed"
((failed++))
fi
((ntest++))
if diff "${TMPDIR}/wolf_F.fasta_count.csv" \
"${TMPDIR}/wolf_F.fastq_count.csv" > /dev/null
then
log "OBICount: counting on fasta and fastq are identical OK"
((success++))
else
log "OBICount: counting on fasta and fastq are different failed"
((failed++))
fi
((ntest++))
if diff "${TMPDIR}/wolf_F.fasta_count.csv" \
"${TMPDIR}/wolf_F.csv_count.csv" > /dev/null
then
log "OBICount: counting on fasta and csv are identical OK"
((success++))
else
log "OBICount: counting on fasta and csv are different failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

Binary file not shown.

Binary file not shown.

Binary file not shown.

109
obitests/obitools/obicsv/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicsv
CMD=obicsv
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obidemerge
CMD=obidemerge
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obidistribute
CMD=obidistribute
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

109
obitests/obitools/obigrep/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obigrep
CMD=obigrep
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

109
obitests/obitools/obijoin/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obijoin
CMD=obijoin
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obikmermatch
CMD=obikmermatch
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obikmersimcount
CMD=obikmersimcount
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obilandmark
CMD=obilandmark
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimatrix
CMD=obimatrix
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimicrosat
CMD=obimicrosat
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimultiplex
CMD=obimultiplex
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,150 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obipairing
CMD=obipairing
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obipairing -F "${TEST_DIR}/wolf_F.fastq.gz" \
-R "${TEST_DIR}/wolf_R.fastq.gz" \
| obidistribute -Z -c mode \
-p "${TMPDIR}/wolf_paired_%s.fastq.gz"
then
log "OBIPairing: sequence pairing OK"
((success++))
else
log "OBIPairing: sequence pairing failed"
((failed++))
fi
((ntest++))
if obicsv -Z -s -i \
-k ali_dir -k ali_length -k pairing_fast_count \
-k pairing_fast_overlap -k pairing_fast_score \
-k score -k score_norm -k seq_a_single \
-k seq_b_single -k seq_ab_match \
"${TMPDIR}/wolf_paired_alignment.fastq.gz" \
> "${TMPDIR}/wolf_paired_alignment.csv.gz" \
&& zdiff -c "${TEST_DIR}/wolf_paired_alignment.csv.gz" \
"${TMPDIR}/wolf_paired_alignment.csv.gz"
then
log "OBIPairing: check aligned sequences OK"
((success++))
else
log "OBIPairing: check aligned sequences failed"
((failed++))
fi
((ntest++))
if obicsv -Z -s -i \
"${TMPDIR}/wolf_paired_join.fastq.gz" \
> "${TMPDIR}/wolf_paired_join.csv.gz" \
&& zdiff -c "${TEST_DIR}/wolf_paired_join.csv.gz" \
"${TMPDIR}/wolf_paired_join.csv.gz"
then
log "OBIPairing: check joined sequences OK"
((success++))
else
log "OBIPairing: check joined sequences failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

Binary file not shown.

Binary file not shown.

Binary file not shown.

109
obitests/obitools/obipcr/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obipcr
CMD=obipcr
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obirefidx
CMD=obirefidx
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiscript
CMD=obiscript
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obisplit
CMD=obisplit
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,9 @@
>Seq_1 {"count":2,"merged_sample":{"15a_F730814":1,"29a_F260619":1}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agctyaaaactcaaaggacttggcggtgctttataccctt
>Seq_2 {"count":22,"merged_sample":{"15a_F730814":12,"29a_F260619":10}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
atcttaaaactcaaaggacttggcggtgctttataccctt
>Seq_3 {"count":22,"merged_sample":{"15a_F730814":15,"29a_F260619":7}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat
agcttaaaactcaaaggacttggcggtgctttataccctt

View File

@@ -0,0 +1,35 @@
{
"annotations": {
"keys": {
"map": {
"merged_sample": 3
},
"scalar": {
"count": 3
}
},
"map_attributes": 1,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 46,
"total_length": 300,
"variants": 3
},
"samples": {
"sample_count": 2,
"sample_stats": {
"15a_F730814": {
"reads": 28,
"singletons": 1,
"variants": 3
},
"29a_F260619": {
"reads": 18,
"singletons": 1,
"variants": 3
}
}
}
}

View File

@@ -0,0 +1,25 @@
annotations:
keys:
map:
merged_sample: 3
scalar:
count: 3
map_attributes: 1
scalar_attributes: 1
vector_attributes: 0
count:
reads: 46
total_length: 300
variants: 3
samples:
sample_count: 2
sample_stats:
15a_F730814:
reads: 28
singletons: 1
variants: 3
29a_F260619:
reads: 18
singletons: 1
variants: 3

View File

@@ -0,0 +1,152 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obisummary
CMD=obisummary
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obisummary "${TEST_DIR}/some_uniq_seq.fasta" \
> "${TMPDIR}/some_uniq_seq.json"
then
log "$MCMD: formating json execution OK"
((success++))
else
log "$MCMD: formating json execution failed"
((failed++))
fi
((ntest++))
if diff "${TEST_DIR}/some_uniq_seq.json" \
"${TMPDIR}/some_uniq_seq.json" > /dev/null
then
log "$MCMD: formating json OK"
((success++))
else
log "$MCMD: formating json failed"
((failed++))
fi
((ntest++))
if obisummary --yaml "${TEST_DIR}/some_uniq_seq.fasta" \
> "${TMPDIR}/some_uniq_seq.yaml"
then
log "$MCMD: formating yaml execution OK"
((success++))
else
log "$MCMD: formating yaml execution failed"
((failed++))
fi
((ntest++))
if diff "${TEST_DIR}/some_uniq_seq.yaml" \
"${TMPDIR}/some_uniq_seq.yaml" > /dev/null
then
log "$MCMD: formating yaml OK"
((success++))
else
log "$MCMD: formating yaml failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,148 @@
# Tests pour obisuperkmer
## Description
Ce répertoire contient les tests automatisés pour la commande `obisuperkmer`.
## Fichiers
- `test.sh` : Script de test principal (exécutable)
- `test_sequences.fasta` : Jeu de données de test minimal (3 séquences courtes)
- `README.md` : Ce fichier
## Jeu de données de test
Le fichier `test_sequences.fasta` contient 3 séquences de 32 nucléotides chacune :
1. **seq1** : Répétition du motif ACGT (séquence régulière)
2. **seq2** : Alternance de blocs homopolymères (AAAA, CCCC, GGGG, TTTT)
3. **seq3** : Répétition du motif ATCG (différent de seq1)
Ces séquences sont volontairement courtes pour :
- Minimiser la taille du dépôt Git
- Accélérer l'exécution des tests en CI/CD
- Tester différents cas d'extraction de super k-mers
## Tests effectués
Le script `test.sh` effectue 12 tests :
### Test 1 : Affichage de l'aide
Vérifie que `obisuperkmer -h` s'exécute correctement.
### Test 2 : Extraction basique avec paramètres par défaut
Exécute `obisuperkmer` avec k=21, m=11 (valeurs par défaut).
### Test 3 : Vérification du fichier de sortie non vide
S'assure que la commande produit une sortie.
### Test 4 : Comptage des super k-mers extraits
Vérifie qu'au moins un super k-mer a été extrait.
### Test 5 : Présence des métadonnées requises
Vérifie que chaque super k-mer contient :
- `minimizer_value`
- `minimizer_seq`
- `parent_id`
### Test 6 : Extraction avec paramètres personnalisés
Teste avec k=15 et m=7.
### Test 7 : Vérification des paramètres dans les métadonnées
S'assure que les valeurs k=15 et m=7 sont présentes dans la sortie.
### Test 8 : Format de sortie FASTA explicite
Teste l'option `--fasta-output`.
### Test 9 : Vérification des IDs des super k-mers
S'assure que tous les IDs contiennent "superkmer".
### Test 10 : Préservation des IDs parents
Vérifie que seq1, seq2 et seq3 apparaissent dans la sortie.
### Test 11 : Option -o pour fichier de sortie
Teste la redirection vers un fichier avec `-o`.
### Test 12 : Vérification de la création du fichier avec -o
S'assure que le fichier de sortie a été créé.
### Test 13 : Cohérence des longueurs
Vérifie que la somme des longueurs des super k-mers est inférieure ou égale à la longueur totale des séquences d'entrée.
## Exécution des tests
### Localement
```bash
cd /chemin/vers/obitools4/obitests/obitools/obisuperkmer
./test.sh
```
### En CI/CD
Les tests sont automatiquement exécutés lors de chaque commit via le système CI/CD configuré pour le projet.
### Prérequis
- La commande `obisuperkmer` doit être compilée et disponible dans `../../build/`
- Les dépendances système : bash, grep, etc.
## Structure du script de test
Le script suit le pattern standard utilisé par tous les tests OBITools :
1. **En-tête** : Définition du nom du test et de la commande
2. **Variables** : Configuration des chemins et compteurs
3. **Fonction cleanup()** : Affiche les résultats et nettoie le répertoire temporaire
4. **Fonction log()** : Affiche les messages horodatés
5. **Tests** : Série de tests avec incrémentation des compteurs
6. **Appel cleanup()** : Nettoyage et sortie avec code de retour approprié
## Format de sortie
Chaque test affiche :
```
[obisuperkmer @ date] message
```
En fin d'exécution :
```
========================================
## Results of the obisuperkmer tests:
- 12 tests run
- 12 successfully completed
- 0 failed tests
Cleaning up the temporary directory...
========================================
```
## Codes de retour
- **0** : Tous les tests ont réussi
- **1** : Au moins un test a échoué
## Ajout de nouveaux tests
Pour ajouter un nouveau test, suivre le pattern :
```bash
((ntest++))
if commande_test arguments
then
log "Description: OK"
((success++))
else
log "Description: failed"
((failed++))
fi
```
## Notes
- Les fichiers temporaires sont créés dans `$TMPDIR` (créé par mktemp)
- Les fichiers de données sont dans `$TEST_DIR`
- La commande testée doit être dans `$OBITOOLS_DIR` (../../build/)
- Le répertoire temporaire est automatiquement nettoyé à la fin

View File

@@ -0,0 +1,232 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obik-super
CMD=obik
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="OBIk-super"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
######################################################################
((ntest++))
if $CMD super -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
# Test 1: Basic super k-mer extraction with default parameters
((ntest++))
if $CMD super "${TEST_DIR}/test_sequences.fasta" \
> "${TMPDIR}/output_default.fasta" 2>&1
then
log "$MCMD: basic extraction with default parameters OK"
((success++))
else
log "$MCMD: basic extraction with default parameters failed"
((failed++))
fi
# Test 2: Verify output is not empty
((ntest++))
if [ -s "${TMPDIR}/output_default.fasta" ]
then
log "$MCMD: output file is not empty OK"
((success++))
else
log "$MCMD: output file is empty - failed"
((failed++))
fi
# Test 3: Count number of super k-mers extracted (should be > 0)
((ntest++))
num_sequences=$(grep -c "^>" "${TMPDIR}/output_default.fasta")
if [ "$num_sequences" -gt 0 ]
then
log "$MCMD: extracted $num_sequences super k-mers OK"
((success++))
else
log "$MCMD: no super k-mers extracted - failed"
((failed++))
fi
# Test 4: Verify super k-mers have required metadata attributes
((ntest++))
if grep -q "minimizer_value" "${TMPDIR}/output_default.fasta" && \
grep -q "minimizer_seq" "${TMPDIR}/output_default.fasta" && \
grep -q "parent_id" "${TMPDIR}/output_default.fasta"
then
log "$MCMD: super k-mers contain required metadata OK"
((success++))
else
log "$MCMD: super k-mers missing metadata - failed"
((failed++))
fi
# Test 5: Extract super k-mers with custom k and m parameters
((ntest++))
if $CMD super -k 15 -m 7 "${TEST_DIR}/test_sequences.fasta" \
> "${TMPDIR}/output_k15_m7.fasta" 2>&1
then
log "$MCMD: extraction with custom k=15, m=7 OK"
((success++))
else
log "$MCMD: extraction with custom k=15, m=7 failed"
((failed++))
fi
# Test 6: Verify custom parameters in output metadata
((ntest++))
if grep -q '"k":15' "${TMPDIR}/output_k15_m7.fasta" && \
grep -q '"m":7' "${TMPDIR}/output_k15_m7.fasta"
then
log "$MCMD: custom parameters correctly set in metadata OK"
((success++))
else
log "$MCMD: custom parameters not in metadata - failed"
((failed++))
fi
# Test 7: Test with different output format (FASTA output explicitly)
((ntest++))
if $CMD super --fasta-output -k 21 -m 11 \
"${TEST_DIR}/test_sequences.fasta" \
> "${TMPDIR}/output_fasta.fasta" 2>&1
then
log "$MCMD: FASTA output format OK"
((success++))
else
log "$MCMD: FASTA output format failed"
((failed++))
fi
# Test 8: Verify all super k-mers have superkmer in their ID
((ntest++))
if grep "^>" "${TMPDIR}/output_default.fasta" | grep -q "superkmer"
then
log "$MCMD: super k-mer IDs contain 'superkmer' OK"
((success++))
else
log "$MCMD: super k-mer IDs missing 'superkmer' - failed"
((failed++))
fi
# Test 9: Verify parent sequence IDs are preserved
((ntest++))
if grep -q "seq1" "${TMPDIR}/output_default.fasta" && \
grep -q "seq2" "${TMPDIR}/output_default.fasta" && \
grep -q "seq3" "${TMPDIR}/output_default.fasta"
then
log "$MCMD: parent sequence IDs preserved OK"
((success++))
else
log "$MCMD: parent sequence IDs not preserved - failed"
((failed++))
fi
# Test 10: Test with output file option
((ntest++))
if $CMD super -o "${TMPDIR}/output_file.fasta" \
"${TEST_DIR}/test_sequences.fasta" 2>&1
then
log "$MCMD: output to file with -o option OK"
((success++))
else
log "$MCMD: output to file with -o option failed"
((failed++))
fi
# Test 11: Verify output file was created with -o option
((ntest++))
if [ -s "${TMPDIR}/output_file.fasta" ]
then
log "$MCMD: output file created with -o option OK"
((success++))
else
log "$MCMD: output file not created with -o option - failed"
((failed++))
fi
# Test 12: Verify each super k-mer length is >= k (default k=31)
((ntest++))
min_len=$(grep -v "^>" "${TMPDIR}/output_default.fasta" | awk '{print length}' | sort -n | head -1)
if [ "$min_len" -ge 31 ]
then
log "$MCMD: all super k-mers have length >= k OK"
((success++))
else
log "$MCMD: some super k-mers shorter than k ($min_len < 31) - failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,6 @@
>seq1
ACGTACGTACGTACGTACGTACGTACGTACGT
>seq2
AAAACCCCGGGGTTTTAAAACCCCGGGGTTTT
>seq3
ATCGATCGATCGATCGATCGATCGATCGATCG

109
obitests/obitools/obitag/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obitag
CMD=obitag
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obipcr
CMD=obipcr
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obitaxonomy
CMD=obitaxonomy
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

258
obitests/obitools/obiuniq/test.sh Executable file
View File

@@ -0,0 +1,258 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiuniq
CMD=obiuniq
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obiuniq "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u.fasta"
then
log "OBIUniq simple: running OK"
((success++))
else
log "OBIUniq simple: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_ref.csv" \
"${TMPDIR}/touniq_u.csv" > /dev/null
then
log "OBIUniq simple: result OK"
((success++))
else
log "OBIUniq simple: result failed"
((failed++))
fi
((ntest++))
if obiuniq -c a "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_a.fasta"
then
log "OBIUniq one category: running OK"
((success++))
else
log "OBIUniq one category: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u_a.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u_a.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_a_ref.csv" \
"${TMPDIR}/touniq_u_a.csv" > /dev/null
then
log "OBIUniq one category: result OK"
((success++))
else
log "OBIUniq one category: result failed"
((failed++))
fi
((ntest++))
if obiuniq -c a -c b "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_a_b.fasta"
then
log "OBIUniq two categories: running OK"
((success++))
else
log "OBIUniq two categories: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u_a_b.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_b_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u_a_b.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_b.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_a_b_ref.csv" \
"${TMPDIR}/touniq_u_a_b.csv" > /dev/null
then
log "OBIUniq two categories: result OK"
((success++))
else
log "OBIUniq two categories: result failed"
((failed++))
fi
##
## Test merge attributes consistency between in-memory and on-disk paths
## This test catches the bug where the shared classifier in the on-disk
## dereplication path caused incorrect merged attributes.
##
((ntest++))
if obiuniq -m a -m b --in-memory \
"${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_merge_mem.fasta" 2>/dev/null
then
log "OBIUniq merge in-memory: running OK"
((success++))
else
log "OBIUniq merge in-memory: running failed"
((failed++))
fi
((ntest++))
if obiuniq -m a -m b --chunk-count 4 \
"${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_merge_disk.fasta" 2>/dev/null
then
log "OBIUniq merge on-disk: running OK"
((success++))
else
log "OBIUniq merge on-disk: running failed"
((failed++))
fi
# Extract sorted annotations (JSON attributes) from both outputs
# to compare merge results independently of sequence ordering
grep '^>' "${TMPDIR}/touniq_u_merge_mem.fasta" \
| sed 's/^>seq[0-9]* //' \
| sort \
> "${TMPDIR}/touniq_u_merge_mem.json"
grep '^>' "${TMPDIR}/touniq_u_merge_disk.fasta" \
| sed 's/^>seq[0-9]* //' \
| sort \
> "${TMPDIR}/touniq_u_merge_disk.json"
((ntest++))
if diff "${TMPDIR}/touniq_u_merge_mem.json" \
"${TMPDIR}/touniq_u_merge_disk.json" > /dev/null
then
log "OBIUniq merge on-disk vs in-memory: result OK"
((success++))
else
log "OBIUniq merge on-disk vs in-memory: result failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,16 @@
>seq1 {"a":2, "b":4,"c":5}
aaacccgggttt
>seq2 {"a":3, "b":4,"c":5}
aaacccgggttt
>seq3 {"a":3, "b":5,"c":5}
aaacccgggttt
>seq4 {"a":3, "b":5,"c":6}
aaacccgggttt
>seq5 {"a":2, "b":4,"c":5}
aaacccgggtttca
>seq6 {"a":3, "b":4,"c":5}
aaacccgggtttca
>seq7 {"a":3, "b":5,"c":5}
aaacccgggtttca
>seq8 {"a":3, "b":5,"c":6}
aaacccgggtttca

View File

@@ -0,0 +1,4 @@
>seq5 {"count":4}
aaacccgggtttca
>seq1 {"count":4}
aaacccgggttt

View File

@@ -0,0 +1,8 @@
>seq5 {"a":2,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq6 {"a":3,"count":3}
aaacccgggtttca
>seq1 {"a":2,"b":4,"c":5,"count":1}
aaacccgggttt
>seq2 {"a":3,"count":3}
aaacccgggttt

View File

@@ -0,0 +1,12 @@
>seq5 {"a":2,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq6 {"a":3,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq7 {"a":3,"b":5,"count":2}
aaacccgggtttca
>seq1 {"a":2,"b":4,"c":5,"count":1}
aaacccgggttt
>seq2 {"a":3,"b":4,"c":5,"count":1}
aaacccgggttt
>seq3 {"a":3,"b":5,"count":2}
aaacccgggttt

Some files were not shown because too many files have changed in this diff Show More