Compare commits

...

120 Commits

Author SHA1 Message Date
Eric Coissac
71574f240b Update version and add CI tests
Update version to 4.4.5 and add a test job in the release workflow to ensure tests pass before creating a release.
2026-02-05 18:02:28 +01:00
Eric Coissac
fe6d74efbf Add automated release workflow and update tag creation
This commit introduces a new GitHub Actions workflow to automatically create releases when tags matching the pattern 'Release_*' are pushed. It also updates the Makefile to use the new tag format 'Release_<version>' for tagging commits, ensuring consistency with the new release automation.
2026-02-05 17:53:52 +01:00
coissac
cff8135468 Merge pull request #72 from metabarcoding/push-zsprzlqxurrp
Push zsprzlqxurrp
2026-02-05 17:42:48 +01:00
Eric Coissac
02ab683fa0 Bump version to 4.4.4
Update version from 4.4.3 to 4.4.4 in version.txt and pkg/obioptions/version.go
2026-02-05 17:42:01 +01:00
Eric Coissac
de88e7eecd Fix typo in variable name
Corrected a typo in the variable name 'usreId' to 'userId' to ensure proper functionality.
2026-02-05 17:41:59 +01:00
Eric Coissac
e3c41fc11b Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup
Add Jaccard distance and similarity computations for KmerSet and KmerSetGroup

This commit introduces Jaccard distance and similarity methods for KmerSet and KmerSetGroup.

For KmerSet:
- Added JaccardDistance method to compute the Jaccard distance between two KmerSets
- Added JaccardSimilarity method to compute the Jaccard similarity between two KmerSets

For KmerSetGroup:
- Added JaccardDistanceMatrix method to compute a pairwise Jaccard distance matrix
- Added JaccardSimilarityMatrix method to compute a pairwise Jaccard similarity matrix

Also includes:
- New DistMatrix implementation in pkg/obidist for storing and computing distance/similarity matrices
- Updated version handling with bump-version target in Makefile
- Added tests for all new methods
2026-02-05 17:39:23 +01:00
Eric Coissac
aa2e94dd6f Refactor k-mer normalization functions and add quorum operations
This commit refactors the k-mer normalization functions, renaming them from 'NormalizeKmer' to 'CanonicalKmer' to better reflect their purpose of returning canonical k-mers. It also introduces new quorum operations (AtLeast, AtMost, Exactly) for k-mer set groups, along with comprehensive tests and benchmarks. The version commit hash has also been updated.
2026-02-05 17:11:34 +01:00
Eric Coissac
a43e6258be docs: translate comments to English
This commit translates all French comments in the kmer filtering and set management code to English, improving code readability and maintainability for international collaborators.
2026-02-05 16:35:55 +01:00
Eric Coissac
12ca62b06a Implémentation complète de la persistance pour FrequencyFilter
Ajout de la fonctionnalité de sauvegarde et de chargement pour FrequencyFilter en utilisant le KmerSetGroup sous-jacent.

- Nouvelle méthode Save() pour enregistrer le filtre dans un répertoire avec formatage des métadonnées
- Nouvelle méthode LoadFrequencyFilter() pour charger un filtre depuis un répertoire
- Initialisation des métadonnées lors de la création du filtre
- Optimisation des méthodes Union() et Intersect() du KmerSetGroup
- Mise à jour du commit hash
2026-02-05 16:26:10 +01:00
Eric Coissac
09ac15a76b Refactor k-mer encoding functions to use 'canonical' terminology
This commit refactors all k-mer encoding and normalization functions to consistently use 'canonical' instead of 'normalized' terminology. This includes renaming functions like EncodeNormalizedKmer to EncodeCanonicalKmer, IterNormalizedKmers to IterCanonicalKmers, and NormalizeKmer to CanonicalKmer. The change aligns the API with biological conventions where 'canonical' refers to the lexicographically smallest representation of a k-mer and its reverse complement. All related documentation and examples have been updated accordingly. The commit also updates the version file with a new commit hash.
2026-02-05 16:14:35 +01:00
Eric Coissac
16f72e6305 refactoring of obikmer 2026-02-05 16:05:48 +01:00
Eric Coissac
6c6c369ee2 Add k-mer encoding and decoding functions with normalized k-mer support
This commit introduces new functions for encoding and decoding k-mers, including support for normalized k-mers. It also updates the frequency filter and k-mer set implementations to use the new encoding functions, providing zero-allocation encoding for better performance. The commit hash has been updated to reflect the latest changes.
2026-02-05 15:51:52 +01:00
Eric Coissac
c5dd477675 Refactor KmerSet and FrequencyFilter to use immutable K parameter and consistent Copy/Clone methods
This commit refactors the KmerSet and related structures to use an immutable K parameter and introduces consistent Copy methods instead of Clone. It also adds attribute API support for KmerSet and KmerSetGroup, and updates persistence logic to handle IDs and metadata correctly.
2026-02-05 15:32:36 +01:00
Eric Coissac
afcb43b352 Ajout de la gestion des métadonnées utilisateur dans KmerSet et KmerSetGroup
Cette modification ajoute la capacité de stocker et de persister des métadonnées utilisateur dans les structures KmerSet et KmerSetGroup. Les changements incluent l'ajout d'un champ Metadata dans KmerSet et KmerSetGroup, ainsi que la mise à jour des méthodes de clonage et de persistance pour gérer ces métadonnées. Cela permet de conserver des informations supplémentaires liées aux ensembles de k-mers tout en maintenant la compatibilité avec les opérations existantes.
2026-02-05 15:02:36 +01:00
Eric Coissac
b26b76cbf8 Add TOML persistence support for KmerSet and KmerSetGroup
This commit adds support for saving and loading KmerSet and KmerSetGroup structures using TOML, YAML, and JSON formats for metadata. It includes:

- Added github.com/pelletier/go-toml/v2 dependency
- Implemented Save and Load methods for KmerSet and KmerSetGroup
- Added metadata persistence with support for multiple formats (TOML, YAML, JSON)
- Added helper functions for format detection and metadata handling
- Updated version commit hash
2026-02-05 14:57:22 +01:00
Eric Coissac
aa468ec462 Refactor FrequencyFilter to use KmerSetGroup
Refactor FrequencyFilter to inherit from KmerSetGroup for better code organization and maintainability. This change replaces the direct bitmap management with a group-based approach, simplifying the implementation and improving readability.
2026-02-05 14:46:57 +01:00
Eric Coissac
00dcd78e84 Refactor k-mer encoding and frequency filtering with KmerSet
This commit refactors the k-mer encoding logic to handle ambiguous bases more consistently and introduces a KmerSet type for better management of k-mer collections. The frequency filter now works with KmerSet instead of roaring bitmaps directly, and the API has been updated to support level-based frequency queries. Additionally, the commit updates the version and commit hash.
2026-02-05 14:41:59 +01:00
Eric Coissac
60f27c1dc8 Add error handling for ambiguous bases in k-mer encoding
This commit introduces error handling for ambiguous DNA bases (N, R, Y, W, S, K, M, B, D, H, V) in k-mer encoding. It adds new functions IterNormalizedKmersWithErrors and EncodeNormalizedKmersWithErrors that track and encode the number of ambiguous bases in each k-mer using error markers in the top 2 bits. The commit also updates the version string to reflect the latest changes.
2026-02-04 21:45:08 +01:00
Eric Coissac
28162ac36f Ajout du filtre de fréquence avec v niveaux Roaring Bitmaps
Implémentation complète du filtre de fréquence utilisant v niveaux de Roaring Bitmaps pour éliminer efficacement les erreurs de séquençage.

- Ajout de la logique de filtrage par fréquence avec v niveaux
- Intégration des bibliothèques RoaringBitmap et bitset
- Ajout d'exemples d'utilisation et de documentation
- Implémentation de l'itérateur de k-mers pour une utilisation mémoire efficace
- Optimisation pour les distributions skewed typiques du séquençage

Ce changement permet de filtrer les k-mers par fréquence minimale avec une utilisation mémoire optimale et une seule passe sur les données.
2026-02-04 21:21:10 +01:00
Eric Coissac
1a1adb83ac Add error marker support for k-mers with enhanced documentation
This commit introduces error marker functionality for k-mers with odd lengths up to 31. The top 2 bits of each k-mer are now reserved for error coding (0-3), allowing for error detection and correction capabilities. Key changes include:

- Added constants KmerErrorMask and KmerSequenceMask for bit manipulation
- Implemented SetKmerError, GetKmerError, and ClearKmerError functions
- Updated EncodeKmers, ExtractSuperKmers, EncodeNormalizedKmers functions to enforce k ≤ 31
- Enhanced ReverseComplement to preserve error bits during reverse complement operations
- Added comprehensive tests for error marker functionality including edge cases and integration tests

The maximum k-mer size is now capped at 31 to accommodate the error bits, ensuring that k-mers with odd lengths ≤ 31 utilize only 62 bits of the 64-bit uint64, leaving the top 2 bits available for error coding.
2026-02-04 16:21:47 +01:00
Eric Coissac
05de9ca58e Add SuperKmer extraction functionality
This commit introduces the ExtractSuperKmers function which identifies maximal subsequences where all consecutive k-mers share the same minimizer. It includes:

- SuperKmer struct to represent the maximal subsequences
- dequeItem struct for tracking minimizers in a sliding window
- Efficient algorithm using monotone deque for O(1) amortized minimizer tracking
- Comprehensive parameter validation
- Support for buffer reuse for performance optimization
- Extensive test cases covering basic functionality, edge cases, and performance benchmarks

The implementation uses simultaneous forward/reverse m-mer encoding for O(1) canonical m-mer computation and maintains a monotone deque to track minimizers efficiently.
2026-02-04 16:04:06 +01:00
Eric Coissac
500144051a Add jj Makefile targets and k-mer encoding utilities
Add new Makefile targets for jj operations (jjnew, jjpush, jjfetch) to streamline commit workflow.

Introduce k-mer encoding utilities in pkg/obikmer:
- EncodeKmers: converts DNA sequences to encoded k-mers
- ReverseComplement: computes reverse complement of k-mers
- NormalizeKmer: returns canonical form of k-mers
- EncodeNormalizedKmers: encodes sequences with normalized k-mers

Add comprehensive tests for k-mer encoding functions including edge cases, buffer reuse, and performance benchmarks.

Document k-mer index design for large genomes, covering:
- Use cases and objectives
- Volume estimations
- Distance metrics (Jaccard, Sørensen-Dice, Bray-Curtis)
- Indexing options (Bloom filters, sorted sets, MPHF)
- Optimization techniques (k-2-mer indexing)
- MinHash for distance acceleration
- Recommended architecture for presence/absence and counting queries
2026-02-04 14:27:10 +01:00
coissac
740f66b4c7 Merge pull request #71 from metabarcoding/push-onwzsyuooozn
Implémentation du filtrage unique basé sur séquence et catégories
2026-01-14 19:19:27 +01:00
Eric Coissac
b49aba9c09 Implémentation du filtrage unique basé sur séquence et catégories
Ajout d'une fonctionnalité pour le filtrage unique qui prend en compte à la fois la séquence et les catégories.

- Modification de la fonction ISequenceChunk pour accepter un classifieur unique optionnel
- Implémentation du traitement unique sur disque en utilisant un classifieur composite
- Mise à jour du classifieur utilisé pour le tri sur disque
- Correction de la gestion des clés de unicité en utilisant le code et la valeur du classifieur
- Mise à jour du numéro de commit
2026-01-14 19:18:17 +01:00
coissac
52244cdb64 Merge pull request #70 from metabarcoding/push-kuwnszsxmxpn
Refactor chunk processing and update version commit
2026-01-14 18:47:17 +01:00
Eric Coissac
0678181023 Refactor chunk processing and update version commit
Optimize chunk processing by moving variable declarations inside the loop and update the commit hash in version.go to reflect the latest changes.
2026-01-14 18:46:04 +01:00
coissac
f55dd553c7 Merge pull request #68 from metabarcoding/push-rrulynolpprl
Push rrulynolpprl
2026-01-14 17:44:36 +01:00
coissac
4a383ac6c9 Merge branch 'master' into push-rrulynolpprl 2025-12-18 14:12:56 +01:00
Eric Coissac
371e702423 obiannotate --cut bug 2025-12-18 14:11:11 +01:00
Eric Coissac
ac0d3f3fe4 Update obiuniq for very large dataset 2025-12-18 14:11:11 +01:00
Eric Coissac
547135c747 End of obilowmask 2025-12-03 11:49:07 +01:00
coissac
f4a919732e Merge pull request #65 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-26 12:13:08 +01:00
Eric Coissac
e681666aaa End of obilowmask 2025-11-26 11:14:56 +01:00
coissac
adf2486295 Merge pull request #64 from metabarcoding/push-yurwulsmpxkq
End of obilowmask
2025-11-24 15:36:20 +01:00
Eric Coissac
272f5c9c35 End of obilowmask 2025-11-24 15:27:38 +01:00
coissac
c1b9503ca6 Merge pull request #63 from metabarcoding/push-vypwrurrsxuk
obicsv bug with stat on value map fields
2025-11-21 14:04:34 +01:00
Eric Coissac
86e60aedd0 obicsv bug with stat on value map fields 2025-11-21 14:03:31 +01:00
coissac
961abcea7b Merge pull request #61 from metabarcoding/push-mvxssvnysyxn
Push mvxssvnysyxn
2025-11-21 13:25:19 +01:00
Eric Coissac
57c65f9d50 obimatrix bug 2025-11-21 13:24:24 +01:00
Eric Coissac
e65b2a5efe obimatrix bugs 2025-11-21 13:24:06 +01:00
coissac
3e5f3f76b0 Merge pull request #60 from metabarcoding/push-qpnzxskwpoxo
Push qpnzxskwpoxo
2025-11-18 15:35:41 +01:00
Eric Coissac
ccc827afd3 finalise obilowmask 2025-11-18 15:33:08 +01:00
Eric Coissac
cef29005a5 debug url reading 2025-11-18 15:30:20 +01:00
Eric Coissac
4603d7973e implementation de obilowmask 2025-11-18 15:30:20 +01:00
coissac
8bc47c13d3 Merge pull request #58 from metabarcoding/push-vxkqkkrokwuz
debug obimultiplex
2025-11-06 15:44:31 +01:00
Eric Coissac
07cdd6f758 debug obimultiplex
bug option obimultiplex
2025-11-06 15:43:13 +01:00
coissac
432da366e2 Merge pull request #57 from metabarcoding/push-ywktmvpvtvmv
debug taxonomy core dump
2025-11-05 19:07:41 +01:00
Eric Coissac
2d7dc7d09d debug taxonomy core dump 2025-11-05 19:01:15 +01:00
coissac
5e12ed5400 Merge pull request #56 from metabarcoding/push-tnrvpwvqtzyo
update install script
2025-11-04 18:11:21 +01:00
Eric Coissac
7500ee1d15 update install script 2025-11-04 18:09:15 +01:00
coissac
5a1d66bf06 Merge pull request #53 from metabarcoding/push-skmxzrzulvtq
Push skmxzrzulvtq
2025-10-28 14:27:19 +01:00
Eric Coissac
0844dcc607 bug obimatrix 2025-10-28 13:57:31 +01:00
Eric Coissac
7f4ebe757e Bug obiuniq - don't clean the chunks 2025-10-28 13:50:22 +01:00
coissac
5150947e23 Merge pull request #51 from metabarcoding/push-urtwmwktsrru
Push urtwmwktsrru
2025-10-20 17:41:33 +02:00
Eric Coissac
d17a9520b9 work on obiclean chimera detection 2025-10-20 17:29:47 +02:00
Eric Coissac
29bf4ce871 add a feature to obimatrix adding obicsv option to obimatrix 2025-10-20 16:34:58 +02:00
coissac
d7ed9d343e Update install_obitools.sh for missing directory 2025-10-15 08:32:06 +02:00
Eric Coissac
82b6bb1ab6 correct a bug in func (worker SeqWorker) ChainWorkers(next SeqWorker) SeqWorker 2025-08-11 15:09:49 +02:00
Eric Coissac
6d204f6281 Patch the fastq detector 2025-08-08 10:23:03 -04:00
Eric Coissac
7a6d552450 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-08-07 17:01:48 -04:00
Eric Coissac
412b54822c Patch a bug in obliclean for d>1 leading to some instability in the result 2025-08-07 17:01:38 -04:00
Eric Coissac
730d448fc3 Allows for only one cpu and it should work 2025-08-06 16:09:25 -04:00
Eric Coissac
04f3af3e60 some renaming of functions 2025-08-06 15:54:50 -04:00
Eric Coissac
997b6e8c01 correct the fastq detector for distinguish with a csv ngsfilter 2025-08-06 15:52:54 -04:00
Eric Coissac
f239e8da92 Rename ISequenceChunk 2025-08-05 08:49:45 -04:00
Eric Coissac
ed28d3fb5b Adds a --u-to-t option 2025-07-07 15:35:26 +02:00
Eric Coissac
43b285587e Debug on taxonomy extraction and CSV conversion 2025-07-07 15:29:40 +02:00
Eric Coissac
8d53d253d4 Add a reading option on readers to convet U to T 2025-07-07 15:29:07 +02:00
Eric Coissac
8c26fc9884 Add a new test on obisummary 2025-07-07 15:28:29 +02:00
Eric Coissac
235a7e202a Update obisummary to account new obiseq.StatsOnValues type 2025-06-19 17:21:30 +02:00
Eric Coissac
27fa984a63 Patch obimatrix accoring to the new type obiseq.StatsOnValues 2025-06-19 16:51:53 +02:00
Eric Coissac
add9d89ccc Patch the Min and Max values of the expression language 2025-06-19 16:43:26 +02:00
Eric Coissac
9965370d85 Manage a lock on StatsOnValues 2025-06-17 16:46:11 +02:00
Eric Coissac
8a2bb1fe82 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obiseq/merge.go
2025-06-17 12:11:35 +02:00
Eric Coissac
efc3f3af29 Patch a concurrent access problem 2025-06-17 12:05:42 +02:00
Eric Coissac
1c6ab1c559 Changes to be committed:
modified:   pkg/obingslibrary/multimatch.go
	modified:   pkg/obioptions/version.go
2025-06-17 09:06:42 +02:00
Eric Coissac
38dcd98d4a Patch the genbank parser automata 2025-06-17 08:52:45 +02:00
Eric Coissac
7b23985693 Add _ to allowed in taxid 2025-06-06 14:37:57 +02:00
Eric Coissac
d31e677304 Patch a bug in obitag 2025-06-04 14:47:28 +02:00
Eric Coissac
6cb7a5a352 Changes to be committed:
modified:   cmd/obitools/obitag/main.go
	modified:   cmd/obitools/obitaxonomy/main.go
	modified:   pkg/obiformats/csvtaxdump_read.go
	modified:   pkg/obiformats/ecopcr_read.go
	modified:   pkg/obiformats/ncbitaxdump_read.go
	modified:   pkg/obiformats/ncbitaxdump_readtar.go
	modified:   pkg/obiformats/newick_write.go
	modified:   pkg/obiformats/options.go
	modified:   pkg/obiformats/taxonomy_read.go
	modified:   pkg/obiformats/universal_read.go
	modified:   pkg/obiiter/extract_taxonomy.go
	modified:   pkg/obioptions/options.go
	modified:   pkg/obioptions/version.go
	new file:   pkg/obiphylo/tree.go
	modified:   pkg/obiseq/biosequenceslice.go
	modified:   pkg/obiseq/taxonomy_methods.go
	modified:   pkg/obitax/taxonomy.go
	modified:   pkg/obitax/taxonset.go
	modified:   pkg/obitools/obiconvert/sequence_reader.go
	modified:   pkg/obitools/obitag/obitag.go
	modified:   pkg/obitools/obitaxonomy/obitaxonomy.go
	modified:   pkg/obitools/obitaxonomy/options.go
	deleted:    sample/.DS_Store
2025-06-04 09:48:10 +02:00
Eric Coissac
3424d3057f Changes to be committed:
modified:   pkg/obiformats/ngsfilter_read.go
	modified:   pkg/obioptions/version.go
	modified:   pkg/obiutils/mimetypes.go
2025-05-14 14:53:25 +02:00
Eric Coissac
f9324dd8f4 add min and max to the obitools expression language 2025-05-13 16:03:03 +02:00
Eric Coissac
f1b9ac4a13 Update the expression language 2025-05-07 20:45:05 +02:00
Eric Coissac
e065e2963b Update the install script 2025-05-01 11:45:46 +02:00
Eric Coissac
13ff892ac9 Patch type mismatch in apat C library 2025-04-23 16:14:10 +02:00
Eric Coissac
c0ecaf90ab Add the --number option to obiannotate 2025-04-22 18:35:51 +02:00
Eric Coissac
a57cfda675 Make the replace function of the eval language accepting regex 2025-04-10 15:17:15 +02:00
Eric Coissac
c2f38e737b Update of the packages 2025-04-10 15:16:36 +02:00
Eric Coissac
0aec5ba4df change the tests according to the corrections in obipairing 2025-04-04 17:10:17 +02:00
Eric Coissac
67e5b6ef24 Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:02:45 +02:00
Eric Coissac
3b1aa2869e Changes to be committed:
modified:   pkg/obioptions/version.go
2025-04-04 17:01:20 +02:00
Eric Coissac
7542e33010 Several bugs dicoverd during the doc writing 2025-04-04 16:59:27 +02:00
Eric Coissac
03b5ce9397 Patch a bug in obitag when some reference sequences have taxid absent from the taxonomy 2025-03-27 16:45:02 +01:00
Eric Coissac
2d52322876 Patch a bug in the obi2 annotation parser on map indexed by integers 2025-03-27 14:54:13 +01:00
Eric Coissac
fd80249b85 Patch a bug in obitag when a taxon from the reference library is unknown in the taxonomy 2025-03-27 14:28:15 +01:00
Eric Coissac
5a3705b6bb Adds the --silent-warning options to the obitools commands and removes the --pared-with option from some of the obitols commands. 2025-03-25 16:44:46 +01:00
Eric Coissac
2ab6f67d58 Add a progress bar to chimera detection 2025-03-25 08:37:27 +01:00
Eric Coissac
8b379d30da Adds the --newick-output option to the obitaxonomy command 2025-03-14 14:24:12 +01:00
Eric Coissac
8448783499 Make sequence files recognized as a taxonomy 2025-03-14 14:22:22 +01:00
Eric Coissac
d1c31c54de add a first version of the inline documentation 2025-03-12 14:40:42 +01:00
Eric Coissac
7a9dc1ab3b update release notes 2025-03-12 14:06:20 +01:00
Eric Coissac
3a1cf4fe97 Accelerate the speed of very long fasta sequences, and more generaly of every format 2025-03-12 13:29:41 +01:00
Eric Coissac
83926c91e1 Patch the install script to desactivate the CSV check 2025-03-12 13:28:52 +01:00
Eric Coissac
937a483aa6 Changes to be committed:
modified:   Makefile
2025-03-12 12:55:41 +01:00
Eric Coissac
dada70e6b1 Changes to be committed:
modified:   Makefile
2025-03-12 12:49:34 +01:00
Eric Coissac
62e5a93492 update the compress option name 2025-03-11 17:14:40 +01:00
Eric Coissac
f21f51ae62 Correct the logic of --update-taxid and --fail-on-taxonomy 2025-03-11 16:56:02 +01:00
Eric Coissac
3b5d4ba455 patch a bug in obiannotate 2025-03-11 16:35:38 +01:00
Eric Coissac
50d11ce374 Add a pre-push git-hook to run tests on obitools commands before pushing on master 2025-03-08 18:56:02 +01:00
Eric Coissac
52d5f6fe11 make makefile crashing on test error 2025-03-08 16:54:24 +01:00
Eric Coissac
78caabd2fd Add basic test on -h for all the commands 2025-03-08 16:28:06 +01:00
Eric Coissac
65bd29b955 normalize the usage of obitaxonomy 2025-03-08 13:00:55 +01:00
Eric Coissac
b18c9b7ac6 add the --raw-taxid option 2025-03-08 09:40:06 +01:00
Eric Coissac
78df7db18d typos 2025-03-08 07:44:41 +01:00
Eric Coissac
fc08c12ab0 update release notes 2025-03-08 07:42:20 +01:00
Eric Coissac
0339e4dffa Patch size limite of the filetype guesser 2025-03-08 07:34:02 +01:00
Eric Coissac
706b44c37f Add option for csv input format 2025-03-08 07:21:24 +01:00
Eric Coissac
fbe7d15dc3 Changes to be committed:
modified:   pkg/obioptions/version.go
	modified:   pkg/obitools/obicleandb/obicleandb.go
	modified:   pkg/obitools/obicleandb/options.go
2025-03-06 13:38:38 +01:00
Eric Coissac
b5cf586f17 patch a duplicate --taxonomy option in obirefidxdb 2025-03-06 11:36:20 +01:00
Eric Coissac
286e27d6ba patch the scienctific_name tag name to "scientific_name" 2025-03-05 14:22:12 +01:00
219 changed files with 14136 additions and 1186 deletions

176
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,176 @@
name: Create Release on Tag
on:
push:
tags:
- "Release_*"
permissions:
contents: write
jobs:
# First run tests
test:
runs-on: ubuntu-latest
steps:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.23"
- name: Checkout obitools4 project
uses: actions/checkout@v4
- name: Run tests
run: make githubtests
# Then create release only if tests pass
create-release:
needs: test
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.23"
- name: Extract version from tag
id: get_version
run: |
TAG=${GITHUB_REF#refs/tags/Release_}
echo "version=$TAG" >> $GITHUB_OUTPUT
echo "tag_name=Release_$TAG" >> $GITHUB_OUTPUT
- name: Build binaries for multiple platforms
env:
VERSION: ${{ steps.get_version.outputs.version }}
run: |
mkdir -p release
# Build for Linux AMD64
echo "Building for Linux AMD64..."
GOOS=linux GOARCH=amd64 make
cd build
for binary in *; do
tar -czf ../release/${binary}_${VERSION}_linux_amd64.tar.gz ${binary}
done
cd ..
rm -rf build
# Build for Linux ARM64
echo "Building for Linux ARM64..."
GOOS=linux GOARCH=arm64 make
cd build
for binary in *; do
tar -czf ../release/${binary}_${VERSION}_linux_arm64.tar.gz ${binary}
done
cd ..
rm -rf build
# Build for macOS AMD64 (Intel)
echo "Building for macOS AMD64..."
GOOS=darwin GOARCH=amd64 make
cd build
for binary in *; do
tar -czf ../release/${binary}_${VERSION}_darwin_amd64.tar.gz ${binary}
done
cd ..
rm -rf build
# Build for macOS ARM64 (Apple Silicon)
echo "Building for macOS ARM64..."
GOOS=darwin GOARCH=arm64 make
cd build
for binary in *; do
tar -czf ../release/${binary}_${VERSION}_darwin_arm64.tar.gz ${binary}
done
cd ..
rm -rf build
# Build for Windows AMD64
echo "Building for Windows AMD64..."
GOOS=windows GOARCH=amd64 make
cd build
for binary in *; do
# Windows binaries have .exe extension
if [ -f "${binary}.exe" ]; then
zip ../release/${binary}_${VERSION}_windows_amd64.zip ${binary}.exe
else
zip ../release/${binary}_${VERSION}_windows_amd64.zip ${binary}
fi
done
cd ..
echo "Built archives:"
ls -lh release/
- name: Generate Release Notes
id: release_notes
env:
VERSION: ${{ steps.get_version.outputs.version }}
run: |
PREV_TAG=$(git describe --tags --abbrev=0 HEAD^ 2>/dev/null || echo "")
echo "# OBITools4 Release ${VERSION}" > release_notes.md
echo "" >> release_notes.md
if [ -n "$PREV_TAG" ]; then
echo "## Changes since ${PREV_TAG}" >> release_notes.md
echo "" >> release_notes.md
git log ${PREV_TAG}..HEAD --pretty=format:"- %s" >> release_notes.md
else
echo "## Changes" >> release_notes.md
echo "" >> release_notes.md
git log --pretty=format:"- %s" -n 20 >> release_notes.md
fi
echo "" >> release_notes.md
echo "" >> release_notes.md
echo "## Installation" >> release_notes.md
echo "" >> release_notes.md
echo "Download the appropriate binary for your system and extract it:" >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (AMD64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf <tool>_${VERSION}_linux_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### Linux (ARM64)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf <tool>_${VERSION}_linux_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Intel)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf <tool>_${VERSION}_darwin_amd64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### macOS (Apple Silicon)" >> release_notes.md
echo '```bash' >> release_notes.md
echo "tar -xzf <tool>_${VERSION}_darwin_arm64.tar.gz" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "### Windows (AMD64)" >> release_notes.md
echo '```powershell' >> release_notes.md
echo "Expand-Archive <tool>_${VERSION}_windows_amd64.zip" >> release_notes.md
echo '```' >> release_notes.md
echo "" >> release_notes.md
echo "Available tools: Replace \`<tool>\` with one of the obitools commands." >> release_notes.md
- name: Create GitHub Release
uses: softprops/action-gh-release@v1
with:
name: Release ${{ steps.get_version.outputs.version }}
body_path: release_notes.md
files: release/*
draft: false
prerelease: false
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

6
.gitignore vendored
View File

@@ -16,12 +16,18 @@
**/*.tgz
**/*.yaml
**/*.csv
xx
.rhistory
/.vscode
/build
/bugs
/ncbitaxo
!/obitests/**
!/sample/**
LLM/**
*_files
entropy.html

104
Makefile
View File

@@ -2,8 +2,9 @@
#export GOBIN=$(GOPATH)/bin
#export PATH=$(GOBIN):$(shell echo $${PATH})
GOFLAGS=
GOCMD=go
GOBUILD=$(GOCMD) build # -compiler gccgo -gccgoflags -O3
GOBUILD=$(GOCMD) build $(GOFLAGS)
GOGENERATE=$(GOCMD) generate
GOCLEAN=$(GOCMD) clean
GOTEST=$(GOCMD) test
@@ -16,6 +17,12 @@ PACKAGES_SRC:= $(wildcard pkg/*/*.go pkg/*/*/*.go)
PACKAGE_DIRS:=$(sort $(patsubst %/,%,$(dir $(PACKAGES_SRC))))
PACKAGES:=$(notdir $(PACKAGE_DIRS))
GITHOOK_SRC_DIR=git-hooks
GITHOOKS_SRC:=$(wildcard $(GITHOOK_SRC_DIR)/*)
GITHOOK_DIR=.git/hooks
GITHOOKS:=$(patsubst $(GITHOOK_SRC_DIR)/%,$(GITHOOK_DIR)/%,$(GITHOOKS_SRC))
OBITOOLS_SRC:= $(wildcard cmd/obitools/*/*.go)
OBITOOLS_DIRS:=$(sort $(patsubst %/,%,$(dir $(OBITOOLS_SRC))))
OBITOOLS:=$(notdir $(OBITOOLS_DIRS))
@@ -53,34 +60,31 @@ endif
OUTPUT:=$(shell mktemp)
all: obitools
all: install-githook obitools
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
install-githook: $(GITHOOKS)
$(GITHOOK_DIR)/%: $(GITHOOK_SRC_DIR)/%
@echo installing $$(basename $@)...
@mkdir -p $(GITHOOK_DIR)
@cp $< $@
@chmod +x $@
packages: $(patsubst %,pkg-%,$(PACKAGES))
obitools: $(patsubst %,$(OBITOOLS_PREFIX)%,$(OBITOOLS))
update-deps:
go get -u ./...
test:
test: .FORCE
$(GOTEST) ./...
obitests:
obitests:
@for t in $$(find obitests -name test.sh -print) ; do \
bash $${t} ;\
done
bash $${t} || exit 1;\
done
githubtests: obitools obitests
man:
make -C doc man
obibook:
make -C doc obibook
doc: man obibook
macos-pkg:
@bash pkgs/macos/macos-installer-builder-master/macOS-x64/build-macos-x64.sh \
OBITools \
0.0.1
$(BUILD_DIR):
mkdir -p $@
@@ -90,19 +94,61 @@ $(foreach P,$(PACKAGE_DIRS),$(eval $(call MAKE_PKG_RULE,$(P))))
$(foreach P,$(OBITOOLS_DIRS),$(eval $(call MAKE_OBITOOLS_RULE,$(P))))
pkg/obioptions/version.go: .FORCE
ifneq ($(strip $(COMMIT_ID)),)
@cat $@ \
| sed -E 's/^var _Commit = "[^"]*"/var _Commit = "'$(COMMIT_ID)'"/' \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "'"$(LAST_TAG)"'"/' \
pkg/obioptions/version.go: version.txt .FORCE
@version=$$(cat version.txt); \
cat $@ \
| sed -E 's/^var _Version = "[^"]*"/var _Version = "Release '$$version'"/' \
> $(OUTPUT)
@diff $@ $(OUTPUT) 2>&1 > /dev/null \
|| echo "Update version.go : $@ to $(LAST_TAG) ($(COMMIT_ID))" \
&& mv $(OUTPUT) $@
|| (echo "Update version.go to $$(cat version.txt)" && mv $(OUTPUT) $@)
@rm -f $(OUTPUT)
endif
.PHONY: all packages obitools man obibook doc update-deps obitests githubtests .FORCE
.FORCE:
bump-version:
@echo "Incrementing version..."
@current=$$(cat version.txt); \
echo " Current version: $$current"; \
major=$$(echo $$current | cut -d. -f1); \
minor=$$(echo $$current | cut -d. -f2); \
patch=$$(echo $$current | cut -d. -f3); \
new_patch=$$((patch + 1)); \
new_version="$$major.$$minor.$$new_patch"; \
echo " New version: $$new_version"; \
echo "$$new_version" > version.txt
@echo "✓ Version updated in version.txt"
@$(MAKE) pkg/obioptions/version.go
jjnew:
@echo "$(YELLOW)→ Creating a new commit...$(NC)"
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
@echo "$(BLUE)→ Done.$(NC)"
@jj new
@echo "$(GREEN)✓ New commit created$(NC)"
jjpush:
@echo "$(YELLOW)→ Pushing commit to repository...$(NC)"
@echo "$(BLUE)→ Documenting current commit...$(NC)"
@jj auto-describe
@echo "$(BLUE)→ Creating new commit for version bump...$(NC)"
@jj new
@$(MAKE) bump-version
@echo "$(BLUE)→ Documenting version bump commit...$(NC)"
@jj auto-describe
@version=$$(cat version.txt); \
tag_name="Release_$$version"; \
echo "$(BLUE)→ Pushing commits and creating tag $$tag_name...$(NC)"; \
jj git push --change @; \
git tag -a "$$tag_name" -m "Release $$version" 2>/dev/null || echo "Tag $$tag_name already exists"; \
git push origin "$$tag_name" 2>/dev/null || echo "Tag already pushed"
@echo "$(GREEN)✓ Commits and tag pushed to repository$(NC)"
jjfetch:
@echo "$(YELLOW)→ Pulling latest commits...$(NC)"
@jj git fetch
@jj new master@origin
@echo "$(GREEN)✓ Latest commits pulled$(NC)"
.PHONY: all obitools update-deps obitests githubtests jjnew jjpush jjfetch bump-version .FORCE
.FORCE:

View File

@@ -1,13 +1,64 @@
# OBITools release notes
## New changes
### Bug fixes
- In `obipairing` correct the misspelling of the `obiparing_*` tags where the `i`
was missing to `obipairing_`.
- In `obigrep` the **-C** option that excludes sequences too abundant was not
functional.
- In `obitaxonomy` the **-l** option that lists all the taxonomic rank defined by
a taxonomy was not functional
- The file type guesser was not using enough data to be able to correctly detect
file format when sequences were too long in fastq and fasta or when lines were
to long in CSV files. That's now corrected
- Options **--fasta** or **--fastq** usable to specify input format were ignored.
They are now correctly considered
- The `obiannotate` command were crashing when a selection option was used but
no editing option.
- The `--fail-on-taxonomy` led to an error on merged taxa even when the
`--update-taxid` option was used.
- The `--compressed` option was not correctly named. It was renamed to `--compress`
### Enhancement
- Some sequences in the Genbank and EMBL databases are several gigabases long. The
sequence parser had to reallocate and recopy memory many times to read them,
resulting in a complexity of O(N^2) for reading such large sequences.
The new file chunk reader has a linear algorithm that speeds up the reading
of very long sequences.
- A new option **--csv** is added to every obitools to indicate that the input
format is CSV
- The new version of obitools are now printing the taxids in a fancy way
including the scientific name and the taxonomic rank (`"taxon:9606 [Homo
sapiens]@species"`). But if you need the old fashion raw taxid, a new option
**--raw-taxid** has been added to get obitools printing the taxids without any
decorations (`"9606"`).
## March 1st, 2025. Release 4.4.0
A new documentation website is available at https://obitools4.metabarcoding.org.
Its development is still in progress.
The biggest step forward in this new version is taxonomy management.
The new version is now able to handle taxonomic identifiers that are not just integer values. This is a first step towards an easy way to handle other taxonomy databases soon, such as the GBIF or Catalogue of Life taxonomies.
This version is able to handle files containing taxonomic information created by previous versions of OBITools, but files created by this new version may have some problems to be analysed by previous versions, at least for the taxonomic information.
The biggest step forward in this new version is taxonomy management. The new
version is now able to handle taxonomic identifiers that are not just integer
values. This is a first step towards an easy way to handle other taxonomy
databases soon, such as the GBIF or Catalog of Life taxonomies. This version
is able to handle files containing taxonomic information created by previous
versions of OBITools, but files created by this new version may have some
problems to be analyzed by previous versions, at least for the taxonomic
information.
### Breaking changes

View File

@@ -0,0 +1,213 @@
# Index de k-mers pour génomes de grande taille
## Contexte et objectifs
### Cas d'usage
- Indexation de k-mers longs (k=31) pour des génomes de grande taille (< 10 Go par génome)
- Nombre de génomes : plusieurs dizaines à quelques centaines
- Indexation en parallèle
- Stockage sur disque
- Possibilité d'ajouter des génomes, mais pas de modifier un génome existant
### Requêtes cibles
- **Présence/absence** d'un k-mer dans un génome
- **Intersection** entre génomes
- **Distances** : Jaccard (présence/absence) et potentiellement Bray-Curtis (comptage)
### Ressources disponibles
- 128 Go de RAM
- Stockage disque
---
## Estimation des volumes
### Par génome
- **10 Go de séquence** → ~10¹⁰ k-mers bruts (chevauchants)
- **Après déduplication** : typiquement 10-50% de k-mers uniques → **~1-5 × 10⁹ k-mers distincts**
### Espace théorique
- **k=31** → 62 bits → ~4.6 × 10¹⁸ k-mers possibles
- Table d'indexation directe impossible
---
## Métriques de distance
### Présence/absence (binaire)
- **Jaccard** : |A ∩ B| / |A B|
- **Sørensen-Dice** : 2|A ∩ B| / (|A| + |B|)
### Comptage (abondance)
- **Bray-Curtis** : 1 - (2 × Σ min(aᵢ, bᵢ)) / (Σ aᵢ + Σ bᵢ)
Note : Pour Bray-Curtis, le stockage des comptages est nécessaire, ce qui augmente significativement la taille de l'index.
---
## Options d'indexation
### Option 1 : Bloom Filter par génome
**Principe** : Structure probabiliste pour test d'appartenance.
**Avantages :**
- Très compact : ~10 bits/élément pour FPR ~1%
- Construction rapide, streaming
- Facile à sérialiser/désérialiser
- Intersection et Jaccard estimables via formules analytiques
**Inconvénients :**
- Faux positifs (pas de faux négatifs)
- Distances approximatives
**Taille estimée** : 1-6 Go par génome (selon FPR cible)
#### Dimensionnement des Bloom filters
```
\mathrm{FPR} ;=; \left(1 - e^{-h n / m}\right)^h
```
| Bits/élément | FPR optimal | k (hash functions) |
|--------------|-------------|---------------------|
| 8 | ~2% | 5-6 |
| 10 | ~1% | 7 |
| 12 | ~0.3% | 8 |
| 16 | ~0.01% | 11 |
Formule du taux de faux positifs :
```
FPR ≈ (1 - e^(-kn/m))^k
```
Où n = nombre d'éléments, m = nombre de bits, k = nombre de hash functions.
### Option 2 : Ensemble trié de k-mers
**Principe** : Stocker les k-mers (uint64) triés, avec compression possible.
**Avantages :**
- Exact (pas de faux positifs)
- Intersection/union par merge sort O(n+m)
- Compression efficace (delta encoding sur k-mers triés)
**Inconvénients :**
- Plus volumineux : 8 octets/k-mer
- Construction plus lente (tri nécessaire)
**Taille estimée** : 8-40 Go par génome (non compressé)
### Option 3 : MPHF (Minimal Perfect Hash Function)
**Principe** : Fonction de hash parfaite minimale pour les k-mers présents.
**Avantages :**
- Très compact : ~3-4 bits/élément
- Lookup O(1)
- Exact pour les k-mers présents
**Inconvénients :**
- Construction coûteuse (plusieurs passes)
- Statique (pas d'ajout de k-mers après construction)
- Ne distingue pas "absent" vs "jamais vu" sans structure auxiliaire
### Option 4 : Hybride MPHF + Bloom filter
- MPHF pour mapping compact des k-mers présents
- Bloom filter pour pré-filtrage des absents
---
## Optimisation : Indexation de (k-2)-mers pour requêtes k-mers
### Principe
Au lieu d'indexer directement les 31-mers dans un Bloom filter, on indexe les 29-mers. Pour tester la présence d'un 31-mer, on vérifie que les **trois 29-mers** qu'il contient sont présents :
- positions 0-28
- positions 1-29
- positions 2-30
### Analyse probabiliste
Si le Bloom filter a un FPR de p pour un 29-mer individuel, le FPR effectif pour un 31-mer devient **p³** (les trois requêtes doivent toutes être des faux positifs).
| FPR 29-mer | FPR 31-mer effectif |
|------------|---------------------|
| 10% | 0.1% |
| 5% | 0.0125% |
| 1% | 0.0001% |
### Avantages
1. **Moins d'éléments à stocker** : il y a moins de 29-mers distincts que de 31-mers distincts dans un génome (deux 31-mers différents peuvent partager un même 29-mer)
2. **FPR drastiquement réduit** : FPR³ avec seulement 3 requêtes
3. **Index plus compact** : on peut utiliser moins de bits par élément (FPR plus élevé acceptable sur le 29-mer) tout en obtenant un FPR très bas sur le 31-mer
### Trade-off
Un Bloom filter à **5-6 bits/élément** pour les 29-mers donnerait un FPR effectif < 0.01% pour les 31-mers, soit environ **2× plus compact** que l'approche directe à qualité égale.
**Coût** : 3× plus de requêtes par lookup (mais les requêtes Bloom sont très rapides).
---
## Accélération des calculs de distance : MinHash
### Principe
Pré-calculer une "signature" compacte (sketch) de chaque génome permettant d'estimer rapidement Jaccard sans charger les index complets.
### Avantages
- Matrice de distances entre 100+ génomes en quelques secondes
- Signature de taille fixe (ex: 1000-10000 hash values) quel que soit le génome
- Stockage minimal
### Utilisation
1. Construction : une passe sur les k-mers de chaque génome
2. Distance : comparaison des sketches en O(taille du sketch)
---
## Architecture recommandée
### Pour présence/absence + Jaccard
1. **Index principal** : Bloom filter de (k-2)-mers avec l'optimisation décrite
- Compact (~3-5 Go par génome)
- FPR très bas pour les k-mers grâce aux requêtes triples
2. **Sketches MinHash** : pour calcul rapide des distances entre génomes
- Quelques Ko par génome
- Permet exploration rapide de la matrice de distances
### Pour comptage + Bray-Curtis
1. **Index principal** : k-mers triés + comptages
- uint64 (k-mer) + uint8/uint16 (count)
- Compression delta possible
- Plus volumineux mais exact
2. **Sketches** : variantes de MinHash pour données pondérées (ex: HyperMinHash)
---
## Prochaines étapes
1. Implémenter un Bloom filter optimisé pour k-mers
2. Implémenter l'optimisation (k-2)-mer → k-mer
3. Implémenter MinHash pour les sketches
4. Définir le format de sérialisation sur disque
5. Benchmarker sur des génomes réels

View File

@@ -30,7 +30,11 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obiannotate.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiannotate",
"edits the sequence annotations",
obiannotate.OptionSet,
)
_, args := optionParser(os.Args)
@@ -38,6 +42,11 @@ func main() {
obiconvert.OpenSequenceDataErrorMessage(args, err)
annotator := obiannotate.CLIAnnotationPipeline()
if obiannotate.CLIHasSetNumberFlag() {
sequences = sequences.NumberSequences(1, !obiconvert.CLINoInputOrder())
}
obiconvert.CLIWriteBioSequences(sequences.Pipe(annotator), true)
obiutils.WaitForLastPipe()

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiclean.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiclean",
"",
obiclean.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -14,7 +14,10 @@ import (
func main() {
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obicleandb.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicleandb",
"clean-up reference databases",
obicleandb.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiconvert.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicomplement",
"reverse complement of sequences",
obiconvert.OptionSet(true))
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obiconsensus.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiconsensus",
"ONT reads denoising",
obiconsensus.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -14,7 +14,10 @@ func main() {
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obiconvert.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiconvert",
"convertion of sequence files to various formats",
obiconvert.OptionSet(true))
_, args := optionParser(os.Args)

View File

@@ -28,6 +28,8 @@ func main() {
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obicount",
"counts the sequences present in a file of sequences",
obiconvert.InputOptionSet,
obicount.OptionSet,
)

View File

@@ -10,7 +10,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obicsv.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obicsv",
"converts sequence files to CSV format",
obicsv.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -15,7 +15,10 @@ func main() {
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obidemerge.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obidemerge",
"",
obidemerge.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obidistribute.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obidistribute",
"divided an input set of sequences into subsets",
obidistribute.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -30,7 +30,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obigrep.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obigrep",
"select a subset of sequences on various criteria",
obigrep.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -15,7 +15,10 @@ func main() {
obidefault.SetStrictReadWorker(2)
obidefault.SetStrictWriteWorker(2)
optionParser := obioptions.GenerateOptionParser(obijoin.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obijoin",
"merge annotations contained in a file to another file",
obijoin.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -31,7 +31,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obikmersim.MatchOptionSet)
optionParser := obioptions.GenerateOptionParser(
"obikmermatch",
"",
obikmersim.MatchOptionSet)
_, args := optionParser(os.Args)

View File

@@ -32,7 +32,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obikmersim.CountOptionSet)
optionParser := obioptions.GenerateOptionParser(
"obikmersimcount",
"",
obikmersim.CountOptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obilandmark.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obilandmark",
"",
obilandmark.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -0,0 +1,47 @@
package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obilowmask"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
defer obiseq.LogBioSeqStatus()
// go tool pprof -http=":8000" ./obipairing ./cpu.pprof
// f, err := os.Create("cpu.pprof")
// if err != nil {
// log.Fatal(err)
// }
// pprof.StartCPUProfile(f)
// defer pprof.StopCPUProfile()
// go tool trace cpu.trace
// ftrace, err := os.Create("cpu.trace")
// if err != nil {
// log.Fatal(err)
// }
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obimicrosat",
"looks for microsatellites sequences in a sequence file",
obilowmask.OptionSet)
_, args := optionParser(os.Args)
sequences, err := obiconvert.CLIReadBioSequences(args...)
obiconvert.OpenSequenceDataErrorMessage(args, err)
selected := obilowmask.CLISequenceEntropyMasker(sequences)
obiconvert.CLIWriteBioSequences(selected, true)
obiutils.WaitForLastPipe()
}

View File

@@ -31,6 +31,8 @@ func main() {
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(
"obimatrix",
"",
obimatrix.OptionSet,
)

View File

@@ -30,7 +30,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obimicrosat.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obimicrosat",
"looks for microsatellites sequences in a sequence file",
obimicrosat.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -28,7 +28,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obimultiplex.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obimultiplex",
"demultiplex amplicons",
obimultiplex.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -30,7 +30,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obipairing.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obipairing",
"align forward with reverse reads with paired reads",
obipairing.OptionSet)
optionParser(os.Args)

View File

@@ -29,7 +29,10 @@ func main() {
obidefault.SetParallelFilesRead(obidefault.ParallelWorkers() / 4)
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obipcr.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obipcr",
"simulates a PCR on a sequence files",
obipcr.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obirefidx.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obireffamidx",
"",
obirefidx.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,7 +11,10 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obirefidx.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obirefidx",
"",
obirefidx.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -31,7 +31,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obiscript.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiscript",
"executes a lua script on the input sequences",
obiscript.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -31,7 +31,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obisplit.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obisplit",
"",
obisplit.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -33,7 +33,10 @@ func main() {
// trace.Start(ftrace)
// defer trace.Stop()
optionParser := obioptions.GenerateOptionParser(obisummary.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obisummary",
"resume main information from a sequence file",
obisummary.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -11,6 +11,7 @@ import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitax"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitag"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitaxonomy"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
@@ -39,7 +40,10 @@ func main() {
obidefault.SetStrictWriteWorker(1)
obidefault.SetBatchSize(10)
optionParser := obioptions.GenerateOptionParser(obitag.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obitag",
"realizes taxonomic assignment",
obitag.OptionSet)
_, args := optionParser(os.Args)
@@ -55,7 +59,7 @@ func main() {
}
if taxo == nil {
taxo, err = references.ExtractTaxonomy(nil)
taxo, err = references.ExtractTaxonomy(nil, obitaxonomy.CLINewickWithLeaves())
if err != nil {
log.Fatalf("No taxonomy specified or extractable from reference database: %v", err)
@@ -70,10 +74,12 @@ func main() {
var identified obiiter.IBioSequence
fsrb := fs.Rebatch(obidefault.BatchSize())
if obitag.CLIGeometricMode() {
identified = obitag.CLIGeomAssignTaxonomy(fs, references, taxo)
identified = obitag.CLIGeomAssignTaxonomy(fsrb, references, taxo)
} else {
identified = obitag.CLIAssignTaxonomy(fs, references, taxo)
identified = obitag.CLIAssignTaxonomy(fsrb, references, taxo)
}
obiconvert.CLIWriteBioSequences(identified, true)

View File

@@ -33,7 +33,10 @@ func main() {
obidefault.SetWorkerPerCore(1)
optionParser := obioptions.GenerateOptionParser(obitagpcr.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obitagpcr",
"split a paired raw read data set per sample",
obitagpcr.OptionSet)
optionParser(os.Args)
pairs, err := obipairing.CLIPairedSequence()

View File

@@ -4,9 +4,11 @@ import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiitercsv"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obioptions"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitax"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obiconvert"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obicsv"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitools/obitaxonomy"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
@@ -14,30 +16,59 @@ import (
)
func main() {
optionParser := obioptions.GenerateOptionParser(obitaxonomy.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obitaxonomy",
"manipulates and queries taxonomy",
obitaxonomy.OptionSet)
_, args := optionParser(os.Args)
var iterator *obitax.ITaxon
switch {
case obitaxonomy.CLIDownloadNCBI():
if obitaxonomy.CLIDownloadNCBI() {
err := obitaxonomy.CLIDownloadNCBITaxdump()
if err != nil {
log.Errorf("Cannot download NCBI taxonomy: %s", err.Error())
os.Exit(1)
}
os.Exit(0)
}
if !obidefault.HasSelectedTaxonomy() {
log.Fatal("you must indicate a taxonomy using the -t or --taxonomy option")
}
switch {
case obitaxonomy.CLIAskForRankList():
newIter := obiitercsv.NewICSVRecord()
newIter.Add(1)
newIter.AppendField("rank")
go func() {
ranks := obitax.DefaultTaxonomy().RankList()
data := make([]obiitercsv.CSVRecord, len(ranks))
for i, rank := range ranks {
record := make(obiitercsv.CSVRecord)
record["rank"] = rank
data[i] = record
}
newIter.Push(obiitercsv.MakeCSVRecordBatch(obitax.DefaultTaxonomy().Name(), 0, data))
newIter.Close()
newIter.Done()
}()
obicsv.CLICSVWriter(newIter, true)
obiutils.WaitForLastPipe()
os.Exit(0)
case obitaxonomy.CLIExtractTaxonomy():
iter, err := obiconvert.CLIReadBioSequences(args...)
iter = iter.NumberSequences(1, true)
if err != nil {
log.Fatalf("Cannot extract taxonomy: %v", err)
}
taxonomy, err := iter.ExtractTaxonomy()
taxonomy, err := iter.ExtractTaxonomy(obitaxonomy.CLINewickWithLeaves())
if err != nil {
log.Fatalf("Cannot extract taxonomy: %v", err)
@@ -99,7 +130,12 @@ func main() {
}
iterator = obitaxonomy.CLITaxonRestrictions(iterator)
obitaxonomy.CLICSVTaxaWriter(iterator, true)
if obitaxonomy.CLIAsNewick() {
obitaxonomy.CLINewickWriter(iterator, true)
} else {
obitaxonomy.CLICSVTaxaWriter(iterator, true)
}
obiutils.WaitForLastPipe()

View File

@@ -33,7 +33,10 @@ func main() {
obidefault.SetBatchSize(10)
obidefault.SetReadQualities(false)
optionParser := obioptions.GenerateOptionParser(obiuniq.OptionSet)
optionParser := obioptions.GenerateOptionParser(
"obiuniq",
"dereplicate sequence data sets",
obiuniq.OptionSet)
_, args := optionParser(os.Args)

View File

@@ -3,13 +3,13 @@ package main
import (
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obitax"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiformats"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
func main() {
obitax.DetectTaxonomyFormat(os.Args[1])
obiformats.DetectTaxonomyFormat(os.Args[1])
println(obiutils.RemoveAllExt("toto/tutu/test.txt"))
println(obiutils.Basename("toto/tutu/test.txt"))

1
ecoprimers Submodule

Submodule ecoprimers added at b7552200bd

23
git-hooks/pre-push Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
remote="$1"
#url="$2"
log() {
echo -e "[Pre-Push tests @ $(date)] $*" 1>&2
}
current_branch=$(git symbolic-ref --short head)
cmd="make githubtests"
if [[ $current_branch = "master" ]]; then
log "you are on $current_branch, running build test"
if ! eval "$cmd"; then
log "Pre-push tests failed $cmd"
exit 1
fi
fi
log "Tests are OK, ready to push on $remote"
exit 0

21
go.mod
View File

@@ -1,11 +1,12 @@
module git.metabarcoding.org/obitools/obitools4/obitools4
go 1.23.1
go 1.23.4
toolchain go1.24.2
require (
github.com/DavidGamba/go-getoptions v0.28.0
github.com/PaesslerAG/gval v1.2.2
github.com/TuftsBCB/io v0.0.0-20140121014543-22b94e9b23f9
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df
github.com/buger/jsonparser v1.1.1
github.com/chen3feng/stl4go v0.1.1
@@ -19,19 +20,23 @@ require (
github.com/stretchr/testify v1.8.4
github.com/tevino/abool/v2 v2.1.0
github.com/yuin/gopher-lua v1.1.1
golang.org/x/exp v0.0.0-20231006140011-7918f672742d
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa
gonum.org/v1/gonum v0.14.0
gopkg.in/yaml.v3 v3.0.1
scientificgo.org/special v0.0.0
)
require (
github.com/RoaringBitmap/roaring v1.9.4 // indirect
github.com/bits-and-blooms/bitset v1.12.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/goombaio/orderedmap v0.0.0-20180924084748-ba921b7e2419 // indirect
github.com/kr/pretty v0.3.0 // indirect
github.com/kr/pretty v0.3.1 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/mschoch/smat v0.2.0 // indirect
github.com/pelletier/go-toml/v2 v2.2.4 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/rogpeppe/go-internal v1.6.1 // indirect
github.com/rogpeppe/go-internal v1.12.0 // indirect
)
require (
@@ -44,8 +49,8 @@ require (
github.com/rivo/uniseg v0.4.4 // indirect
github.com/shopspring/decimal v1.3.1 // indirect
github.com/ulikunitz/xz v0.5.11
golang.org/x/net v0.17.0 // indirect
golang.org/x/sys v0.17.0 // indirect
golang.org/x/term v0.13.0 // indirect
golang.org/x/net v0.35.0 // indirect
golang.org/x/sys v0.30.0 // indirect
golang.org/x/term v0.29.0 // indirect
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
)

39
go.sum
View File

@@ -4,10 +4,12 @@ github.com/PaesslerAG/gval v1.2.2 h1:Y7iBzhgE09IGTt5QgGQ2IdaYYYOU134YGHBThD+wm9E
github.com/PaesslerAG/gval v1.2.2/go.mod h1:XRFLwvmkTEdYziLdaCeCa5ImcGVrfQbeNUbVR+C6xac=
github.com/PaesslerAG/jsonpath v0.1.0 h1:gADYeifvlqK3R3i2cR5B4DGgxLXIPb3TRTH1mGi0jPI=
github.com/PaesslerAG/jsonpath v0.1.0/go.mod h1:4BzmtoM/PI8fPO4aQGIusjGxGir2BzcV0grWtFzq1Y8=
github.com/TuftsBCB/io v0.0.0-20140121014543-22b94e9b23f9 h1:Zc1/GNsUpgZR9qm1EmRSKrnOHA7CCd0bIzGdq0cREN0=
github.com/TuftsBCB/io v0.0.0-20140121014543-22b94e9b23f9/go.mod h1:PZyV4WA3NpqtezSY0h6E6NARAmdDm0qwrydveOyR5Gc=
github.com/RoaringBitmap/roaring v1.9.4 h1:yhEIoH4YezLYT04s1nHehNO64EKFTop/wBhxv2QzDdQ=
github.com/RoaringBitmap/roaring v1.9.4/go.mod h1:6AXUsoIEzDTFFQCe1RbGA6uFONMhvejWj5rqITANK90=
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df h1:GSoSVRLoBaFpOOds6QyY1L8AX7uoY+Ln3BHc22W40X0=
github.com/barkimedes/go-deepcopy v0.0.0-20220514131651-17c30cfc62df/go.mod h1:hiVxq5OP2bUGBRNS3Z/bt/reCLFNbdcST6gISi1fiOM=
github.com/bits-and-blooms/bitset v1.12.0 h1:U/q1fAF7xXRhFCrhROzIfffYnu+dlS38vCZtmFVPHmA=
github.com/bits-and-blooms/bitset v1.12.0/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=
github.com/buger/jsonparser v1.1.1 h1:2PnMjfWD7wBILjqQbt530v576A/cAbQvEW9gGIpYMUs=
github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0=
github.com/chen3feng/stl4go v0.1.1 h1:0L1+mDw7pomftKDruM23f1mA7miavOj6C6MZeadzN2Q=
@@ -36,10 +38,9 @@ github.com/klauspost/compress v1.17.2/go.mod h1:ntbaceVETuRiXiv4DpjP66DpAtAGkEQs
github.com/klauspost/cpuid v1.2.0/go.mod h1:Pj4uuM528wm8OyEC2QMXAi2YiTZ96dNQPGgoMS4s3ek=
github.com/klauspost/pgzip v1.2.6 h1:8RXeL5crjEUFnR2/Sn6GJNWtSQ3Dk8pq4CL3jvdDyjU=
github.com/klauspost/pgzip v1.2.6/go.mod h1:Ch1tH69qFZu15pkjo5kYi6mth2Zzwzt50oCQKQE9RUs=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
github.com/kr/pretty v0.3.0 h1:WgNl7dwNpEZ6jJ9k1snq4pZsg7DOEN8hP9Xw0Tsjwk0=
github.com/kr/pretty v0.3.0/go.mod h1:640gp4NfQd8pI5XOwp5fnNeVWj67G7CFk/SaSQn7NBk=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
@@ -50,15 +51,21 @@ github.com/mattn/go-runewidth v0.0.15 h1:UNAjwbU9l54TA3KzvqLGxwWjHmMgBUVhBiTjelZ
github.com/mattn/go-runewidth v0.0.15/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ=
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw=
github.com/mschoch/smat v0.2.0 h1:8imxQsjDm8yFEAVBe7azKmKSgzSkZXDuKkSq9374khM=
github.com/mschoch/smat v0.2.0/go.mod h1:kc9mz7DoBKqDyiRL7VZN8KvXQMWeTaVnttLRXOlotKw=
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58 h1:onHthvaw9LFnH4t2DcNVpwGmV9E1BkGknEliJkfwQj0=
github.com/pbnjay/memory v0.0.0-20210728143218-7b4eea64cf58/go.mod h1:DXv8WO4yhMYhSNPKjeNKa5WY9YCIEBRbNzFFPJbWO6Y=
github.com/pelletier/go-toml/v2 v2.2.4 h1:mye9XuhQ6gvn5h28+VilKrrPoQVanw5PMw/TB0t5Ec4=
github.com/pelletier/go-toml/v2 v2.2.4/go.mod h1:2gIqNv+qfxSVS7cM2xJQKtLSTLUE9V8t9Stt+h56mCY=
github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e/go.mod h1:pJLUxLENpZxwdsKMEsNbx1VGcRFpLqf3715MtcvvzbA=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
github.com/rivo/uniseg v0.4.4 h1:8TfxU8dW6PdqD27gjM8MVNuicgxIjxpm4K7x4jp8sis=
github.com/rivo/uniseg v0.4.4/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=
github.com/rogpeppe/go-internal v1.6.1 h1:/FiVV8dS/e+YqF2JvO3yXRFbBLTIuSDkuC7aBOAvL+k=
github.com/rogpeppe/go-internal v1.6.1/go.mod h1:xXDCJY+GAPziupqXw64V24skbSoqbTEfhy4qGm1nDQc=
github.com/rogpeppe/go-internal v1.9.0/go.mod h1:WtVeX8xhTBvf0smdhujwtBcq4Qrzq/fJaraNFVN+nFs=
github.com/rogpeppe/go-internal v1.12.0 h1:exVL4IDcn6na9z1rAb56Vxr+CgyK3nn3O+epU5NdKM8=
github.com/rogpeppe/go-internal v1.12.0/go.mod h1:E+RYuTGaKKdloAfM02xzb0FW3Paa99yedzYV+kq4uf4=
github.com/rrethy/ahocorasick v1.0.0 h1:YKkCB+E5PXc0xmLfMrWbfNht8vG9Re97IHSWZk/Lk8E=
github.com/rrethy/ahocorasick v1.0.0/go.mod h1:nq8oScE7Vy1rOppoQxpQiiDmPHuKCuk9rXrNcxUV3R0=
github.com/schollz/progressbar/v3 v3.13.1 h1:o8rySDYiQ59Mwzy2FELeHY5ZARXZTVJC7iHD6PEFUiE=
@@ -79,25 +86,23 @@ github.com/ulikunitz/xz v0.5.11 h1:kpFauv27b6ynzBNT/Xy+1k+fK4WswhN/6PN5WhFAGw8=
github.com/ulikunitz/xz v0.5.11/go.mod h1:nbz6k7qbPmH4IRqmfOplQw/tblSgqTqBwxkY0oWt/14=
github.com/yuin/gopher-lua v1.1.1 h1:kYKnWBjvbNP4XLT3+bPEwAXJx262OhaHDWDVOPjL46M=
github.com/yuin/gopher-lua v1.1.1/go.mod h1:GBR0iDaNXjAgGg9zfCvksxSRnQx76gclCIb7kdAd1Pw=
golang.org/x/exp v0.0.0-20231006140011-7918f672742d h1:jtJma62tbqLibJ5sFQz8bKtEM8rJBtfilJ2qTU199MI=
golang.org/x/exp v0.0.0-20231006140011-7918f672742d/go.mod h1:ldy0pHrwJyGW56pPQzzkH36rKxoZW1tw7ZJpeKx+hdo=
golang.org/x/net v0.17.0 h1:pVaXccu2ozPjCXewfr1S7xza/zcXTity9cCdXQYSjIM=
golang.org/x/net v0.17.0/go.mod h1:NxSsAGuq816PNPmqtQdLE42eU2Fs7NoRIZrHJAlaCOE=
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa h1:FRnLl4eNAQl8hwxVVC17teOw8kdjVDVAiFMtgUdTSRQ=
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa/go.mod h1:zk2irFbV9DP96SEBUUAy67IdHUaZuSnrz1n472HUCLE=
golang.org/x/net v0.35.0 h1:T5GQRQb2y08kTAByq9L4/bz8cipCdA8FbRTXewonqY8=
golang.org/x/net v0.35.0/go.mod h1:EglIi67kWsHKlRzzVMUD93VMSWGFOMSZgxFjparz1Qk=
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.17.0 h1:25cE3gD+tdBA7lp7QfhuV+rJiE9YXTcS3VG1SqssI/Y=
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.30.0 h1:QjkSwP/36a20jFYWkSue1YwXzLmsV5Gfq7Eiy72C1uc=
golang.org/x/sys v0.30.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/term v0.6.0/go.mod h1:m6U89DPEgQRMq3DNkDClhWw02AUbt2daBVO4cn4Hv9U=
golang.org/x/term v0.13.0 h1:bb+I9cTfFazGW51MZqBVmZy7+JEJMouUHTUSKVQLBek=
golang.org/x/term v0.13.0/go.mod h1:LTmsnFJwVN6bCy1rVCoS+qHT1HhALEFxKncY3WNNh4U=
golang.org/x/term v0.29.0 h1:L6pJp37ocefwRRtYPKSWOWzOtWSxVajvz2ldH/xi3iU=
golang.org/x/term v0.29.0/go.mod h1:6bl4lRlvVuDgSf3179VpIxBF0o10JUpXWOnI7nErv7s=
gonum.org/v1/gonum v0.14.0 h1:2NiG67LD1tEH0D7kM+ps2V+fXmsAnpUeec7n8tcr4S0=
gonum.org/v1/gonum v0.14.0/go.mod h1:AoWeoz0becf9QMWtE8iWXNXc27fK4fNeHNf/oMejGfU=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -1,3 +1,5 @@
go 1.23.1
go 1.23.4
toolchain go1.24.2
use .

View File

@@ -2,7 +2,6 @@ git.sr.ht/~sbinet/gg v0.3.1 h1:LNhjNn8DerC8f9DHLz6lS0YYul/b602DUxDgGkd/Aik=
git.sr.ht/~sbinet/gg v0.3.1/go.mod h1:KGYtlADtqsqANL9ueOFkWymvzUvLMQllU5Ixo+8v3pc=
github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b h1:slYM766cy2nI3BwyRiyQj/Ud48djTMtMebDqepE95rw=
github.com/ajstarks/svgo v0.0.0-20211024235047-1546f124cd8b/go.mod h1:1KcenG0jGWcpt8ov532z81sp/kMMUG485J2InIOyADM=
github.com/buger/jsonparser v1.1.1/go.mod h1:6RYKKt7H4d4+iWqouImQ9R2FZql3VbhNgx27UK13J/0=
github.com/chzyer/logex v1.1.10 h1:Swpa1K6QvQznwJRcfTfQJmTE72DqScAa40E+fbHEXEE=
github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
github.com/chzyer/logex v1.2.0 h1:+eqR0HfOetur4tgnC8ftU5imRnhi4te+BadWS95c5AM=
@@ -20,15 +19,20 @@ github.com/go-latex/latex v0.0.0-20230307184459-12ec69307ad9 h1:NxXI5pTAtpEaU49b
github.com/go-latex/latex v0.0.0-20230307184459-12ec69307ad9/go.mod h1:gWuR/CrFDDeVRFQwHPvsv9soJVB/iqymhuZQuJ3a9OM=
github.com/go-pdf/fpdf v0.6.0 h1:MlgtGIfsdMEEQJr2le6b/HNr1ZlQwxyWr77r2aj2U/8=
github.com/go-pdf/fpdf v0.6.0/go.mod h1:HzcnA+A23uwogo0tp9yU+l3V+KXhiESpt1PMayhOh5M=
github.com/go-quicktest/qt v1.101.0/go.mod h1:14Bz/f7NwaXPtdYEgzsx46kqSxVwTbzVZsDC26tQJow=
github.com/goccmack/gocc v0.0.0-20230228185258-2292f9e40198 h1:FSii2UQeSLngl3jFoR4tUKZLprO7qUlh/TKKticc0BM=
github.com/goccmack/gocc v0.0.0-20230228185258-2292f9e40198/go.mod h1:DTh/Y2+NbnOVVoypCCQrovMPDKUGp4yZpSbWg5D0XIM=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0 h1:DACJavvAHhabrF08vX0COfcOBJRhZ8lUbR+ZWIs0Y5g=
github.com/golang/freetype v0.0.0-20170609003504-e2365dfdc4a0/go.mod h1:E/TSTwGwJL78qG/PmXZO1EjYhfJinVAhrmmHX6Z8B9k=
github.com/google/go-cmdtest v0.4.1-0.20220921163831-55ab3332a786/go.mod h1:apVn/GCasLZUVpAJ6oWAuyP7Ne7CEsQbTnc0plM3m+o=
github.com/google/go-cmp v0.5.8 h1:e6P7q2lk1O+qJJb4BtCQXlK8vWEO8V1ZeuEdJNOqZyg=
github.com/google/go-cmp v0.5.8/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI=
github.com/google/safehtml v0.1.0/go.mod h1:L4KWwDsUJdECRAEpZoBn3O64bQaywRscowZjJAzjHnU=
github.com/hashicorp/errwrap v1.0.0/go.mod h1:YH+1FKiLXxHSkmPseP+kNlulaMuP3n2brvKWEqk/Jc4=
github.com/hashicorp/go-multierror v1.1.1/go.mod h1:iw975J/qwKPdAO1clOe2L8331t/9/fmwbPZ6JB6eMoM=
github.com/ianlancetaylor/demangle v0.0.0-20220319035150-800ac71e25c2 h1:rcanfLhLDA8nozr/K289V1zcntHr3V+SHlXwzz1ZI2g=
github.com/jba/templatecheck v0.7.1/go.mod h1:n1Etw+Rrw1mDDD8dDRsEKTwMZsJ98EkktgNJC6wLUGo=
github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213 h1:qGQQKEcAR99REcMpsXCp3lJ03zYT1PkRd3kQGPn9GVg=
github.com/klauspost/cpuid v1.2.0 h1:NMpwD2G9JSFOE1/TJjGSo5zG7Yb2bTe7eq1jH+irmeE=
github.com/kr/pty v1.1.1 h1:VkoXIwSboBpnk99O/KFauAEILuNHv5DVFKZMBN/gUgw=
@@ -39,17 +43,22 @@ github.com/stretchr/objx v0.1.0 h1:4G4v2dO3VZwixGIRoQ5Lfboy6nUhCyYzaqnIAPPhYs4=
github.com/stretchr/objx v0.5.0 h1:1zr/of2m5FGMsad5YfcqgdqdWrIhu+EBEJRhR1U7z/c=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/yuin/goldmark v1.4.13 h1:fVcFKWvrslecOb/tg+Cc05dkeYx540o0FuFt3nUVDoE=
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
golang.org/x/crypto v0.14.0 h1:wBqGXzWJW6m1XrIKlAH0Hs1JJ7+9KBwnIO8v66Q9cHc=
golang.org/x/crypto v0.14.0/go.mod h1:MVFd36DqK4CsrnJYDkBA3VC4m2GkXAM0PvzMCn4JQf4=
golang.org/x/crypto v0.33.0/go.mod h1:bVdXmD7IV/4GdElGPozy6U7lWdRXA4qyRVGJV57uQ5M=
golang.org/x/image v0.6.0 h1:bR8b5okrPI3g/gyZakLZHeWxAR8Dn5CyxXv1hLH5g/4=
golang.org/x/image v0.6.0/go.mod h1:MXLdDR43H7cDJq5GEGXEVeeNhPgi+YYEQ2pC1byI1x0=
golang.org/x/mod v0.13.0 h1:I/DsJXRlw/8l/0c24sM9yb0T4z9liZTduXvdAWYiysY=
golang.org/x/mod v0.13.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=
golang.org/x/mod v0.14.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4 h1:uVc8UZUe6tr40fFVnUP5Oj+veunVezqYl9z7DYw9xzw=
golang.org/x/text v0.13.0 h1:ablQoSUd0tRdKxZewP80B+BaqeKJuVhuRxj/dkrun3k=
golang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=
golang.org/x/text v0.22.0/go.mod h1:YRoo4H8PVmsu+E3Ou7cqLVH8oXWIHVoX0jqUWALQhfY=
golang.org/x/tools v0.14.0 h1:jvNa2pY0M4r62jkRQ6RwEZZyPcymeL9XZMLBbV7U2nc=
golang.org/x/tools v0.14.0/go.mod h1:uYBEerGOWcJyEORxN+Ek8+TT266gXkNlHdJBwexUsBg=
golang.org/x/tools v0.15.0/go.mod h1:hpksKq4dtpQWS1uQ61JkdqWM3LscIS6Slf+VVkm+wQk=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7 h1:9zdDQZ7Thm29KFXgAX/+yaf3eVbP7djjWp/dXAppNCc=
gonum.org/v1/plot v0.10.1 h1:dnifSs43YJuNMDzB7v8wV64O4ABBHReuAVAoBxqBqS4=
gonum.org/v1/plot v0.10.1/go.mod h1:VZW5OlhkL1mysU9vaqNHnsy86inf6Ot+jB3r+BczCEo=

View File

@@ -59,6 +59,11 @@ if [[ ! "$WORK_DIR" || ! -d "$WORK_DIR" ]]; then
exit 1
fi
mkdir -p "${WORK_DIR}/cache" \
|| (echo "Cannot create ${WORK_DIR}/cache directory" 1>&2
exit 1)
mkdir -p "${INSTALL_DIR}/bin" 2> /dev/null \
|| (echo "Please enter your password for installing obitools in ${INSTALL_DIR}" 1>&2
sudo mkdir -p "${INSTALL_DIR}/bin")
@@ -68,11 +73,11 @@ if [[ ! -d "${INSTALL_DIR}/bin" ]]; then
exit 1
fi
INSTALL_DIR="$(cd $INSTALL_DIR && pwd)"
INSTALL_DIR="$(cd ${INSTALL_DIR} && pwd)"
echo WORK_DIR=$WORK_DIR 1>&2
echo INSTALL_DIR=$INSTALL_DIR 1>&2
echo OBITOOLS_PREFIX=$OBITOOLS_PREFIX 1>&2
echo "WORK_DIR=$WORK_DIR" 1>&2
echo "INSTALL_DIR=$INSTALL_DIR" 1>&2
echo "OBITOOLS_PREFIX=$OBITOOLS_PREFIX" 1>&2
pushd "$WORK_DIR"|| exit
@@ -105,6 +110,13 @@ curl "$GOURL" \
PATH="$(pwd)/go/bin:$PATH"
export PATH
GOPATH="$(pwd)/go"
export GOPATH
export GOCACHE="$(pwd)/cache"
echo "GOCACHE=$GOCACHE" 1>&2@
mkdir -p "$GOCACHE"
curl -L "$OBIURL4" > master.zip
unzip master.zip
@@ -112,11 +124,12 @@ unzip master.zip
echo "Install OBITOOLS from : $OBIURL4"
cd obitools4-master || exit
mkdir vendor
if [[ -z "$OBITOOLS_PREFIX" ]] ; then
make
make GOFLAGS="-buildvcs=false"
else
make OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
make GOFLAGS="-buildvcs=false" OBITOOLS_PREFIX="${OBITOOLS_PREFIX}"
fi
(cp build/* "${INSTALL_DIR}/bin" 2> /dev/null) \
@@ -125,5 +138,6 @@ fi
popd || exit
chmod -R +w "$WORK_DIR"
rm -rf "$WORK_DIR"

View File

@@ -0,0 +1,292 @@
# Filtre de Fréquence avec v Niveaux de Roaring Bitmaps
## Algorithme
```go
Pour chaque k-mer rencontré dans les données:
c = 0
tant que (k-mer index[c] ET c < v):
c++
si c < v:
index[c].insert(k-mer)
```
**Résultat** : `index[v-1]` contient les k-mers vus **≥ v fois**
---
## Exemple d'exécution (v=3)
```
Données:
Read1: kmer X
Read2: kmer X
Read3: kmer X (X vu 3 fois)
Read4: kmer Y
Read5: kmer Y (Y vu 2 fois)
Read6: kmer Z (Z vu 1 fois)
Exécution:
Read1 (X):
c=0: X ∉ index[0] → index[0].add(X)
État: index[0]={X}, index[1]={}, index[2]={}
Read2 (X):
c=0: X ∈ index[0] → c=1
c=1: X ∉ index[1] → index[1].add(X)
État: index[0]={X}, index[1]={X}, index[2]={}
Read3 (X):
c=0: X ∈ index[0] → c=1
c=1: X ∈ index[1] → c=2
c=2: X ∉ index[2] → index[2].add(X)
État: index[0]={X}, index[1]={X}, index[2]={X}
Read4 (Y):
c=0: Y ∉ index[0] → index[0].add(Y)
État: index[0]={X,Y}, index[1]={X}, index[2]={X}
Read5 (Y):
c=0: Y ∈ index[0] → c=1
c=1: Y ∉ index[1] → index[1].add(Y)
État: index[0]={X,Y}, index[1]={X,Y}, index[2]={X}
Read6 (Z):
c=0: Z ∉ index[0] → index[0].add(Z)
État: index[0]={X,Y,Z}, index[1]={X,Y}, index[2]={X}
Résultat final:
index[0] (freq≥1): {X, Y, Z}
index[1] (freq≥2): {X, Y}
index[2] (freq≥3): {X} ← K-mers filtrés ✓
```
---
## Utilisation
```go
// Créer le filtre
filter := obikmer.NewFrequencyFilter(31, 3) // k=31, minFreq=3
// Ajouter les séquences
for _, read := range reads {
filter.AddSequence(read)
}
// Récupérer les k-mers filtrés (freq ≥ 3)
filtered := filter.GetFilteredSet("filtered")
fmt.Printf("K-mers de qualité: %d\n", filtered.Cardinality())
// Statistiques
stats := filter.Stats()
fmt.Println(stats.String())
```
---
## Performance
### Complexité
**Par k-mer** :
- Lookups : Moyenne ~v/2, pire cas v
- Insertions : 1 Add
- **Pas de Remove** ✅
**Total pour n k-mers** :
- Temps : O(n × v/2)
- Mémoire : O(unique_kmers × v × 2 bytes)
### Early exit pour distribution skewed
Avec distribution typique (séquençage) :
```
80% singletons → 1 lookup (early exit)
15% freq 2-3 → 2-3 lookups
5% freq ≥4 → jusqu'à v lookups
Moyenne réelle : ~2 lookups/kmer (au lieu de v/2)
```
---
## Mémoire
### Pour 10^8 k-mers uniques
| v (minFreq) | Nombre bitmaps | Mémoire | vs map simple |
|-------------|----------------|---------|---------------|
| v=2 | 2 | ~400 MB | 6x moins |
| v=3 | 3 | ~600 MB | 4x moins |
| v=5 | 5 | ~1 GB | 2.4x moins |
| v=10 | 10 | ~2 GB | 1.2x moins |
| v=20 | 20 | ~4 GB | ~égal |
**Note** : Avec distribution skewed (beaucoup de singletons), la mémoire réelle est bien plus faible car les niveaux hauts ont peu d'éléments.
### Exemple réaliste (séquençage)
Pour 10^8 k-mers totaux, v=3 :
```
Distribution:
80% singletons → 80M dans index[0]
15% freq 2-3 → 15M dans index[1]
5% freq ≥3 → 5M dans index[2]
Mémoire:
index[0]: 80M × 2 bytes = 160 MB
index[1]: 15M × 2 bytes = 30 MB
index[2]: 5M × 2 bytes = 10 MB
Total: ~200 MB ✅
vs map simple: 80M × 24 bytes = ~2 GB
Réduction: 10x
```
---
## Comparaison des approches
| Approche | Mémoire (10^8 kmers) | Passes | Lookups/kmer | Quand utiliser |
|----------|----------------------|--------|--------------|----------------|
| **v-Bitmaps** | **200-600 MB** | **1** | **~2 (avg)** | **Standard** ✅ |
| Map simple | 2.4 GB | 1 | 1 | Si RAM illimitée |
| Multi-pass | 400 MB | v | v | Si I/O pas cher |
---
## Avantages de v-Bitmaps
**Une seule passe** sur les données
**Mémoire optimale** avec Roaring bitmaps
**Pas de Remove** (seulement Contains + Add)
**Early exit** efficace sur singletons
**Scalable** jusqu'à v~10-20
**Simple** à implémenter et comprendre
---
## Cas d'usage typiques
### 1. Éliminer erreurs de séquençage
```go
filter := obikmer.NewFrequencyFilter(31, 3)
// Traiter FASTQ
for read := range StreamFastq("sample.fastq") {
filter.AddSequence(read)
}
// K-mers de qualité (pas d'erreurs)
cleaned := filter.GetFilteredSet("cleaned")
```
**Résultat** : Élimine 70-80% des k-mers (erreurs)
### 2. Assemblage de génome
```go
filter := obikmer.NewFrequencyFilter(31, 2)
// Filtrer avant l'assemblage
for read := range reads {
filter.AddSequence(read)
}
solidKmers := filter.GetFilteredSet("solid")
// Utiliser solidKmers pour le graphe de Bruijn
```
### 3. Comparaison de génomes
```go
collection := obikmer.NewKmerSetCollection(31)
for _, genome := range genomes {
filter := obikmer.NewFrequencyFilter(31, 3)
filter.AddSequences(genome.Reads)
cleaned := filter.GetFilteredSet(genome.ID)
collection.Add(cleaned)
}
// Analyses comparatives sur k-mers de qualité
matrix := collection.ParallelPairwiseJaccard(8)
```
---
## Limites
**Pour v > 20** :
- Trop de lookups (v lookups/kmer)
- Mémoire importante (v × 200MB pour 10^8 kmers)
**Solutions alternatives pour v > 20** :
- Utiliser map simple (9 bytes/kmer) si RAM disponible
- Algorithme différent (sketch, probabiliste)
---
## Optimisations possibles
### 1. Parallélisation
```go
// Traiter plusieurs fichiers en parallèle
filters := make([]*FrequencyFilter, numFiles)
var wg sync.WaitGroup
for i, file := range files {
wg.Add(1)
go func(idx int, f string) {
defer wg.Done()
filters[idx] = ProcessFile(f, k, minFreq)
}(i, file)
}
wg.Wait()
// Merger les résultats
merged := MergeFilters(filters)
```
### 2. Streaming avec seuil adaptatif
```go
// Commencer avec v=5, réduire progressivement
filter := obikmer.NewFrequencyFilter(31, 5)
// ... traitement ...
// Si trop de mémoire, réduire à v=3
if filter.MemoryUsage() > threshold {
filter = ConvertToLowerThreshold(filter, 3)
}
```
---
## Récapitulatif final
**Pour filtrer les k-mers par fréquence ≥ v :**
1. **Créer** : `filter := NewFrequencyFilter(k, v)`
2. **Traiter** : `filter.AddSequence(read)` pour chaque read
3. **Résultat** : `filtered := filter.GetFilteredSet(id)`
**Mémoire** : ~2v MB par million de k-mers uniques
**Temps** : Une seule passe, ~2 lookups/kmer en moyenne
**Optimal pour** : v ≤ 20, distribution skewed (séquençage)
---
## Code fourni
1. **frequency_filter.go** - Implémentation complète
2. **examples_frequency_filter_final.go** - Exemples d'utilisation
**Tout est prêt à utiliser !** 🚀

View File

@@ -0,0 +1,320 @@
package main
import (
"fmt"
"obikmer"
)
func main() {
// ==========================================
// EXEMPLE 1 : Utilisation basique
// ==========================================
fmt.Println("=== EXEMPLE 1 : Utilisation basique ===\n")
k := 31
minFreq := 3 // Garder les k-mers vus ≥3 fois
// Créer le filtre
filter := obikmer.NewFrequencyFilter(k, minFreq)
// Simuler des séquences avec différentes fréquences
sequences := [][]byte{
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"), // Kmer X
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"), // Kmer X (freq=2)
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"), // Kmer X (freq=3) ✓
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"), // Kmer Y
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"), // Kmer Y (freq=2) ✗
[]byte("GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"), // Kmer Z (freq=1) ✗
}
fmt.Printf("Traitement de %d séquences...\n", len(sequences))
for _, seq := range sequences {
filter.AddSequence(seq)
}
// Récupérer les k-mers filtrés
filtered := filter.GetFilteredSet("filtered")
fmt.Printf("\nK-mers avec freq ≥ %d: %d\n", minFreq, filtered.Cardinality())
// Statistiques
stats := filter.Stats()
fmt.Println("\n" + stats.String())
// ==========================================
// EXEMPLE 2 : Vérifier les niveaux
// ==========================================
fmt.Println("\n=== EXEMPLE 2 : Inspection des niveaux ===\n")
// Vérifier chaque niveau
for level := 0; level < minFreq; level++ {
levelSet := filter.GetKmersAtLevel(level)
fmt.Printf("Niveau %d (freq≥%d): %d k-mers\n",
level+1, level+1, levelSet.Cardinality())
}
// ==========================================
// EXEMPLE 3 : Données réalistes
// ==========================================
fmt.Println("\n=== EXEMPLE 3 : Simulation données séquençage ===\n")
filter2 := obikmer.NewFrequencyFilter(31, 3)
// Simuler un dataset réaliste :
// - 1000 reads
// - 80% contiennent des erreurs (singletons)
// - 15% vrais k-mers à basse fréquence
// - 5% vrais k-mers à haute fréquence
// Vraie séquence répétée
trueSeq := []byte("ACGTACGTACGTACGTACGTACGTACGTACG")
for i := 0; i < 50; i++ {
filter2.AddSequence(trueSeq)
}
// Séquence à fréquence moyenne
mediumSeq := []byte("CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC")
for i := 0; i < 5; i++ {
filter2.AddSequence(mediumSeq)
}
// Erreurs de séquençage (singletons)
for i := 0; i < 100; i++ {
errorSeq := []byte(fmt.Sprintf("TTTTTTTTTTTTTTTTTTTTTTTTTTTT%03d", i))
filter2.AddSequence(errorSeq)
}
stats2 := filter2.Stats()
fmt.Println(stats2.String())
fmt.Println("Distribution attendue:")
fmt.Println(" - Beaucoup de singletons (erreurs)")
fmt.Println(" - Peu de k-mers à haute fréquence (signal)")
fmt.Println(" → Filtrage efficace !")
// ==========================================
// EXEMPLE 4 : Tester différents seuils
// ==========================================
fmt.Println("\n=== EXEMPLE 4 : Comparaison de seuils ===\n")
testSeqs := [][]byte{
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"), // freq=5
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"),
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"),
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"), // freq=3
[]byte("GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"), // freq=1
}
for _, minFreq := range []int{2, 3, 5} {
f := obikmer.NewFrequencyFilter(31, minFreq)
f.AddSequences(testSeqs)
fmt.Printf("minFreq=%d: %d k-mers retenus (%.2f MB)\n",
minFreq,
f.Cardinality(),
float64(f.MemoryUsage())/1024/1024)
}
// ==========================================
// EXEMPLE 5 : Comparaison mémoire
// ==========================================
fmt.Println("\n=== EXEMPLE 5 : Comparaison mémoire ===\n")
filter3 := obikmer.NewFrequencyFilter(31, 3)
// Simuler 10000 séquences
for i := 0; i < 10000; i++ {
seq := make([]byte, 100)
for j := range seq {
seq[j] = "ACGT"[(i+j)%4]
}
filter3.AddSequence(seq)
}
fmt.Println(filter3.CompareWithSimpleMap())
// ==========================================
// EXEMPLE 6 : Workflow complet
// ==========================================
fmt.Println("\n=== EXEMPLE 6 : Workflow complet ===\n")
fmt.Println("1. Créer le filtre")
finalFilter := obikmer.NewFrequencyFilter(31, 3)
fmt.Println("2. Traiter les données (simulation)")
// En pratique : lire depuis FASTQ
// for read := range ReadFastq("data.fastq") {
// finalFilter.AddSequence(read)
// }
// Simulation
for i := 0; i < 1000; i++ {
seq := []byte("ACGTACGTACGTACGTACGTACGTACGTACG")
finalFilter.AddSequence(seq)
}
fmt.Println("3. Récupérer les k-mers filtrés")
result := finalFilter.GetFilteredSet("final")
fmt.Println("4. Utiliser le résultat")
fmt.Printf(" K-mers de qualité: %d\n", result.Cardinality())
fmt.Printf(" Mémoire utilisée: %.2f MB\n", float64(finalFilter.MemoryUsage())/1024/1024)
fmt.Println("5. Sauvegarder (optionnel)")
// result.Save("filtered_kmers.bin")
// ==========================================
// EXEMPLE 7 : Vérification individuelle
// ==========================================
fmt.Println("\n=== EXEMPLE 7 : Vérification de k-mers spécifiques ===\n")
checkFilter := obikmer.NewFrequencyFilter(31, 3)
testSeq := []byte("ACGTACGTACGTACGTACGTACGTACGTACG")
for i := 0; i < 5; i++ {
checkFilter.AddSequence(testSeq)
}
var kmers []uint64
kmers = obikmer.EncodeKmers(testSeq, 31, &kmers)
if len(kmers) > 0 {
testKmer := kmers[0]
fmt.Printf("K-mer test: 0x%016X\n", testKmer)
fmt.Printf(" Présent dans filtre: %v\n", checkFilter.Contains(testKmer))
fmt.Printf(" Fréquence approx: %d\n", checkFilter.GetFrequency(testKmer))
}
// ==========================================
// EXEMPLE 8 : Intégration avec collection
// ==========================================
fmt.Println("\n=== EXEMPLE 8 : Intégration avec KmerSetCollection ===\n")
// Créer une collection de génomes filtrés
collection := obikmer.NewKmerSetCollection(31)
genomes := map[string][][]byte{
"Genome1": {
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"), // Erreur
},
"Genome2": {
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("ACGTACGTACGTACGTACGTACGTACGTACG"),
[]byte("GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"), // Erreur
},
}
for id, sequences := range genomes {
// Filtrer chaque génome
genomeFilter := obikmer.NewFrequencyFilter(31, 3)
genomeFilter.AddSequences(sequences)
// Ajouter à la collection
filteredSet := genomeFilter.GetFilteredSet(id)
collection.Add(filteredSet)
fmt.Printf("%s: %d k-mers de qualité\n", id, filteredSet.Cardinality())
}
// Analyser la collection
fmt.Println("\nAnalyse comparative:")
collectionStats := collection.ComputeStats()
fmt.Printf(" Core genome: %d k-mers\n", collectionStats.CoreSize)
fmt.Printf(" Pan genome: %d k-mers\n", collectionStats.PanGenomeSize)
// ==========================================
// RÉSUMÉ
// ==========================================
fmt.Println("\n=== RÉSUMÉ ===\n")
fmt.Println("Le FrequencyFilter permet de:")
fmt.Println(" ✓ Filtrer les k-mers par fréquence minimale")
fmt.Println(" ✓ Utiliser une mémoire optimale avec Roaring bitmaps")
fmt.Println(" ✓ Une seule passe sur les données")
fmt.Println(" ✓ Éliminer efficacement les erreurs de séquençage")
fmt.Println("")
fmt.Println("Workflow typique:")
fmt.Println(" 1. filter := NewFrequencyFilter(k, minFreq)")
fmt.Println(" 2. for each sequence: filter.AddSequence(seq)")
fmt.Println(" 3. filtered := filter.GetFilteredSet(id)")
fmt.Println(" 4. Utiliser filtered dans vos analyses")
}
// ==================================
// FONCTION HELPER POUR BENCHMARKS
// ==================================
func BenchmarkFrequencyFilter() {
k := 31
minFreq := 3
// Test avec différentes tailles
sizes := []int{1000, 10000, 100000}
fmt.Println("\n=== BENCHMARK ===\n")
for _, size := range sizes {
filter := obikmer.NewFrequencyFilter(k, minFreq)
// Générer des séquences
for i := 0; i < size; i++ {
seq := make([]byte, 100)
for j := range seq {
seq[j] = "ACGT"[(i+j)%4]
}
filter.AddSequence(seq)
}
fmt.Printf("Size=%d reads:\n", size)
fmt.Printf(" Filtered k-mers: %d\n", filter.Cardinality())
fmt.Printf(" Memory: %.2f MB\n", float64(filter.MemoryUsage())/1024/1024)
fmt.Println()
}
}
// ==================================
// FONCTION POUR DONNÉES RÉELLES
// ==================================
func ProcessRealData() {
// Exemple pour traiter de vraies données FASTQ
k := 31
minFreq := 3
filter := obikmer.NewFrequencyFilter(k, minFreq)
// Pseudo-code pour lire un FASTQ
/*
fastqFile := "sample.fastq"
reader := NewFastqReader(fastqFile)
for reader.HasNext() {
read := reader.Next()
filter.AddSequence(read.Sequence)
}
// Récupérer le résultat
filtered := filter.GetFilteredSet("sample_filtered")
filtered.Save("sample_filtered_kmers.bin")
// Stats
stats := filter.Stats()
fmt.Println(stats.String())
*/
fmt.Println("Workflow pour données réelles:")
fmt.Println(" 1. Créer le filtre avec minFreq approprié (2-5 typique)")
fmt.Println(" 2. Stream les reads depuis FASTQ")
fmt.Println(" 3. Récupérer les k-mers filtrés")
fmt.Println(" 4. Utiliser pour assemblage/comparaison/etc.")
_ = filter // unused
}

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiannotate
CMD=obiannotate
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiclean
CMD=obiclean
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicleandb
CMD=obicleandb
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicomplement
CMD=obicomplement
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiconsensus
CMD=obiconsensus
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

Binary file not shown.

View File

@@ -0,0 +1,144 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiconvert
CMD=obiconvert
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
if [ -z "$TEST_DIR" ] ; then
TEST_DIR="."
fi
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obiconvert -Z "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
> "${TMPDIR}/xxx.fasta.gz" && \
zdiff "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
"${TMPDIR}/xxx.fasta.gz"
then
log "$MCMD: converting large fasta file to fasta OK"
((success++))
else
log "$MCMD: converting large fasta file to fasta failed"
((failed++))
fi
((ntest++))
if obiconvert -Z --fastq-output \
"${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
> "${TMPDIR}/xxx.fastq.gz" && \
obiconvert -Z --fasta-output \
"${TMPDIR}/xxx.fastq.gz" \
> "${TMPDIR}/yyy.fasta.gz" && \
zdiff "${TEST_DIR}/gbpln1088.4Mb.fasta.gz" \
"${TMPDIR}/yyy.fasta.gz"
then
log "$MCMD: converting large file between fasta and fastq OK"
((success++))
else
log "$MCMD: converting large file between fasta and fastq failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -5,6 +5,7 @@
#
TEST_NAME=obicount
CMD=obicount
######
#
@@ -15,6 +16,7 @@ TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
@@ -38,9 +40,14 @@ cleanup() {
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
@@ -79,6 +86,18 @@ log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obicount "${TEST_DIR}/wolf_F.fasta.gz" \
> "${TMPDIR}/wolf_F.fasta_count.csv"

109
obitests/obitools/obicsv/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obicsv
CMD=obicsv
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obidemerge
CMD=obidemerge
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obidistribute
CMD=obidistribute
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

109
obitests/obitools/obigrep/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obigrep
CMD=obigrep
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

109
obitests/obitools/obijoin/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obijoin
CMD=obijoin
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obikmermatch
CMD=obikmermatch
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obikmersimcount
CMD=obikmersimcount
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obilandmark
CMD=obilandmark
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimatrix
CMD=obimatrix
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimicrosat
CMD=obimicrosat
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obimultiplex
CMD=obimultiplex
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -4,7 +4,8 @@
# Here give the name of the test serie
#
TEST_NAME=obiparing
TEST_NAME=obipairing
CMD=obipairing
######
#
@@ -15,6 +16,7 @@ TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
@@ -38,9 +40,13 @@ cleanup() {
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
@@ -79,6 +85,16 @@ log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obipairing -F "${TEST_DIR}/wolf_F.fastq.gz" \
-R "${TEST_DIR}/wolf_R.fastq.gz" \
@@ -94,8 +110,8 @@ fi
((ntest++))
if obicsv -Z -s -i \
-k ali_dir -k ali_length -k paring_fast_count \
-k paring_fast_overlap -k paring_fast_score \
-k ali_dir -k ali_length -k pairing_fast_count \
-k pairing_fast_overlap -k pairing_fast_score \
-k score -k score_norm -k seq_a_single \
-k seq_b_single -k seq_ab_match \
"${TMPDIR}/wolf_paired_alignment.fastq.gz" \

109
obitests/obitools/obipcr/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obipcr
CMD=obipcr
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obirefidx
CMD=obirefidx
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiscript
CMD=obiscript
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obisplit
CMD=obisplit
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,9 @@
>Seq_1 {"count":2,"merged_sample":{"15a_F730814":1,"29a_F260619":1}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
agctyaaaactcaaaggacttggcggtgctttataccctt
>Seq_2 {"count":22,"merged_sample":{"15a_F730814":12,"29a_F260619":10}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcaat
atcttaaaactcaaaggacttggcggtgctttataccctt
>Seq_3 {"count":22,"merged_sample":{"15a_F730814":15,"29a_F260619":7}}
ttagccctaaacacaagtaattaatataacaaaattattcgccagagtactaccggcgat
agcttaaaactcaaaggacttggcggtgctttataccctt

View File

@@ -0,0 +1,35 @@
{
"annotations": {
"keys": {
"map": {
"merged_sample": 3
},
"scalar": {
"count": 3
}
},
"map_attributes": 1,
"scalar_attributes": 1,
"vector_attributes": 0
},
"count": {
"reads": 46,
"total_length": 300,
"variants": 3
},
"samples": {
"sample_count": 2,
"sample_stats": {
"15a_F730814": {
"reads": 28,
"singletons": 1,
"variants": 3
},
"29a_F260619": {
"reads": 18,
"singletons": 1,
"variants": 3
}
}
}
}

View File

@@ -0,0 +1,25 @@
annotations:
keys:
map:
merged_sample: 3
scalar:
count: 3
map_attributes: 1
scalar_attributes: 1
vector_attributes: 0
count:
reads: 46
total_length: 300
variants: 3
samples:
sample_count: 2
sample_stats:
15a_F730814:
reads: 28
singletons: 1
variants: 3
29a_F260619:
reads: 18
singletons: 1
variants: 3

View File

@@ -0,0 +1,152 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obisummary
CMD=obisummary
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obisummary "${TEST_DIR}/some_uniq_seq.fasta" \
> "${TMPDIR}/some_uniq_seq.json"
then
log "$MCMD: formating json execution OK"
((success++))
else
log "$MCMD: formating json execution failed"
((failed++))
fi
((ntest++))
if diff "${TEST_DIR}/some_uniq_seq.json" \
"${TMPDIR}/some_uniq_seq.json" > /dev/null
then
log "$MCMD: formating json OK"
((success++))
else
log "$MCMD: formating json failed"
((failed++))
fi
((ntest++))
if obisummary --yaml "${TEST_DIR}/some_uniq_seq.fasta" \
> "${TMPDIR}/some_uniq_seq.yaml"
then
log "$MCMD: formating yaml execution OK"
((success++))
else
log "$MCMD: formating yaml execution failed"
((failed++))
fi
((ntest++))
if diff "${TEST_DIR}/some_uniq_seq.yaml" \
"${TMPDIR}/some_uniq_seq.yaml" > /dev/null
then
log "$MCMD: formating yaml OK"
((success++))
else
log "$MCMD: formating yaml failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

109
obitests/obitools/obitag/test.sh Executable file
View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obitag
CMD=obitag
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obipcr
CMD=obipcr
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,109 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obitaxonomy
CMD=obitaxonomy
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

205
obitests/obitools/obiuniq/test.sh Executable file
View File

@@ -0,0 +1,205 @@
#!/bin/bash
#
# Here give the name of the test serie
#
TEST_NAME=obiuniq
CMD=obiuniq
######
#
# Some variable and function definitions: please don't change them
#
######
TEST_DIR="$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")"
OBITOOLS_DIR="${TEST_DIR/obitest*/}build"
export PATH="${OBITOOLS_DIR}:${PATH}"
MCMD="$(echo "${CMD:0:4}" | tr '[:lower:]' '[:upper:]')$(echo "${CMD:4}" | tr '[:upper:]' '[:lower:]')"
TMPDIR="$(mktemp -d)"
ntest=0
success=0
failed=0
cleanup() {
echo "========================================" 1>&2
echo "## Results of the $TEST_NAME tests:" 1>&2
echo 1>&2
echo "- $ntest tests run" 1>&2
echo "- $success successfully completed" 1>&2
echo "- $failed failed tests" 1>&2
echo 1>&2
echo "Cleaning up the temporary directory..." 1>&2
echo 1>&2
echo "========================================" 1>&2
rm -rf "$TMPDIR" # Suppress the temporary directory
if [ $failed -gt 0 ]; then
log "$TEST_NAME tests failed"
log
log
exit 1
fi
log
log
exit 0
}
log() {
echo -e "[$TEST_NAME @ $(date)] $*" 1>&2
}
log "Testing $TEST_NAME..."
log "Test directory is $TEST_DIR"
log "obitools directory is $OBITOOLS_DIR"
log "Temporary directory is $TMPDIR"
log "files: $(find $TEST_DIR | awk -F'/' '{print $NF}' | tail -n +2)"
######################################################################
####
#### Below are the tests
####
#### Before each test :
#### - increment the variable ntest
####
#### Run the command as the condition of an if / then /else
#### - The command must return 0 on success
#### - The command must return an exit code different from 0 on failure
#### - The datafiles are stored in the same directory than the test script
#### - The test script directory is stored in the TEST_DIR variable
#### - If result files have to be produced they must be stored
#### in the temporary directory (TMPDIR variable)
####
#### then clause is executed on success of the command
#### - Write a success message using the log function
#### - increment the variable success
####
#### else clause is executed on failure of the command
#### - Write a failure message using the log function
#### - increment the variable failed
####
######################################################################
((ntest++))
if $CMD -h > "${TMPDIR}/help.txt" 2>&1
then
log "$MCMD: printing help OK"
((success++))
else
log "$MCMD: printing help failed"
((failed++))
fi
((ntest++))
if obiuniq "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u.fasta"
then
log "OBIUniq simple: running OK"
((success++))
else
log "OBIUniq simple: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_ref.csv" \
"${TMPDIR}/touniq_u.csv" > /dev/null
then
log "OBIUniq simple: result OK"
((success++))
else
log "OBIUniq simple: result failed"
((failed++))
fi
((ntest++))
if obiuniq -c a "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_a.fasta"
then
log "OBIUniq one category: running OK"
((success++))
else
log "OBIUniq one category: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u_a.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u_a.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_a_ref.csv" \
"${TMPDIR}/touniq_u_a.csv" > /dev/null
then
log "OBIUniq one category: result OK"
((success++))
else
log "OBIUniq one category: result failed"
((failed++))
fi
((ntest++))
if obiuniq -c a -c b "${TEST_DIR}/touniq.fasta" \
> "${TMPDIR}/touniq_u_a_b.fasta"
then
log "OBIUniq two categories: running OK"
((success++))
else
log "OBIUniq two categories: running failed"
((failed++))
fi
obicsv -s --auto ${TEST_DIR}/touniq_u_a_b.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_b_ref.csv"
obicsv -s --auto ${TMPDIR}/touniq_u_a_b.fasta \
| tail -n +2 \
| sort \
> "${TMPDIR}/touniq_u_a_b.csv"
((ntest++))
if diff "${TMPDIR}/touniq_u_a_b_ref.csv" \
"${TMPDIR}/touniq_u_a_b.csv" > /dev/null
then
log "OBIUniq two categories: result OK"
((success++))
else
log "OBIUniq two categories: result failed"
((failed++))
fi
#########################################
#
# At the end of the tests
# the cleanup function is called
#
#########################################
cleanup

View File

@@ -0,0 +1,16 @@
>seq1 {"a":2, "b":4,"c":5}
aaacccgggttt
>seq2 {"a":3, "b":4,"c":5}
aaacccgggttt
>seq3 {"a":3, "b":5,"c":5}
aaacccgggttt
>seq4 {"a":3, "b":5,"c":6}
aaacccgggttt
>seq5 {"a":2, "b":4,"c":5}
aaacccgggtttca
>seq6 {"a":3, "b":4,"c":5}
aaacccgggtttca
>seq7 {"a":3, "b":5,"c":5}
aaacccgggtttca
>seq8 {"a":3, "b":5,"c":6}
aaacccgggtttca

View File

@@ -0,0 +1,4 @@
>seq5 {"count":4}
aaacccgggtttca
>seq1 {"count":4}
aaacccgggttt

View File

@@ -0,0 +1,8 @@
>seq5 {"a":2,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq6 {"a":3,"count":3}
aaacccgggtttca
>seq1 {"a":2,"b":4,"c":5,"count":1}
aaacccgggttt
>seq2 {"a":3,"count":3}
aaacccgggttt

View File

@@ -0,0 +1,12 @@
>seq5 {"a":2,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq6 {"a":3,"b":4,"c":5,"count":1}
aaacccgggtttca
>seq7 {"a":3,"b":5,"count":2}
aaacccgggtttca
>seq1 {"a":2,"b":4,"c":5,"count":1}
aaacccgggttt
>seq2 {"a":3,"b":4,"c":5,"count":1}
aaacccgggttt
>seq3 {"a":3,"b":5,"count":2}
aaacccgggttt

View File

@@ -169,7 +169,7 @@ func BuildQualityConsensus(seqA, seqB *obiseq.BioSequence, path []int, statOnMis
right = len(*bufferQA) - right
// log.Warnf("BuildQualityConsensus: left = %d right = %d\n", left, right)
// obilog.Warnf("BuildQualityConsensus: left = %d right = %d\n", left, right)
for i, qA = range *bufferQA {
nA := (*bufferSA)[i]

View File

@@ -117,7 +117,7 @@ func _MatchScoreRatio(QF, QR byte) (float64, float64) {
term1 := _Logaddexp(qF, qR)
term2 := _Logdiffexp(term1, qF+qR)
// log.Warnf("MatchScoreRatio: %v, %v , %v, %v", QF, QR, term1, term2)
// obilog.Warnf("MatchScoreRatio: %v, %v , %v, %v", QF, QR, term1, term2)
match_logp := _Log1mexp(term2 + l3 - l4)
match_score := match_logp - _Log1mexp(match_logp)

View File

@@ -4,33 +4,6 @@ import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
)
var _iupac = [26]byte{
// a b c d e f
1, 14, 2, 13, 0, 0,
// g h i j k l
4, 11, 0, 0, 12, 0,
// m n o p q r
3, 15, 0, 0, 0, 5,
// s t u v w x
6, 8, 8, 13, 9, 0,
// y z
10, 0,
}
func _samenuc(a, b byte) bool {
if (a >= 'A') && (a <= 'Z') {
a |= 32
}
if (b >= 'A') && (b <= 'Z') {
b |= 32
}
if (a >= 'a') && (a <= 'z') && (b >= 'a') && (b <= 'z') {
return (_iupac[a-'a'] & _iupac[b-'a']) > 0
}
return a == b
}
// FastLCSEGFScoreByte calculates the score of the Longest Common Subsequence (LCS) between two byte slices.
//
// The score is calculated using the following scoring matrix:
@@ -165,7 +138,7 @@ func FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[
default:
// We are in the middle of the matrix
Sdiag = _incpath(previous[x])
if _samenuc(bA[j-1], bB[i-1]) {
if obiseq.SameIUPACNuc(bA[j-1], bB[i-1]) {
Sdiag = _incscore(Sdiag)
}
@@ -265,7 +238,7 @@ func FastLCSEGFScoreByte(bA, bB []byte, maxError int, endgapfree bool, buffer *[
Sleft = _notavail
default:
Sdiag = _incpath(previous[x])
if _samenuc(bA[j-1], bB[i-1]) {
if obiseq.SameIUPACNuc(bA[j-1], bB[i-1]) {
Sdiag = _incscore(Sdiag)
}

View File

@@ -1,6 +1,9 @@
package obialign
import log "github.com/sirupsen/logrus"
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
log "github.com/sirupsen/logrus"
)
// buffIndex converts a pair of coordinates (i, j) into a linear index in a matrix
// of size width x width. The coordinates are (-1)-indexed, and the linear index
@@ -69,7 +72,7 @@ func LocatePattern(id string, pattern, sequence []byte) (int, int, int) {
// Mismatch score = -1
// Match score = 0
match := -1
if _samenuc(pattern[j], sequence[i]) {
if obiseq.SameIUPACNuc(pattern[j], sequence[i]) {
match = 0
}
@@ -103,7 +106,7 @@ func LocatePattern(id string, pattern, sequence []byte) (int, int, int) {
// Mismatch score = -1
// Match score = 0
match := -1
if _samenuc(pattern[jmax], sequence[i]) {
if obiseq.SameIUPACNuc(pattern[jmax], sequence[i]) {
match = 0
}
@@ -152,7 +155,7 @@ func LocatePattern(id string, pattern, sequence []byte) (int, int, int) {
}
// log.Warnf("from : %d to: %d error: %d match: %v",
// obilog.Warnf("from : %d to: %d error: %d match: %v",
// i, end+1, -buffer[buffIndex(len(sequence)-1, len(pattern)-1, width)],
// string(sequence[i:(end+1)]))
return i, end + 1, -buffer[buffIndex(len(sequence)-1, len(pattern)-1, width)]

View File

@@ -53,10 +53,10 @@ func ReadAlign(seqA, seqB *obiseq.BioSequence,
over = min(seqA.Len(), seqB.Len())
}
// log.Warnf("fw/fw: %v shift=%d fastCount=%d/over=%d fastScore=%f",
// obilog.Warnf("fw/fw: %v shift=%d fastCount=%d/over=%d fastScore=%f",
// directAlignment, shift, fastCount, over, fastScore)
// log.Warnf(("seqA: %s\nseqB: %s\n"), seqA.String(), seqB.String())
// obilog.Warnf(("seqA: %s\nseqB: %s\n"), seqA.String(), seqB.String())
// At least one mismatch exists in the overlaping region
if fastCount+3 < over {

14
pkg/obiapat/obiapat.c Normal file → Executable file
View File

@@ -149,9 +149,9 @@ char *LowerSequence(char *seq)
char *cseq;
for (cseq = seq ; *cseq ; cseq++)
if (IS_UPPER(*cseq))
if (IS_UPPER(*cseq)) {
*cseq = TO_LOWER(*cseq);
}
return seq;
}
@@ -299,14 +299,14 @@ int32_t delete_apatseq(SeqPtr pseq,
return 1;
}
PatternPtr buildPattern(const char *pat, int32_t error_max, uint8_t hasIndel,
Pattern *buildPattern(const char *pat, int32_t error_max, uint8_t hasIndel,
int *errno, char **errmsg)
{
PatternPtr pattern;
Pattern *pattern;
int32_t patlen;
int32_t patlen2;
patlen = strlen(pat);
patlen = (int32_t)strlen(pat);
patlen2 = lenPattern(pat);
pattern = ECOMALLOC(sizeof(Pattern) + // Space for struct Pattern
@@ -341,10 +341,10 @@ PatternPtr buildPattern(const char *pat, int32_t error_max, uint8_t hasIndel,
}
PatternPtr complementPattern(PatternPtr pat, int *errno,
Pattern *complementPattern(Pattern *pat, int *errno,
char **errmsg)
{
PatternPtr pattern;
Pattern *pattern;
pattern = ECOMALLOC(sizeof(Pattern) +
sizeof(char) * strlen(pat->cpat) + 1 +

10
pkg/obiapat/obiapat.h Normal file → Executable file
View File

@@ -116,13 +116,13 @@ ecoseq_t *new_ecoseq_with_data( char *AC,
int32_t delete_apatseq(SeqPtr pseq,
int32_t delete_apatseq(Seq *pseq,
int *errno, char **errmsg);
PatternPtr buildPattern(const char *pat, int32_t error_max, uint8_t hasIndel, int *errno, char **errmsg);
PatternPtr complementPattern(PatternPtr pat, int *errno, char **errmsg);
Pattern *buildPattern(const char *pat, int32_t error_max, uint8_t hasIndel, int *errno, char **errmsg);
Pattern *complementPattern(Pattern *pat, int *errno, char **errmsg);
SeqPtr new_apatseq(const char *in,int32_t circular, int32_t seqlen,
SeqPtr out,
Seq *new_apatseq(const char *in,int32_t circular, int32_t seqlen,
Seq *out,
int *errno, char **errmsg);
char *ecoComplementPattern(char *nucAcSeq);

13
pkg/obiapat/pattern.go Normal file → Executable file
View File

@@ -26,7 +26,7 @@ var _AllocatedApaPattern = 0
// ApatPattern stores a regular pattern usable by the
// Apat algorithm functions and methods
type _ApatPattern struct {
pointer *C.Pattern
pointer C.PatternPtr
pattern string
}
@@ -72,6 +72,7 @@ var NilApatSequence = ApatSequence{nil}
//
// Returns an ApatPattern object and an error if the pattern is invalid.
func MakeApatPattern(pattern string, errormax int, allowsIndel bool) (ApatPattern, error) {
cpattern := C.CString(pattern)
defer C.free(unsafe.Pointer(cpattern))
cerrormax := C.int32_t(errormax)
@@ -159,7 +160,7 @@ func (pattern ApatPattern) Free() {
// Print method prints the ApatPattern to the standard output.
// This is mainly a debug method.
func (pattern ApatPattern) Print() {
C.PrintDebugPattern(C.PatternPtr(pattern.pointer.pointer))
C.PrintDebugPattern((*C.Pattern)(pattern.pointer.pointer))
}
// MakeApatSequence casts an obiseq.BioSequence to an ApatSequence.
@@ -410,8 +411,8 @@ func (pattern ApatPattern) FilterBestMatch(sequence ApatSequence, begin, length
best := [3]int{0, 0, 10000}
for _, m := range res {
// log.Warnf("Current : Begin : %d End : %d Err : %d", m[0], m[1], m[2])
// log.Warnf("Best : Begin : %d End : %d Err : %d", best[0], best[1], best[2])
// obilog.Warnf("Current : Begin : %d End : %d Err : %d", m[0], m[1], m[2])
// obilog.Warnf("Best : Begin : %d End : %d Err : %d", best[0], best[1], best[2])
if (m[0] - m[2]) < best[1]+best[2] {
// match are overlapping
// log.Warnln("overlap")
@@ -467,7 +468,7 @@ func (pattern ApatPattern) AllMatches(sequence ApatSequence, begin, length int)
// Recompute the start and end position of the match
// when the pattern allows for indels
if m[2] > 0 && pattern.pointer.pointer.hasIndel {
// log.Warnf("Locating indel on sequence %s[%s]", sequence.pointer.reference.Id(), pattern.String())
// obilog.Warnf("Locating indel on sequence %s[%s]", sequence.pointer.reference.Id(), pattern.String())
start := m[0] - m[2]*2
start = max(start, 0)
end := start + int(pattern.pointer.pointer.patlen) + 4*m[2]
@@ -489,7 +490,7 @@ func (pattern ApatPattern) AllMatches(sequence ApatSequence, begin, length int)
m[0] = start + pb
m[1] = start + pe
// log.Warnf("seq[%d@%d:%d] %d: %s %d - %s:%s:%s", i, m[0], m[1], olderr, sequence.pointer.reference.Id(), score,
// obilog.Warnf("seq[%d@%d:%d] %d: %s %d - %s:%s:%s", i, m[0], m[1], olderr, sequence.pointer.reference.Id(), score,
// frg, (*cpattern)[0:int(pattern.pointer.pointer.patlen)], sequence.pointer.reference.Sequence()[m[0]:m[1]])
}

22
pkg/obichunk/chunk.go Normal file
View File

@@ -0,0 +1,22 @@
package obichunk
import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
)
func ISequenceChunk(iterator obiiter.IBioSequence,
classifier *obiseq.BioSequenceClassifier,
onMemory bool,
dereplicate bool,
na string,
statsOn obiseq.StatsOnDescriptions,
uniqueClassifier *obiseq.BioSequenceClassifier,
) (obiiter.IBioSequence, error) {
if onMemory {
return ISequenceChunkOnMemory(iterator, classifier)
} else {
return ISequenceChunkOnDisk(iterator, classifier, dereplicate, na, statsOn, uniqueClassifier)
}
}

View File

@@ -10,6 +10,7 @@ import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiformats"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiiter"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiutils"
)
// tempDir creates a temporary directory with a prefix "obiseq_chunks_"
@@ -73,7 +74,13 @@ func find(root, ext string) []string {
// is removed. The function logs the number of batches created and the processing
// status of each batch.
func ISequenceChunkOnDisk(iterator obiiter.IBioSequence,
classifier *obiseq.BioSequenceClassifier) (obiiter.IBioSequence, error) {
classifier *obiseq.BioSequenceClassifier,
dereplicate bool,
na string,
statsOn obiseq.StatsOnDescriptions,
uniqueClassifier *obiseq.BioSequenceClassifier,
) (obiiter.IBioSequence, error) {
obiutils.RegisterAPipe()
dir, err := tempDir()
if err != nil {
return obiiter.NilIBioSequence, err
@@ -86,7 +93,7 @@ func ISequenceChunkOnDisk(iterator obiiter.IBioSequence,
go func() {
defer func() {
os.RemoveAll(dir)
log.Debugln("Clear the cache directory")
obiutils.UnregisterPipe()
}()
newIter.Wait()
@@ -111,11 +118,45 @@ func ISequenceChunkOnDisk(iterator obiiter.IBioSequence,
panic(err)
}
source, chunk := iseq.Load()
if dereplicate {
u := make(map[string]*obiseq.BioSequence)
var source string
uniqueClassifier.Reset()
newIter.Push(obiiter.MakeBioSequenceBatch(source, order, chunk))
log.Infof("Start processing of batch %d/%d : %d sequences",
order, nbatch, len(chunk))
for iseq.Next() {
batch := iseq.Get()
source = batch.Source()
for _, seq := range batch.Slice() {
// Use composite key: sequence + categories
code := uniqueClassifier.Code(seq)
key := uniqueClassifier.Value(code)
prev, ok := u[key]
if ok {
prev.Merge(seq, na, true, statsOn)
} else {
u[key] = seq
}
}
}
chunk := obiseq.MakeBioSequenceSlice(len(u))
i := 0
for _, seq := range u {
chunk[i] = seq
i++
}
newIter.Push(obiiter.MakeBioSequenceBatch(source, order, chunk))
} else {
source, chunk := iseq.Load()
newIter.Push(obiiter.MakeBioSequenceBatch(source, order, chunk))
log.Infof("Start processing of batch %d/%d : %d sequences",
order+1, nbatch, len(chunk))
}
}

View File

@@ -9,7 +9,20 @@ import (
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
)
func ISequenceChunk(iterator obiiter.IBioSequence,
// ISequenceChunkOnMemory processes a sequence iterator by distributing the sequences
// into chunks in memory. It uses a classifier to determine how to distribute
// the sequences and returns a new iterator for the processed sequences.
//
// Parameters:
// - iterator: An iterator of biosequences to be processed.
// - classifier: A pointer to a BioSequenceClassifier used to classify the sequences
// during distribution.
//
// Returns:
// An iterator of biosequences representing the processed chunks.
//
// The function operates asynchronously.
func ISequenceChunkOnMemory(iterator obiiter.IBioSequence,
classifier *obiseq.BioSequenceClassifier) (obiiter.IBioSequence, error) {
newIter := obiiter.MakeIBioSequence()

View File

@@ -25,18 +25,35 @@ func IUniqueSequence(iterator obiiter.IBioSequence,
log.Infoln("Starting data splitting")
cat := opts.Categories()
na := opts.NAValue()
// Classifier for bucketing: Hash only to control number of chunks
bucketClassifier := obiseq.HashClassifier(opts.BatchCount())
// Classifier for uniqueness: Sequence + categories
var uniqueClassifier *obiseq.BioSequenceClassifier
if len(cat) > 0 {
cls := make([]*obiseq.BioSequenceClassifier, len(cat)+1)
cls[0] = obiseq.SequenceClassifier()
for i, c := range cat {
cls[i+1] = obiseq.AnnotationClassifier(c, na)
}
uniqueClassifier = obiseq.CompositeClassifier(cls...)
} else {
uniqueClassifier = obiseq.SequenceClassifier()
}
if opts.SortOnDisk() {
nworkers = 1
iterator, err = ISequenceChunkOnDisk(iterator,
obiseq.HashClassifier(opts.BatchCount()))
iterator, err = ISequenceChunkOnDisk(iterator, bucketClassifier, true, na, opts.StatsOn(), uniqueClassifier)
if err != nil {
return obiiter.NilIBioSequence, err
}
} else {
iterator, err = ISequenceChunk(iterator,
obiseq.HashClassifier(opts.BatchCount()))
iterator, err = ISequenceChunkOnMemory(iterator, bucketClassifier)
if err != nil {
return obiiter.NilIBioSequence, err
@@ -63,63 +80,25 @@ func IUniqueSequence(iterator obiiter.IBioSequence,
return neworder
}
var ff func(obiiter.IBioSequence,
*obiseq.BioSequenceClassifier,
int)
cat := opts.Categories()
na := opts.NAValue()
ff = func(input obiiter.IBioSequence,
classifier *obiseq.BioSequenceClassifier,
icat int) {
icat--
ff := func(input obiiter.IBioSequence,
classifier *obiseq.BioSequenceClassifier) {
input, err = ISequenceSubChunk(input,
classifier,
1)
var next obiiter.IBioSequence
if icat >= 0 {
next = obiiter.MakeIBioSequence()
iUnique.Add(1)
go ff(next,
obiseq.AnnotationClassifier(cat[icat], na),
icat)
}
o := 0
for input.Next() {
batch := input.Get()
if icat < 0 || len(batch.Slice()) == 1 {
// No more sub classification of sequence or only a single sequence
if !(opts.NoSingleton() && len(batch.Slice()) == 1 && batch.Slice()[0].Count() == 1) {
iUnique.Push(batch.Reorder(nextOrder()))
}
} else {
// A new step of classification must du realized
next.Push(batch.Reorder(o))
o++
if !(opts.NoSingleton() && len(batch.Slice()) == 1 && batch.Slice()[0].Count() == 1) {
iUnique.Push(batch.Reorder(nextOrder()))
}
}
if icat >= 0 {
next.Close()
}
iUnique.Done()
}
for i := 0; i < nworkers-1; i++ {
go ff(iterator.Split(),
obiseq.SequenceClassifier(),
len(cat))
go ff(iterator.Split(), uniqueClassifier.Clone())
}
go ff(iterator,
obiseq.SequenceClassifier(),
len(cat))
go ff(iterator, uniqueClassifier)
iMerged := iUnique.IMergeSequenceBatch(opts.NAValue(),
opts.StatsOn(),

View File

@@ -2,21 +2,75 @@ package obicorazick
import (
log "github.com/sirupsen/logrus"
"sync"
"os"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obiseq"
"git.metabarcoding.org/obitools/obitools4/obitools4/pkg/obidefault"
"github.com/rrethy/ahocorasick"
"github.com/schollz/progressbar/v3"
)
func AhoCorazickWorker(slot string, patterns []string) obiseq.SeqWorker {
matcher := ahocorasick.CompileStrings(patterns)
sizebatch:=10000000
nmatcher := len(patterns) / sizebatch + 1
log.Infof("Building AhoCorasick %d matcher for %d patterns in slot %s",
nmatcher, len(patterns), slot)
if nmatcher == 0 {
log.Errorln("No patterns provided")
}
matchers := make([]*ahocorasick.Matcher, nmatcher)
ieme := make(chan int)
mutex := &sync.WaitGroup{}
npar := min(obidefault.ParallelWorkers(), nmatcher)
mutex.Add(npar)
pbopt := make([]progressbar.Option, 0, 5)
pbopt = append(pbopt,
progressbar.OptionSetWriter(os.Stderr),
progressbar.OptionSetWidth(15),
progressbar.OptionShowCount(),
progressbar.OptionShowIts(),
progressbar.OptionSetDescription("Building AhoCorasick matcher..."),
)
bar := progressbar.NewOptions(nmatcher, pbopt...)
bar.Add(0)
builder := func() {
for i := range ieme {
matchers[i] = ahocorasick.CompileStrings(patterns[i*sizebatch:min((i+1)*sizebatch,len(patterns))])
bar.Add(1)
}
mutex.Done()
}
for i := 0; i < npar; i++ {
go builder()
}
for i := 0; i < nmatcher; i++ {
ieme <- i
}
close(ieme)
mutex.Wait()
fslot := slot + "_Fwd"
rslot := slot + "_Rev"
f := func(s *obiseq.BioSequence) (obiseq.BioSequenceSlice, error) {
matchesF := len(matcher.FindAllByteSlice(s.Sequence()))
matchesR := len(matcher.FindAllByteSlice(s.ReverseComplement(false).Sequence()))
matchesF := 0
matchesR := 0
b := s.Sequence()
bc := s.ReverseComplement(false).Sequence()
for _, matcher := range matchers {
matchesF += len(matcher.FindAllByteSlice(b))
matchesR += len(matcher.FindAllByteSlice(bc))
}
log.Debugln("Macthes = ", matchesF, matchesR)
matches := matchesF + matchesR

View File

@@ -1,15 +1,15 @@
package obidefault
var __compressed__ = false
var __compress__ = false
func CompressOutput() bool {
return __compressed__
return __compress__
}
func SetCompressOutput(b bool) {
__compressed__ = b
__compress__ = b
}
func CompressedPtr() *bool {
return &__compressed__
func CompressOutputPtr() *bool {
return &__compress__
}

Some files were not shown because too many files have changed in this diff Show More