Refactoring codes for removing buffer size options. An some other changes...

Former-commit-id: 10b57cc1a27446ade3c444217341e9651e89cdce
2025-06-29 16:20:46 +00:00 · 2023-03-07 11:12:13 +07:00
parent 9811e440b8
commit d88de15cdc
52 changed files with 1172 additions and 421 deletions
--- a/doc/book/annexes.qmd
+++ b/doc/book/annexes.qmd
@ -1,82 +1,107 @@
 # Annexes

-### Sequence attributes
+## Sequence attributes

-#### Reserved sequence attributes
+**ali_dir (`string`)**

-##### `ali_dir`
+  - Set by the *obipairing* tool
+  - The attribute can contain 2 string values `left` or `right`.

-###### Type : `string`
+  The alignment generated by *obipairing* is a 3'-end gap free algorithm.
+  Two cases can occur when aligning the forward and reverse reads. If the
+  barcode is long enough, both the reads overlap only on their 3' ends. In
+  such case, the alignment direction `ali_dir` is set to *left*. If the
+  barcode is shorter than the read length, the paired reads overlap by
+  their 5' ends, and the complete barcode is sequenced by both the reads.
+  In that later case, `ali_dir` is set to *right*.

-The attribute can contain 2 string values `"left"` or `"right".`
+**ali_length (`int`)**

-###### Set by the *obipairing* tool
+  - Set by the *obipairing* tool

-The alignment generated by *obipairing* is a 3'-end gap free algorithm.
-Two cases can occur when aligning the forward and reverse reads. If the
-barcode is long enough, both the reads overlap only on their 3' ends. In
-such case, the alignment direction `ali_dir` is set to *left*. If the
-barcode is shorter than the read length, the paired reads overlap by
-their 5' ends, and the complete barcode is sequenced by both the reads.
-In that later case, `ali_dir` is set to *right*.
+  Length of the aligned parts when merging forward and reverse reads

-##### `ali_length`

-###### Set by the *obipairing* tool
+**count (`int`)**

-Length of the aligned parts when merging forward and reverse reads
+  - Set by the *obiuniq* tool
+  - Getter : method `Count()`
+  - Setter : method `SetCount(int)`

-##### `count` : the number of sequence occurrences
+  The `count` attribute indicates how-many strictly identical reads
+  have been merged in a single record. It contains an integer value. If it
+  is absent this means that the sequence record represents a single
+  occurrence of the sequence.

-###### Set by the *obiuniq* tool
+  The `Count()` method allows to access to the count attribute as an
+  integer value. If the `count` attribute is not defined for the given
+  sequence, the value *1* is returned

-The `count` attribute indicates how-many strictly identical sequences
-have been merged in a single record. It contains an integer value. If it
-is absent this means that the sequence record represents a single
-occurrence of the sequence.
+**merged_* (`map[string]int`)**

-###### Getter : method `Count()`
+  - Set by the *obiuniq* tool

-The `Count()` method allows to access to the count attribute as an
-integer value. If the `count` attribute is not defined for the given
-sequence, the value *1* is returned
+  The `-m` option of the *obiuniq* tools allows for keeping track of the
+  distribution of the values stored in given attribute of interest. Often
+  this option is used to summarise distribution of a sequence variant
+  accross samples when *obiuniq* is run after running *obimultiplex*. The
+  actual name of the attribute depends on the name of the monitored
+  attribute. If `-m` option is used with the attribute *sample*, then this
+  attribute names *merged_sample*.

-##### `merged_*`
+**mode (`string`)**

-###### Type : `map[string]int`
+  - Set by the *obipairing* tool
+  - The attribute can contain 2 string values `join` or `alignment`.

-###### Set by the *obiuniq* tool

-The `-m` option of the *obiuniq* tools allows for keeping track of the
-distribution of the values stored in given attribute of interest. Often
-this option is used to summarise distribution of a sequence variant
-accross samples when *obiuniq* is run after running *obimultiplex*. The
-actual name of the attribute depends on the name of the monitored
-attribute. If `-m` option is used with the attribute *sample*, then this
-attribute names *merged_sample*.
+**obitag_ref_index (`map[string]string`)**

-##### `mode`
+  - Set by the *obirefidx* tool.

-###### Set by the *obipairing* tool
+  It resumes to which taxonomic annotation a match to that sequence must
+  lead according to the number of differences existing between the query
+  sequence and the reference sequence having that tag. 

-**`obitag_ref_index`**
+```json
+   {"0":"9606@Homo sapiens@species",
+    "2":"207598@Homininae@subfamily",
+    "3":"9604@Hominidae@family",
+    "8":"314295@Hominoidea@superfamily",
+    "10":"9526@Catarrhini@parvorder",
+    "12":"1437010@Boreoeutheria@clade",
+    "16":"9347@Eutheria@clade",
+    "17":"40674@Mammalia@class",
+    "22":"117571@Euteleostomi@clade",
+    "25":"7776@Gnathostomata@clade",
+    "29":"33213@Bilateria@clade",
+    "30":"6072@Eumetazoa@clade"}
+```

-###### Set by the *obirefidx* tool.
+**pairing_mismatches (`map[string]string`)**

-It resumes to which taxonomic annotation a match to that sequence must
-lead according to the number of differences existing between the query
-sequence and the reference sequence having that tag.
+  - Set by the *obipairing* tool

-###### Getter : method `Count()`
+**seq_a_single (`int`)**

-##### `pairing_mismatches`
+  - Set by the *obipairing* tool

-###### Set by the *obipairing* tool
+**seq_ab_match (`int`)**

-##### `score`
+  - Set by the *obipairing* tool

-###### Set by the *obipairing* tool
+**seq_b_single (`int`)**

-##### `score_norm`
+  - Set by the *obipairing* tool

-###### Set by the *obipairing* tool
+**score (`int`)**
+
+  - Set by the *obipairing* tool
+
+**score_norm (`float`)**
+
+  - Set by the *obipairing* tool
+  - The value ranges between 0 and 1.
+
+  Score of the alignment between forward and reverse reads expressed as a fraction of identity.
+ 
--- a/doc/book/comm_sampling.qmd
+++ b/doc/book/comm_sampling.qmd
@ -10,13 +10,39 @@

 Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.

+**Selection based on the sequence**
+
+
+Sequence records can be selected according if they match or not with a pattern. The simplest pattern is as short sequence (*e.g* `AACCTT`). But the usage of regular patterns allows for looking for more complex pattern. As example, `A[TG]C+G` matches a `A`, followed by a `T` or a `G`, then one or several `C` and endly a `G`.
+
+{{< include ../lib/options/selection/_sequence.qmd >}}
+
+*Examples:*
+
+: Selects only the sequence records that contain an *EcoRI* restriction site.
+
+```bash   
+obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
+```
+    
+: Selects only the sequence records that contain a stretch of at least 10 ``A``.    
+    
+```bash   
+obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
+```
+
+: Selects only the sequence records that do not contain ambiguous nucleotides.
+    
+```bash   
+obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
+```


 {{< include ../lib/options/selection/_min-count.qmd >}}

 {{< include ../lib/options/selection/_max-count.qmd >}}

-Example
+*Examples*

 : Selecting sequence records representing at least five reads in the dataset.

--- a/doc/book/expressions.qmd
+++ b/doc/book/expressions.qmd
@ -11,26 +11,64 @@ Several OBITools (*e.g.* obigrep, obiannotate) allow the user to specify some si

 ### Instrospection functions {.unnumbered}

- `len(x)`is a generic function allowing to retreive the size of a object. It returns 
+**`len(x)`**
+
+: It is a generic function allowing to retreive the size of a object. It returns 
  the length of a sequences, the number of element in a map like `annotations`, the number
  of elements in an array. The reurned value is an `int`.

 ### Cast functions {.unnumbered}

- `int(x)`  converts if possible the `x` value to an integer value. The function 
+**`int(x)`**  
+
+: Converts if possible the `x` value to an integer value. The function 
  returns an `int`.
- `numeric(x)` converts if possible the `x` value to a float value. The function 
+
+**`numeric(x)`** 
+
+: Converts if possible the `x` value to a float value. The function 
  returns a `float`.
- `bool(x)` converts if possible the `x` value to a boolean value. The function 
+
+**`bool(x)`** 
+
+: Converts if possible the `x` value to a boolean value. The function 
  returns a `bool`.

 ### String related functions {.unnumbered}

- `printf(format,...)` allows to combine several values to build a string. `format` follows the
+**`printf(format,...)`** 
+
+: Allows to combine several values to build a string. `format` follows the
   classical C `printf` syntax. The function returns a `string`.
- `subspc(x)` substitutes every space in the `x` string by the underscore (`_`) character. The function 
+
+**`subspc(x)`** 
+
+: substitutes every space in the `x` string by the underscore (`_`) character. The function 
   returns a `string`. 

+### Condition function {.unnumbered}
+
+**`ifelse(condition,val1,val2)`**
+
+: The `condition` value has to be a `bool` value. If it is `true` the function returns `val1`,
+  otherwise, it is returning `val2`.
+
+### Sequence analysis related function
+
+**`composition(sequence)`**
+
+: The nucleotide composition of the sequence is returned as as map indexed by `a`, `c`, `g`, or `t` and
+  each value is the number of occurrences of that nucleotide. A fifth key `others` accounts for
+  all others symboles.  
+
+**`gcskew(sequence)`**
+
+: Computes the excess of g compare to c of the sequence, known as the GC skew.
+
+    $$
+    Skew_{GC}=\frac{G-C}{G+C}
+    $$
+    
 ## Accessing to the sequence annotations

 The `annotations` variable is a map object containing all the annotations associated to the currently processed sequence. Index of the map are the attribute names. It exists to possibillities to retreive
@ -53,4 +91,7 @@ Special attributes of the sequence are accessible only by dedicated methods of t
 - The sequence identifier : `Id()`
 - THe sequence definition : `Definition()`

+```go
+sequence.Id()
+```