Refactoring codes for removing buffer size options. An some other changes...

Former-commit-id: 10b57cc1a27446ade3c444217341e9651e89cdce
This commit is contained in:
2023-03-07 11:12:13 +07:00
parent 9811e440b8
commit d88de15cdc
52 changed files with 1172 additions and 421 deletions

View File

@ -1,82 +1,107 @@
# Annexes
### Sequence attributes
## Sequence attributes
#### Reserved sequence attributes
**ali_dir (`string`)**
##### `ali_dir`
- Set by the *obipairing* tool
- The attribute can contain 2 string values `left` or `right`.
###### Type : `string`
The alignment generated by *obipairing* is a 3'-end gap free algorithm.
Two cases can occur when aligning the forward and reverse reads. If the
barcode is long enough, both the reads overlap only on their 3' ends. In
such case, the alignment direction `ali_dir` is set to *left*. If the
barcode is shorter than the read length, the paired reads overlap by
their 5' ends, and the complete barcode is sequenced by both the reads.
In that later case, `ali_dir` is set to *right*.
The attribute can contain 2 string values `"left"` or `"right".`
**ali_length (`int`)**
###### Set by the *obipairing* tool
- Set by the *obipairing* tool
The alignment generated by *obipairing* is a 3'-end gap free algorithm.
Two cases can occur when aligning the forward and reverse reads. If the
barcode is long enough, both the reads overlap only on their 3' ends. In
such case, the alignment direction `ali_dir` is set to *left*. If the
barcode is shorter than the read length, the paired reads overlap by
their 5' ends, and the complete barcode is sequenced by both the reads.
In that later case, `ali_dir` is set to *right*.
Length of the aligned parts when merging forward and reverse reads
##### `ali_length`
###### Set by the *obipairing* tool
**count (`int`)**
Length of the aligned parts when merging forward and reverse reads
- Set by the *obiuniq* tool
- Getter : method `Count()`
- Setter : method `SetCount(int)`
##### `count` : the number of sequence occurrences
The `count` attribute indicates how-many strictly identical reads
have been merged in a single record. It contains an integer value. If it
is absent this means that the sequence record represents a single
occurrence of the sequence.
###### Set by the *obiuniq* tool
The `Count()` method allows to access to the count attribute as an
integer value. If the `count` attribute is not defined for the given
sequence, the value *1* is returned
The `count` attribute indicates how-many strictly identical sequences
have been merged in a single record. It contains an integer value. If it
is absent this means that the sequence record represents a single
occurrence of the sequence.
**merged_* (`map[string]int`)**
###### Getter : method `Count()`
- Set by the *obiuniq* tool
The `Count()` method allows to access to the count attribute as an
integer value. If the `count` attribute is not defined for the given
sequence, the value *1* is returned
The `-m` option of the *obiuniq* tools allows for keeping track of the
distribution of the values stored in given attribute of interest. Often
this option is used to summarise distribution of a sequence variant
accross samples when *obiuniq* is run after running *obimultiplex*. The
actual name of the attribute depends on the name of the monitored
attribute. If `-m` option is used with the attribute *sample*, then this
attribute names *merged_sample*.
##### `merged_*`
**mode (`string`)**
###### Type : `map[string]int`
- Set by the *obipairing* tool
- The attribute can contain 2 string values `join` or `alignment`.
###### Set by the *obiuniq* tool
The `-m` option of the *obiuniq* tools allows for keeping track of the
distribution of the values stored in given attribute of interest. Often
this option is used to summarise distribution of a sequence variant
accross samples when *obiuniq* is run after running *obimultiplex*. The
actual name of the attribute depends on the name of the monitored
attribute. If `-m` option is used with the attribute *sample*, then this
attribute names *merged_sample*.
**obitag_ref_index (`map[string]string`)**
##### `mode`
- Set by the *obirefidx* tool.
###### Set by the *obipairing* tool
It resumes to which taxonomic annotation a match to that sequence must
lead according to the number of differences existing between the query
sequence and the reference sequence having that tag.
**`obitag_ref_index`**
```json
{"0":"9606@Homo sapiens@species",
"2":"207598@Homininae@subfamily",
"3":"9604@Hominidae@family",
"8":"314295@Hominoidea@superfamily",
"10":"9526@Catarrhini@parvorder",
"12":"1437010@Boreoeutheria@clade",
"16":"9347@Eutheria@clade",
"17":"40674@Mammalia@class",
"22":"117571@Euteleostomi@clade",
"25":"7776@Gnathostomata@clade",
"29":"33213@Bilateria@clade",
"30":"6072@Eumetazoa@clade"}
```
###### Set by the *obirefidx* tool.
**pairing_mismatches (`map[string]string`)**
It resumes to which taxonomic annotation a match to that sequence must
lead according to the number of differences existing between the query
sequence and the reference sequence having that tag.
- Set by the *obipairing* tool
###### Getter : method `Count()`
**seq_a_single (`int`)**
##### `pairing_mismatches`
- Set by the *obipairing* tool
###### Set by the *obipairing* tool
**seq_ab_match (`int`)**
##### `score`
- Set by the *obipairing* tool
###### Set by the *obipairing* tool
**seq_b_single (`int`)**
##### `score_norm`
- Set by the *obipairing* tool
###### Set by the *obipairing* tool
**score (`int`)**
- Set by the *obipairing* tool
**score_norm (`float`)**
- Set by the *obipairing* tool
- The value ranges between 0 and 1.
Score of the alignment between forward and reverse reads expressed as a fraction of identity.

View File

@ -10,13 +10,39 @@
Sequences can be selected on several of their caracteristics, their length, their id, their sequence. Options allow for specifying the condition if selection.
**Selection based on the sequence**
Sequence records can be selected according if they match or not with a pattern. The simplest pattern is as short sequence (*e.g* `AACCTT`). But the usage of regular patterns allows for looking for more complex pattern. As example, `A[TG]C+G` matches a `A`, followed by a `T` or a `G`, then one or several `C` and endly a `G`.
{{< include ../lib/options/selection/_sequence.qmd >}}
*Examples:*
: Selects only the sequence records that contain an *EcoRI* restriction site.
```bash
obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
```
: Selects only the sequence records that contain a stretch of at least 10 ``A``.
```bash
obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
```
: Selects only the sequence records that do not contain ambiguous nucleotides.
```bash
obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
```
{{< include ../lib/options/selection/_min-count.qmd >}}
{{< include ../lib/options/selection/_max-count.qmd >}}
Example
*Examples*
: Selecting sequence records representing at least five reads in the dataset.

View File

@ -11,26 +11,64 @@ Several OBITools (*e.g.* obigrep, obiannotate) allow the user to specify some si
### Instrospection functions {.unnumbered}
- `len(x)`is a generic function allowing to retreive the size of a object. It returns
**`len(x)`**
: It is a generic function allowing to retreive the size of a object. It returns
the length of a sequences, the number of element in a map like `annotations`, the number
of elements in an array. The reurned value is an `int`.
### Cast functions {.unnumbered}
- `int(x)` converts if possible the `x` value to an integer value. The function
**`int(x)`**
: Converts if possible the `x` value to an integer value. The function
returns an `int`.
- `numeric(x)` converts if possible the `x` value to a float value. The function
**`numeric(x)`**
: Converts if possible the `x` value to a float value. The function
returns a `float`.
- `bool(x)` converts if possible the `x` value to a boolean value. The function
**`bool(x)`**
: Converts if possible the `x` value to a boolean value. The function
returns a `bool`.
### String related functions {.unnumbered}
- `printf(format,...)` allows to combine several values to build a string. `format` follows the
**`printf(format,...)`**
: Allows to combine several values to build a string. `format` follows the
classical C `printf` syntax. The function returns a `string`.
- `subspc(x)` substitutes every space in the `x` string by the underscore (`_`) character. The function
**`subspc(x)`**
: substitutes every space in the `x` string by the underscore (`_`) character. The function
returns a `string`.
### Condition function {.unnumbered}
**`ifelse(condition,val1,val2)`**
: The `condition` value has to be a `bool` value. If it is `true` the function returns `val1`,
otherwise, it is returning `val2`.
### Sequence analysis related function
**`composition(sequence)`**
: The nucleotide composition of the sequence is returned as as map indexed by `a`, `c`, `g`, or `t` and
each value is the number of occurrences of that nucleotide. A fifth key `others` accounts for
all others symboles.
**`gcskew(sequence)`**
: Computes the excess of g compare to c of the sequence, known as the GC skew.
$$
Skew_{GC}=\frac{G-C}{G+C}
$$
## Accessing to the sequence annotations
The `annotations` variable is a map object containing all the annotations associated to the currently processed sequence. Index of the map are the attribute names. It exists to possibillities to retreive
@ -53,4 +91,7 @@ Special attributes of the sequence are accessible only by dedicated methods of t
- The sequence identifier : `Id()`
- THe sequence definition : `Definition()`
```go
sequence.Id()
```