Files
obitools4/doc/book/library.qmd
Eric Coissac c94b2974fb Reorganization of the documentation book directory
Former-commit-id: 095acaf9c8649b0e527c6253dc79330feedac865
2023-02-23 23:41:24 +01:00

213 lines
7.3 KiB
Plaintext

# The GO *OBITools* library
## BioSequence
The `BioSequence` class is used to represent biological sequences. It
allows for storing : - the sequence itself as a `[]byte` - the
sequencing quality score as a `[]byte` if needed - an identifier as a
`string` - a definition as a `string` - a set of *(key, value)* pairs in
a `map[sting]interface{}`
BioSequence is defined in the obiseq module and is included using the
code
``` go
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
```
### Creating new instances
To create new instance, use
- `MakeBioSequence(id string, sequence []byte, definition string) obiseq.BioSequence`
- `NewBioSequence(id string, sequence []byte, definition string) *obiseq.BioSequence`
Both create a `BioSequence` instance, but when the first one returns the
instance, the second returns a pointer on the new instance. Two other
functions `MakeEmptyBioSequence`, and `NewEmptyBioSequence` do the same
job but provide an uninitialized objects.
- `id` parameters corresponds to the unique identifier of the
sequence. It mist be a string constituted of a single word (not
containing any space).
- `sequence` is the DNA sequence itself, provided as a `byte` array
(`[]byte`).
- `definition` is a `string`, potentially empty, but usualy containing
a sentence explaining what is that sequence.
``` go
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
myseq := obiseq.NewBiosequence(
"seq_GH0001",
bytes.FromString("ACGTGTCAGTCG"),
"A short test sequence",
)
}
```
When formated as fasta the parameters correspond to the following schema
>id definition containing potentially several words
sequence
### End of life of a `BioSequence` instance
When an instance of `BioSequence` is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the `Recycle` method on it to store the allocated memory elements in a `pool` to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.
### Accessing to the elements of a sequence
The different elements of an `obiseq.BioSequence` must be accessed using
a set of methods. For the three main elements provided during the
creation of a new instance methodes are :
- `Id() string`
- `Sequence() []byte`
- `Definition() string`
It exists pending method to change the value of these elements
- `SetId(id string)`
- `SetSequence(sequence []byte)`
- `SetDefinition(definition string)`
``` go
import (
"fmt"
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
myseq := obiseq.NewBiosequence(
"seq_GH0001",
bytes.FromString("ACGTGTCAGTCG"),
"A short test sequence",
)
fmt.Println(myseq.Id())
myseq.SetId("SPE01_0001")
fmt.Println(myseq.Id())
}
```
#### Different ways for accessing an editing the sequence
If `Sequence()`and `SetSequence(sequence []byte)` methods are the basic
ones, several other methods exist.
- `String() string` return the sequence directly converted to a
`string` instance.
- The `Write` method family allows for extending an existing sequence
following the buffer protocol.
- `Write(data []byte) (int, error)` allows for appending a byte
array on 3' end of the sequence.
- `WriteString(data string) (int, error)` allows for appending a
`string`.
- `WriteByte(data byte) error` allows for appending a single
`byte`.
The `Clear` method empties the sequence buffer.
``` go
import (
"fmt"
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
myseq := obiseq.NewEmptyBiosequence()
myseq.WriteString("accc")
myseq.WriteByte(byte('c'))
fmt.Println(myseq.String())
}
```
#### Sequence quality scores
Sequence quality scores cannot be initialized at the time of instance
creation. You must use dedicated methods to add quality scores to a
sequence.
To be coherent the length of both the DNA sequence and que quality score
sequence must be equal. But assessment of this constraint is realized.
It is of the programmer responsability to check that invariant.
While accessing to the quality scores relies on the method
`Quality() []byte`, setting the quality need to call one of the
following method. They run similarly to their sequence dedicated
conterpart.
- `SetQualities(qualities Quality)`
- `WriteQualities(data []byte) (int, error)`
- `WriteByteQualities(data byte) error`
In a way analogous to the `Clear` method, `ClearQualities()` empties the
sequence of quality scores.
### The annotations of a sequence
A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name.
The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.
- Scalar types:
- integer
- numeric
- character
- boolean
- Container types:
- vector
- map
Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.
Annotations are stored in an object of type `bioseq.Annotation` which is an alias of `map[string]interface{}`. This map can be retrieved using the `Annotations() Annotation` method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of `BioSequence` using its `HasAnnotation() bool` method to see if it has any annotations associated with it.
- GetAttribute(key string) (interface{}, bool)
## The sequence iterator
The pakage *obiter* provides an iterator mecanism for manipulating sequences. The main class provided by this package is `obiiter.IBioSequence`. An `IBioSequence` iterator provides batch of sequences.
### Basic usage of a sequence iterator
Many functions, among them functions reading sequences from a text file, return a `IBioSequence` iterator. The iterator class provides two main methods:
- `Next() bool`
- `Get() obiiter.BioSequenceBatch`
The `Next` method moves the iterator to the next value, while the `Get` method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.
``` go
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"
)
func main() {
mydata := obiformats.ReadFastSeqFromFile("myfile.fasta")
for mydata.Next() {
data := mydata.Get()
//
// Whatever you want to do with the data chunk
//
}
}
```
An `obiseq.BioSequenceBatch` instance is a set of sequences stored in an `obiseq.BioSequenceSlice` and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.
### The `Pipable` functions
A function consuming a `obiiter.IBioSequence` and returning a `obiiter.IBioSequence` is of class `obiiter.Pipable`.
### The `Teeable` functions
A function consuming a `obiiter.IBioSequence` and returning two `obiiter.IBioSequence` instance is of class `obiiter.Teeable`.