mirror of
https://github.com/metabarcoding/obitools4.git
synced 2025-06-29 16:20:46 +00:00
213 lines
7.3 KiB
Plaintext
213 lines
7.3 KiB
Plaintext
# The GO *OBITools* library
|
|
|
|
## BioSequence
|
|
|
|
The `BioSequence` class is used to represent biological sequences. It
|
|
allows for storing : - the sequence itself as a `[]byte` - the
|
|
sequencing quality score as a `[]byte` if needed - an identifier as a
|
|
`string` - a definition as a `string` - a set of *(key, value)* pairs in
|
|
a `map[sting]interface{}`
|
|
|
|
BioSequence is defined in the obiseq module and is included using the
|
|
code
|
|
|
|
``` go
|
|
import (
|
|
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
|
|
)
|
|
```
|
|
|
|
### Creating new instances
|
|
|
|
To create new instance, use
|
|
|
|
- `MakeBioSequence(id string, sequence []byte, definition string) obiseq.BioSequence`
|
|
- `NewBioSequence(id string, sequence []byte, definition string) *obiseq.BioSequence`
|
|
|
|
Both create a `BioSequence` instance, but when the first one returns the
|
|
instance, the second returns a pointer on the new instance. Two other
|
|
functions `MakeEmptyBioSequence`, and `NewEmptyBioSequence` do the same
|
|
job but provide an uninitialized objects.
|
|
|
|
- `id` parameters corresponds to the unique identifier of the
|
|
sequence. It mist be a string constituted of a single word (not
|
|
containing any space).
|
|
- `sequence` is the DNA sequence itself, provided as a `byte` array
|
|
(`[]byte`).
|
|
- `definition` is a `string`, potentially empty, but usualy containing
|
|
a sentence explaining what is that sequence.
|
|
|
|
``` go
|
|
import (
|
|
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
|
|
)
|
|
|
|
func main() {
|
|
myseq := obiseq.NewBiosequence(
|
|
"seq_GH0001",
|
|
bytes.FromString("ACGTGTCAGTCG"),
|
|
"A short test sequence",
|
|
)
|
|
}
|
|
```
|
|
|
|
When formated as fasta the parameters correspond to the following schema
|
|
|
|
>id definition containing potentially several words
|
|
sequence
|
|
|
|
### End of life of a `BioSequence` instance
|
|
|
|
When an instance of `BioSequence` is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the `Recycle` method on it to store the allocated memory elements in a `pool` to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.
|
|
|
|
|
|
### Accessing to the elements of a sequence
|
|
|
|
The different elements of an `obiseq.BioSequence` must be accessed using
|
|
a set of methods. For the three main elements provided during the
|
|
creation of a new instance methodes are :
|
|
|
|
- `Id() string`
|
|
- `Sequence() []byte`
|
|
- `Definition() string`
|
|
|
|
It exists pending method to change the value of these elements
|
|
|
|
- `SetId(id string)`
|
|
- `SetSequence(sequence []byte)`
|
|
- `SetDefinition(definition string)`
|
|
|
|
``` go
|
|
import (
|
|
"fmt"
|
|
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
|
|
)
|
|
|
|
func main() {
|
|
myseq := obiseq.NewBiosequence(
|
|
"seq_GH0001",
|
|
bytes.FromString("ACGTGTCAGTCG"),
|
|
"A short test sequence",
|
|
)
|
|
|
|
fmt.Println(myseq.Id())
|
|
myseq.SetId("SPE01_0001")
|
|
fmt.Println(myseq.Id())
|
|
}
|
|
```
|
|
|
|
#### Different ways for accessing an editing the sequence
|
|
|
|
If `Sequence()`and `SetSequence(sequence []byte)` methods are the basic
|
|
ones, several other methods exist.
|
|
|
|
- `String() string` return the sequence directly converted to a
|
|
`string` instance.
|
|
- The `Write` method family allows for extending an existing sequence
|
|
following the buffer protocol.
|
|
- `Write(data []byte) (int, error)` allows for appending a byte
|
|
array on 3' end of the sequence.
|
|
- `WriteString(data string) (int, error)` allows for appending a
|
|
`string`.
|
|
- `WriteByte(data byte) error` allows for appending a single
|
|
`byte`.
|
|
|
|
The `Clear` method empties the sequence buffer.
|
|
|
|
``` go
|
|
import (
|
|
"fmt"
|
|
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
|
|
)
|
|
|
|
func main() {
|
|
myseq := obiseq.NewEmptyBiosequence()
|
|
|
|
myseq.WriteString("accc")
|
|
myseq.WriteByte(byte('c'))
|
|
fmt.Println(myseq.String())
|
|
}
|
|
```
|
|
|
|
#### Sequence quality scores
|
|
|
|
Sequence quality scores cannot be initialized at the time of instance
|
|
creation. You must use dedicated methods to add quality scores to a
|
|
sequence.
|
|
|
|
To be coherent the length of both the DNA sequence and que quality score
|
|
sequence must be equal. But assessment of this constraint is realized.
|
|
It is of the programmer responsability to check that invariant.
|
|
|
|
While accessing to the quality scores relies on the method
|
|
`Quality() []byte`, setting the quality need to call one of the
|
|
following method. They run similarly to their sequence dedicated
|
|
conterpart.
|
|
|
|
- `SetQualities(qualities Quality)`
|
|
- `WriteQualities(data []byte) (int, error)`
|
|
- `WriteByteQualities(data byte) error`
|
|
|
|
In a way analogous to the `Clear` method, `ClearQualities()` empties the
|
|
sequence of quality scores.
|
|
|
|
### The annotations of a sequence
|
|
|
|
A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name.
|
|
The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.
|
|
|
|
- Scalar types:
|
|
- integer
|
|
- numeric
|
|
- character
|
|
- boolean
|
|
- Container types:
|
|
- vector
|
|
- map
|
|
|
|
Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.
|
|
|
|
Annotations are stored in an object of type `bioseq.Annotation` which is an alias of `map[string]interface{}`. This map can be retrieved using the `Annotations() Annotation` method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of `BioSequence` using its `HasAnnotation() bool` method to see if it has any annotations associated with it.
|
|
|
|
- GetAttribute(key string) (interface{}, bool)
|
|
|
|
## The sequence iterator
|
|
|
|
The pakage *obiter* provides an iterator mecanism for manipulating sequences. The main class provided by this package is `obiiter.IBioSequence`. An `IBioSequence` iterator provides batch of sequences.
|
|
|
|
### Basic usage of a sequence iterator
|
|
|
|
Many functions, among them functions reading sequences from a text file, return a `IBioSequence` iterator. The iterator class provides two main methods:
|
|
|
|
- `Next() bool`
|
|
- `Get() obiiter.BioSequenceBatch`
|
|
|
|
The `Next` method moves the iterator to the next value, while the `Get` method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.
|
|
|
|
``` go
|
|
import (
|
|
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"
|
|
)
|
|
|
|
func main() {
|
|
mydata := obiformats.ReadFastSeqFromFile("myfile.fasta")
|
|
|
|
for mydata.Next() {
|
|
data := mydata.Get()
|
|
//
|
|
// Whatever you want to do with the data chunk
|
|
//
|
|
}
|
|
}
|
|
```
|
|
|
|
An `obiseq.BioSequenceBatch` instance is a set of sequences stored in an `obiseq.BioSequenceSlice` and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.
|
|
|
|
### The `Pipable` functions
|
|
|
|
A function consuming a `obiiter.IBioSequence` and returning a `obiiter.IBioSequence` is of class `obiiter.Pipable`.
|
|
|
|
### The `Teeable` functions
|
|
|
|
A function consuming a `obiiter.IBioSequence` and returning two `obiiter.IBioSequence` instance is of class `obiiter.Teeable`.
|