4 The GO OBITools library
4.1 BioSequence
The BioSequence
class is used to represent biological sequences. It allows for storing : - the sequence itself as a []byte
- the sequencing quality score as a []byte
if needed - an identifier as a string
- a definition as a string
- a set of (key, value) pairs in a map[sting]interface{}
BioSequence is defined in the obiseq module and is included using the code
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
4.1.1 Creating new instances
To create new instance, use
MakeBioSequence(id string, sequence []byte, definition string) obiseq.BioSequence
NewBioSequence(id string, sequence []byte, definition string) *obiseq.BioSequence
Both create a BioSequence
instance, but when the first one returns the instance, the second returns a pointer on the new instance. Two other functions MakeEmptyBioSequence
, and NewEmptyBioSequence
do the same job but provide an uninitialized objects.
id
parameters corresponds to the unique identifier of the sequence. It mist be a string constituted of a single word (not containing any space).sequence
is the DNA sequence itself, provided as abyte
array ([]byte
).definition
is astring
, potentially empty, but usualy containing a sentence explaining what is that sequence.
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
:= obiseq.NewBiosequence(
myseq "seq_GH0001",
.FromString("ACGTGTCAGTCG"),
bytes"A short test sequence",
)
}
When formated as fasta the parameters correspond to the following schema
>id definition containing potentially several words
sequence
4.1.2 End of life of a BioSequence
instance
When an instance of BioSequence
is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the Recycle
method on it to store the allocated memory elements in a pool
to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.
4.1.3 Accessing to the elements of a sequence
The different elements of an obiseq.BioSequence
must be accessed using a set of methods. For the three main elements provided during the creation of a new instance methodes are :
Id() string
Sequence() []byte
Definition() string
It exists pending method to change the value of these elements
SetId(id string)
SetSequence(sequence []byte)
SetDefinition(definition string)
import (
"fmt"
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
:= obiseq.NewBiosequence(
myseq "seq_GH0001",
.FromString("ACGTGTCAGTCG"),
bytes"A short test sequence",
)
.Println(myseq.Id())
fmt.SetId("SPE01_0001")
myseq.Println(myseq.Id())
fmt}
4.1.3.1 Different ways for accessing an editing the sequence
If Sequence()
and SetSequence(sequence []byte)
methods are the basic ones, several other methods exist.
String() string
return the sequence directly converted to astring
instance.- The
Write
method family allows for extending an existing sequence following the buffer protocol.Write(data []byte) (int, error)
allows for appending a byte array on 3’ end of the sequence.WriteString(data string) (int, error)
allows for appending astring
.WriteByte(data byte) error
allows for appending a singlebyte
.
The Clear
method empties the sequence buffer.
import (
"fmt"
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)
func main() {
:= obiseq.NewEmptyBiosequence()
myseq
.WriteString("accc")
myseq.WriteByte(byte('c'))
myseq.Println(myseq.String())
fmt}
4.1.3.2 Sequence quality scores
Sequence quality scores cannot be initialized at the time of instance creation. You must use dedicated methods to add quality scores to a sequence.
To be coherent the length of both the DNA sequence and que quality score sequence must be equal. But assessment of this constraint is realized. It is of the programmer responsability to check that invariant.
While accessing to the quality scores relies on the method Quality() []byte
, setting the quality need to call one of the following method. They run similarly to their sequence dedicated conterpart.
SetQualities(qualities Quality)
WriteQualities(data []byte) (int, error)
WriteByteQualities(data byte) error
In a way analogous to the Clear
method, ClearQualities()
empties the sequence of quality scores.
4.1.4 The annotations of a sequence
A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name. The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.
- Scalar types:
- integer
- numeric
- character
- boolean
- Container types:
- vector
- map
Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.
Annotations are stored in an object of type bioseq.Annotation
which is an alias of map[string]interface{}
. This map can be retrieved using the Annotations() Annotation
method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of BioSequence
using its HasAnnotation() bool
method to see if it has any annotations associated with it.
- GetAttribute(key string) (interface{}, bool)
4.2 The sequence iterator
The pakage obiter provides an iterator mecanism for manipulating sequences. The main class provided by this package is obiiter.IBioSequence
. An IBioSequence
iterator provides batch of sequences.
4.2.1 Basic usage of a sequence iterator
Many functions, among them functions reading sequences from a text file, return a IBioSequence
iterator. The iterator class provides two main methods:
Next() bool
Get() obiiter.BioSequenceBatch
The Next
method moves the iterator to the next value, while the Get
method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.
import (
"git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"
)
func main() {
:= obiformats.ReadFastSeqFromFile("myfile.fasta")
mydata
for mydata.Next() {
:= mydata.Get()
data //
// Whatever you want to do with the data chunk
//
}
}
An obiseq.BioSequenceBatch
instance is a set of sequences stored in an obiseq.BioSequenceSlice
and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.
4.2.2 The Pipable
functions
A function consuming a obiiter.IBioSequence
and returning a obiiter.IBioSequence
is of class obiiter.Pipable
.
4.2.3 The Teeable
functions
A function consuming a obiiter.IBioSequence
and returning two obiiter.IBioSequence
instance is of class obiiter.Teeable
.