4  The GO OBITools library

4.1 BioSequence

The BioSequence class is used to represent biological sequences. It allows for storing : - the sequence itself as a []byte - the sequencing quality score as a []byte if needed - an identifier as a string - a definition as a string - a set of (key, value) pairs in a map[sting]interface{}

BioSequence is defined in the obiseq module and is included using the code

import (
    "git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)

4.1.1 Creating new instances

To create new instance, use

  • MakeBioSequence(id string, sequence []byte, definition string) obiseq.BioSequence
  • NewBioSequence(id string, sequence []byte, definition string) *obiseq.BioSequence

Both create a BioSequence instance, but when the first one returns the instance, the second returns a pointer on the new instance. Two other functions MakeEmptyBioSequence, and NewEmptyBioSequence do the same job but provide an uninitialized objects.

  • id parameters corresponds to the unique identifier of the sequence. It mist be a string constituted of a single word (not containing any space).
  • sequence is the DNA sequence itself, provided as a byte array ([]byte).
  • definition is a string, potentially empty, but usualy containing a sentence explaining what is that sequence.
import (
    "git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)

func main() {
    myseq := obiseq.NewBiosequence(
        "seq_GH0001",
        bytes.FromString("ACGTGTCAGTCG"),
        "A short test sequence",
        )
}

When formated as fasta the parameters correspond to the following schema

>id definition containing potentially several words
sequence

4.1.2 End of life of a BioSequence instance

When an instance of BioSequence is no longer in use, it is normally taken over by the GO garbage collector. If you know that an instance will never be used again, you can, if you wish, call the Recycle method on it to store the allocated memory elements in a pool to limit the allocation effort when many sequences are being handled. Once the recycle method has been called on an instance, you must ensure that no other method is called on it.

4.1.3 Accessing to the elements of a sequence

The different elements of an obiseq.BioSequence must be accessed using a set of methods. For the three main elements provided during the creation of a new instance methodes are :

  • Id() string
  • Sequence() []byte
  • Definition() string

It exists pending method to change the value of these elements

  • SetId(id string)
  • SetSequence(sequence []byte)
  • SetDefinition(definition string)
import (
    "fmt"
    "git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)

func main() {
    myseq := obiseq.NewBiosequence(
        "seq_GH0001",
        bytes.FromString("ACGTGTCAGTCG"),
        "A short test sequence",
        )

    fmt.Println(myseq.Id())
    myseq.SetId("SPE01_0001")
    fmt.Println(myseq.Id())
}

4.1.3.1 Different ways for accessing an editing the sequence

If Sequence()and SetSequence(sequence []byte) methods are the basic ones, several other methods exist.

  • String() string return the sequence directly converted to a string instance.
  • The Write method family allows for extending an existing sequence following the buffer protocol.
    • Write(data []byte) (int, error) allows for appending a byte array on 3’ end of the sequence.
    • WriteString(data string) (int, error) allows for appending a string.
    • WriteByte(data byte) error allows for appending a single byte.

The Clear method empties the sequence buffer.

import (
    "fmt"
    "git.metabarcoding.org/lecasofts/go/obitools/pkg/obiseq"
)

func main() {
    myseq := obiseq.NewEmptyBiosequence()

    myseq.WriteString("accc")
    myseq.WriteByte(byte('c'))
    fmt.Println(myseq.String())
}

4.1.3.2 Sequence quality scores

Sequence quality scores cannot be initialized at the time of instance creation. You must use dedicated methods to add quality scores to a sequence.

To be coherent the length of both the DNA sequence and que quality score sequence must be equal. But assessment of this constraint is realized. It is of the programmer responsability to check that invariant.

While accessing to the quality scores relies on the method Quality() []byte, setting the quality need to call one of the following method. They run similarly to their sequence dedicated conterpart.

  • SetQualities(qualities Quality)
  • WriteQualities(data []byte) (int, error)
  • WriteByteQualities(data byte) error

In a way analogous to the Clear method, ClearQualities() empties the sequence of quality scores.

4.1.4 The annotations of a sequence

A sequence can be annotated with attributes. Each attribute is associated with a value. An attribute is identified by its name. The name of an attribute consists of a character string containing no spaces or blank characters. Values can be of several types.

  • Scalar types:
    • integer
    • numeric
    • character
    • boolean
  • Container types:
    • vector
    • map

Vectors can contain any type of scalar. Maps are compulsorily indexed by strings and can contain any scalar type. It is not possible to have nested container type.

Annotations are stored in an object of type bioseq.Annotation which is an alias of map[string]interface{}. This map can be retrieved using the Annotations() Annotation method. If no annotation has been defined for this sequence, the method returns an empty map. It is possible to test an instance of BioSequence using its HasAnnotation() bool method to see if it has any annotations associated with it.

  • GetAttribute(key string) (interface{}, bool)

4.2 The sequence iterator

The pakage obiter provides an iterator mecanism for manipulating sequences. The main class provided by this package is obiiter.IBioSequence. An IBioSequence iterator provides batch of sequences.

4.2.1 Basic usage of a sequence iterator

Many functions, among them functions reading sequences from a text file, return a IBioSequence iterator. The iterator class provides two main methods:

  • Next() bool
  • Get() obiiter.BioSequenceBatch

The Next method moves the iterator to the next value, while the Get method returns the currently pointed value. Using them, it is possible to loop over the data as in the following code chunk.

import (
    "git.metabarcoding.org/lecasofts/go/obitools/pkg/obiformats"
)

func main() {
    mydata := obiformats.ReadFastSeqFromFile("myfile.fasta")
       
    for mydata.Next() {
        data := mydata.Get()
        //
        // Whatever you want to do with the data chunk
        //
    }
}

An obiseq.BioSequenceBatch instance is a set of sequences stored in an obiseq.BioSequenceSlice and a sequence number. The number of sequences in a batch is not defined. A batch can even contain zero sequences, if for example all sequences initially included in the batch have been filtered out at some stage of their processing.

4.2.2 The Pipable functions

A function consuming a obiiter.IBioSequence and returning a obiiter.IBioSequence is of class obiiter.Pipable.

4.2.3 The Teeable functions

A function consuming a obiiter.IBioSequence and returning two obiiter.IBioSequence instance is of class obiiter.Teeable.