Add docuentation for the new options and an option to manage the ecotag

cache size
This commit is contained in:
2015-07-03 10:39:59 +02:00
parent aa064dda57
commit 2af94b9da7
3 changed files with 76 additions and 28 deletions

View File

@ -3,6 +3,24 @@ Options to specify input format
.. program:: obitools .. program:: obitools
Restrict the analysis to a sub-part of the input file
.....................................................
.. cmdoption:: --skip <N>
The N first sequence records of the file are discarded from the analysis and
not reported to the output file
.. cmdoption:: --only <N>
Only the N next sequence records of the file are analyzed. The following sequences
in the file are neither analyzed, neither reported to the output file.
This option can be used conjointly with the `--skip` option.
Sequence annotated format Sequence annotated format
......................... .........................

View File

@ -1,55 +1,70 @@
.. automodule:: ecotag .. automodule:: ecotag
:py:mod:`ecotag` specific options :py:mod:`ecotag` specific options
--------------------------------- ---------------------------------
.. cmdoption:: -R <FILENAME>, --ref-database=<FILENAME> .. cmdoption:: -R <FILENAME>, --ref-database=<FILENAME>
<FILENAME> is the fasta file containing the reference sequences <FILENAME> is the fasta file containing the reference sequences
.. cmdoption:: -m FLOAT, --minimum-identity=FLOAT .. cmdoption:: -m FLOAT, --minimum-identity=FLOAT
When the best match with the reference database present an identity
level below FLOAT, the taxonomic assignment for the sequence record
is not computed. The sequence record is nevertheless included in the
output file. FLOAT is included in a [0,1] interval.
.. cmdoption:: --minimum-circle=FLOAT
When sequence identity is less than FLOAT, the taxonomic minimum identity considered for the assignment circle.
assignment for the sequence record is not indicated in ``ecotag``'s FLOAT is included in a [0,1] interval.
output. FLOAT is included in a [0,1] interval.
(This option doesn't seem to work).
.. cmdoption:: -x RANK, --explain=RANK .. cmdoption:: -x RANK, --explain=RANK
.. cmdoption:: -u, --uniq .. cmdoption:: -u, --uniq
When this option is specified, the program first dereplicates the sequence When this option is specified, the program first dereplicates the sequence
records to work on unique sequences only. This option greatly improves records to work on unique sequences only. This option greatly improves
the program's speed, especially for highly redundant datasets. the program's speed, especially for highly redundant datasets.
.. cmdoption:: --sort=<KEY> .. cmdoption:: --sort=<KEY>
The output is sorted based on the values of the relevant attribute. The output is sorted based on the values of the relevant attribute.
.. cmdoption:: -r, --reverse .. cmdoption:: -r, --reverse
The output is sorted in reverse order (should be used with the --sort option). The output is sorted in reverse order (should be used with the --sort option).
(Works even if the --sort option is not set, but could not find on what (Works even if the --sort option is not set, but could not find on what
the output is sorted). the output is sorted).
.. cmdoption:: -E FLOAT, --errors=FLOAT .. cmdoption:: -E FLOAT, --errors=FLOAT
FLOAT is the fraction of reference sequences that will FLOAT is the fraction of reference sequences that will
be ignored when looking for the most recent common ancestor. This be ignored when looking for the most recent common ancestor. This
option is useful when a non-negligible proportion of reference sequences option is useful when a non-negligible proportion of reference sequences
is expected to be assigned to the wrong taxon, for example because of is expected to be assigned to the wrong taxon, for example because of
taxonomic misidentification. FLOAT is included in a [0,1] interval. taxonomic misidentification. FLOAT is included in a [0,1] interval.
.. cmdoption:: --cache-size=INTEGER
A cache for computed similarities is maintained by `ecotag`. the default
size for this cache is 1,000,000 of scores. This option allows to change
the cache size.
.. include:: ../optionsSet/taxonomyDB.txt .. include:: ../optionsSet/taxonomyDB.txt
.. include:: ../optionsSet/inputformat.txt
.. include:: ../optionsSet/outputformat.txt
.. include:: ../optionsSet/defaultoptions.txt .. include:: ../optionsSet/defaultoptions.txt
:py:mod:`ecotag` added sequence attributes :py:mod:`ecotag` added sequence attributes
------------------------------------------ ------------------------------------------
.. hlist:: .. hlist::
:columns: 3 :columns: 3
- :doc:`best_identity <../attributes/best_identity>` - :doc:`best_identity <../attributes/best_identity>`
- :doc:`best_match <../attributes/best_match>` - :doc:`best_match <../attributes/best_match>`
- :doc:`family <../attributes/family>` - :doc:`family <../attributes/family>`
@ -65,4 +80,3 @@
- :doc:`species_list <../attributes/species_list>` - :doc:`species_list <../attributes/species_list>`
- :doc:`species_name <../attributes/species_name>` - :doc:`species_name <../attributes/species_name>`
- :doc:`taxid <../attributes/taxid>` - :doc:`taxid <../attributes/taxid>`

View File

@ -148,6 +148,13 @@ def addSearchOptions(optionManager):
default=0.0, default=0.0,
help='Tolerated rate of wrong assignation') help='Tolerated rate of wrong assignation')
optionManager.add_option('--cache-size',
action='store',dest='cache',
type='int',
metavar='<SIZE>',
default=1000000,
help='Cache size for the aligment score')
def count(data): def count(data):
rep = {} rep = {}
@ -203,6 +210,7 @@ def cachedLenLCS(s1,s2,minid,normalized,reference):
global __LCSCache__ global __LCSCache__
global __INCache__ global __INCache__
global __OUTCache__ global __OUTCache__
global __CACHE_SIZE__
pair=frozenset((s1.id,s2.id)) pair=frozenset((s1.id,s2.id))
@ -217,7 +225,7 @@ def cachedLenLCS(s1,s2,minid,normalized,reference):
__LCSCache__[pair]=rep __LCSCache__[pair]=rep
if len(__LCSCache__) > 1000000: if len(__LCSCache__) > __CACHE_SIZE__:
__LCSCache__.popitem(0) __LCSCache__.popitem(0)
return rep return rep
@ -292,9 +300,15 @@ if __name__=='__main__':
__INCache__=1.0 __INCache__=1.0
__OUTCache__=1.0 __OUTCache__=1.0
optionParser = getOptionManager([addSearchOptions,addTaxonomyDBOptions,addInOutputOption],progdoc=__doc__) optionParser = getOptionManager([addSearchOptions,addTaxonomyDBOptions,addInOutputOption],progdoc=__doc__)
(options, entries) = optionParser() (options, entries) = optionParser()
__CACHE_SIZE__=options.cache
if __CACHE_SIZE__ < 10:
__CACHE_SIZE__=10
taxonomy = loadTaxonomyDatabase(options) taxonomy = loadTaxonomyDatabase(options)
writer = sequenceWriterGenerator(options) writer = sequenceWriterGenerator(options)
@ -317,7 +331,7 @@ if __name__=='__main__':
taxonlink = {} taxonlink = {}
rankid = taxonomy.findRankByName(options.explain) rankid = taxonomy.findRankByName(options.explain)
for seq in db: for seq in db:
id = seq.id[0:46] id = seq.id[0:46]
seq.id=id seq.id=id
@ -338,6 +352,8 @@ if __name__=='__main__':
search = lcsIteratorSelf(entries,db,options) search = lcsIteratorSelf(entries,db,options)
print >>sys.stderr,'\nCache size : %d\n'
for seq,best,match in search: for seq,best,match in search:
try: try:
@ -424,9 +440,9 @@ if __name__=='__main__':
else: else:
seq['species_name']=None seq['species_name']=None
print >>sys.stderr,'\rCache size : %5.3f ' % (__INCache__/__OUTCache__),
writer(seq) writer(seq)
print >>sys.stderr,'\n%5.3f% of the alignments was cached' % (__INCache__/(__INCache__+__OUTCache__)*100)