bioOTU is a computational tool for operational taxonomic units (OTUs) clustering of 16s rRNA gene sequences in microbial studiesby the high-throughput sequencing techniques, which aims at two improvements in algorithm, including (i) to effectively distinguish the error-free sequences from potential error sequences from known species in reference database and then cluster them into OTUs by different strategies, and (ii) to simultaneously perform the taxonomic assignments and clustering of OTUs for achieving higher reliability.The performance tests on both mock and real communities revealed that bioOTU is powerful for recovering the underlying profiles at both microbial composition and abundance, and it also produces comparable or less number of OTUs in comparison with the prevailing tools of Mothur (http://mothur.org) and UPARSE (http://drive5.com/uparse).
Schematic illustration of bioOTU algorithm.
Input unique tags are denoted by circles (A) and all of them can be classified into both known species of error-free (Dark orange, big) or artificial sequences (Light orange, small), and unknown species of error-free (Dark olive, big) or artificial sequences (Light olive, small), respectively. After being subjected to taxonomic assignments, the assigned (B) and unassigned (C) tags are here in separated. These assigned tags are classified into different genera for taxonomy-guided clustering into OTUs (D, dashed circles), to which these taxonomically unassigned tags are further appended and finally generate complete OTUs (E). Finally, all pending tags (F) are also subjected to OTUs clustering by the improved heuristic algorithm (G).
The framework of bioOTU was written in Python, whereas these calculation-expensive steps during clustering process, such as the pairwise alignment by Needleman-Wunsch algorithm, were further written in C for improving speed. Furthermore, bioOTU also employed the strategy of multiple threads of computation. Source codes and reference manual of bioOTU are freely available here.
Additionally, we design a new algorithm and provide the custom scripts to taxonomically annotate OTUs at species level, including the preparation of reference database, extraction of representative sequences from OTUs, and homologous search against reference database (details are stated in Supplementary Data of bioOTU).
Here, we also provide a Python script to calculate the normalized mutual information (NMI) score, which requires the input of one tab-delimited file containing three columns. For this input file, the three columns are the name of sequences, name of reference species (gold standard), and label of the corresponding OTUs, respectively (details are stated in Supplementary Data of bioOTU).
Chen SY, Deng F, Huang Y, Jia X, Liu YP, Lai SJ. bioOTU: An Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s rRNA Gene Sequences. Journal of Computational Biology. 2016, 23(4):229-38.