This methodology is probably not restricted to pyrosequencing datasets, and could be, after some modifications, applied to datasets obtained with any kind of sequencing techniques. Acknowledgements This research was financed by
the Swiss National Science Foundation, Grants No. 120536, 138148 and 120627. We recognize the excellent assistance selleck compound of Yoan Rappaz in molecular biology analyses. We acknowledge Scot E. Dowd, Yan Sun, Lars Koenig and at Research and Testing Laboratory (Lubbock, Texas, USA), Timothy M. Vogel, Sébastien Cecillon and the Environmental Microbial Genomics Group at Ecole Centrale de Lyon (France), and GATC Biotech (Konstanz, Germany) for pyrosequencing analyses and advice. We are grateful to Ioannis Xenarios for support and access to the Vital-IT HPCC of the Swiss Institute of Bioinformatics (Lausanne, Switzerland). Electronic supplementary
material Additional file 1: Quality plots generated for samples pyrosequenced with LowRA (>3′000 reads) and HighRA methods (>10′000 reads). Sequence quality PHRED scores over all bases (A): PHRED scores are defined as the logarithm of the base-calling error probability Perror = 10-PHRED/10 and PHRED = −10 log Perror. Box plots represent the distribution of reads quality at each sequence length. The black curve represents the mean sequence quality in function of the sequence length. Distribution of the mean sequence quality PHRED score over the pyrosequencing reads (B). Distribution of sequence lengths over EGFR inhibitors list all pyrosequencing reads (C). Only sequences between 300 and 500 bp were kept for dT-RFLP analysis. (PDF 163 KB) Additional
file 2: Assessment of mapping performances with pyrosequencing datasets denoised without (0–500 bp) and with (300–500 bp) minimal read length cutoff. Examples are given for the groundwater sample GRW01, the flocculent activated sludge sample FLS01 and the aerobic granular sludge sample AGS01. After denoising with the one or the other method, each dataset was mapped against a reference database with MG-RAST [66]. No cutoff was set for e-value, minimum identity and minimum Parvulin alignment length. After having observed that between 35-45% of the sequences were unassigned with Greengenes, RDP – the Ribosomal Database Project [67] was used as reference database for this assessment (only 4% unassigned sequences). Correlations between bacterial community profiles obtained with both denoising methods and both reference databases were analyzed with STAMP [68]. (PDF 375 KB) Additional file 3: Comparison of the distributions of the SW mapping score and of the traditional identity score used by microbial ecologists in the field of environmental sciences for phylogenetic affiliation of sequences.