On Sat, Feb 01, 2020 at 07:45:20AM +0100, Paul Gevers wrote: > Hi, > > On Fri, 31 Jan 2020 12:22:42 +0100 Andreas Tille <andr...@an3as.eu> wrote: > > > I'd love if some users of fastqc could comment on the outcome of these > > tests. > > Me too. Because how it looks to me (Release Team member hat on) this > test should actually not reward FastQC with the reduced age. Hence I > have aged this version of fastqc to 10 days.
Hi Paul, hello Andreas, I spent a while looking into this bug (and thereby took a crash course in the Sequence Alignment Map file format) and I am convinced that there's not an issue with fastqc. The "FAIL" output in the summary indicates that FastQC is finding issues with the data quality in the toy.sam and toy.bam datasets. And since the latter is merely the binary version of the former, the quality issues detected are the same. From the upstream README [1]: > FastQC is an application which takes a FastQ file and runs a series > of tests on it to generate a comprehensive QC report. This will > tell you if there is anything unusual about your sequence. Each > test is flagged as a pass, warning or fail depending on how far it > departs from what you'd expect from a normal large dataset with no > significant biases. It's important to stress that warnings or even > failures do not necessarily mean that there is a problem with your > data, only that it is unusual. It is possible that the biological > nature of your sample means that you would expect this particular > bias in your results. In this tutorial [2], the warning is stronger: > The output from FastQC, after analyzing a FASTQ file of sequence reads, > is an html file that may be viewed in your browser. The report contains > one result section for each FastQC module. In addition to the graphical > or list data provided by each module, a flag of “Passed”, “Warn” or > “Fail” is assigned. Researchers should be very cautious about relying > on these flags when assessing sequence data. The thresholds used to > assign these flags are based on a very specific set of assumptions that > are applicable to a very specific type of sequence data. Specifically, > they are tuned for good quality whole genome shotgun DNA sequencing. > They are less reliable with other types of sequencing, for example > mRNA-Seq, small RNA-Seq, methyl-seq, targeted sequence capture and > targeted amplicon sequencing. Therefore, a module result that has a > “Warn” or “Fail” flag does not necessarily mean that the sequence run > failed. “Warn” and “Fail” flags mean that the researcher must stop and > consider what that results mean in the context of that particular sample > and the type of sequencing that was run. In order to prove this to myself, I ran FastQC against a number of different datasets in the SAM and BAM format that I found online (for example in this Galaxy tutorial [3]) and against the SAM files found in picard-tools [4] sources, and it found issues with all of them. So I think the tool is doing the right thing outputting FAIL for these files. I propose that we update the test to ensure that a summary file is produced and contains the requisite number of lines for each of the data quality tests, and that each line contains one of "PASS|WARN|FAIL" to indicate that FastQC was able to run the tests. This will ensure that FastQC isn't exiting unexpectedly while trying to read the file, etc. Going further, we could establish some known results and compare them to the output during the build. For reference, good output looks like this: $ cat GSM461177_untreat_paired_chr4_fastqc/summary.txt PASS Basic Statistics GSM461177_untreat_paired_chr4.bam PASS Per base sequence quality GSM461177_untreat_paired_chr4.bam PASS Per sequence quality scores GSM461177_untreat_paired_chr4.bam FAIL Per base sequence content GSM461177_untreat_paired_chr4.bam PASS Per sequence GC content GSM461177_untreat_paired_chr4.bam PASS Per base N content GSM461177_untreat_paired_chr4.bam PASS Sequence Length Distribution GSM461177_untreat_paired_chr4.bam WARN Sequence Duplication Levels GSM461177_untreat_paired_chr4.bam WARN Overrepresented sequences GSM461177_untreat_paired_chr4.bam PASS Adapter Content GSM461177_untreat_paired_chr4.bam Bad output looks like a Java exception stack trace... :) Cheers, tony [1] https://raw.githubusercontent.com/s-andrews/FastQC/master/README.txt [2] https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/ [3] https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html [4] https://salsa.debian.org/med-team/picard-tools/-/tree/master/testdata%2Fpicard%2Fsam
signature.asc
Description: PGP signature