fastqc
The FastQC module parses results generated by FastQC, a quality control tool for high throughput sequence data written by Simon Andrews at the Babraham Institute.
FastQC generates a HTML report which is what most people use when
they run the program. However, it also helpfully generates a file
called fastqc_data.txt
which is relatively easy to parse.
A typical run will produce the following files:
mysample_fastqc.html
mysample_fastqc/
Icons/
Images/
fastqc.fo
fastqc_data.txt
fastqc_report.html
summary.txt
Sometimes the directory is zipped, with just mysample_fastqc.zip
.
The FastQC MultiQC module looks for files called fastqc_data.txt
or ending in _fastqc.zip
. If the zip files are found, they are
read in memory and fastqc_data.txt
parsed.
The directory and zip file are often both present. To speed up MultiQC execution, zip files will be skipped if the file name suggests that they will share a sample name with data that has already been parsed.
You can customise the patterns used for finding these files in your MultiQC config (see Module search patterns). The below code shows the default file patterns:
sp:
fastqc/data:
fn: "fastqc_data.txt"
fastqc/zip:
fn: "*_fastqc.zip"
Sample names are discovered by parsing the line beginning
Filename
in fastqc_data.txt
, not based on the FastQC report names.
Theoretical GC Content
It is possible to plot a dashed line showing the theoretical GC content for a reference genome. MultiQC comes with genome and transcriptome guides for Human and Mouse. You can use these in your reports by adding the following MultiQC config keys (see Configuring MultiQC):
fastqc_config:
fastqc_theoretical_gc: "hg38_genome"
Only one theoretical distribution can be plotted. The following guides are available: (txome = transcriptome)
hg38_genome
hg38_txome
mm10_genome
mm10_txome
Alternatively, a custom theoretical guide can be used in reports. To do this,
create a file with fastqc_theoretical_gc
in the filename and place it with your
analysis files. It should be tab delimited with the following format (column 1 = %GC,
column 2 = % of genome):
# FastQC theoretical GC content curve: YOUR REFERENCE NAME
0 0.005311768
1 0.004108502
2 0.004060371
3 0.005066476
[...]
You can generate these files using an R package called fastqcTheoreticalGC written by Mike Love. Please see the package readme for more details.
Result files from this package are searched for with the following search pattern (can be customised as described above):
sp:
fastqc/theoretical_gc:
fn: "*fastqc_theoretical_gc*"
If you want to always use a specific custom file for MultiQC reports without having to add it to the analysis directory, add the full file path to the same MultiQC config variable described above:
fastqc_config:
fastqc_theoretical_gc: "/path/to/your/custom_fastqc_theoretical_gc.txt"
Overrepresented sequences
The overrepresented sequences table shows the most common sequences found, measured by the number of samples they occur as overrepresented. By default, the table shows top 20 sequences. This can be customised in the config:
fastqc_config:
top_overrepresented_sequences: 50
You can also choose to rank the top sequences by the total number of reads rather than by number of samples:
fastqc_config:
top_overrepresented_sequences_by: "total"
Changing the order of sections
Remember that it is possible to customise the order in which the different module sections appear in the report if you wish. See the docs for more information.
For example, to show the Status Checks section at the top, use the following config:
report_section_order:
fastqc_status_checks:
order: -1000
Showing FastQC status checks
FastQC uses thresholds to mark samples as "pass", "warn" or "fail" for various checks. If you prefer the MultiQC module to ignore those thresholds, and use standard MultiQC colors for samples instead, use the following config:
fastqc_config:
status_checks: false