What is a "Custom Reference Genome" ?
A reference genome contains the nucleotide sequence of the chromosomes, scaffolds, or contigs for a single species, representative of a specific genome build or release.
In Galaxy, a custom reference genome is a FASTA formatted dataset that can be used in place of a native reference genome with most tools.
custom: a dataset from the history loaded by users
native: local or cached by administrators (see NGS Setup)
Overview
There are three basic steps to using a Custom Reference Genome:
obtain a FASTA copy of the target genome
FTP the genome to Galaxy and load into a history as a dataset
set a tool form's options to use a custom reference genome from the history and select the loaded genome
Screencasts & Tutorials
Screencast |
Topic |
demonstrates Bowtie tool form options using a custom genome from the history |
|
(-coming soon-) Extract DNA using a Custom Reference Genome |
demonstrates Exact genomic DNA tool form options using a custom genome from the history |
if you need to know what to do before the tool form options are set, this is the quickie to watch |
Sources
- UCSC, Ensembl, NCBI/GenBank
- Other Research project associated with specific genome projects
- Internal research projects
Format
Custom Genomes are required to be in FASTA format
The data should be formatted as FASTA prior to upload into Galaxy
The dataset will need to be labeled as FASTA after loaded (if not automatically assigned)
Sorting
Many tool expect that reference genomes are sorted in lexicographical order. These tools are often downstream of the initial mapping tools, which means that a large investment in a project has already been made (i.e. a long mapping process), before a problem with sorting pops up in conclusion layer tools. No one likes to start over!
How to avoid? Always sort your FASTA reference genome dataset at the beginning of a project. Many sources only provide sorted genomes, but double checking is your own responsibility, and super easy in Galaxy. So easy that there isn't even a shared workflow, just a recipe (but feel free to make your own):
quick lexicographical sort recipe:
1. Convert Formats -> FASTA-to-Tabular
2. Filter and Sort -> Sort
on column: c1
with flavor: Alphabetical
everything in: Ascending order
3. Convert Formats -> Tabular-to-FASTA
Troubleshooting
If a custom genome dataset is producing errors, clicking on the green bug icon
will often provide a description of the problem. This does not automatically submit a bug report, and it is not always necessary to do so, but it is a good way to get more information about why a job is failing
Common problems and solutions:
# |
Problem |
Symptoms |
Tests |
Solution |
1. |
Custom genome not assigned as FASTA format |
Dataset not included in custom genome pull down menu on tool forms |
check datatype assigned to dataset |
Click on the dataset's pencil icon |
2. |
Incomplete file load |
Sometimes none if all steps run in Galaxy, or only downstream as data analysis inconsistencies. Errors can appear if some steps (such as Tophat) are run outside of Galaxy, but later steps (such as Cufflinks) are run in Galaxy. |
Use Text Manipulation → Select last lines from a dataset to check last 10 lines to see if file is truncated |
Reload (switch to FTP if not using already). Check your FTP client logs if used for prior load. Or just reload. |
3. |
Extra spaces, extra lines, inconsistent line wrapping, any deviation from strict FASTA format |
RNA-seq tools (Cufflinks, Cuffcompare, Cuffmerge, Cuffdiff, but not Tophat) fails with error Error: sequence lines in a FASTA record must have the same length!. |
File tested and corrected locally then re-upload or test/fix within Galaxy, then re-run |
Start with FASTA manipulation → FASTA Width formatter with a value between 40-80 (60 is common) to reformat wrapping. Next, use Filter and Sort → Select with ">" to examine identifiers. Use a combination of Convert Formats → FASTA-to-Tabular, Text Manipulation tools, then Tabular-to-FASTA to correct. Finally, use Filter and Sort → Select with "^$" to search for empty lines (use "NOT matching" to remove). |
4. |
Inconsistent line wrapping, common if merging chromosomes from various Genbank records (e.g. primary chroms with mito) |
Tools (SAMTools, Extract Genomic DNA, but rarely alignment tools) may complain about unexpected line lengths/missing identifiers. |
File tested and corrected locally then re-upload or test/fix within Galaxy, then re-run |
Use FASTA manipulation → FASTA Width formatter with a value between 40-80 (60 is common) to reformat and re-run |
5. |
Unsorted genome |
Tools such as Extract Genomic DNA report problems with sequence lengths |
First try sorting in Galaxy and re-run. If still problem, file tested and corrected locally then re-upload, or test/fix as for #3 above |
To sort, follow instructions for Sorting a Custom Genome |
A problem or not a problem? Certain job errors with RNA-seq tools can at first appear to look like a format problem with a custom reference genome, but are actually a bit more complicated...
- Cufflinks/merge/diff reports a missing/problem transcripts.gtf file. This generally indicates a mismatch in the chromosome identifiers between the reference genome used for the original (Tophat) alignment, the reference annotation GTF data, and the reference genome.
The problem can sometimes be corrected by altering the chromosome identifiers in the GTF file or the reference genome (see the RNA-seq FAQ: http://main.g2.bx.psu.edu/u/jeremy/p/transcriptome-analysis-faq).
A quick solution is to not use the GTF file and/or to turn off the bias correction option on the tool form.
- The best solution is to use the same exact reference genome for all steps in the same analysis pipeline. Alignment tools (BWA, Bowtie, Tophat) are generally tolerant of minor formatting problems with reference genomes. However, downstream tools tend to have more stringent format requirements. To avoid having to reprocess, a best practice is to verify that the formatting is correct before any steps are started.

