So you've sequenced a genome...now what? (A crash course in bioinformatics, part 1)
“How did you learn all this?”
I get this question basically every time I teach a bit of bioinformatics, from troubleshooting code for a labmate to running workshops. The truth of it is, I’m mostly self-taught, with a smattering of workshops and slightly related classes to fill out my knowledge here and there. It gets the job done, and I’ve managed to get to the point that I can pretty comfortably handle most computational issues that get thrown my way. This seems to be pretty typical, from the many conversations I’ve had with other evolutionary biologists. While some folks do come into the field with a background in computer science and serious coding chops, it’s definitely not that common. Most of us muddle our way through, following pipelines and workflows designed by others until we figure out what it is we’re really doing.
If this sounds fairly inefficient, well, I agree. While there are certainly great programs out there that give a comprehensive and thorough grounding in these skills, the norm is to just….figure it out, as this series of polls I did over on my Twitter feed demonstrate. This leads to a lot of frustration, lots of reinventing the wheel, lots of black-box programming that can lead to mistakes. We can do better than this, though!
The basics: what the hell are our data?
Personally, I find the most helpful thing to do is to roll all the way back to a question that is essential, but overlooked way too often when teaching bioinformatics: what is “genomic data”? How do we get from tubes of clear liquid to strings of letters in a computer? Understanding this process makes some of the properties of our data make a lot more sense, so it’s worth spending some time on.
Let’s consider an Illumina flow cell. Obviously there’s other platforms, but as this is the most commonly used one, it’s a good starting point. The flow cell is basically very fancy Velcro, but made from DNA- it’s covered in little nubbins of DNA that can pair with the adaptors you attach during library prep, sticking them onto the surface. Once this happens, there’s several rounds of PCR that grow each strand out into a little cluster of identical copies of that strand- just enough for the sensors to see. Here’s what that looks like:
This is where, as far as I’m concerned, magic happens. Each cluster is mapped to a very precise grid, and a solution with bases tagged with fluorescent dye washes across it. The color-coded bases pair to your sequences, building the strand just like in PCR. The difference is that for each cycle of that solution flooding the cell, a camera takes a picture of the cell. These images (in super high definition, obviously) are stored up, and then the color sequence- red-green-green-blue-red-yellow, for example- gets extracted for each set of coordinates, and translated into the bases that go with those colors- in our little example, TGGCTA.
Okay, so why are we getting into this? Apart from the fact that it’s one of the coolest damn things- we hacked the inner workings of extremophile bacteria and harnessed light to read the code of life itself, dammit- those little clusters represent one of the fundamental units in bioinformatics that we will be spending a lot of time talking about in this series: a read.
Each read is a unique fragment of DNA from your sample. If you’re doing shotgun-based preps, it’s the pieces you broke your DNA extraction into via enzymes or sonication; if it’s an amplicon approach, it’s the region you got your primers for. When you get your data back from the sequencing center, this will be the form it’s in: files that for each indexed sample, list all of the sequences for the clusters tagged with that index. These will likely be the first type of files you handle, but they’re far from the only.
A field guide to file formats
There’s an old joke about how bioinformatics is at least 70% converting between file types, and I have yet to find compelling evidence to the contrary. You will probably encounter a frankly bewildering array of file types as you process your data, and it’s a lot to keep track of. However, most files formats will generally group into one of three broad types- sequences, alignments, and sites. For this post, I’ll focus on the first type, sequences, with the others being covered in subsequent posts.
One other property of files to keep in mind is the flexibility of the format. Simply put, this is how rigid is each type in terms of character position and allowable characters. Rigid formats are ones where there is little variation in things like line length, where the exact position in a given row and/or column means something. A simple version of this is a tab-delimited file- you may have a header to describe each column and row, but ultimately the data is defined by the position it is stored in. Meanwhile, flexible formats will use things like keywords or indexing to define what a piece of data is. This is more or less a continuum- many formats combine elements of both- but I will note whenever a file format tends towards one or the other, as it’s helpful for understanding the limits of each.
Sequences
The simplest type are sequence-based formats. These do what they say on the box- they list the bases of a given portion of DNA, usually by read or, for assembled sequences, contigs or even chromosomes. The most basic, which you will almost certainly run into, is the FASTA format. This format, which typically have the file extensions .fasta, .fa, .fna, or .fas, consist of a header, followed by the sequence:
> sequence 1 AGCTCCGTCCGATAAAAAATATACCGCGCA > sequence 2 GTCCGTAGCCAGATAGAGGGAGAGCTCTATTTTACGCTCGCG
FASTAs can consist of multiple loci from a single individual, the same locus for multiple individuals, or (g-d help you) some mix of both. The header is flexible- i.e., it can contain most basic ASCII characters for any length between the > and the end of line. The sequences can be of any length, and do not need to be aligned. While the standard four bases are the most common if you work with DNA, the format also accepts other characters- RNA, amino acids, expanded base series to mark variation (e.g., to mark transversions or other mutations), as well as dashes and Ns to represent either deletions or missing data. Keep in mind though that just because these are allowable generally in the format, not all programs can parse them properly.
A close relative of the FASTA is the FASTQ. The Q in this case stands for “quality”, which is the main difference here, and the file extensions to look for are .fastq and .fq. Unlike the simple two-line format of the FASTA, FASTQs encode each read in four lines:
@UCB-SEQ:6:73:941:1973#0/1 ATGGTCGTAGCTCAAAA + ii*iiiig9iififghg @UCB-SEQ:6:79:8278:234#0/1 TTGCTCTAGGCAATATT + ii*iijhg9iififghg
FASTQs are typically used for unassembled or unaligned reads- basically, these are what you get back from the sequencing center. The first three lines are relatively self explanatory- the header gives info on the coordinates of this read on the flow cell, which “names” it, the next line is the sequence itself, and the third links that to the fourth line. But as well as giving you, well, your data, it includes a bit of really important metadata: the Phred score, which is a score of how likely that base call is to be accurate.
Why do we need these scores? Well, let’s jump back to how these data are generated on the sequencer. Each base in your sequence is created from a picture of a flash of color- which, while amazing, isn’t exactly foolproof. It’s very accurate, considering, but with so many individual bases in so many reads being called in a given run, even a 0.001% chance of the computer (or even a base pairing incorrectly in building the strand) works out to thousands of incorrect calls across your dataset. Managing this error is an important component of bioinformatics pipelines- and one we’ll really get into in the next post in this series- but to be able to do anything about it, we need to know what this likelihood is for each and every base in our sequences. FASTQ files store this important information in a way we can relatively easily access.
One downside of sequence-based files, especially FASTQs, is that they very quickly become very large. Even small genomes are data-wise an awful lot to handle, and once you start talking about vertebrate genomes or (terrifyingly) plant genomes, these get big very, very fast. An assembled hummingbird genome will give you a FASTA of around 1.1 GB, a human 3.2 GB, an axolotl 32 GB, and the unassuming plant Paris japonica a whopping 149 GB. But FASTQs are even bigger- if you want to sequence a whole genome at 10x coverage, you want 10 copies of every single base pair, and each of these reads will be its own set of four lines. For this reason, FASTQs in particular are typically found compressed as gzipped files, since otherwise all of our computers would just constantly be smoking husks of themselves.
But that’s far from the only file type you’ll be encountering. In the next post, I’ll talk about initial read processing, quality control, and alignment. Stay tuned!