The Ancestry Composition feature is an innovative, well-tested system for analysing your ancestry based on your DNA. Our powerful machine learning algorithm identifies your ancestry using the following approach:
- Determine which genetic information was inherited together on the same chromosome
- Estimate ancestry for each window of the genome
- Further smooth the window assignments
- Calibrate and return results
DNA carries a history of human migrations and origins. By comparing your DNA to the DNA of present-day individuals from different parts of the world, we can map your own genetic ancestry. The locations in the genome that vary from person to person are called SNPs (single nucleotide polymorphisms), and the different versions of a SNP are called alleles. Everyone carries two alleles at most SNPs, one allele inherited from each parent. Although any one SNP only contains a small amount of information about your geographic origins, by combining the evidence across many SNPs, we can develop a picture of your genetic ancestry.
Step 1: Phasing
Recall that for each of your 23 chromosome pairs, one chromosome in each pair comes from your mother, and the other from your father. Genotyping chips – like the one 23andMe uses – don't capture the information about which genetic information or alleles came from which parent; DNA data from our genotyping chip is called "unphased." We can figure out what you inherited from each parent through a process referred to as “phasing”.
Here's a quick example to illustrate. Say, for a short stretch of Chromosome 1, you inherited the following genotypes:
from Dad: A-T-C
from Mom: G-T-A
When we get your data and we look at your genotype in this spot on Chromosome 1, we'll see the following:
You: A/G - T/T - A/C
Since runs of markers tell us more about your ancestry than individual markers do, we first need to determine which alleles are on the same chromosome together, through a process called "phasing."
Using our example above, here are two possibilities that are consistent with the data. It could be:
Or it could be:
To determine which alleles are on the same chromosome together, we phase your chromosomes using an in-house adaptation of the well-known program EAGLE. When you have a parent or child in the system, you'll get a very high-quality phasing, since we will be able to compare your genetic information with your parent's to determine the correct arrangement.
Step 2: Window Classification
Once we have determined which alleles were inherited on the same chromosome together, we break up the chromosomes into short, non-overlapping, adjacent windows, like boxcars in a train. The idea is that you inherited all the DNA in that window from a single parent, grandparent, great-grandparent, etc., going back quite a few generations. We assume that any single window has a single ancestry, but different windows can come from different populations.
After phasing your chromosomes, we segment them into consecutive windows containing ~300 genetic markers each. We measure between 7,400 and 45,000 markers per chromosome, which translates to 24 to 149 windows, depending on the chromosome's length. We then take each of those windows in turn and compare them against the DNA from reference individuals (who represent various populations) to determine what population your DNA most likely came from.
We use a well-known machine learning technique called a support vector machine (SVM) - to classify each window because this algorithm performed the best among techniques that we tested.
Learn more about how Ancestry Composition uses machine learning to detect fine-scale ancestry differences
View our reference populations
Step 3: Smoothing
Each window is initially classified independently, giving us a "first draft" version of the customer's ancestry. For example, suppose we just had three reference populations: X, Y, and Z. What comes out of the SVM might look like this:
chromosome 1, parent 1: X - X - Y - X - Z - Z - Z - X - Z
chromosome 1, parent 2: Z - Z - Z - Z - X - Y - X - X - X
The smoother uses a version of another well-known mathematical tool called a Hidden Markov Model. The smoother's job is to try to correct or "smooth" two kinds of mistakes, both of which we can see in this example.
- The first kind of mistake is an unusual assignment amid a run of similar assignments. In the first line above, there is a run of Z's, interrupted by a single X: Z - Z - Z - X - Z. It may have been that that lone X was a close call between X and Z, which went the wrong way. In context, however, we can see that it probably should have been a Z. The smoother can correct this to Z - Z - Z - Z - Z.
- The second kind of mistake is one inherited from the phasing step. Phasing algorithms can make a mistake known as a "switch error," where it mixes up the DNA of one parent with another. The smoother can switch the ancestry assignments back between your two versions of a given chromosome. In this case, it looks like there's a switch error after the fourth window, therefore it would be better to keep the runs of X's together, and the runs of Z's together.
The smoother will output something like this:
chromosome 1, parent 1: Z - Z - Z - Z - Z - Z - Z - Z - Z
chromosome 1, parent 2: X - X - X - X - X - X - X - X - X
This simplified example illustrates the purpose of the Smoother, but the reality is even more complex. Instead of assigning a single ancestry to each window like we did in this example, the smoother estimates the probabilities of each Ancestry Composition population matching each window of DNA. The following picture shows a concrete example:
This is the output of the smoother analysis of one copy of chromosome 2. Starting on the left, there is a short run of pink, then a wider run of green, then another run of pink. In this chart, pink is the color for Sub-Saharan African ancestry, and green is the color for Native American. The y-axis runs from 0 to 100 percent, and it shows the probability that the DNA in that region of the chromosome comes from each Ancestry Composition population. These pink and green regions fill the entire vertical space of the graph, which means that we are 100 percent confident that the DNA in those regions has Sub-Saharan African and Native American genetic ancestry, respectively.
The next region to the right — between positions 50 and 100 on the x-axis — is a stretch of multi-colored blue. The thickest strip at the bottom is dark teal, which is the color for British & Irish. This segment of DNA has somewhere between a 50 percent chance and a 60 percent chance of reflecting from British & Irish ancestry. The other shades of blue show that the same DNA segment also has a chance of reflecting Italian, Iberian, or French & German ancestry. If you think back to the haplogroup example above, this result makes sense: it is normal for a DNA marker to match reference DNA from lots of places, even if it matches some places better than others. In this example, the result shows that this DNA segment matches reference DNA from all over Europe. We can very confidently conclude that this stretch of DNA reflects European ancestry, but the evidence isn’t strong enough to assign it to one specific region of Europe with high confidence.
Step 4: Calibration
We developed a recalibration scheme based on simulated data that adjusts the ancestry proportions so that a population is not overrepresented relative to how often it actually occurs.
Step 5: Aggregation & Reporting
As a last step we summarize the results for display. The way we do this is to apply a threshold to the probability plot, like so:
Here, we're showing a 70 percent confidence threshold. You can view your own Chromosome Painting at different confidence thresholds ranging from 50 percent (speculative) to 90 percent (conservative).
Any region of the genome with a probability exceeding the threshold (in this case 70 percent) gets assigned as that ancestry. You can see that with the exception of the blue European stretch, we exceed this threshold over the majority of the chromosome.
- In the case of the European segment, no single population exceeds our 70 percent threshold, so we won't report that DNA as coming from any of those populations. In this case, we refer to our hierarchy of reference populations. For example, we have a "Northwestern European" group that contains three reference populations: Britain & Ireland, Scandinavia, and France & Germany. We will add up the contributions of each of these subgroups, and see if the total "Northwestern European" contribution exceeds the 70 percent threshold. If it does, then we would report the region as "Broadly Northwestern European".
- In this particular case, the Northwestern European reference populations do not exceed 70 percent. So we go up another step, and see if the contributions of all the European populations exceed 70 percent. They do, and so we would report this region as "Broadly European."
- In regions where we go all the way to the top of our hierarchy, but no group of populations exceeds the threshold in place, we report "Unassigned."
The populations assigned to you are displayed in your online account.
You can learn more about how Ancestry Composition by reading our Ancestry Composition Guide or viewing our more technical ASHG poster.