The Genographic Project is studying the genetic signatures of ancient human migrations and creating an open-source research database. It allows members of the public to participate in a real-time anthropological genetics study by submitting personal samples for analysis and donating the genetic results to the database.
In the first scientific publication from the project they report on genotyping human mitochondrial DNA during the first 18 months of the project.
To making sorting and cataloguing so much data easier, they created the Nearest Neighbor haplogroup prediction tool. The accurate classification of genetic lineages into distinct branches on the human family tree, known as haplogroups, has long been a struggle for anthropologists.
To compile their data, samples are collected in two ways. As a foundation, the project comprises a consortium of ten scientific teams from around the world responsible for sample collection and analysis in their respective region. In addition, the project promotes public participation in countries around the world and anyone can participate by purchasing a participation kit.
The mitochondrial DNA (mtDNA), typed in female participants, is inherited from the mother without recombining, being particularly informative with respect to maternal ancestry.
Over the first 18 months of public participation in the project they have built up the largest to date database of mtDNA variants, containing 78,590 entries from around the world. Here, they describe the procedures used to generate, manage, and analyze the genetic data, and the first insights from them so scientists can understand new aspects of the structure of the mtDNA tree and develop much better ways of classifying mtDNA.
They have released this dataset and the new methods they have developed and will continue to update them as more people join the Genographic Project.
A total of 78,590 mtDNA samples were analyzed, of which 41,552; 5,046; 15,021; and 16,971, respectively, were genotyped with a panel of 10, 20, 21, and 22 SNPs. They excluded from the analysis samples in which the SNP genotyping result was summarized as “uninformative” and heteroplasmic positions.
Therefore, they consider three different versions of the database: (1) The entire database: 76,638 samples. (2) The reference database made up of the subset of samples genotyped with the panel of 22 SNPs, currently comprising 16,609 samples. This reference database is expanding, as all new samples are genotyped with these 22 SNPs. (3) The consented database, released to the public with the participants' consent. So far, data from 21,141 samples (7,174 of which belong to the reference database) have been donated to the scientific community and are reported in Dataset S1.
Analyses using complete haplotype information are restricted to this dataset. The database presents the following information about each sample: a sequential serial number (different from the anonymous Genographic participant ID number), the number of SNPs genotyped, results of all genotyped SNPs, the Hg inferred from the SNP genotyping, the final Hg assigned in the current study, and the HVS-I haplotype.
Read the full report at PLoS.