About the 1000 Genomes ProjectProject Overview
Use of the Project data and samples
Samples included in the project
Publications and project documents
Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.
As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.
Three pilot studies provided data to inform the design of the full-scale project:
|1 - low coverage||Assess strategy of sharing data across samples||2-4X||Whole-genome sequencing of 180 samples||Sequencing completed October 2008|
|2 - trios||Assess coverage and platforms and centers||20-60X||Whole-genome sequencing of 2 mother-father-adult child trios||Sequencing completed October 2008|
|3 - gene regions||Assess methods for gene-region-capture||50X||1000 gene regions in 900 samples||Sequencing completed June 2009|
Public releases of summary processed data and variant calls are available through the data tab of this website. The data from the pilot projects have been analyzed, especially to determine whether the strategy of 4X coverage is adequate to meet the goals of the Project. A paper describing the analyses of the pilot data and design of the full project was published in October in the journal Nature and is available here.
A summary of sequencing done for each of the three pilot projects is available here. And the list of samples and allocations is provided in a spreadsheet. Regions targeted in Pilot 3 are available on the project FTP site.
The plan for the full project is to sequence about 2,500 samples at 4X coverage. The first set of samples for sequencing includes 1167 samples that already existed or could be collected quickly, from 13 populations, for sequencing in 2010 and early 2011. The second set includes 633 samples that are being collected, from 7 populations, for sequencing in early 2011. The third set, consisting of 700 samples, is expected to be available for sequencing in late 2011. Full details of the samples to be sequenced are below.
Use of the Project data and samples
In genome-wide association (GWA) studies, researchers aim to discover regions of the genome that are associated with particular diseases or traits. They genotype samples from hundreds to thousands of people with a disease and without a disease for hundreds of thousands to about a million SNPs and structural variants. They look for which regions have variants that are more common (containing risk variants) or less common (containing protective variants) in the people with the disease as compared to those without the disease. Because genetic variants in a region are generally associated with many other variants in the region (linkage disequilibrium, LD), a variant associated with a disease is a marker for the region it is in, but the LD patterns do not allow a determination of which particular variant or, frequently, which gene or genetic element, causally contributes to the risk for the disease.
Disease researchers will use 1000 Genomes data in two ways. They will combine the 1000 Genomes data with the genotype data in their disease GWA study to impute the genotypes in their samples for millions of additional variants beyond those they genotyped directly. They will do this computationally, with no genotyping cost. The additional genotype data will allow the researchers to localize the disease-associated regions more precisely.
Once the disease-associated regions are identified, researchers will want to know all of the variants in those regions. The Project will provide data on almost all of the variants with a frequency of at least 1% in the populations studied. This will save disease researchers the time and expense of having to sequence their own samples. Although this list will not tell them which variants cause the increased risk for the disease, it will give them the set of suspects. They will then do experimental studies to determine which genes, genetic elements, and variants functionally cause the increased risk of disease.
Many researchers will compare the allele frequencies and LD patterns that they find in their studies with those in the 1000 Genomes data. Researchers will also examine the 1000 Genomes data to study recombination, natural selection, and population structure and admixture.
DNA from the samples will be available for researchers to follow up with more detailed studies of variation in particular regions. The cell lines will allow researches to study cellular phenotypes such as gene expression, epigenetic patterns, and drug response. The extensive genotype data available on the samples, and the trio samples for some of the populations, will allow researchers to map regions of the genome computationally that affect the cellular phenotypes, and to study the heritability of these phenotypes. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GM
Samples included in the project
The 1000 Genomes Project developed guidelines on ethical considerations for investigators doing sampling, outlined in the Informed Consent Background Document and the Informed Consent Form Template for new collections. All new collections included in the Project followed these ethical guidelines and model informed consent language. The 1000 Genomes Project Steering Committee, with input from the Project's Samples and ELSI Group, made final decisions about which populations and sample sets to include in the Project.
The samples for the 1000 Genomes Project mostly are anonymous and have no associated medical or phenotype data; for some of the populations the collectors have phenotype data but these data are not at Coriell and are not distributed. Although the 1000 Genomes samples have no phenotype data, the genetic variation data produced by the Project will be used by researchers to study many diseases, in sets of disease and control samples that have been carefully phenotyped. More information about using the 1000 Genomes project data is available on the data tab of this website.
Below is a summary of the samples the 1000 Genomes project has sequenced. A full list of these samples is available from our sample spreadsheet . Cell lines for all these samples should be available from the Coriell Cell Repository
|1000 Genomes Samples|
|Population||DNA sequenced from blood||Offspring Samples from Trios Available||Pilot Samples||Phase 1 Samples||Final Phase Sample||Total|
|Chinese Dai in Xishuangbanna, China(CDX)||no||yes||0||0||99||99|
|Han Chinese in Bejing, China (CHB)||no||no||91||97||103||106|
|Japanese in Tokyo, Japan (JPT)||no||no||94||89||104||105|
|Kinh in Ho Chi Minh City, Vietnam (KHV)||yes||yes||0||0||101||101|
|Southern Han Chinese, China (CHS)||no||yes||0||100||108||112|
|Total East Asian Ancestry (ASN)||185||286||515||523|
|Bengali in Bangladesh (BEB)||no||yes||0||0||86||86|
|Gujarati Indian in Houston,TX (GIH)||no||yes||0||0||106||106|
|Indian Telugu in the UK (ITU)||yes||yes||0||0||103||103|
|Punjabi in Lahore,Pakistan (PJL)||yes||yes||0||0||96||96|
|Sri Lankan Tamil in the UK (STU)||yes||yes||0||0||103||103|
|Total South Asian Ancestry (SAN)||0||0||494||494|
|African Ancestry in Southwest US (ASW)||no||yes||0||61||66||66|
|African Caribbean in Barbados (ACB)||yes||yes||0||0||96||96|
|Esan in Nigeria (ESN)||no||yes||0||0||99||99|
|Gambian in Western Division, The Gambia (GWD)||no||yes||0||0||113||113|
|Luhya in Webuye, Kenya (LWK)||no||yes||102||97||101||116|
|Mende in Sierra Leone (MSL)||no||yes||0||85||85|
|Yoruba in Ibadan, Nigeria (YRI)||no||yes||106||88||109||116|
|Total African Ancestry (AFR)||208||246||669||691|
|British in England and Scotland (GBR)||no||yes||0||89||92||94|
|Finnish in Finland (FIN)||no||no||0||93||99||100|
|Iberian populations in Spain (IBS)||no||yes||0||14||107||107|
|Toscani in Italy (TSI)||no||no||66||98||108||110|
|Utah residents with Northern and Western European ancestry (CEU)||no||yes||94||85||99||103|
|Total European Ancestry (EUR)||160||379||505||514|
|Colombian in Medellin, Colombia (CLM)||no||yes||0||60||94||95|
|Mexican Ancestry in Los Angeles, California (MXL)||no||yes||0||66||67||69|
|Peruvian in Lima, Peru (PEL)||yes||yes||0||0||86||86|
|Puerto Rican in Puerto Rico (PUR)||yes||yes||0||55||105||105|
|Total Americas Ancestry (AMR)||181||352||355|
Note. Some samples from the pilot or phase1 did not progress to the final phase for reasons of quality control or changing criteria with respect to complete samples, This causes some discrepancies in the total numbers of samples compared to the numbers in the final phase.
Summary of all Samples
A spreadsheet detailing the samples we are sequencing and related samples for which we have HD genotype data for is available from the ftp site . The columns are documented in the README . There is also a pedigree file available.
The samples, as cell lines and DNA, are available to researchers from the non-profit Coriell Institute for Medical Research, or will become available when the cell lines have been made. For some of the populations, additional samples are available, including ones from the children in mother-father-adult child trios. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GMYou can see summaries of how much sequencing was carried out for the pilot and phase1 studies in the two respective publication supplementary information http://www.nature.com/nature/journal/v467/n7319/extref/nature09534-s2.xls and http://www.nature.com/nature/journal/v491/n7422/extref/nature11632-s1.pdf