About the 1000 Genomes Project
Project OverviewProject Design
Use of the Project data and samples
Samples included in the project
Publications and project documents
Project Overview
Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.
As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.
Project Design
Pilot Project
Three pilot studies provided data to inform the design of the full-scale project:
| Pilot | Purpose | Coverage | Strategy | Status |
|---|---|---|---|---|
| 1 - low coverage | Assess strategy of sharing data across samples | 2-4X | Whole-genome sequencing of 180 samples | Sequencing completed October 2008 |
| 2 - trios | Assess coverage and platforms and centers | 20-60X | Whole-genome sequencing of 2 mother-father-adult child trios | Sequencing completed October 2008 |
| 3 - gene regions | Assess methods for gene-region-capture | 50X | 1000 gene regions in 900 samples | Sequencing completed June 2009 |
Public releases of summary processed data and variant calls are available through the data tab of this website. The data from the pilot projects have been analyzed, especially to determine whether the strategy of 4X coverage is adequate to meet the goals of the Project. A paper describing the analyses of the pilot data and design of the full project was published in October in the journal Nature and is available here.
A summary of sequencing done for each of the three pilot projects is available here. And the list of samples and allocations is provided in a spreadsheet. Regions targeted in Pilot 3 are available on the project FTP site.
Main Project
The plan for the full project is to sequence about 2,500 samples at 4X coverage. The first set of samples for sequencing includes 1167 samples that already existed or could be collected quickly, from 13 populations, for sequencing in 2010 and early 2011. The second set includes 633 samples that are being collected, from 7 populations, for sequencing in early 2011. The third set, consisting of 700 samples, is expected to be available for sequencing in late 2011. Full details of the samples to be sequenced are below.
Use of the Project data and samples
In genome-wide association (GWA) studies, researchers aim to discover regions of the genome that are associated with particular diseases or traits. They genotype samples from hundreds to thousands of people with a disease and without a disease for hundreds of thousands to about a million SNPs and structural variants. They look for which regions have variants that are more common (containing risk variants) or less common (containing protective variants) in the people with the disease as compared to those without the disease. Because genetic variants in a region are generally associated with many other variants in the region (linkage disequilibrium, LD), a variant associated with a disease is a marker for the region it is in, but the LD patterns do not allow a determination of which particular variant or, frequently, which gene or genetic element, causally contributes to the risk for the disease.
Disease researchers will use 1000 Genomes data in two ways. They will combine the 1000 Genomes data with the genotype data in their disease GWA study to impute the genotypes in their samples for millions of additional variants beyond those they genotyped directly. They will do this computationally, with no genotyping cost. The additional genotype data will allow the researchers to localize the disease-associated regions more precisely.
Once the disease-associated regions are identified, researchers will want to know all of the variants in those regions. The Project will provide data on almost all of the variants with a frequency of at least 1% in the populations studied. This will save disease researchers the time and expense of having to sequence their own samples. Although this list will not tell them which variants cause the increased risk for the disease, it will give them the set of suspects. They will then do experimental studies to determine which genes, genetic elements, and variants functionally cause the increased risk of disease.
Many researchers will compare the allele frequencies and LD patterns that they find in their studies with those in the 1000 Genomes data. Researchers will also examine the 1000 Genomes data to study recombination, natural selection, and population structure and admixture.
DNA from the samples will be available for researchers to follow up with more detailed studies of variation in particular regions. The cell lines will allow researches to study cellular phenotypes such as gene expression, epigenetic patterns, and drug response. The extensive genotype data available on the samples, and the trio samples for some of the populations, will allow researchers to map regions of the genome computationally that affect the cellular phenotypes, and to study the heritability of these phenotypes. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GM
Samples included in the project
The 1000 Genomes Project developed guidelines on ethical considerations for investigators doing sampling, outlined in the Informed Consent Background Document and the Informed Consent Form Template for new collections. All new collections included in the Project followed these ethical guidelines and model informed consent language. The 1000 Genomes Project Steering Committee, with input from the Project's Samples and ELSI Group, made final decisions about which populations and sample sets to include in the Project.
The samples for the 1000 Genomes Project mostly are anonymous and have no associated medical or phenotype data; for some of the populations the collectors have phenotype data but these data are not at Coriell and are not distributed. Although the 1000 Genomes samples have no phenotype data, the genetic variation data produced by the Project will be used by researchers to study many diseases, in sets of disease and control samples that have been carefully phenotyped. More information about using the 1000 Genomes project data is available on the data tab of this website.
Below is a summary of the samples that the 1000 Genomes Project plans to sequence for the full-scale study. The first set of samples for sequencing includes those available by the end of 2009; the second set includes those expected to be available by early 2011.
| 1000 Genomes Samples | 16-Feb-12 | ||||||
|---|---|---|---|---|---|---|---|
| Population | When cell line avail. (approx) | DNA sequenced from blood | Offspring samples from trios avail. | First set | Second set | Third set | Total |
| Utah residents (CEPH) with Northern and Western European ancestry (CEU) | Available | no | yes | 100 | 100 | ||
| Toscani in Italia (TSI) | Available | no | no | 100 | 100 | ||
| British from England and Scotland (GBR) | Available | no | no | 96 | 4 | 100 | |
| Finnish from Finland (FIN) | Available | no | no | 100 | 100 | ||
| Iberian populations in Spain (IBS) | Available | no | yes | 30 | 70 | 100 | |
| Total European ancestry | 426 | 74 | 500 | ||||
| Han Chinese in Beijing, China (CHB) | Available | no | no | 100 | 100 | ||
| Japanese in Toyko, Japan (JPT) | Available | no | no | 100 | 100 | ||
| Han Chinese South (CHS) | Available | most | yes | 100 | 100 | ||
| Chinese Dai in Xishuangbanna (CDX) | Feb 2012 | some | no | 100 | 100 | ||
| Kinh in Ho Chi Minh City, Vietnam (KHV) | Available | some | some | 100 | 100 | ||
| Chinese in Denver, Colorado (CHD) (pilot 3 only) | Available | no | no | 0 | |||
| TOTAL East Asian ancestry | 300 | 200 | 500 | ||||
| Yoruba in Ibadan, Nigeria (YRI) | Available | no | yes | 100 | 100 | ||
| Luhya in Webuye, Kenya (LWK) | Available | no | no | 100 | 100 | ||
| Gambian in Western Division, The Gambia (GWD) | Aug 2012 | no | yes | 100 | 100 | ||
| Mende in Sierra Leono (MSL) | Aug 2012 | no | yes | 100 | 100 | ||
| Esan in Nigeria (ESN) | Aug 2012 | no | yes | 100 | 100 | ||
| TOTAL West African ancestry | 200 | 300 | 500 | ||||
| African Ancestry in Southwest US (ASW) | Available | no | some | 61 | 1 | 62 | |
| African Caribbean in Barbados (ACB) | Available | yes | yes | 79 | 21 | 100 | |
| Mexican Ancestry in Los Angeles, CA (MXL) | Available | no | yes | 70 | 70 | ||
| Puerto Rican in Puerto Rico (PUR) | Available | yes | yes | 70 | 20 | 90 | |
| Colombian in Medellin, Colombia (CLM) | Available | no | yes | 70 | 19 | 89 | |
| Peruvian in Lima, Peru (PEL) | Available | yes | yes | 70 | 19 | 89 | |
| TOTAL Americas | 271 | 150 | 79 | 500 | |||
| Gujarati Indian in Houston, TX (GIH) | Available | no | no | 100 | 100 | ||
| Punjabi in Lahore, Pakistan (PJL) | May-Aug 2012 | yes | yes | 100 | 100 | ||
| Bengali in Bangladesh (BEB) | Aug 2012 | no | yes | 100 | 100 | ||
| Sri Lankan Tamil in the UK (STU) | Aug 2012 | yes | yes | 100 | 100 | ||
| Indian Telegu in the UK (ITU) | Aug 2012 | yes | yes | 100 | 100 | ||
| TOTAL South Asian ancestry | 100 | 400 | 500 | ||||
| TOTAL | 1197 | 524 | 779 | 2500 | |||
Note 1, some samples in the first (or second) set are being sequenced with the second (or third) set of samples
Note 2, all samples are being sequenced lightly and in exomes. 322 samples and their children (161 trios) and 11 unrelated samples are being sequenced deeply.
Current list of samples planned for sequencing
List of samples for sequencing as of 6th May 2011 to is provided in a spreadsheet. There is also a pedigree file available
A summary of sequencing done for each of the three pilot projects is available here
The samples, as cell lines and DNA, are available to researchers from the non-profit Coriell Institute for Medical Research, or will become available when the cell lines have been made. For some of the populations, additional samples are available, including ones from the children in mother-father-adult child trios. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GM
Publications and project documents
Project Documents
May 2011
1000 Genomes Browser Quick Start Guide
October 2010
1000 genomes pilot paper
September 2007
Report of the 1000 Genomes planning meeting