Project Overview

Recent improvements in sequencing technology ("next-gen" sequencing platforms) have sharply reduced the cost of sequencing. The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation.

As with other major human genome reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide scientific community through freely accessible public databases. (See Data use statement.)

The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.

Sequencing is still too expensive to deeply sequence the many samples being studied for this project. However, any particular region of the genome generally contains a limited number of haplotypes. Data can be combined across many samples to allow efficient detection of most of the variants in a region. The Project currently plans to sequence each sample to about 4X coverage; at this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%. Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

Project Design

Pilot Project

Three pilot studies provided data to inform the design of the full-scale project:

Pilot Purpose Coverage Strategy Status
1 - low coverage Assess strategy of sharing data across samples 2-4X Whole-genome sequencing of 180 samples Sequencing completed October 2008
2 - trios Assess coverage and platforms and centers 20-60X Whole-genome sequencing of 2 mother-father-adult child trios Sequencing completed October 2008
3 - gene regions Assess methods for gene-region-capture 50X 1000 gene regions in 900 samples Sequencing completed June 2009

Public releases of summary processed data and variant calls are available through the data tab of this website. The data from the pilot projects have been analyzed, especially to determine whether the strategy of 4X coverage is adequate to meet the goals of the Project. A paper describing the analyses of the pilot data and design of the full project was published in October in the journal Nature and is available here.

A summary of sequencing done for each of the three pilot projects is available here. And the list of samples and allocations is provided in a spreadsheet. Regions targeted in Pilot 3 are available on the project FTP site.

Main Project

The plan for the full project is to sequence about 2,500 samples at 4X coverage. The first set of samples for sequencing includes 1167 samples that already existed or could be collected quickly, from 13 populations, for sequencing in 2010 and early 2011. The second set includes 633 samples that are being collected, from 7 populations, for sequencing in early 2011. The third set, consisting of 700 samples, is expected to be available for sequencing in late 2011. Full details of the samples to be sequenced are below.

Use of the Project data and samples

In genome-wide association (GWA) studies, researchers aim to discover regions of the genome that are associated with particular diseases or traits. They genotype samples from hundreds to thousands of people with a disease and without a disease for hundreds of thousands to about a million SNPs and structural variants. They look for which regions have variants that are more common (containing risk variants) or less common (containing protective variants) in the people with the disease as compared to those without the disease. Because genetic variants in a region are generally associated with many other variants in the region (linkage disequilibrium, LD), a variant associated with a disease is a marker for the region it is in, but the LD patterns do not allow a determination of which particular variant or, frequently, which gene or genetic element, causally contributes to the risk for the disease.

Disease researchers will use 1000 Genomes data in two ways. They will combine the 1000 Genomes data with the genotype data in their disease GWA study to impute the genotypes in their samples for millions of additional variants beyond those they genotyped directly. They will do this computationally, with no genotyping cost. The additional genotype data will allow the researchers to localize the disease-associated regions more precisely.

Once the disease-associated regions are identified, researchers will want to know all of the variants in those regions. The Project will provide data on almost all of the variants with a frequency of at least 1% in the populations studied. This will save disease researchers the time and expense of having to sequence their own samples. Although this list will not tell them which variants cause the increased risk for the disease, it will give them the set of suspects. They will then do experimental studies to determine which genes, genetic elements, and variants functionally cause the increased risk of disease.

Many researchers will compare the allele frequencies and LD patterns that they find in their studies with those in the 1000 Genomes data. Researchers will also examine the 1000 Genomes data to study recombination, natural selection, and population structure and admixture.

DNA from the samples will be available for researchers to follow up with more detailed studies of variation in particular regions. The cell lines will allow researches to study cellular phenotypes such as gene expression, epigenetic patterns, and drug response. The extensive genotype data available on the samples, and the trio samples for some of the populations, will allow researchers to map regions of the genome computationally that affect the cellular phenotypes, and to study the heritability of these phenotypes. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GM

Samples included in the project

The 1000 Genomes Project developed guidelines on ethical considerations for investigators doing sampling, outlined in the Informed Consent Background Document and the Informed Consent Form Template for new collections. All new collections included in the Project followed these ethical guidelines and model informed consent language. The 1000 Genomes Project Steering Committee, with input from the Project's Samples and ELSI Group, made final decisions about which populations and sample sets to include in the Project.

The samples for the 1000 Genomes Project mostly are anonymous and have no associated medical or phenotype data; for some of the populations the collectors have phenotype data but these data are not at Coriell and are not distributed. Although the 1000 Genomes samples have no phenotype data, the genetic variation data produced by the Project will be used by researchers to study many diseases, in sets of disease and control samples that have been carefully phenotyped. More information about using the 1000 Genomes project data is available on the data tab of this website.

Below is a summary of the samples that the 1000 Genomes Project plans to sequence for the full-scale study. The first set of samples for sequencing includes those available by the end of 2009; the second set includes those expected to be available by early 2011.

1000 Genomes Samples 16-Feb-12
Population When cell line avail. (approx) DNA sequenced from blood Offspring samples from trios avail. First set Second set Third set Total
Utah residents (CEPH) with Northern and Western European ancestry (CEU) Available no yes 100     100
Toscani in Italia (TSI) Available no no 100     100
British from England and Scotland (GBR) Available no no 96 4   100
Finnish from Finland (FIN) Available no no 100     100
Iberian populations in Spain (IBS) Available no yes 30 70   100
Total European ancestry       426 74   500
Han Chinese in Beijing, China (CHB) Available no no 100     100
Japanese in Toyko, Japan (JPT) Available no no 100     100
Han Chinese South (CHS) Available most yes 100     100
Chinese Dai in Xishuangbanna (CDX) Feb 2012 some no   100   100
Kinh in Ho Chi Minh City, Vietnam (KHV) Available some some   100   100
Chinese in Denver, Colorado (CHD) (pilot 3 only) Available no no       0
TOTAL East Asian ancestry       300 200   500
Yoruba in Ibadan, Nigeria (YRI) Available no yes 100     100
Luhya in Webuye, Kenya (LWK) Available no no 100     100
Gambian in Western Division, The Gambia (GWD) Aug 2012 no yes     100 100
Mende in Sierra Leono (MSL) Aug 2012 no yes     100 100
Esan in Nigeria (ESN) Aug 2012 no yes     100 100
TOTAL West African ancestry       200   300 500
African Ancestry in Southwest US (ASW) Available no some 61 1   62
African Caribbean in Barbados (ACB) Available yes yes   79  21 100
Mexican Ancestry in Los Angeles, CA (MXL) Available no yes 70     70
Puerto Rican in Puerto Rico (PUR) Available yes yes 70    20 90
Colombian in Medellin, Colombia (CLM) Available no yes 70    19 89
Peruvian in Lima, Peru (PEL) Available yes yes   70  19 89
TOTAL Americas       271 150 79 500
Gujarati Indian in Houston, TX (GIH) Available no no   100   100
Punjabi in Lahore, Pakistan (PJL) May-Aug 2012 yes yes     100 100
Bengali in Bangladesh (BEB) Aug 2012 no yes     100 100
Sri Lankan Tamil in the UK (STU) Aug 2012 yes yes     100 100
Indian Telegu in the UK (ITU) Aug 2012 yes yes  
 100 100
TOTAL South Asian ancestry         100 400 500
TOTAL       1197 524 779 2500

Note 1, some samples in the first (or second) set are being sequenced with the second (or third) set of samples

Note 2, all samples are being sequenced lightly and in exomes.  322 samples and their children (161 trios) and 11 unrelated samples are being sequenced deeply.

Current list of samples planned for sequencing

List of samples for sequencing as of 6th May 2011 to is provided in a spreadsheet. There is also a pedigree file available

A summary of sequencing done for each of the three pilot projects is available here

The samples, as cell lines and DNA, are available to researchers from the non-profit Coriell Institute for Medical Research, or will become available when the cell lines have been made. For some of the populations, additional samples are available, including ones from the children in mother-father-adult child trios. See http://ccr.coriell.org/sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=GM

Publications and project documents

Project Documents

May 2011
1000 Genomes Browser Quick Start Guide

October 2010
1000 genomes pilot paper

September 2007
Report of the 1000 Genomes planning meeting