IGSR Sample Collection Principles

1000 Genomes Project Publications

File formats

Software tools

Download data


About IGSR and the 1000 Genomes Project

The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data. As the project ended, the Data Coordination Centre at EMBL-EBI has received continued funding from the Wellcome Trust to maintain and expand the resource. The International Genome Sample Resource (IGSR) was set up to do this and has the following aims:

  1. Ensure the future access to and usability of the 1000 Genomes reference data
  2. Incorporate additional published genomic data on the 1000 Genomes samples
  3. Expand the data collection to include new populations not represented in the 1000 Genomes Project

The 1000 Genomes Project

Overview of the 1000 Genomes Project

The goal of the 1000 Genomes Project was to find most genetic variants with frequencies of at least 1% in the populations studied.

The 1000 Genomes Project took advantage of developments in sequencing technology, which sharply reduced the cost of sequencing. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.

Sequencing remained too expensive to deeply sequence the many samples being studied in the project. However, any particular region of the genome generally contains a limited number of haplotypes. Data was combined across samples to allow efficient detection of most of the variants in a region. The project planned to sequence each sample to 4x genome coverage; at this depth, sequencing can not discover all variants in each sample, but can allow the detection of most variants with frequencies as low as 1%. In the final phase of the project, data from 2,504 samples was combined to allow highly accurate assignment of the genotypes in each sample at all the variant sites the project discovered. The multi-sample approach combined with genotype imputation allowed the project to determine a sample’s genotype, even in variants not covered by sequencing reads in that sample.

The contribution of the 1000 Genomes Project to genomics was summarised in Nature in the issue containing the final publications from the main project.

Design of the 1000 Genomes Project

The Project was planned during a meeting at The Welcome Genome Campus in September 2007. You can read the original plan in the meeting report. Once underway, the project was conducted in four stages: a pilot phase and three phases of the main project. In the main project, phases one and three produced data, with phase two concentrating on technical development.

Pilot project

Three pilot studies provided data to inform the design of the full-scale project:

Pilot Purpose Coverage Strategy Status
1 - low coverage Assess strategy of sharing data across samples 2-4X Whole-genome sequencing of 180 samples Sequencing completed October 2008
2 - trios Assess coverage and platforms and centres 20-60X Whole-genome sequencing of 2 mother-father-adult child trios Sequencing completed October 2008
3 - gene regions Assess methods for gene-region-capture 50X 1000 gene regions in 900 samples Sequencing completed June 2009

Data from the pilot projects was analysed to determine whether the strategy of 4x coverage was adequate to meet the goals of the project.

Main project

Sequencing was carried out in phases one and three of the main project, with data releases and analysis corresponding to each. The final data freeze, associated with the third and final phase, took place on the 2nd May 2013. This data set (defined in the 20130502.sequence.index file) represented the finalised data set upon which the phase three analysis was based and superseded previous data releases. During the course of the project, analysis methods were further developed and the phase three analysis replaces earlier versions.

The final data set contains data for 2,504 individuals from 26 populations. Low coverage and exome sequence data are present for all of these individuals, 24 individuals were also sequenced to high coverage for validation purposes.

Analyses were conducted, looking at both the short variations (up to 50 base pairs in length) and also structural variations. These analyses were published at the conclusion of the project in 2015. A list of our main publications can be seen below.


1000 Genomes Project samples and data

The 1000 Genomes Project developed guidelines on ethical considerations for investigators doing sampling, outlined in the Informed Consent Background Document and the Informed Consent Form Template. All collections included in the Project followed these ethical guidelines and model informed consent language. The 1000 Genomes Project Steering Committee, with input from the Project’s Samples and ELSI Group, made final decisions about which populations and sample sets to include in the Project.

Data from the 1000 Genomes Project is available without embargo, following the final publications from the project. Use of the data should be cited in the usual way, with current details available in the FAQs, where further details on using 1000 Genomes Project data can be found. Additional information on using data provided by IGSR is available and should also be consulted.

The available data from the 1000 Genomes Project can be explored on our data page, alongside other data in IGSR. Cell lines and DNA are available for all 1000 Genomes samples and can be obtained from the Coriell Institute. A complete list of the populations available can be found on our Cell lines and DNA page

The samples for the 1000 Genomes Project are anonymous and have no associated medical or phenotype data. The project holds self-reported ethnicity and gender. All participants declared themselves to be healthy at the time the samples were collected.


As stated, the IGSR was set up to ensure the future usability and accessibility of data from the 1000 Genomes Project and to extend the data set to include new data on the 1000 Genomes samples and new populations where sampling has been carried out in line with IGSR sampling principles.

1. Ensuring the future usability of the 1000 Genomes reference data

In 2014, the Genome Reference Consortium released an update of the human assembly, GRCh38. This update to the human reference assembly shows a significant improvement in the quantity of alternative loci represented. It now contains 178 genomic regions with associated alternative loci (2% of chromosomal sequence (61.9 Mb)). This is made up from 261 alternative loci (containing 3.6 Mb novel sequence relative to chromosomes). The GRC were also able to resolve more than 1000 issues from the previous version of the assembly.

Taking advantage of the alternative loci when identifying variation and calling genotypes is an important step in improving our ability to discover human variation. Currently, very few tools can use the alternative loci data. IGSR has remapped the phase 3 1000 Genomes data to GRCh38 in an alternative loci aware manner using BWA mem. This provides the method development community with a source of alignments that can drive new methods forward, as well as providing the wider community with up to date alignments, ensuring everyone can benefit from the data in the context of the new assembly. The IGSR plans to recall variants on these new alignments.

In addition, further sets of genomic sequence data are being aligned to GRCh38, with the Platinum Genomes data from Illumina being the first new collection of data to be aligned.

2. Incorporate published genomic data on the 1000 Genomes samples

The 1000 Genomes samples have proved a popular resource for molecular phenotyping experiments and investigating the associations between genetic variation and expression or measurements of epigenetic state. Large datasets have been generated on these samples by projects such as GEUVADIS, who generated RNA-Seq data on the 1000 Genomes European samples and the YRI population, and ENCODE, who have carried out extensive assays on the NA12878 cell line. Many other groups have also conducted assays on the 1000 Genomes samples. The IGSR would like to present all this information in a unified manner so the community can benefit from all the data which exists on these samples.

3. Expand the data collection to include new populations

The IGSR recognises that the current 1000 Genomes Project samples do not reflect all populations. An important aim for IGSR is to expand the populations represented in the collection and ensure the available public data represents the maximum possible population diversity. This will ensure the 1000 Genomes dataset remains a valuable open resource for the community over the next five years. The IGSR will work with the groups who were unable to contribute samples to the 1000 Genomes Project before it finished sample collection and investigate collaborations with other groups to ensure the population diversity gaps are filled. You can find more details about this on our sample collection principles page.

Questions about any of the above can be sent to