The International Genome Sample Resource (IGSR) and the 1000 Genomes Project

IGSR was set up to ensure the future usability and accessibility of data from the 1000 Genomes Project and to extend the data set produced by the 1000 Genomes Project to include new data generated from the 1000 Genomes Project samples and new populations where sampling has been carried out in line with IGSR sampling principles.

The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data. As the project ended, the Data Coordination Centre at EMBL-EBI received funding from the Wellcome Trust to create IGSR with the following aims:

  1. Ensure the future access to and usability of the 1000 Genomes reference data
  2. Incorporate additional published genomic data on the 1000 Genomes samples
  3. Expand the data collection to include new populations not represented in the 1000 Genomes Project

1. Ensuring the future usability of the 1000 Genomes reference data

One aim of IGSR is keeping the 1000 Genomes Project data up to date with current reference data sets.

In 2014, the Genome Reference Consortium released an update of the human genome reference assembly, GRCh38. This update to the human reference assembly increased the quantity of alternative loci represented. GRCh38 contains 178 genomic regions with associated alternative loci (2% of chromosomal sequence (61.9 Mb)). These are made up of 261 alternative loci (containing 3.6 Mb novel sequence relative to chromosomes). The GRC were also able to resolve more than 1000 issues from the previous version of the assembly, providing a better basis for alignment and subsequent analysis.

As part of its work to maintain the 1000 Genomes Project data, IGSR realigned the original project’s sequence data to GRCh38 and used these alignments to call biallelic SNVs and INDELs.

Subseqeuent work by the New York Genome Center (NYGC), funded by NHGRI, generated new high-coverage data for the 1000 Genomes samples and has also analysed the data on GRCh38.

2. Incorporate published genomic data on the 1000 Genomes samples

Along with high-coverage genomic sequence data from NYGC, the cell lines generated by the 1000 Genomes Project have been used by other researchers, who have generated further data sets. The GEUVADIS project, which generated RNA-Seq data on the 1000 Genomes samples of European ancestry and the YRI population is one example of this. Groups such as the Human Genome Structural Variation Consortium (HGSVC) have also generated a wide variety of data from the 1000 Genomes Project cell lines.

3. Expand the data collection to include new populations

IGSR’s sample collection principles enable sharing of further samples with a similar consent to the samples in the 1000 Genomes Project, recognising that the sample collections used in the 1000 Genomes Project do not capture all populations.

During the lifetime of IGSR, various populations have been added, mainly coming from the Human Genome Diversity Project (HGDP), the Simons Genome Diversity Project (SGDP) and the Gambian Genome Variation Project (GGVP).

Learning more

Please email questions about any of the above to info@1000genomes.org.