1000 Genomes Data and Sample Information

The 1000 Genomes Project is a community resource project that aims to release data rapidly for the benefit of the scientific community.

Description of data released by the project
How to Access 1000 Genomes Data
Data Release Policy
Sample Availability
Use of the Project data, presentations and publications, and authorship

Data released by the 1000 Genomes Project

Sample lists and sequencing progress

A summary of sequencing done for each of the three pilot projects is available here. The list of samples and allocations is provided in a spreadsheet.

Variant Calls

Our variant calls are always released in vcf format. The released can be found in the release directory EBI|NCBI.

Alignments

The main project alignments are available in BAM format. A list of the files currently available can be found in the alignment index EBI|NCBI. Alignment statistics can be found in the alignment_indices directory EBI|NCBI. There is also a README which explains the alignment process and file layout

Raw sequence files

The main project raw sequence data is available in fastq format. A list of files currently available can be found in the sequence.index EBI|NCBI Sequence statistics can be found in the sequence_indices directory EBI|NCBI. There is also a README which explains the sequence processing and the file layout

How to Access 1000 Genomes Data

Download data

The data for this project can be downloaded from the 1000 Genomes FTP site hosted at the EBI ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/

The data can be downloaded via FTP, aspera and globus gridftp. More information about using aspera or globus can be found in our FAQ

How to download files using aspera
How to download files using globus

The NCBI and the Amazon S3 bucket still hosts 1000 Genomes data but no longer mirrors new data posted by the ongoing project. Both these locations reflect the structure of the FTP site in August 2015 and hold all the pilot, phase1 and phase3 data.

NCBI FTP Site : ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp
Amazon S3 : s3://1000genomes

FTP Hierarchy 

The FTP structure was changed in September 2015. The new structure is described in the FTP site structure README 

Browse data

The first set of SNP calls representing the July 2010 data release can be viewed directly through the 1000 Genomes browser at http://browser.1000genomes.org. Launch the browser and view a sample region here.

For more information about using the 1000 Genomes browser, download the Quick start guide

Amazon Web Services

The alignment files for the 1000 Genomes Pilot Porject are available as a public data set via Amazon Web Services (AWS). Further information can be found on 1000 Genomes public data set page or directly on http://s3.amazonaws.com/1000genomes.

Data release policy

As a community resource project, the 1000 Genomes Project publicly releases data on a regular basis. The sequencing centers release the raw reads to the SRA. The Data Processing Group, Analysis Group, and DCC release alignments, re-calibrated error rates, and SNP calls on individual samples. Data that passed quality filters are the most visible, but data that did not pass quality filters are also available. When the Structural Variation Group has developed its methods, the structural variant calls on individual samples will be released. Eventually the Project will release haplotypes and imputed variant calls. Data formats and analysis software developed by the Project are also made publicly available.

Sample Availability

All the samples studied by the 1000 Genomes Project are or will be available as DNA and cell lines to scientific investigators for research projects. Many of the samples are currently available from the non-profit Coriell Institute for Medical Research. Cell lines and DNA from additional populations will become available through 2011.

Use of the Project data, presentations and publications, and authorship

The data producers release the Project data quickly, prior to publication, in the expectation that they will be valuable for many researchers. In keeping with Fort Lauderdale principles, data users may use the data for many studies, but are expected to allow the data producers to make the first presentations and to publish the first paper with global analyses of the data.

Global analyses of Project data

The Project plans to publish global analyses of the sequence data and quality, SNPs, structural variants, STRs and microsatellites and haplotypes and LD patterns, population genetic phenomena such as population comparisons, mutation rates, and signals of selection, and functional annotations, as well as analyses of regions of general interest such as HLA. Talks, posters, and papers on all such analyses are to be published first by the 1000 Genomes Consortium, by approved presenters on behalf of the Project, with the Consortium as author. When the first major Project paper on these analyses is published, then researchers inside and outside the Project are free to present and publish using the Project data for these and other analyses.

Large-scale analyses of Project data

Groups within the Project may make presentations and publish papers on more extensive analyses of topics to be included in the main analysis presentations and papers, coincident with the main project analysis presentations and papers. The major points would be included in the main Project presentations and papers, but these additional presentations and papers allow more focused discussion of methods and results. The author list would include the Consortium.

Methods development using Project data

Researchers who have used small amounts of Project data (<= one chromosome) may present methods development posters, talks, and papers that include these data prior to the first major Project paper, without needing Project approval or authorship, although the Project should be acknowledged. Methods presentations or papers on global analyses or analyses using large amounts of Project data, on topics that the Consortium plans to examine, would be similar to large-scale analyses of Project data: researchers within the Project may make presentations or submit papers at the same time as the main Project presentations and papers, and others could do so after the Project publishes the first major analysis paper.

Disease studies using Project data

Researchers may present and publish on use of Project data in specific chromosome regions (that are not of general interest) or as summaries (such as the total number of variants) for disease studies without Project approval, prior to the first major Project paper being published. The Project should not be listed as an author.

Population comparisons using Project data

Researchers may use Project data as controls or additional information for comparisons with their samples from people with particular diseases or from other populations, prior to the major Project paper being published, as long as the analyses that the Project plans to do are not included. These are not Project studies, and the Project should not be listed as an author.

Researchers who have questions about whether they may make presentations or submit papers using Project data, or whether to include the 1000 Genomes Consortium as an author, may contact Richard Durbin, David Altshuler, Gil McVean, Gonçalo Abecasis, or Lisa Brooks.