1000 Genomes Data and Sample Information

The 1000 Genomes Project is a community resource project that aims to release data rapidly for the benefit of the scientific community.

Description of data released by the project
How to Access 1000 Genomes Data
Data Release Policy
Sample Availability
Use of the Project data, presentations and publications, and authorship

Data released by the 1000 Genomes Project

Sample lists and sequencing progress

A summary of sequencing done for each of the three pilot projects is available here. The list of samples and allocations is provided in a spreadsheet.

Variant Calls

Our variant calls are always released in vcf format. The released can be found in the release directory EBI|NCBI.

Alignments

The main project alignments are available in BAM format. A list of the files currently available can be found in the alignment index EBI|NCBI. Alignment statistics can be found in the alignment_indices directory EBI|NCBI. There is also a README which explains the alignment process and file layout

Raw sequence files

The main project raw sequence data is available in fastq format. A list of files currently available can be found in the sequence.index EBI|NCBI Sequence statistics can be found in the sequence_indices directory EBI|NCBI. There is also a README which explains the sequence processing and the file layout

How to Access 1000 Genomes Data

Download data

The sequence and alignment data generated by the 1000genomes project is made available as quickly as possible via our mirrored ftp sites.

EBI FTP: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
NCBI FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/

Users in the Americas should use the NCBI ftp site and users in Europe and the rest of the world should use the EBI ftp site

The data is also available via an Aspera server from both sites. To be able to use this service you need to download the Aspera connect software. This provides both a firefox plug in for downloading data and a bulk download client called ascp

The plugin should automatically start when you visit either the EBI Aspera site or the NCBI Aspera site.

An example command line for ascp looks like:

ascp -i bin/aspera/etc/asperaweb_id_dsa.putty -Tr -Q -l 100M -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./

The argument to -i may be different depending on the location of the default key file. This command should not ask you for a password. All the 1000 genomes data is freely accessible but you do need to give ascp the ssh key to complete the command.

FTP Hierachy

The FTP site follows a specific data hierachy to enable data discovery

 

  • CHANGLOG, This file gives summaries and dates for changes made to the ftp site
  • changelog_details, This directory contains file lists specifying particular files which have changed. The filename format is explained in README.FTP_structure
  • current.tree, This file describes the current file hierachy and the size of all the files in it, what their md5 checksum is and the date they were last upated.
  • data, This directory contains another directory per individual named for the sample name like NA12878. The contents of each individual directory is described below in the data directory section
  • pilot_data, This directory contains all the data used for the pilot analysis. Its structure mirrors that of the data, release and technical directories of the main ftp site
  • release, This directory contains a dated directory for each release of analysis results like SNP or CNV calls. Each directory is named for the sequence index release it is based on in the form YYYYMMDD e.g our first main project release directory is called 20100804 to match the sequence.index is based on.
  • technical, This directory contains various technical information required for the project. Its contents is explained further in the technical directory section

 

The data dir

Under the data dir there is a directory for every individual sequenced so far as part of the project. Each directory is names with the NAXXXXX identifier assigned to the individual. This system is also mimiced under pilot_data/data

Each NAXXXX directory contains several directories itself

  • sequence_read, This directory contains all the sequence data associated with that individual. This data is also indexed in the sequence.index file.
  • alignment, This directory contains all the bam, bai and bas files associated with the alignments for a particular individual. Generally there is one bam file per individual and technology. For high cover individuals each bam file is split on a per chromosome basis. This information is indexd in the alignment.index file.

 

The technical dir

Under the technical dir there are several directories designed to store differeent data types. This is primarily a location for exchange of data between project participants. It also stores refence data sets in different directories

  • method development, This contains useful files for testing out new methods and assessing other collaborators methods
  • qc, This contains information about any qc done on the data in the database
  • reference, This contains reference datasets like a copy of the currently used human genome and annotation sets like gene predictions or ancestral alignments
  • retired_reference, This contains any reference data which is no longer used by the project
  • simulations, This contains any simulated data and the programs used to create it
  • working, This contains a series of dated directories which contain technical data required for the ongoing project

 

Browse data

The first set of SNP calls representing the July 2010 data release can be viewed directly through the 1000 Genomes browser at http://browser.1000genomes.org. Launch the browser and view a sample region here.

For more information about using the 1000 Genomes browser, download the Quick start guide

Amazon Web Services

The alignment files for the 1000 Genomes Pilot Porject are available as a public data set via Amazon Web Services (AWS). Further information can be found on 1000 Genomes public data set page or directly on http://s3.amazonaws.com/1000genomes.

Data release policy

As a community resource project, the 1000 Genomes Project publicly releases data on a regular basis. The sequencing centers release the raw reads to the SRA. The Data Processing Group, Analysis Group, and DCC release alignments, re-calibrated error rates, and SNP calls on individual samples. Data that passed quality filters are the most visible, but data that did not pass quality filters are also available. When the Structural Variation Group has developed its methods, the structural variant calls on individual samples will be released. Eventually the Project will release haplotypes and imputed variant calls. Data formats and analysis software developed by the Project are also made publicly available.

Sample Availability

All the samples studied by the 1000 Genomes Project are or will be available as DNA and cell lines to scientific investigators for research projects. Many of the samples are currently available from the non-profit Coriell Institute for Medical Research. Cell lines and DNA from additional populations will become available through 2011.

Use of the Project data, presentations and publications, and authorship

The data producers release the Project data quickly, prior to publication, in the expectation that they will be valuable for many researchers. In keeping with Fort Lauderdale principles, data users may use the data for many studies, but are expected to allow the data producers to make the first presentations and to publish the first paper with global analyses of the data.

Global analyses of Project data

The Project plans to publish global analyses of the sequence data and quality, SNPs, structural variants, STRs and microsatellites and haplotypes and LD patterns, population genetic phenomena such as population comparisons, mutation rates, and signals of selection, and functional annotations, as well as analyses of regions of general interest such as HLA. Talks, posters, and papers on all such analyses are to be published first by the 1000 Genomes Consortium, by approved presenters on behalf of the Project, with the Consortium as author. When the first major Project paper on these analyses is published, then researchers inside and outside the Project are free to present and publish using the Project data for these and other analyses.

Large-scale analyses of Project data

Groups within the Project may make presentations and publish papers on more extensive analyses of topics to be included in the main analysis presentations and papers, coincident with the main project analysis presentations and papers. The major points would be included in the main Project presentations and papers, but these additional presentations and papers allow more focused discussion of methods and results. The author list would include the Consortium.

Methods development using Project data

Researchers who have used small amounts of Project data (<= one chromosome) may present methods development posters, talks, and papers that include these data prior to the first major Project paper, without needing Project approval or authorship, although the Project should be acknowledged. Methods presentations or papers on global analyses or analyses using large amounts of Project data, on topics that the Consortium plans to examine, would be similar to large-scale analyses of Project data: researchers within the Project may make presentations or submit papers at the same time as the main Project presentations and papers, and others could do so after the Project publishes the first major analysis paper.

Disease studies using Project data

Researchers may present and publish on use of Project data in specific chromosome regions (that are not of general interest) or as summaries (such as the total number of variants) for disease studies without Project approval, prior to the first major Project paper being published. The Project should not be listed as an author.

Population comparisons using Project data

Researchers may use Project data as controls or additional information for comparisons with their samples from people with particular diseases or from other populations, prior to the major Project paper being published, as long as the analyses that the Project plans to do are not included. These are not Project studies, and the Project should not be listed as an author.

Researchers who have questions about whether they may make presentations or submit papers using Project data, or whether to include the 1000 Genomes Consortium as an author, may contact Richard Durbin, David Altshuler, Gil McVean, Gonçalo Abecasis, or Lisa Brooks.