Announcements

Monday September 15, 2014

Our final release of the Phase 3 variant set is now available on the FTP site

This update represents version 5 of our release. The issues which have been resolved since our initial release are covered in the Known Issues README

This release includes super population allele frequencies in the main release vcfs and functional annotation from the Ensembl Variant Effect Predictor along side many other datasets in the supporting directory. The complete list of data is covered in the Supporting Directory README

Please send any questions about this data set to info@1000genomes.org



Wednesday June 25, 2014

We have created a new tool to calculate population specific allele frequencies. The Allele Frequency Calculator will calculate and provide a table of population specific allele frequencies from a vcf file and sample panel file.

The tool is documented on this website

The tool currently has two run modes, the first gives you the allele frequencies for a particular population. The second is run by selecting the ALL population and this gives you the allele frequency for all the populations as well as the global allele frequency.

Please do note in genomic regions which are very variant dense or very large regions the tool will run more slowly.

Please send any questions about the tool to info@1000genomes.org



Tuesday June 24, 2014

The initial call set from the 1000 Genomes Project Phase 3 analysis is now available on our ftp site in the directory release/20130502/.

These release contains more than 79 million variant sites and includes not just biallelic snps but also indels, deletions, complex short substitutions and other structural variant classes. It is based on data from 2535 individuals from 26 different populations around the world.

More details about the variant set can be found in the README

Please send any questions about this data set to info@1000genomes.org



Tuesday June 17, 2014

The 1000 Genomes FTP site is now available as an endpoint in the Globus Online service.

The endpoint name is ebi#1000genomes 

Our FAQ has more details about how to access the data via Globus.



Thursday February 20, 2014

1000 Genomes Project and Beyond
24-26 June 2014
Churchill College, Cambridge, UK

This Wellcome Trust conference will focus on advances enabled by the 1000 Genomes Project, including the new directions in genetics and genomics that it has facilitated. It is the latest in the successful series of community meetings for the HapMap and 1000 Genomes Project, marking the end of the 1000 Genomes Project this summer.

Scientific sessions will include:
Patterns of genetic variation within and between populations
Management and processing of whole genome sequence data
Whole genome sequencing in complex and rare diseases
Human evolution
Functional analysis of variation
Genome sequencing: the past, present and future

Scientific programme committee
Richard Durbin, Wellcome Trust Sanger Institute, UK
Goncalo Abecasis, University of Michigan, USA
David Altshuler, Broad Institute of Harvard and MIT, USA
Lisa Brooks, National Human Genome Research Institute, USA
Gil McVean, University of Oxford, UK

For further information, visit the Wellcome Trust Scientific Conferences Page



Monday January 06, 2014

All the samples from the 1000 genomes are available as lymphoblastoid cell lines (LCLs) and LCL derived DNA from the Coriell Cell Repository as part of the NHGRI Catalog. In addition Standard Population DNA Panels for the 1000 Genomes and HapMap projects are available at $1000 or less each.

A full listing of all the populations available can be see on the Coriell Cell lines and DNA page



Monday October 21, 2013

Our 1000 Genomes Browser has been updated to contain Ensembl v73. This contains the majority of the Phase 1 Integrated release.

You can find our tutorial here.



Thursday October 10, 2013

The 1000 Genomes project is giving a tutorial during ASHG 2013 in Boston. This meeting is free for the public to attend.

This meeting is free for the public to attend but we ask that you register so we know how many people to expect.

For full details of the schedule please see our 2013 tutorial page.



Friday October 04, 2013

The Functional Analysis group from the 1000 Genomes Project Consortium have published its findings based on the Phase 1 integrated analysis in Science today. 

The functional anotation itself is available from the phase1 analysis results directory on our ftp site.

We also provide the supplementary information associated with the paper in the phase1 paper directory



Tuesday September 17, 2013

Olivier Delaneau and Jonathan Marchini have provided improved haplotypes for the phase1 integrated call set on the autosomes generated using SHAPEIT2.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/shapeit2_phased_haplotypes/

Using a set of validation genotypes at SNP and biallelic indels Olivier and Jonathan have been able to show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low frequency.

README



Thursday August 01, 2013

Lossy CRAMs for mapped phase3 BAMs are available now on the ftp site:

 
In these lossy crams, the quality scores were binned according to Illumina 8-binning scheme; tags OQ, BQ, CQ were dropped.
 
Two index files for the lossy CRAMs can be found in 
 
 
20130502.alignment.lossy_cram.index
20130502.exome.alignment.lossy_cram.index
 
The average size of the lossy crams is 30.7% and 31.0% of that of the low coverage and exome BAMs, respectively.
 
The above information can also be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.crams
 


Friday July 26, 2013

The CG data for 427 samples are available now at:

 
 
Of the 427 samples, 6 samples of the PUR trio and KHV trio were sequenced in both blood and LCL thus the total count of datasets is 433.
 
CG BAMs, VCFs and tar files can be found under subdirectory NAxxxxxx/cg_data; these files are listed in an index file:
 
 
The reference data came along with the CG data can be found under subdirectory NAxxxxxx/cg_data/REF_[LCL | Blood | Buffy]; an index file is put together to summarise the reference files:
 
 
We also un-tar the tar ball and put the files under NAxxxxxx/cg_data/ASM_[lcl | blood | buffy]; the untarred files are listed in the following index file:
 
 


Saturday May 25, 2013

The official release of phase3 low coverage and exome data is completed and available on the ftp site. The alignment data were generated by Sanger Center.  All BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn. The 20130502.analysis.sequence.index has been updated to reflect the withdrawn:  

Here are the main alignment index files: 
 
or 
 
There are 2535 samples in the index files; all of them passed QA and have both exome and low coverage data.
 
In the alignment_indices directory ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/, you may find associated stats files and summary bas files and an exome HsMetrics file:
 
20130502_20120522.alignment_stats.low_coverage.csv
20130502_20120522.alignment_stats.exome.csv
20130502.low_coverage.alignment.index.bas.gz
20130502.exome.alignment.index.bas.gz  
20130502.exome.alignment.index.HsMetrics.gz
20130502.exome.alignment.index.HsMetrics.gz.stats  
 
A handful samples passed all QA but only have either low coverage data (23) or exome data (16); we keep the BAM files for these samples at
 
 
Two alignment index files can be found in the same directory:
 
20130502.exome_only.alignment.index
20130502.lc_only.alignment.index


Monday April 22, 2013

The final sequence index file is released on the FTP site
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130502.sequence.index


The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130502.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices.  We have achieved our goal of 2500 samples for both low coverage and exome projects!  The overlap between >5Gb exome samples and >10Gb low coverage samples is also greater than 2500. 


Monday April 15, 2013

Another sequence index file is released on the FTP site

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130415.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130415.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices.  We have achieved our goal of 2500 samples for both low coverage and exome projects!  The overlap between >5Gb exome samples and >10Gb low coverage samples is 2466.


Monday April 08, 2013

Another sequence index file is released on the FTP site


http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130408.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130408.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices.  We are very close to our goal of 2500 samples for both low coverage and exome projects!  


Tuesday April 02, 2013

Another sequence index file is released on the FTP site

 
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices


Monday March 25, 2013

Another sequence index file is released on the FTP site

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130325.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130325.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices. Here is a snapshot for your convenience.  Please note that we are now reporting number of samples with greater than 10Gb for low coverage data and greater than 5Gb for exome data by population. 


Tuesday March 19, 2013

Another sequence index file is released on the FTP site


http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130319.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130319.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices. Here is a snapshot for your convenience.  Please note that we are now reporting number of samples with greater than 10Gb and 5Gb of data by population. 
 


Monday March 11, 2013

Another sequence index file is released on the FTP site

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130311.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130311.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices


Wednesday March 06, 2013
A new sequence index is now available on the FTP site

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130305.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130305.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices


Wednesday February 20, 2013

In order to meet phase3 release timelines, we are going to make more frequent sequence index releases. Basically we are checking data submission weekly; a new sequence index will be released when 25 or more new samples or 250GB or more new sequences are detected.  This way BAM files can be created incrementally without much delay.

 
A new sequence index is now available on the FTP site

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130218.sequence.index

The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is 

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130218.analysis.sequence.index
 
You may find different stats files for this release in http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices


Wednesday December 19, 2012
Complete Genomics data for 57 samples are available now in
 
 
The data are organised in above directory under sample_name/cg_data. Each sample has a CG native ASM tar file, an evidence BAM file, an evidence BAM file with supporting reads, a VCF file, and corresponding index files (bai and tbi).  
 
An index for these files is 
 


Thursday December 13, 2012

Our 1000 Genomes Browser has been updated to contain Ensembl v69. This contains the majority of the Phase 1 Integrated release.

You can find our tutorial here.



Tuesday December 11, 2012

A new alignment release is available on the ftp site. The exome and low coverage alignment.index files describes the location of all the alignment files This has been made based on the 20120522.analysis.sequence.indexAll BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn.



Tuesday December 11, 2012

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:

20121211.sequence.index

Going forward the project plans to produce alignments and variant calls using only Illumina platform sequence data with 70bp reads or longer. This data set has been put into the analysis.sequence.index. There is more info about this in our FAQ

Data access links:EBI/NCBI

Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Friday November 09, 2012

The slides from our tutorial session which was held on Wednesday 7th and the Data Access Poster are both available from our website

Tutorial Slides

Poster Slide



Wednesday October 31, 2012

The Phase 1 publication, An Integrated map of genetic variation from 1092 human genomes is now available from Nature and can be downloaded directly from the ftp site.  The paper is distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported licence.  Please share our paper appropriately.

All the data files associated with this paper can be found in our phase1 analysis results directory.



Tuesday October 23, 2012

The 1000 Genomes Project is holding a tutorial during ASHG 2012 on Wednesday 7th November 7:00 to 9:30pm at the San Francisco Marriot Marquis.

The 1000 Genomes Project has released the sequence data and an integrated set of variants, genotypes, and haplotypes for the 1092 samples in the phase 1 set, and the sequence data for the phase 2 set.  This tutorial describes the data sets, how to access them, and how to use them.

There are more details on the tutorial page. Please use this form to sign up.



Thursday September 06, 2012

Two Accessibility Tracks have now been added to the 1000 Genomes Browser 

This information was built using sequence data from the phase1 dataset

The two tracks are called the 1000 Genomes Pilot Accessibility Mask and the 1000 Genomes Strict Accessibility Mask.

There is a README which describes how this data set was created. The raw bed and fasta files are also available in the accessible genome ftp directory



Monday July 02, 2012

Analysis results based on our phase1 integrated variant call set are now available.

This includes chrY and chrMT variant calls, functional annotation of our variant calls and local area ancestry inference for our admixed populations.

Full details of the directory contents can be found on this webpage.

Data Access links: EBI / NCBI

README: EBI / NCBI 



Tuesday May 22, 2012

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:

20120522.sequence.index

Data access links:EBI/NCBI

Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Wednesday May 09, 2012

Our 1000 Genomes Browser has been updated to contain the Integrated Phase 1 variant set. It has also be moved to Ensembl version 65.

This update includes improved navigation and a track for the Exome Sequencing Project SNPs

You can find our tutorial here.



Sunday April 29, 2012

Nature Methods has published The 1000 Genomes Project:data management and community access.

This paper describes how the consortium manages its data and the tools we have created to make access easier.

The paper itself is freely available under a creative commons license

Article PDF Supplementary Info PDF



Friday April 20, 2012

The new index can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20120419.sequence.index

Associated stats can be found in the same directory.


Thursday April 05, 2012

The 1000 Genomes Project is holding a Community Meeting at the University of Michigan, Ann Arbor on the 12th and 13th of July 2012

The Meeting is meant to showcase advances made by the 1000 Genomes Project, both with respect to methods for generation and analysis of sequence data and in our understanding of human genetic variation, show how 1000 Genomes Project data and methodology is advancing our understanding of human disease,  both in disease studies that use 1000 Genomes Project data as a reference and in studies that are applying population sequencing and other project technologies in phenotyped samples, highlight other cutting edge sequencing studies and technologies in humans and finally to generate discussion and lay the ground work for the next round of community resource sequencing projects.

For more details and to register please go to http://1000gconference.sph.umich.edu/



Thursday March 29, 2012

The 1000 Genomes data set is now available in the Amazon Web Service Cloud(AWS)

For more information about how to access and use the data in the cloud please look at our documentation

More details are available in the NIH Press Release



Friday March 16, 2012

This March 2012 release represents a improved set of our integrated phase 1 variant release. This release represents version 3 of an integrated variant call set based on both low coverage and exome whole genome sequence data. This is an updated set from the version 2 (February 2012) of the 20110521 release. For this release approximately 2.38M indels have been filtered from the version 2 call set. See README_v3  for more information.

Our FAQ contains instructions on how to get smaller subsections of these files

Data access links: EBI / NCBI



Tuesday March 13, 2012

The official release of phase2 alignment data is complete and available on the ftp site.  All BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn. The 20111114.sequence.index has been updated to reflect the withdrawn.

EXOME data (Illumina and SOLiD)

In the alignment_indices directory ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment_indices/, you may find associated stats files and summary bas files:
Low Coverage:
20111114_20101123.alignment_stats.low_coverage.csv
20111114.alignment.index.bas.gz
EXOME:
20111114_20101123.alignment_stats.exome.csv
20111114.exome.alignment.index.bas.gz
20111114.exome.alignment.index.HsMetrics.gz
20111114.exome.alignment.index.HsMetrics.stats


Thursday March 01, 2012

We have created a tutorial to give users a basic background into the 1000genomes project.

This includes a series of slides covering the background and history of the project, the structure and format of the raw data, our browser and our tools.

There are exercises for both our browser and web based tools and command line tools which are useful when working with 1000 genomes data.

All the tutorial documents can be found on our ftp site.



Tuesday February 14, 2012

This February 2012 release represents a new improved set of phased genotypes for our integrated phase 1 variant release. This release contains SNPS, short INDELs and Deletions based on low coverage and exome sequencing data across 1092 individuals.

Please note the sites list has been filtered to remove a small number of indels which were discovered to have a high false positve rate. There is more information about this in the README

Our FAQ contains instructions on how to get smaller subsections of these files

Data access links: EBI / NCBI

Link to additional information:README file



Monday January 30, 2012

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:

20120130.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Friday January 27, 2012

The EBI 1000 Genomes ftp site is now also available via http



Wednesday January 11, 2012

Trio high coverage BAMs generated for pilot2 project have been moved to

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams 

Exon targetted BAMs generated for pilot3 project has been moved to 

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot3_exon_targetted_GRCh37_bams

Both sets of BAMs are mapped to GRCh37.



Thursday January 05, 2012

In preparation for the incoming of phase 2 BAMs, all phase 1 BAMs have been moved to a phase1 freeze directory -

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/  or



Monday December 05, 2011

The October 2011 Integrated Phase 1 Variant Release has been updated. It now includes variant calls on chrX.

For more information please see the new README



Tuesday November 15, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:

20111114.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Friday November 11, 2011

The October 2011 Integrated Phase 1 Variant Release has been updated to include additional information like rs numbers from dbSNP, Ancesteral Alleles and Allele Frequencies.

For more information please see the new README



Wednesday October 26, 2011

At ICHG2011 we presented a tutorial about 1000 Genomes Data. The slides from this tutorial can be found here



Thursday October 13, 2011

A new Project browser based on our Interim 20101123 phase 1 variant calls has been released.

It is based on Ensembl release 63.

Please read our tutorial document for more information about the browser.



Wednesday October 12, 2011

This October 2011 release represents an integrated set of variant calls and phased genotypes including SNPS, short INDELs and Deletions based on low coverage and exome sequencing data across 1092 individuals.

Our FAQ contains instructions on how to get smaller subsections of these files

Data access links: EBI / NCBI

Link to additional information:README file



Wednesday October 12, 2011

The Poster which was presented at the ICHG 2011 Poster session on 12th October is available in powerpoint format here

The 1000 Genomes Project Resources



Tuesday September 20, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:

20110920.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Thursday September 01, 2011

The project browser based on the 20100804 variant release and has been moved to the latest version of Ensembl, 63.

The new browser features a new Variation Pattern Finder and improvements to the Data Slicer  to allow vcf slices to be subselected on the basis of individual or if associations provided by population. All location based pages also now have a get vcf button which transfers you to the Data slicer with the coordinates fill out and the path of the vcf for our 20100804 release filled in. For more information about the new features which are part of the Ensembl project please look at their blog post on the matter



Friday August 12, 2011

The alignments based on the 20110521.sequence.index have been updated. 

Illumina data was aligned at Boston College and SOLiD data at Baylor College of Medicine. Full project data is aligned to the GRCh37 human assembly. In the updated release, all Baylor SOLiD BAMs and associated bai and bas files have been replaced; the new BAMs are now re-calibrated and locally re-aligned.

Data access links: EBI / NCBI / Instructions for data download and Aspera



Tuesday July 19, 2011

The alignments based on the 20110521.sequence.index have been released. This is the first set of exome alignments to be released by the project.

Illumina data was aligned at Boston College and SOLiD data at Baylor College of Medicine. Full project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera



Tuesday July 19, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110719.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Tuesday July 12, 2011

A new reference assembly which is to be used in the mapping of the Phase 2 data is now available on our FTP site

All existing phase 1 data will be remapped to this assembly for the phase 2 analysis.

Data access links: EBI / NCBI

Link to additional information: README



Thursday June 23, 2011

Genotypes for 1094 individuals for the May 2011 snp calls from the 20101123 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and is released in the format VCF 4.0

Our FAQ contains instructions on how to get smaller subsections of these files

Data access links: EBI / NCBI

Link to additional information:README file



Thursday June 16, 2011

We have released a public mysql instance of the Ensembl databases which sit behind our browser.

There are more details about this instance on the Public Mysql Instance page.

Please emai info@1000genomes.org if you have any questions



Wednesday June 01, 2011

The Database of Genomic Variants Archive (DGVa) have recently released the Stuctural Variants associated with our pilot study paper.

The variants are available from the ftp sites of both the DGVa and dbVar and can be browsed on the dbVar summary page



Saturday May 21, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110521.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Saturday May 21, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110521.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Thursday May 12, 2011

Full Project low coverage SNP call release

SNP calls based on 1094 individuals from the 20101123 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0

Data access links: EBI / NCBI

Link to additional information:README file



Friday May 06, 2011

The project browser has been updated to include 20100804 variant release and has been moved to the latest version of Ensembl, 62.

A tutorial for the updated browser is available here.

New features include a Data Slicer tool which provides the ability to get subsections of vcf and bam files. Transcript consequences also now include Sift and Polyphen scores for non synonymous SNPs.

The pilot browser is still available at http://pilotbrowser.1000genomes.org



Friday May 06, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110506.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Monday April 11, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110411.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Monday February 28, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110228.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information:Sequence index file format



Wednesday February 16, 2011

The alignments based on the 20101123.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.

Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. Full project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera



Wednesday February 16, 2011

Full Project Indel Release

Indels calls from Dindel. These calls arebased on 629 individuals from the 20100804 sequence and alignment release of the 1000 genomes project. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0

Data access links: EBI / NCBI

Link to additional information:README file



Thursday February 03, 2011

The Structural Variant group from the 1000 Genomes Project Consortium has published its findings based on the pilot data analysis in Nature today. Mapping copy number variation by population scale genome sequencing. The data supporting this paper can be found on the ftp site EBI|NCBI



Tuesday January 25, 2011

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110124.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Sequence index and Statistics files

Sequence index file format



Thursday December 16, 2010

Full Project Genotype Release

Genotypes and haplotypes have been added to the SNP calls released in November. These calls arebased on 629 individuals from the 20100804 sequence and alignment release of the 1000 genomes project. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0

Data access links: EBI / NCBI

Link to additional information:README file



Monday November 22, 2010

The 1000 Genomes Project Data Tutorial was held on Wednesday, November 3, during the American Society of Human Genetics meeting in Washington, DC. The slides and video of the tutorial are publicly available on the NHGRI website.



Tuesday November 09, 2010

Full Project SNP call release

SNP calls based on 628 individuals from the 20100804 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0

Data access links: EBI / NCBI

Link to additional information:README file



Wednesday October 27, 2010

The 1000 Genomes Project Consortium has published the results of the pilot project analysis in the journal Nature in an article appearing on line today. The paper A map of human genome variation from population-scale sequencing is available from the Nature web site and is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence to ensure wide distribution. The paper is also available directly from this link and supporting data for the paper is avilable from our mirror web sites at the EBI and NCBI



Monday October 04, 2010

The latest release of sequence data from the 1000 Genomes full project is now available. The new sequence.index file can be found at: 20101004.sequence.index
 

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information: List of new index and statistics files / Sequence index file format



Monday October 04, 2010

The alignments based on the 20100804.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.

Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Full project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera



Wednesday August 04, 2010

The latest release of sequence data from the 1000 Genomes full project is now available. The new sequence.index file can be found at: 20100804.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information: List of new index and statistics files / Sequence index file format



Tuesday July 20, 2010

Pilot Project Variant call release

Variant Calls from the three pilot projects are now available in VCF 4.0 format. This release includes SNPs, short indels and large scale structural variants. All 1000 genomes pilot project files reference the NCBI build 36 assembly of the human genome

Data access links: EBI / NCBI

Link to additional information: README file



Monday July 19, 2010

The alignments based on the 20100611.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.

Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Full project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information: New Files / Withdrawn bams



Monday July 12, 2010

Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20100710.sequence.index

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information: List of new index and statistics files / Sequence index file format



Sunday June 06, 2010

The alignments based on the 20100517.sequence.index have been released. BAM files with more data replace older (and now withdrawn) BAM files with less data. Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in from the links below. Main project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera

Links to additional information: New Illumina Files / New SOLiD Files / List of withdrawn bams



Thursday April 29, 2010

New main project sequence files are available on the FTP site.

Link to additional information: 20100429.sequence.index / README.sequence_data / README.populations



Friday April 16, 2010

The Pilot 1 mask files have been patched to support creation of *.fai files with SAMtools.

Data access links: EBI / NCBI

Link to additional information: Changelog



Thursday April 15, 2010

The first set of alignment files from the main phase of the 1000 Genomes project are now available for download.

Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Main project data is aligned to the GRCh37 human assembly.

Data access links: EBI / NCBI / Instructions for data download and Aspera

Link to additional information: Changelog (Illumina files) / Changelog (SOLiD files) / README.alignment_data



Wednesday March 31, 2010

The final set of SNPs from Pilots 1, 2 and 3 are now available in VCF format. All 1000 Genomes pilot project files reference the NCBI build 36 assembly of the human genome.

Data access links: EBI / NCBI

Link to additional information: README file



Tuesday February 23, 2010

An updated set of sequence data now available on the ftp site. We have also updated the readme to reflect a new column being added to the sequence.index file and the additional statistics file which is now associated with the index.

The population column of the sequence index has also been rationalised to use only the three letter population codes. Definitions of each code can be found in the readme file below.

Links to additional information: sequence.index / README.sequence_data / README.populations



Monday December 21, 2009

Updated PIlot2 and Pilot3 SNP calls in VCF format are now available.

Data access links: EBI / NCBI

Link to additional information: Supplemental and supporting data



Thursday December 17, 2009

The 1000 Genomes web site at 1000genomes.org is now available again.

As part of the recovery of the site, we had to change all of the passwords for those of you that have wiki accounts. You can request your password by selecting the "Send me my password" link on the left side of the page. You will then receive an automated email with a new password and instructions for activating that password. You may have to check your junk mail box to find this email.



Thursday December 10, 2009

The 1000genomes.org web site is currently unavailable. We are aware of the problem and working to get it corrected.



Tuesday November 24, 2009

A Pilot 1 bam file (plus .bas file): NA18940.SLX.maq.SRP000031.2009_09.unmapped.bam has been added to the ftp site. The new version contains corrected library information.

Links to additional information: Changelog / alignment.index



Wednesday November 18, 2009

Several updates the project alignment files are detailed below:

1. Five pilot 2 chromY bam (plus associated .bai and .bas files) have been withdrawn. This data included female samples mapped to the Y chromosome. Changelog

2. New .bas files for unmapped.bam files on already on ftp site have been added. Changelog

3. Replacement of 166 .bas files. The replacement files contain corrected "library" entries that are consistent with pilot_data.sequence.index file. Changelog

The "alignment.index" file has been updated to reflect these changes: alignment.index



Wednesday November 11, 2009

Updated filtered fastq files have been added to the FTP site. This is a regular monthly update of the filtered fastq files for all the pilots and on-going production project. The update includes replacement of 206 buggy filtered fastq files, as well as addition of 430 new filtered fastq files obtained and screened from ERA.

In addition, 159 replacement filtered fastq files for specific CHB runs have been added to the FTP site.

Data access links: EBI / NCBI

Links to additional information: Changelog (new files) / Changelog (replacement of 206 files) / Changelog (replacement of 159 CHB files)



Tuesday November 10, 2009

New annotation files to be used in the analysis of the pilot projet have been added to the FTP site. These files include

  • Phastcons regions of conservation, conserved_noncoding_phastcons_17way.txt.gz This is a set of phastcon non coding conserved regions
  • Uniqueness of the genome, hs36_uniqueness_mask.fa.gz, This is a fasta like file with a rating for how unique each postion of the genome is. The associated readme explains the meaning of each score
  • Genetic map and Recombination hotspots, genetic_map_b36.tar.gz, Genetic map for the 22 autosomes and a recombination hotspot file

A full list of the annotation can be found on the annotation wiki page.

Data access links: EBI / NCBI

 

Link to further information: Changelog.



Monday November 09, 2009

Replacement bams for pilot1 CHB alignments and pilot2 NA19239 to account for the previously missed libraries.

Links to additional information: Changelog / alignment.index



Friday November 06, 2009

Several fastq files used to create the alignemnt files, but missing from the FTP site have been reinstated to the FTP site.

Links to additional information: Changelog / pilot_data.sequence.index.



Wednesday November 04, 2009

5 replacement pilot 1 .bas files have been added the files replaced were incorrectly labelled SRP000032 in column 3.

Links to additional information: Changelog



Tuesday November 03, 2009

New files containing scripts and data related to the GENCODE annotation for determining conserved noncoding regions and introns have been added to the FTP site.

Data access links: EBI / NCBI.

Links for additional information: Changelog / README / Code download (.tgz).



Monday November 02, 2009

The alignment.index file was not correctly updated after the addition of replacement bams on October 29. This has been corrected.

Links to additional information: alignment.index



Thursday October 29, 2009

48 replacement pilot 2 454 bams (and associated bas and bai) files for NA12878 and NA19240 have been added to the ftp site.

Links to additional information: Changelog (withdawn files) / Changelog (replacement files) / alignment.index



Tuesday October 27, 2009

Replacement BAM files for the CHB population have been added to the DCC ftp site with an up-to-date alignment index file. 42 old BAM files and associated bai and bas files have been withdrawn and replaced by 18 BAM files and associated bai and bas files.

Links to data: EBI

Links to additional information: Changelog (replacement files) / Changelog (withdrawn files)



Friday October 23, 2009

Pilot 1 replacement bam, bai, bas files have been added to the FTP site.

Links to additional information: Changelog (withdrawn files) / Changelog (replacement files)



Wednesday October 21, 2009

A total of 351 new Pilot 3 454 BAM files have been added to the FTP site replacing a set of older BAM files. These files use a more stringent method of duplicate removal. In the changelog files below, those labelled with the pattern *454.ssaha2.SRP000033.2009_09.ba* were withdrawn.

Links to additional information: Changelog (replacement files) / Changelog (withdown files)



Monday October 12, 2009

A pilot 3 gene list and a bed file for pilot 3 consensus exonic target regions as described by Fuli below have been put on the 1000 Genomes ftp site.

Links to data: EBI / NCBI

Links to additional information: Changelog



Monday October 12, 2009

Replacement bam (and .bai, .bas) files added, correcting for withdrawn lanes included in bam file.

Links to additional information: Changelog (replacement files) / Changelog (withdrawn files) / alignment.index



Monday October 12, 2009

The reference genome to be used for the main project is GRCh37 and the files to be used for alignment in the main project are now available. The procedure for creating the reference genome is:

1. Download individual chrs for the GRCh37 assembly from Ensembl: FTP site

2. Download the newer version of the MT (NC_012920): Data link

3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs.

Links to data: EBI / NCBI

The pilot project reference genome (based on NCBI build 36) remains available in the retired_reference directory on the FTP site.

Link to additional information: Changelog (file locations)



Wednesday October 07, 2009

A total of 844 files are added to ftp/data/ directory representing new and replacement filtered fastq files. Replacement files were required for a small number of fastq files with incorrect quality scores. The files with the incorrect quality scores were removed. In the sequence index file the removed files are marked "WITHDRAWN FROM ARCHIVE".

Links to data: EBI / NCBI.

Links to additional information: Changelog (new files) / Changelog (replaced files) / Changelog (withdrawn files)



Tuesday October 06, 2009

A truncated bam file and the associated bai and bas files have been replaced.

Links to replacement file: EBI / NCBI

Link to additional information: Changelog



Friday October 02, 2009

719 new .bas (bam statistics) files have been added to the FTP site and 1129 .bas files have been replaced. The relpaced files have corrected bam file names and statistics including #_total_bases and %_of_mismatched_bases. The alignment index files will be updated shortly to give #_total_bases and #_mapped_bases for each bam file entry.

Links to additional information: Changelog (new .bas files) / Changelog (replacement .bas files)



Friday October 02, 2009

A set of 33315 fastq files containing sequence reads that have passed a DCC fastq QC process are now available on the FTP site.

The filtered_fastq files contain reads passing the DCC fastq QC process and have been put on the ftp site. The input to the DCC QC pipeline are all fastq files retrieved from ERA, including reads generated by all three pilots and the main project, as of 4 September 2009.

Summary statistics of the current set of files:

 

Total read count: 143,297,692,374

Total base count: 6,248,070,765,643

Total filtered read count: 130,945,326,161

Total filtered base count: 5,715,866,786,951

Percentage of good reads: 91.38%

Percentage of good bases: 91.48%

Links to data: EBI / NCBI

Link to additional information: Changelog (list of new files) / QC criteria in README.sequence_data



Friday May 15, 2009

Indel calls have been made by several groups on the trio children NA12878  (CEU) and NA19240 (YRI).

There is a README which describes the call set

These calls are available on the ftp site EBI:NCBI



Sunday February 15, 2009

The first set of SNPs and other supporting data from the low coverage individuals is now officially released on the EBI and NCBI FTP sites associated with the 1000 Genomes project.

This intermediate release of results from the 1000 Genomes project consists of the following (all file paths below are relative to the following root on the two mirror ftp sites: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2009_02/ or ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/2009_02/).

1. SNP calls from 104 individuals in the low coverage pilot, Pilot 1.

See this file: Pilot1/README_SRP000031_2009_02 for more details on how the calls were made and what the file formats are. All the files related to this set of calls, excepting the human genome reference used (see point 3 below), are in the Pilot1 directory.

See this file: Pilot1/alignments.index for a tab-delimited list of alignment files, one per individual, including md5 checksums for each file. The third and fourth columns give the project id for Pilot1, SRP000031, and the identifier for the corresponding individual. The alignment files are in SAM format (see http://samtools.sourceforge.net/).

Please note these are intermediate results, NOT based on the sequencing data from the complete Pilot.

2. The human genome reference files used can be found in the following directory: ../../technical/reference.

The male and female reference files are in gzipped fasta format, and there are also two index files, for indexing directly into the compressed references, for BAM file manipulation. This index format is defined in the README within that directory.



Tuesday December 23, 2008

The first set of SNP calls representing the preliminary analysis of four genome sequences are now available to download through the EBI FTP site and the NCBI FTP site. The README file dealing with the FTP structure will help you find the data you are looking for.

The data can also be viewed directly through the 1000 Genomes browser at http://browser.1000genomes.org. Launch the browser and view a sample region here.

For more information about using the 1000 Genomes browser, download the Quick start guide.

The 1000 Genomes Project announces the release of the first set of SNP calls for 4 individuals that are part of the high coverage pilot project. These SNPs represent the preliminary analysis of a portion of the data so far collected and are released in accordance with the Ft. Lauderdale agreement for community resource projects.

This preliminary release is designed to both provide data to the community and to test the systems used to make the data available including the 1000 Genomes browser available at http://browser.1000genomes.org. Additional updates of this site and the 1000 Genomes browser will take place througout January.

In addition to the SNP files and the 1000 Genomes project browser, raw project data is made available as soon as possible through by NCBI and the EBI. The data consist of README files, fastq files (nucleotides and qualities) suitable for use with most aligners, and md5 checksum data.

Users from the Americas should download the mirrored 1000 Genomes data from NCBI via ftp at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ or via the Aspera high speed data transfer client at: http://fasp.ncbi.nlm.nih.gov/1000genomes.html.

Users from the Europe and the rest of the would should download the mirrored 1000 Genomes data from the EBI at: ftp://ftp.1000genomes.ebi.ac.uk/. The EBI will be implementing an Aspera client in the near future.

The data available at the EBI and the NCBI are identical. Depending on server load and Internet connectivity some users in the Americas may have faster FTP downloads from the EBI and some users in Europe and the rest of the world may have faster downloads from NCBI.

Raw data for a portion of the project is already available through the Short Read Archive at the NCBI and the European Read Archive at the EBI. These comprehensive archives are specifically designed for short read data and will be making the complete project data available as soon as possible

The 1000 Genomes project plans to release summary data including positions of variants in individuals and populations. These data releases will include SNPs and CNV analysis for the six high-coverage individuals and all of the low coverage individuals. Quarterly data releases are planned starting in January 2009.