Announcements
The official release of phase3 low coverage and exome data is completed and available on the ftp site. The alignment data were generated by Sanger Center. All BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn. The 20130502.analysis.sequence.index has been updated to reflect the withdrawn:
The final sequence index file is released on the FTP site
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130422.sequence.index
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130422.analysis.sequence.index
Another sequence index file is released on the FTP site
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130415.analysis.sequence.index
Another sequence index file is released on the FTP site
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130408.sequence.index
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130408.analysis.sequence.index
Another sequence index file is released on the FTP site
Another sequence index file is released on the FTP site
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130325.analysis.sequence.index
Another sequence index file is released on the FTP site
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130319.sequence.index
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130319.analysis.sequence.index
Another sequence index file is released on the FTP site
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130311.analysis.sequence.index
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130305.sequence.index
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130305.analysis.sequence.index
In order to meet phase3 release timelines, we are going to make more frequent sequence index releases. Basically we are checking data submission weekly; a new sequence index will be released when 25 or more new samples or 250GB or more new sequences are detected. This way BAM files can be created incrementally without much delay.
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130218.sequence.index
The corresponding analysis.sequence.index that contains only >70bp long Illumina reads is
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20130218.analysis.sequence.index
Our 1000 Genomes Browser has been updated to contain Ensembl v69. This contains the majority of the Phase 1 Integrated release.
You can find our tutorial here.
A new alignment release is available on the ftp site. The exome and low coverage alignment.index files describes the location of all the alignment files This has been made based on the 20120522.analysis.sequence.index. All BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn.
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:
Going forward the project plans to produce alignments and variant calls using only Illumina platform sequence data with 70bp reads or longer. This data set has been put into the analysis.sequence.index. There is more info about this in our FAQ
Instructions for data download and Aspera
Sequence index and Statistics files
The slides from our tutorial session which was held on Wednesday 7th and the Data Access Poster are both available from our website
The Phase 1 publication, An Integrated map of genetic variation from 1092 human genomes is now available from Nature and can be downloaded directly from the ftp site. The paper is distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported licence. Please share our paper appropriately.
All the data files associated with this paper can be found in our phase1 analysis results directory.
The 1000 Genomes Project is holding a tutorial during ASHG 2012 on Wednesday 7th November 7:00 to 9:30pm at the San Francisco Marriot Marquis.
The 1000 Genomes Project has released the sequence data and an integrated set of variants, genotypes, and haplotypes for the 1092 samples in the phase 1 set, and the sequence data for the phase 2 set. This tutorial describes the data sets, how to access them, and how to use them.
There are more details on the tutorial page. Please use this form to sign up.
Two Accessibility Tracks have now been added to the 1000 Genomes Browser
This information was built using sequence data from the phase1 dataset
The two tracks are called the 1000 Genomes Pilot Accessibility Mask and the 1000 Genomes Strict Accessibility Mask.
There is a README which describes how this data set was created. The raw bed and fasta files are also available in the accessible genome ftp directory
Analysis results based on our phase1 integrated variant call set are now available.
This includes chrY and chrMT variant calls, functional annotation of our variant calls and local area ancestry inference for our admixed populations.
Full details of the directory contents can be found on this webpage.
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:
Instructions for data download and Aspera
Sequence index and Statistics files
Our 1000 Genomes Browser has been updated to contain the Integrated Phase 1 variant set. It has also be moved to Ensembl version 65.
This update includes improved navigation and a track for the Exome Sequencing Project SNPs
You can find our tutorial here.
Nature Methods has published The 1000 Genomes Project:data management and community access.
This paper describes how the consortium manages its data and the tools we have created to make access easier.
The paper itself is freely available under a creative commons license
Article PDF Supplementary Info PDF
The new index can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20120419.sequence.index
The 1000 Genomes Project is holding a Community Meeting at the University of Michigan, Ann Arbor on the 12th and 13th of July 2012
The Meeting is meant to showcase advances made by the 1000 Genomes Project, both with respect to methods for generation and analysis of sequence data and in our understanding of human genetic variation, show how 1000 Genomes Project data and methodology is advancing our understanding of human disease, both in disease studies that use 1000 Genomes Project data as a reference and in studies that are applying population sequencing and other project technologies in phenotyped samples, highlight other cutting edge sequencing studies and technologies in humans and finally to generate discussion and lay the ground work for the next round of community resource sequencing projects.
For more details and to register please go to http://1000gconference.sph.umich.edu/
The 1000 Genomes data set is now available in the Amazon Web Service Cloud(AWS)
For more information about how to access and use the data in the cloud please look at our documentation
More details are available in the NIH Press Release
This March 2012 release represents a improved set of our integrated phase 1 variant release. This release represents version 3 of an integrated variant call set based on both low coverage and exome whole genome sequence data. This is an updated set from the version 2 (February 2012) of the 20110521 release. For this release approximately 2.38M indels have been filtered from the version 2 call set. See README_v3 for more information.
Our FAQ contains instructions on how to get smaller subsections of these files
The official release of phase2 alignment data is complete and available on the ftp site. All BAMs have gone through the DCC QA process; samples and runs identified as problematic have been withdrawn. The 20111114.sequence.index has been updated to reflect the withdrawn.
We have created a tutorial to give users a basic background into the 1000genomes project.
This includes a series of slides covering the background and history of the project, the structure and format of the raw data, our browser and our tools.
There are exercises for both our browser and web based tools and command line tools which are useful when working with 1000 genomes data.
All the tutorial documents can be found on our ftp site.
This February 2012 release represents a new improved set of phased genotypes for our integrated phase 1 variant release. This release contains SNPS, short INDELs and Deletions based on low coverage and exome sequencing data across 1092 individuals.
Please note the sites list has been filtered to remove a small number of indels which were discovered to have a high false positve rate. There is more information about this in the README
Our FAQ contains instructions on how to get smaller subsections of these files
Link to additional information:README file
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:
Data access links: EBI / NCBI / Instructions for data download and Aspera
Trio high coverage BAMs generated for pilot2 project have been moved to
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams
Exon targetted BAMs generated for pilot3 project has been moved to
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot3_exon_targetted_GRCh37_bams
Both sets of BAMs are mapped to GRCh37.
In preparation for the incoming of phase 2 BAMs, all phase 1 BAMs have been moved to a phase1 freeze directory -
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/ or
The October 2011 Integrated Phase 1 Variant Release has been updated. It now includes variant calls on chrX.
For more information please see the new README
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:
Data access links: EBI / NCBI / Instructions for data download and Aspera
The October 2011 Integrated Phase 1 Variant Release has been updated to include additional information like rs numbers from dbSNP, Ancesteral Alleles and Allele Frequencies.
For more information please see the new README
A new Project browser based on our Interim 20101123 phase 1 variant calls has been released.
It is based on Ensembl release 63.
Please read our tutorial document for more information about the browser.
This October 2011 release represents an integrated set of variant calls and phased genotypes including SNPS, short INDELs and Deletions based on low coverage and exome sequencing data across 1092 individuals.
Our FAQ contains instructions on how to get smaller subsections of these files
Link to additional information:README file
The Poster which was presented at the ICHG 2011 Poster session on 12th October is available in powerpoint format here
The 1000 Genomes Project Resources
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at:
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
The project browser based on the 20100804 variant release and has been moved to the latest version of Ensembl, 63.
The new browser features a new Variation Pattern Finder and improvements to the Data Slicer to allow vcf slices to be subselected on the basis of individual or if associations provided by population. All location based pages also now have a get vcf button which transfers you to the Data slicer with the coordinates fill out and the path of the vcf for our 20100804 release filled in. For more information about the new features which are part of the Ensembl project please look at their blog post on the matter
The alignments based on the 20110521.sequence.index have been updated.
Illumina data was aligned at Boston College and SOLiD data at Baylor College of Medicine. Full project data is aligned to the GRCh37 human assembly. In the updated release, all Baylor SOLiD BAMs and associated bai and bas files have been replaced; the new BAMs are now re-calibrated and locally re-aligned.
Data access links: EBI / NCBI / Instructions for data download and Aspera
The alignments based on the 20110521.sequence.index have been released. This is the first set of exome alignments to be released by the project.
Illumina data was aligned at Boston College and SOLiD data at Baylor College of Medicine. Full project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110719.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Genotypes for 1094 individuals for the May 2011 snp calls from the 20101123 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and is released in the format VCF 4.0
Our FAQ contains instructions on how to get smaller subsections of these files
Link to additional information:README file
We have released a public mysql instance of the Ensembl databases which sit behind our browser.
There are more details about this instance on the Public Mysql Instance page.
Please emai info@1000genomes.org if you have any questions
The Database of Genomic Variants Archive (DGVa) have recently released the Stuctural Variants associated with our pilot study paper.
The variants are available from the ftp sites of both the DGVa and dbVar and can be browsed on the dbVar summary page
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110521.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110521.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Full Project low coverage SNP call release
SNP calls based on 1094 individuals from the 20101123 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0
Link to additional information:README file
The project browser has been updated to include 20100804 variant release and has been moved to the latest version of Ensembl, 62.
A tutorial for the updated browser is available here.
New features include a Data Slicer tool which provides the ability to get subsections of vcf and bam files. Transcript consequences also now include Sift and Polyphen scores for non synonymous SNPs.
The pilot browser is still available at http://pilotbrowser.1000genomes.org
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110506.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110411.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110228.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information:Sequence index file format
The alignments based on the 20101123.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.
Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. Full project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
Full Project Indel Release
Indels calls from Dindel. These calls arebased on 629 individuals from the 20100804 sequence and alignment release of the 1000 genomes project. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0
Link to additional information:README file
The Structural Variant group from the 1000 Genomes Project Consortium has published its findings based on the pilot data analysis in Nature today. Mapping copy number variation by population scale genome sequencing. The data supporting this paper can be found on the ftp site EBI|NCBI
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20110124.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Sequence index and Statistics files
Full Project Genotype Release
Genotypes and haplotypes have been added to the SNP calls released in November. These calls arebased on 629 individuals from the 20100804 sequence and alignment release of the 1000 genomes project. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0
Link to additional information:README file
The 1000 Genomes Project Data Tutorial was held on Wednesday, November 3, during the American Society of Human Genetics meeting in Washington, DC. The slides and video of the tutorial are publicly available on the NHGRI website.
Full Project SNP call release
SNP calls based on 628 individuals from the 20100804 sequence and alignment release of the 1000 genomes project has now been made. This release is based on the GRCh37 assembly of the human genome and are released in the format VCF 4.0
Link to additional information:README file
The 1000 Genomes Project Consortium has published the results of the pilot project analysis in the journal Nature in an article appearing on line today. The paper A map of human genome variation from population-scale sequencing is available from the Nature web site and is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence to ensure wide distribution. The paper is also available directly from this link and supporting data for the paper is avilable from our mirror web sites at the EBI and NCBI
The latest release of sequence data from the 1000 Genomes full project is now available. The new sequence.index file can be found at: 20101004.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information: List of new index and statistics files / Sequence index file format
The alignments based on the 20100804.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.
Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Full project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
The latest release of sequence data from the 1000 Genomes full project is now available. The new sequence.index file can be found at: 20100804.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information: List of new index and statistics files / Sequence index file format
Pilot Project Variant call release
Variant Calls from the three pilot projects are now available in VCF 4.0 format. This release includes SNPs, short indels and large scale structural variants. All 1000 genomes pilot project files reference the NCBI build 36 assembly of the human genome
Link to additional information: README file
The alignments based on the 20100611.sequence.index have been released. There are both new BAM files and updated BAM files with more data were added. For the case of updated files, the older, redundant files have been withdrawn.
Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Full project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information: New Files / Withdrawn bams
Additional sequence data from the 1000 Genomes full project are now available. The current sequence.index file can be found at: 20100710.sequence.index
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information: List of new index and statistics files / Sequence index file format
The alignments based on the 20100517.sequence.index have been released. BAM files with more data replace older (and now withdrawn) BAM files with less data. Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in from the links below. Main project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
Links to additional information: New Illumina Files / New SOLiD Files / List of withdrawn bams
New main project sequence files are available on the FTP site.
Link to additional information: 20100429.sequence.index / README.sequence_data / README.populations
The first set of alignment files from the main phase of the 1000 Genomes project are now available for download.
Illumina data was aligned at the Sanger Institute and SOLiD data at TGEN. More information is available in the README file linked below. Main project data is aligned to the GRCh37 human assembly.
Data access links: EBI / NCBI / Instructions for data download and Aspera
Link to additional information: Changelog (Illumina files) / Changelog (SOLiD files) / README.alignment_data
The final set of SNPs from Pilots 1, 2 and 3 are now available in VCF format. All 1000 Genomes pilot project files reference the NCBI build 36 assembly of the human genome.
Link to additional information: README file
An updated set of sequence data now available on the ftp site. We have also updated the readme to reflect a new column being added to the sequence.index file and the additional statistics file which is now associated with the index.
The population column of the sequence index has also been rationalised to use only the three letter population codes. Definitions of each code can be found in the readme file below.
Links to additional information: sequence.index / README.sequence_data / README.populations
Updated PIlot2 and Pilot3 SNP calls in VCF format are now available.
Link to additional information: Supplemental and supporting data
The 1000 Genomes web site at 1000genomes.org is now available again.
As part of the recovery of the site, we had to change all of the passwords for those of you that have wiki accounts. You can request your password by selecting the "Send me my password" link on the left side of the page. You will then receive an automated email with a new password and instructions for activating that password. You may have to check your junk mail box to find this email.
The 1000genomes.org web site is currently unavailable. We are aware of the problem and working to get it corrected.
A Pilot 1 bam file (plus .bas file): NA18940.SLX.maq.SRP000031.2009_09.unmapped.bam has been added to the ftp site. The new version contains corrected library information.
Links to additional information: Changelog / alignment.index
Several updates the project alignment files are detailed below:
1. Five pilot 2 chromY bam (plus associated .bai and .bas files) have been withdrawn. This data included female samples mapped to the Y chromosome. Changelog
2. New .bas files for unmapped.bam files on already on ftp site have been added. Changelog
3. Replacement of 166 .bas files. The replacement files contain corrected "library" entries that are consistent with pilot_data.sequence.index file. Changelog
The "alignment.index" file has been updated to reflect these changes: alignment.index
Updated filtered fastq files have been added to the FTP site. This is a regular monthly update of the filtered fastq files for all the pilots and on-going production project. The update includes replacement of 206 buggy filtered fastq files, as well as addition of 430 new filtered fastq files obtained and screened from ERA.
In addition, 159 replacement filtered fastq files for specific CHB runs have been added to the FTP site.
Links to additional information: Changelog (new files) / Changelog (replacement of 206 files) / Changelog (replacement of 159 CHB files)
New annotation files to be used in the analysis of the pilot projet have been added to the FTP site. These files include
- Phastcons regions of conservation, conserved_noncoding_phastcons_17way.txt.gz This is a set of phastcon non coding conserved regions
- Uniqueness of the genome, hs36_uniqueness_mask.fa.gz, This is a fasta like file with a rating for how unique each postion of the genome is. The associated readme explains the meaning of each score
- Genetic map and Recombination hotspots, genetic_map_b36.tar.gz, Genetic map for the 22 autosomes and a recombination hotspot file
A full list of the annotation can be found on the annotation wiki page.
Link to further information: Changelog.
Replacement bams for pilot1 CHB alignments and pilot2 NA19239 to account for the previously missed libraries.
Links to additional information: Changelog / alignment.index
Several fastq files used to create the alignemnt files, but missing from the FTP site have been reinstated to the FTP site.
Links to additional information: Changelog / pilot_data.sequence.index.
5 replacement pilot 1 .bas files have been added the files replaced were incorrectly labelled SRP000032 in column 3.
Links to additional information: Changelog
New files containing scripts and data related to the GENCODE annotation for determining conserved noncoding regions and introns have been added to the FTP site.
Data access links: EBI / NCBI.
Links for additional information: Changelog / README / Code download (.tgz).
The alignment.index file was not correctly updated after the addition of replacement bams on October 29. This has been corrected.
Links to additional information: alignment.index
48 replacement pilot 2 454 bams (and associated bas and bai) files for NA12878 and NA19240 have been added to the ftp site.
Links to additional information: Changelog (withdawn files) / Changelog (replacement files) / alignment.index
Replacement BAM files for the CHB population have been added to the DCC ftp site with an up-to-date alignment index file. 42 old BAM files and associated bai and bas files have been withdrawn and replaced by 18 BAM files and associated bai and bas files.
Links to data: EBI
Links to additional information: Changelog (replacement files) / Changelog (withdrawn files)
Pilot 1 replacement bam, bai, bas files have been added to the FTP site.
Links to additional information: Changelog (withdrawn files) / Changelog (replacement files)
A total of 351 new Pilot 3 454 BAM files have been added to the FTP site replacing a set of older BAM files. These files use a more stringent method of duplicate removal. In the changelog files below, those labelled with the pattern *454.ssaha2.SRP000033.2009_09.ba* were withdrawn.
Links to additional information: Changelog (replacement files) / Changelog (withdown files)
Replacement bam (and .bai, .bas) files added, correcting for withdrawn lanes included in bam file.
Links to additional information: Changelog (replacement files) / Changelog (withdrawn files) / alignment.index
The reference genome to be used for the main project is GRCh37 and the files to be used for alignment in the main project are now available. The procedure for creating the reference genome is:
1. Download individual chrs for the GRCh37 assembly from Ensembl: FTP site
2. Download the newer version of the MT (NC_012920): Data link
3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs.
The pilot project reference genome (based on NCBI build 36) remains available in the retired_reference directory on the FTP site.
Link to additional information: Changelog (file locations)
A total of 844 files are added to ftp/data/ directory representing new and replacement filtered fastq files. Replacement files were required for a small number of fastq files with incorrect quality scores. The files with the incorrect quality scores were removed. In the sequence index file the removed files are marked "WITHDRAWN FROM ARCHIVE".
Links to additional information: Changelog (new files) / Changelog (replaced files) / Changelog (withdrawn files)
719 new .bas (bam statistics) files have been added to the FTP site and 1129 .bas files have been replaced. The relpaced files have corrected bam file names and statistics including #_total_bases and %_of_mismatched_bases. The alignment index files will be updated shortly to give #_total_bases and #_mapped_bases for each bam file entry.
Links to additional information: Changelog (new .bas files) / Changelog (replacement .bas files)
A set of 33315 fastq files containing sequence reads that have passed a DCC fastq QC process are now available on the FTP site.
The filtered_fastq files contain reads passing the DCC fastq QC process and have been put on the ftp site. The input to the DCC QC pipeline are all fastq files retrieved from ERA, including reads generated by all three pilots and the main project, as of 4 September 2009.
Summary statistics of the current set of files:
Total read count: 143,297,692,374
Total base count: 6,248,070,765,643
Total filtered read count: 130,945,326,161
Total filtered base count: 5,715,866,786,951
Percentage of good reads: 91.38%
Percentage of good bases: 91.48%
Link to additional information: Changelog (list of new files) / QC criteria in README.sequence_data
The first set of SNPs and other supporting data from the low coverage individuals is now officially released on the EBI and NCBI FTP sites associated with the 1000 Genomes project.
This intermediate release of results from the 1000 Genomes project consists of the following (all file paths below are relative to the following root on the two mirror ftp sites: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2009_02/ or ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/2009_02/).
1. SNP calls from 104 individuals in the low coverage pilot, Pilot 1.
See this file: Pilot1/README_SRP000031_2009_02 for more details on how the calls were made and what the file formats are. All the files related to this set of calls, excepting the human genome reference used (see point 3 below), are in the Pilot1 directory.
See this file: Pilot1/alignments.index for a tab-delimited list of alignment files, one per individual, including md5 checksums for each file. The third and fourth columns give the project id for Pilot1, SRP000031, and the identifier for the corresponding individual. The alignment files are in SAM format (see http://samtools.sourceforge.net/).
Please note these are intermediate results, NOT based on the sequencing data from the complete Pilot.
2. The human genome reference files used can be found in the following directory: ../../technical/reference.
The male and female reference files are in gzipped fasta format, and there are also two index files, for indexing directly into the compressed references, for BAM file manipulation. This index format is defined in the README within that directory.
The first set of SNP calls representing the preliminary analysis of four genome sequences are now available to download through the EBI FTP site and the NCBI FTP site. The README file dealing with the FTP structure will help you find the data you are looking for.
The data can also be viewed directly through the 1000 Genomes browser at http://browser.1000genomes.org. Launch the browser and view a sample region here.
For more information about using the 1000 Genomes browser, download the Quick start guide.
The 1000 Genomes Project announces the release of the first set of SNP calls for 4 individuals that are part of the high coverage pilot project. These SNPs represent the preliminary analysis of a portion of the data so far collected and are released in accordance with the Ft. Lauderdale agreement for community resource projects.
This preliminary release is designed to both provide data to the community and to test the systems used to make the data available including the 1000 Genomes browser available at http://browser.1000genomes.org. Additional updates of this site and the 1000 Genomes browser will take place througout January.
In addition to the SNP files and the 1000 Genomes project browser, raw project data is made available as soon as possible through by NCBI and the EBI. The data consist of README files, fastq files (nucleotides and qualities) suitable for use with most aligners, and md5 checksum data.
Users from the Americas should download the mirrored 1000 Genomes data from NCBI via ftp at: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ or via the Aspera high speed data transfer client at: http://fasp.ncbi.nlm.nih.gov/1000genomes.html.
Users from the Europe and the rest of the would should download the mirrored 1000 Genomes data from the EBI at: ftp://ftp.1000genomes.ebi.ac.uk/. The EBI will be implementing an Aspera client in the near future.
The data available at the EBI and the NCBI are identical. Depending on server load and Internet connectivity some users in the Americas may have faster FTP downloads from the EBI and some users in Europe and the rest of the world may have faster downloads from NCBI.
Raw data for a portion of the project is already available through the Short Read Archive at the NCBI and the European Read Archive at the EBI. These comprehensive archives are specifically designed for short read data and will be making the complete project data available as soon as possible
The 1000 Genomes project plans to release summary data including positions of variants in individuals and populations. These data releases will include SNPs and CNV analysis for the six high-coverage individuals and all of the low coverage individuals. Quarterly data releases are planned starting in January 2009.