Software Tools
Major tools used by the 1000 Genomes Project to process data are detailed on this page. Many tools may be published and are certainly in active development and it is possible that for many tools the most current version downloaded from the developers web sites is updated from the version that was used to create the relevant 1000 Genomes release files.
Recalibration
Quality scores form mulitple centers and platforms must be callibrated for data to be used together for combined analysis. For example, due to a number of experimental and technical valiables, bases with reported quality of Q25 may actually mismatch the reference at a rate of 1 in 1000 and so should be recorded as Q30. The recalibration procedure aligns reads from a given run, determines the empirical mismatch rates and produces re-calibrated quality values from the analysis.
The recalibration tool used by the 1000 Genomes project is imbedded in the Genome Analysis Tool Kit (GATK) developed by the Broad Institute.
Further information on the recalibration tool and how it is used is avalibale here.
Alignment
The 1000 Genomes Pilot Project used the following alignment programs:
- MAQ: [Pubmed] [Homepage/Source Code]
- MOSAIK [Homepage/Source code]
- SSAHA2 [Pubmed] [Homepage/Source code]
- Corona lite [Homepage/Source code]
- cross_match [Homepage/Source code]
For pilots 1 and 2, MAQ was used to align Illumina data, Corona lite was used to align SOLiD data and SSAHA2 for 454 data. All pilot 3 data was aligned MOSAIK. Depending on the production center, other pilot 3 data was aligned with SSAHA2, cross_match, MAQ or Corona lite.
The 1000 genomes full project is using the following aligners:
- BWA: [Pubmed] [Homepage/Source Code]
- BFAST: [Pubmed] [Homepage/Source Code]
- MOSAIK: [Homepage/Source code]
For the low coverage project the low coverage Illumina runs are aligned independently using both BWA and MOSAIK and the SOLiD runs are aligned using BFAST
Assembly
Efforts to create de novo assemblies of 1000 Genomes data are currently underway, but no assemblies have been released by the project.
The Gerstein Lab at Yale University has created a version of the NA12878 genome based on NCBI build 36 and incororating SNPs, indels and SVs identified by the 1000 Genomes project. This genome sequence is available at http://sv.gersteinlab.org/NA12878_diploid. Users of this assembly are requested to cite: Rozowsky J et al.Coordination between Allele Specific Expression and Binding in a Network Framework. (submitted)
SNP calling
- GATK: [Pubmed] [Homepage/Source code]
- GlfMultiples: [Homepage/Source code]
- FreeBayes: [Homepage/Source code]
- bcftools/mpileup: [Homepage/Source code]