The following analysis methods were used to create the data provided by the 1000 Genomes project and released through the FTP site
Sequence reads were aligned to the human genome reference sequence, The copy of the fasta file that is used for the alignments can be found on our ftp site here It currently represents the full chromosomes of the GRCh37 build of the human reference.
Differing alignment algorithms were used for each sequencing technology. For the Trio and Low Coverage pilots, Illumina data were mapped using the MAQ algorithm, SOLiD data were mapped using Corona Lite, and 454 data were mapped using SSAHA2. For the Exon pilot, Illumina and 454 reads were mapped using MOSAIK.
Corona Lite: http://solidsoftwaretools.com/gf/project/corona/
All data from Illumina and 454 platforms has been recalibrated using the the GATK package (McKenna, Hanna et al. 2010).
Multiple SNP calling procedures have been used by the 1000 Genomes Project. In part this was because different methods were appropriate for different aspects of the Project, in part because different approaches were under active development by individual members of the Project, and in part because we found empirically that, given the state of current methods, the consensus of multiple primary call sets from different methods proved of higher quality than any of the primary call sets themselves.
Low Coverage SNP Calling
Calling was separate for the three analysis panels, CEU, YRI and CHB+JPT. For each, three primary SNP call sets were generated, from the Broad Institute (Broad), the University of Michigan (Michigan) and the Sanger Institute (Sanger). Details of the methods are given in separate publications (DePristo, Banks et al. 2010; Le and Durbin 2010; Li, Willer et al. 2010). All three were produced by a two step processes, involving in the first step a set of candidate calls made based on the evidence at each base pair in the reference, independent of neighbouring sites. Then in each case a more computationally intensive linkage disequilibrium (LD)/imputation based approach was used to refine the call set and genotypes for all individuals.
Analysis of individual call sets produced by Broad, Michigan and Sanger demonstrated that a consensus approach to defining variant positions led to a higher quality data set as demonstrated by: (a) genotype calls that were more accurate than in any single call set, (b) a transition-transversion ratio at newly discovered SNPs that was more similar to that observed at dbSNP sites, (c) an overall transition-transversion ratio that was closer to value of slightly >2 observed in previous SNP discovery efforts, (d) consistently high rediscovery rates for dbSNPs and HapMap SNPs. Overall, genotypes obtained through a consensus procedure are estimated to have 30% fewer errors than those generated by any single caller.
Consensus genotypes were defined using a simple majority vote among callers. In case of conflicting evidence (for example if a different genotype was listed in all three sets or the genotype was different in two sets and absent from the third), the following order of precedence was used: Sanger, UMich, Broad for the CEU and YRI call sets, and UMich, Sanger, Broad for the CHB+JPT call set. Where possible, phase was taken from the primary dataset. If the chosen genotype was not from the primary dataset, then the genotype was set as unphased. SNP and genotype quality scores were not retained in the merged dataset.
Low coverage SNP calls on the X chromosome were derived from call sets generated by the Broad and Sanger pipelines described above, and only calls made by both pipelines were reported. At sites in the intersection, the genotypes reported came from the Sanger calls, which were sex aware.
Low Coverage Phasing
The low coverage call set merging produced a set of partially phased haplotypes, i.e. a series of haplotype fragments that were phased relative to each other but separated by unphased heterozygotes. We sought to "phase-finish" these haplotypes by using an LD-based method to place the remaining unphased alleles onto the haplotype scaffold that resulted from the merging. This was achieved using the IMPUTE2 software (Howie, Donnelly et al. 2009) to produce best-guess haplotypes from unphased genotype data. Howie et al. (2009) explain the basic procedure for sampling from the joint posterior distribution of haplotypes underlying the genotypes of a number of individuals: the IMPUTE model (Marchini, Howie et al. 2007) is embedded in a Gibbs sampler, and at each iteration every individual samples a new pair of haplotypes, conditional on the current guesses of the other individuals.
Trio SNP calling
Trio SNP calling was analogous to low coverage calling except that only two call sets were generated, from Broad and Michigan, and the LD/imputation based refinement stage that was relevant to population data was simply removed in the Broad set, which relied on much deeper data, and replaced in the Michigan set by a structured prior that enforced Mendelian segregation. The intersection of the two call sets was taken as the consensus.
Once the trio genotypes were called, we set aside all SNPs that violated the rules of Mendelian inheritance (these were later used in the analysis of de novo mutation and structural variation). We then phased as many of the remaining SNPs as possible by identifying unambiguous transmissions from at least one parent. SNPs that are heterozygous in the child and both parents cannot be phased in this way, and we sought to recover some of these by borrowing LD information from the low coverage pilot.
For each trio, we used IMPUTE2 to phase the intersection of the SNPs called in the low-coverage pilot dataset and the corresponding trio pilot dataset (CEU and YRI, respectively). We fixed the phase of the low coverage haplotypes (following the phase-finishing step described earlier), as well as the haplotypes of the trio parents at SNPs with unambiguous phase. We then phased the remaining SNPs in the parents, separately for CEU and YRI.
The trio parents were phased as if they were unrelated, in the sense that the model did not force the parents to transmit opposite alleles at SNPs that were heterozygous in both parents and their child. Instead, this constraint was enforced post-hoc by recalculating the marginal posterior probabilities for each unphased genotype, conditional on the parents transmitting opposite alleles.
Exon SNP calling
The Exon Pilot SNP call set was composed of the intersection of SNP calls made using the GATK Unified Genotyper essentially as described above for low coverage and trio calls, with calls from an alternate pipeline using the MOSAIK read mapper and the GigaBayes SNP caller. Note that unlike the low coverage and trio pilots, the different SNP call sets started with different read alignments. SNP calls were made separately for each of the 7 Exon Project populations.
The process of generating indel variant and genotype calls for the genome wide data was broken up into three stages: candidate indel identification, calculation of genotype likelihood through local re-alignment, and LD-based genotype inference and calling. This workflow was chosen because, in contrast to SNPs, it is not possible to enumerate all potential indel mutations, and compute the posterior support of each. Instead, a pseudo-Bayesian approach was used, in which potential candidates are first obtained, and then subsequently tested (together with the reference) in a Bayesian framework.
All likelihood-based calculations were made from the Illumina data, although candidate indels from the 454 and SOLiD reads were considered.
The Low Coverage Pilot release consists of indel calls on the finished autosomal sequence. No indel calls on the X, Y, mitochondrial, or unplaced chromosomes have been made.
Structural Variant Discovery
Several methods were used to discover structural variants (SVs) from trio and low coverage raw sequencing reads. Algorithms underlying several distinct rationales for SV detection have been used in the SV discovery process. The high-confidence SV set reported in this study is based on validated SVs and on such discovered by algorithms achieving an FDR ≤10%. Further details can be found in the 1000 Genomes Pilot Project publication.