About alignment files (BAM and CRAM)

Answer:

All our alignment files are in BAM or CRAM format. BAM is a standard alignment format which was defined by the 1000 Genomes consortium and has since seen wide community adoption, whereas CRAM is a compressed version of this. This compression is driven by the reference the sequence data is aligned to.

The CRAM file format was designed by the EBI to reduce the disk footprint of alignment data in these days of ever-increasing data volumes.

The CRAM files the 1000 Genomes project distributes are lossy cram files which reduce the base quality scores using the Illumina 8-bin compression scheme as described in the lossy compression section on the cram usage page

There is a github page where the format of CRAM file is discussed and help can be found.

CRAM files can be read using many Picard tools and work is being done to ensure samtools can also read the file format natively.

BAM file names

The bam file names look like:

NA00000.location.platform.population.analysis_group.YYYYMMDD.bam

The bai index and bas statistics files are also named in the same way.

The name includes the individual sample ID, where the sequence is mapped to, if the file has only contains mapping to a particular chromosome that is what the name contains otherwise, mapped means the whole genome mapping and unmapped means the reads which failed to map to the reference (pairs where one mate mapped and the other didn’t stay in the mapped file), the sequencing platform, the ethnicity of the sample using our three letter population code, the sequencing strategy. The date matches the date of the sequence used to build the bams and can also be found in the sequence.index filename.

Unmapped bams

The unmapped bams contain all the reads for the given individual which could not be placed on the reference genome. It contains no mapping information

Please note that any paired end sequence where one end successfully maps but the other does not both reads are found in the mapped bam

Bas files

Bas files are statistics we generate for our alignment files which we distribute alongside our alignment files.

These are readgroup level statistics in a tab delimited manner and are described in this README

Each mapped and unmapped bam file has an associated bas file and we provide them collected together into a single file in the alignment_indices directory, dated to match the alignment release.

Related questions:

Are the variant calls in IGSR phased?

Answer:

You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28

The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.

Some other studies have also produced phased versions of their calls. These include the analysis of high-coverage data across 3,202 samples on GRCh38 completed by NYGC. Multiple sets of VCFs are available, including phased VCFs, linked to from the page for that collection.

Related questions:

Are there any genomic regions that have not been studied?

Answer:

The 1000 Genomes Project created what they defined as accessibilty masks for the pilot phase, phase one and phase three of the Project. Some other studies have similar files.

In phase three of the 1000 Genomes Project, using the pilot criteria 95.9% of the genome was found to be accessible. For the stricter mask created during phase three, 76.9% was found to be accessible. A detailed description of the accessibility masks created during phase three, the final phase of the Project, can be found in section 9.2 of the supplementary material for the main publication. The percentages quoted are for non-N bases.

While the above was generated on GRCh37, similar files were created on GRCh38 for the reanalysis of the 1000 Genomes Project data on GRCh38. HGSVC2 also have files listing regions of the genome that were not analysed.

Related questions:

Can I get phased genotypes and haplotypes for the individual genomes?

Answer:

Phased variant call sets are described in “Are the variant calls in IGSR phased?”.

You can obtain individual phased genotypes through either the Ensembl Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Please also note that some studies, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC), are now producing haplotype resolved asssemblies.

Related questions:

Can I map your variant coordinates between different genome assemblies?

Answer:

We have data presented on GRCh38, GRCh37 and NCBI36, please check the data portal to see what assembly the data is on. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.

If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI

Related questions:

Can I use the IGSR data for imputation?

Answer:

The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.

Please look at the software’s website to find those files.

Related questions:

Does the 1000 Genomes Project use HapMap data?

Answer:

The 1000 Genomes Project shares some samples with the HapMap project; any sample which starts with NA was likely part of the HapMap project. In the pilot stages of the project HapMap genotypes were also used to help quality control the data and identify sample swaps and contamination. Since phase 1 the HapMap data has not been used by the 1000 Genomes Project, and all genotypes were independantly identified by 1000 Genomes.

Related questions:

How are allele frequencies calculated?

Answer:

Our standard AF values are allele frequencies rounded to two decimal places calculated using allele count (AC) and allele number (AN) values.

LDAF is an allele frequency value in the info column of our phase 1 VCF files. LDAF is the allele frequency as inferred from the haplotype estimation. You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.

Genotype Dosage

The phase 1 data set also contains Genotype Dosage values. This comes from Mach/Thunder, imputation engine used for genotype refinement in the phase 1 data set.

The Dosage represents the predicted dosage of the non reference allele given the data available, it will always have a value between 0 and 2.

The formula is Dosage = Pr(Het|Data) + 2*Pr(Alt|Data)

The dosage value gives an indication of how well the genotype is supported by the imputation engine. The genotype likelihood gives an indication of how well the genotype is supported by the sequence data.

Related questions:

How do I find the most up-to-date data?

Answer:

Reviewing the list of data collections and their publications in our data portal is a good starting point.

We also share data via our FTP site and data may be available on our FTP site sometime beofre being added to the website. Data collection directories are available on the FTP site.

In addition, to track changes on our FTP site we provide change logs and a current.tree file, which list all files present on our FTP site and any changes made to them.

Related questions:

How do you calculate ancestral alleles?

Answer:

Information relating to ancestral alleles is available for phase three of the 1000 Genomes Project. The work on annotating ancestral alleles is described in Section 8.3 of the supplementary material of the publication accompanying that work.

Related questions:

How was imputation used in 1000 Genomes to fill in gaps in sequencing?

Answer:

In the original phase 1 and 3 sequencing of the 1000 Genomes individuals, many genomes were only sequenced in full at low coverage, so some individuals some genotypes will be based on imputation.

This means that if an individual has no coverage at a particular location but overall we have been able to determine there is variation at that location then we can statistically infer the genotype for that variant in that individual using haplotype information. This means we are able to provide complete haplotypes for all the variation we discover.

The process used to create our genotypes first gave our merged sites and genotype likelihoods sets to Beagle to generate initial haplotypes (using 50 interations across all samples) and these were refined using a modified version of Thunder (it used 300 states chosen by longest matching haplotype at each iteration in addition to 100 randomly chosen states).

This process means we are unable to precisely identify which sites used imputation to generate their genotype. Without this process the approximate error rate for our heterozygous sites would be 20% so you can estimate that 20% of our heterozgous sites will have been changed on the basis of imputation. The sites covered by our exome sequencing represent our highest accuracy sites and these are the least likely to have been changed by this process. The converse is also true any site without any sequence alignment will have been imputed. You can find the depth of coverage at any site using our bam files. Other sites may have been given greater evidence on the basis of the imputation and refinement process.

You can find out more about this in our Phase 1 paper.

Related questions:

Is the sequencing data in IGSR contaminated with mycoplasma?

Answer:

William Langdon published in April 2014 in BioData Mining about mycoplasma reads in the 1000 Genomes sequencing data. He tested 2% of the total runs produced by the project (3982/187720) and found 7% of them (269/3982) to be contamintated with mycoplasma. A full description of the analysis can be found in his paper.

A full list of the runs Langdon tested and their contamination status can be found on our FTP site.

We recognise that there are mycoplasma sequence reads in some of the 1000 Genomes Project raw data sets but we do not believe that they have affected any of the published results and analyses, nor the use of data from the project by additional users.

The primary outcome from the 1000 Genomes Project is a collection of more than 35 million human genetic variants. These are obtained from reads that map to the human reference genome. As Langdon points out, the mycoplasma reads identified by Langdon do not map to the human reference genome, so they do not contaminate the results on human genetic variation. The 1000 Genomes Project makes its raw data sets available for reanalysis, and the complete read sets include mycoplasma reads, as well as reads from Epstein-Barr virus (EBV) and potentially from other non-human organisms that may have been present in the starting material. However the project also makes aligned data sets available and the vast majority of users only examine the reads aligned to the human reference, along with the inferred individual genome sequences derived from them. We make all the original raw data available as a matter of policy, both for transparency with respect to our data processing, and also to support those who would like to examine additional technical or biological phenomena that can be derived from the data.

Most of the DNA used for 1000 Genomes Project sequencing was obtained from immortalised cell lines and, although mycoplasmal infection of laboratory cell lines is undesirable, it is not a great surprise that some of these had mycoplasma infections, especially given that some of the cell lines and DNA were prepared a long time ago.

Related questions:

Is there gene expression and/or functional annotation available for the samples?

Answer:

Functional annotation

As part of our phase 1 analysis we performed functional annotation of our phase 1 variants with respect to both coding and non-coding annotation from GENCODE and the ENCODE project respectively.

This functional annotation can be found in our phase 1 analysis results directory. We present both the annotation we compared the variants to and VCF files which contain the functional consequences for each variant.

Gene expresssion

The most important available existing expression datasets involving 1000 Genomes individuals are probably the following:

Pre-publication RNA-sequencing data from the Geuvadis project is available through http://www.geuvadis.org

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/samples.html
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-2/samples.html

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-197

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-198
http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-264

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19480

References

  1. Reference:Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010 Apr 1;464(7289):773-7. Epub 2010 Mar 10.
  2. Reference: Stranger,B.E S.B. Montgomery, A.S. Dimas, L. Parts, O. Stegle, C.E. Ingle, M. Sekowska, G. Davey Smith, D. Evans, M. Gutierrez-Arcelus, A. Price, T. Raj J. Nisbett, A.C. Nica, C. Beazley, R. Durbin, P. Deloukas, E.T. Dermitzakis. Patterns of cis regulatory variation in diverse human populations. PLoS Genetics in press
  3. Reference: Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010 Apr 1;464(7289):768-72. Epub 2010 Mar 10.

Related questions:

Was HLA Diversity studied in IGSR?

Answer:

HLA diversity is not something which was studied by the 1000 Genomes Project directly. However, groups have looked at the HLA diversity of the samples in the 1000 Genomes Project.

2018 data

The most recent of these studies was published by Laurent Abi-Rached, Julien Paganini and colleagues in 2018 and covers 2,693 samples from the work of the 1000 Genomes Project. Details of the study and data used in this work are available via the publication and the HLA types are available on our FTP site at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/.

2014 data

The FTP site also hosts data from an earlier study by Pierre-Antoine Gourraud, Jorge Oksenberg and colleages at UCSF who carried out an HLA typing assay on DNA sourced from Coriell for 1000 Genomes samples. This earlier study looks at only the 1,267 samples that were available at that time.

The earlier work assessing HLA Diversity is publised in “HLA diversity in the 1000 Genome Dataset”, with data available from the 1000 Genomes FTP site in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140725_hla_genotypes/.

Related questions:

Were the same analysis tools used for every sample in one data collection?

Answer:

The analysis tools used for samples in a given data collection can vary depending on the technologies (i.e. PacBio, exome, etc.) that were used to generate data for any given sample. The technologies may not be the same across all samples in a collection.

Generally, however, for any given analysis the data types will be the same or very similar and will have been analysed in a consistent manner.

For data collections where the analysis has been published, the publication will give details of what methods were used. Checking this may involve looking at the supplementary material. We list publcations on our data collections page. In addition, for analyses which may not have been published, you will find README files on our FTP site, in the directories for each data collection, that provide further information.

Related questions:

What are the different data collections available for 1000 Genomes?

Answer:

In IGSR, data is organised into collections that roughly correspond to studies or projects.

The samples collected by the 1000 Genomes Project have now been used in many different studies, some generating new data and others reanalysing existing data.

The final phase of the 1000 Genomes Project was phase 3 and represents 2504 samples on GRCh37.

The data from phase three of the 1000 Genomes Project was subsequently reanalysed on GRCh38.

Following this work, the samples have been resequenced to high-coverage, with additional related samples being sequenced, bringing the total number of samples up to 3,202. This data was analysed on GRCh38.

Further studies have also generated data on samples from the 1000 Genomes Project, including work by the Human Genome Structural Variation Consortium (HGSVC).

These data collections are listed in our data portal.

Related questions:

What is the coverage depth?

Answer:

The Phase 1 integrated variant set does not report the depth of coverage for each individual at each site. We instead report genotype likelihoods and dosage. If you would like to see depth of coverage numbers you will need to calculate them directly.

The bedtools suite provides a method to do this.

genomeCoverageBed is a tool which can provide a bed file which specifies coverage for every base in the genome and intersectBed which will provide an intersection between two vcf/bed/bam files.

These commands also require samtools, tabix and vcftools to be installed.

An example set of commands would be:

samtools view -b  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01375/alignment/HG01375.mapped.ILLUMINA.bwa.CLM.low_coverage.20120522.bam 2:1,000,000-2,000,000 | genomeCoverageBed -ibam stdin -bg > coverage.bg

This command gives you a bedgraph file of the coverage of the HG01375 bam between 2:1,000,000-2,000,000:

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr2.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:1,000,000-2,000,000 | vcf-subset -c HG01375 | bgzip -c > HG01375.vcf.gz

This command gives you the vcf file for 2:1,000,000-2,000,000 with just the genotypes for HG01375.

To get the coverage for all those sites you would use:

intersectBed -a HG01375.vcf.gz -b coverage.bg -wb > depth_numbers.vcf

You can find more information about bed file formats please see the Ensembl File Formats Help.

For more information you may wish to look at our documentation about data slicing.

Related questions:

What methods were used for generating alignments?

Answer:

Details of alignment methodology differ between data sets and the types of sequence data being aligned.

Full details of methodology can be found in the publications accompanying the data collections and, for unpublished alignments, in the README files placed with the data collections on our FTP site.

Related questions:

Which reference assembly do you use?

Answer:

The reference assembly the 1000 Genomes Project has mapped sequence data to has changed over the course of the project.

For the pilot phase we mapped data to NCBI36. A copy of our reference fasta file can be found on the ftp site.

For the phase 1 and phase 3 analysis we mapped to GRCh37. Our fasta file which can be found on our ftp site called human_g1k_v37.fasta.gz, it contains the autosomes, X, Y and MT but no haplotype sequence or EBV.

Our most recent alignment release was mapped to GRCh38, this also contained decoy sequence, alternative haplotypes and EBV. It was mapped using an alt aware version of BWA-mem. The fasta files can be found on our ftp site.

Related questions:

Why are there differences between the different analyses of the 1000 Genomes samples?

Answer:

The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).

The difference between the two lists can be explained by a number of different reasons.

  1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.

  2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.

  3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.

    The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.

    891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.

    481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.

  4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.

We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.

You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/

Related questions:

Why is the allele frequency different from allele count/allele number?

Answer:

In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to two decimal places.

Related questions: