About VCF variant files

Answer:

Variants are released in VCF format. As these have been released at different times, they are on different versions of the format - this will be indicated in the file heading. Our VCFs are multi-individual, with genotypes listed for each sample; we do not have individual or population specific VCFs.

Are all the genotype calls in the 1000 Genomes Project VCF files bi-allelic?

No. While bi-allelic calling was used in earlier phases of the 1000 Genomes Project, multi-allelic SNPs, indels, and a diverse set of structural variants (SVs) were called in the final phase 3 call set. More information can be found in the main phase 3 publication from the 1000 Genomes Project and the structural variation publication. The supplementary information for both papers provides further detail.

In earlier phases of the 1000 Genomes Project, the programs used for genotyping were unable to genotype sites with more than two alleles. In most cases, the highest frequency alternative allele was chosen and genotyped. Depth of coverage, base quality and mapping quality were also used when making this decision. This was the approach used in phase 1 of the 1000 Genomes Project. As methods were developed during the 1000 Genomes Project, it is recommended to use the final phase 3 data in preference to earlier call sets.

Related questions:

Can I get phased genotypes and haplotypes for the individual genomes?

Answer:

Phased variant call sets are described in “Are the variant calls in IGSR phased?”.

You can obtain individual phased genotypes through either the Ensembl Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Please also note that some studies, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC), are now producing haplotype resolved asssemblies.

Related questions:

How are allele frequencies calculated?

Answer:

Our standard AF values are allele frequencies rounded to two decimal places calculated using allele count (AC) and allele number (AN) values.

LDAF is an allele frequency value in the info column of our phase 1 VCF files. LDAF is the allele frequency as inferred from the haplotype estimation. You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.

Genotype Dosage

The phase 1 data set also contains Genotype Dosage values. This comes from Mach/Thunder, imputation engine used for genotype refinement in the phase 1 data set.

The Dosage represents the predicted dosage of the non reference allele given the data available, it will always have a value between 0 and 2.

The formula is Dosage = Pr(Het|Data) + 2*Pr(Alt|Data)

The dosage value gives an indication of how well the genotype is supported by the imputation engine. The genotype likelihood gives an indication of how well the genotype is supported by the sequence data.

Related questions:

Was HLA Diversity studied in IGSR?

Answer:

HLA diversity is not something which was studied by the 1000 Genomes Project directly. However, groups have looked at the HLA diversity of the samples in the 1000 Genomes Project.

2018 data

The most recent of these studies was published by Laurent Abi-Rached, Julien Paganini and colleagues in 2018 and covers 2,693 samples from the work of the 1000 Genomes Project. Details of the study and data used in this work are available via the publication and the HLA types are available on our FTP site at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/.

2014 data

The FTP site also hosts data from an earlier study by Pierre-Antoine Gourraud, Jorge Oksenberg and colleages at UCSF who carried out an HLA typing assay on DNA sourced from Coriell for 1000 Genomes samples. This earlier study looks at only the 1,267 samples that were available at that time.

The earlier work assessing HLA Diversity is publised in “HLA diversity in the 1000 Genome Dataset”, with data available from the 1000 Genomes FTP site in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140725_hla_genotypes/.

Related questions: