This page is to support discussion of the draft spec for the Variant Call Format (VCF), in which the 1000 Genomes SNP, short indel, and CNV calls are released.
At the 1000 Genomes Meeting in May 2012 we introduced the specification for BCF2, the official binary version of VCF. In addition, we have compiled a "Quick Reference" document for those who may find it useful. We are also currently drafting VCF v4.2. If you wish to join the discussion please join the vcftools-spec mailing list
There is also a vcftools-help mailing list for more general questions about the format and the vcftools codebase.
Please note the specification is now being maintained in github at https://github.com/amarcket/vcf_spec
Any documentation here may be out of date after 1st August 2013
The currently supported version is VCF4.1 (drafted Feb 15th, 2011). The 1000 Genomes Project data are available in this format, and there are various software tools supporting it, such as a VCF Validator. Notable changes from v4.0 are:
- Recommend usage of SAM-style SQ tags in the header for describing the reference backing the calls.
- REF and ALT characters can be lowercase.
- We state explicitly that multiple records can have the same POS.
- Non-alphanumeric characters are permitted in many of the fields; each field defines precisely which characters are allowed.
- The GT field is no longer required to be present; the description of ploidy has been fleshed out.
- The reserved GQ and GL tags have been better defined.
- PL, PS, GP, PQ, and EC tags and CNL for structural variants have been added to the list of reserved key names for the FORMAT field.
- The SV specs have been included in the VCF4.1 spec.
The v4.1 version of the example is:
##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FILTER= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2
The previous version was VCF4.0 (drafted May 12th, 2010; finalized July 15th, 2010). The aim of this version from VCF 3.3 was to improve the various data types and missing values used in the VCF file and to derive a consensus for placing short indels. Major changes from v3.3 were:
- The REF column can be a String (list of bases).
- The ALT column no longer allows for I/D codes, but does permit angle-bracketed IDs.
- InDels must include the base before the event (and this must be reflected by the value in the POS field).
- There is a standard missing value character (”.”) used.
- “PASS” is used to signify that a record passes filters instead of “0”.
- The meta information fields in the header use key/value pairs.
- END and CIGAR have been included in the list of reserved key names for the INFO field.
- GL is reserved as a genotype field tag to represent genotype likelihoods. How to represent this information for multi-allelic sites is still under discussion.
The v4.0 version of the example is:
##fileformat=VCFv4.0 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=-1,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Before that, the previous version was VCF3.3 (drafted November 4th, 2009). There are some software tools that still support this version, such as a VCF 3.3 Validator. The aim of this update from 3.2 was to clarify the various data types and missing values used in the VCF file. Major changes from v3.2 are:
- Each field now has a defined type and missing value.
- CHROM and POS are now required.
- ##fileformat must be declared in the meta information
- Meta information flags ##INFO, ##FILTER and ##FORMAT used to define the information presented in the INFO, FILTER and FORMAT fields respectively.
The v3.3 version of the example is:
##fileformat=VCFv3.3 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial ##INFO=NS,1,Integer,"Number of Samples With Data" ##INFO=DP,1,Integer,"Total Depth" ##INFO=AF,-1,Float,"Allele Frequency" ##INFO=AA,1,String,"Ancestral Allele" ##INFO=DB,0,Flag,"dbSNP membership, build 129" ##INFO=H2,0,Flag,"HapMap2 membership" ##FILTER=q10,"Quality below 10" ##FILTER=s50,"Less than 50% of samples have data" ##FORMAT=GT,1,String,"Genotype" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=HQ,2,Integer,"Haplotype Quality" #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 0 NS=58;DP=258;AF=0.786;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5 20 17330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 0 NS=55;DP=276;AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 G D4,IGA 50 0 NS=55;DP=250;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
The SCR version 2.0 proposal from Gabor 24/7 is available from SCR2.0. This would have stored a few example lines as:
CHR POS FLT PRB REF ALT INF NA00001 NA00002 NA00003 20 14370 0 29 G A NS=58;DP=258;DB;H2 G|G:48:1:51,51 A|G:48:8:51,51 A/A:43:5 20 13330 1 3 T A NS=55;DP=202 T|T:49:3:58,50 T|A:3:5:65,3 T/T:41:3 20 1110696 0 67 A G,T NS=55;DP=276;DB G|T:21:6:23,27 T|G:2:0:18,2 T/T:35:4 20 10237 0 47 T . NS=57;DP=257 T|T:54:7:56,60 T|T:48:4:51,51 T/T:61:2
with fixed genotype data fields GenoType;Quality;Depth;HaplotypeQuality. Comments on this from Richard 29/7 are at SCR2.0-RD, and there were other useful comments from Gil, Sol Katzman and others that could be added.
Following this we agreed the meta-information format. Example:
##format=VCFv2.0 ##fileDate=20090805 ##source=QCALLv1.0 ##reference=1000GenomesPilot-NCBI36 ##phasing=partial
and also agreed to require all the fixed fields, and add extra info fields for allele frequency and ancestral allele.
The VCF 3.0 proposal from Gabor 3/8 is available at VCF3.0. The example lines are now:
#CHR POS FLT PRB REF ALT INF NA00001 NA00002 NA00003 20 14370 1 0.99867 G A NS=58;DP=258;AF=0.786;DB=1;H2=1 G|G&0&0&48&1&HQ=51,51 A|G&0&0&48&8&HQ=51,51 A/A&0&0&43&5 20 13330 2 0.43336 T A NS=55;DP=202;AF=0.024 T|T&0&0&49&3&HQ=58,50 T|A&0&0&3&5&HQ=65,3 T/T&0&0&41&3 20 1110696 1 1.00000 A G,T NS=55;DP=276;AF=0.421,0.579;AA=T;DB=1 G|T&0&0&21&6&HQ=23,27 T|G&0&1&2&0&HQ=18,2 T/T&0&0&35&4 20 10237 0 0.00002 T . NS=57;DP=257;AA=T T|T&0&0&54&7&HQ=56,60 T|T&0&0&48&4&HQ=51,51 T/T&0&0&61&2
with the FLT field recording whether there is a SNP (1) or not (0) if the site is not filtered (>1), additional info fields for allele frequency and ancestral allele, and additional filter and method data for each genotype.
A proposed revised version of this from Richard 5/8 is given at VCF3.1. The example lines are now:
#CHROM POS REF ALT PROB FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 G A 0.99867 0 NS=58;DP=258;AF=0.786;DB;H2 GT;GQ;DP;HQ G|G;48;1;51,51 A|G;48;8;51,51 A/A;43;5 20 13330 T A 0.43336 p90 NS=55;DP=202;AF=0.024 GT;GQ;DP;HQ T|T;49;3;58,50 T|A;3;5;65,3 T/T;41;3 20 1110696 A G,T 1.00000 0 NS=55;DP=276;AF=0.421,0.579;AA=T;DB GT;GQ;DP;HQ G|T;21;6;23,27 T|G;2;0;18,2 T/T;35;4 20 10237 T . 0.00002 0 NS=57;DP=257;AA=T GT;GQ;DP;HQ T|T;54;7;56,60 T|T;48;4;51,51 T/T;61;2
- I am proposing a new way to store per sample genotype data, to support more flexibility. There is now a FORMAT field describing which data are present for each genotype. This applies to the whole record. Gabor changed the set of fixed fields, I presume based on the needs of the pilot 3 group, for example allowing a filter for each genotype. Goncalo also wants the potential to store other information per genotype. Gabor proposed fixed fields followed by a key=value system, as in the main record, but this gets quite heavy per item, and I am pretty sure that all genotypes on a line, if not in a file, will have the same data. This is part way towards Goncalo's desire to have a TYPE field, then just one type of genotype information per record, with multiple records per variant. He could use the proposed revised format that way, though I'd prefer that we distribute 1000 Genomes results with one line per position, and perhaps diferent per-sample information for the different pilots.
- The revised genotype fields look like the 2.0 ones because that I put into 2.0 the fields we at Sanger wanted to generate for pilot 1. But by changing the format it would be possible to store per genotype filters and methods, which were the fields that Gabor added and may be important for pilot 3, and for example drop haplotype qualities, which may not be important.
Suggestion from Gil: One item that immediately jumps out as being absent is a unique identifier for the novel variants. Of course, the chr/pos is unique, but it would be good to have a project identifier that we can use before the variant gets folded into dbSNP. I'd like to propose another field, which is the unique identifier - dbSNP rsID where known and a project given identifier where not. This could be another term in INF, but I would suggest that having this as a FIXED field makes sense.
- I suggest changing the order of the fixed fields, based on feedback from potential users. Apologies if it is a bit late to do this. We could revert back, but this way round the reference follows the chromosome and position, which seems more natural to people.
- I reverted to the filter being 0 for no filter, i.e. a good call, and using alphanumeric codes for failed filters. e.g. p90 above for the probability of a SNP being less than 90%.
- I reverted to allowing keys without values in info, as flags. e.g. DB, H2 above for dbSNP and HapMap 2 membership. It would be good to give the rs number as the DB value.
- I changed the separator in the genotype data to be ';'. It had originally been ':', was changed by Gabor to ' & ', and this last change makes it consistent with the separator in the INFO field.
- I also reverted to allow missing data in genotype data subfields to be dropped, rather than represented by '.'.
- I accept, with a bit of reluctance, the change to give the probability of a variant as a real number, rather than the phred-scaled probability of the call made in ALT being wrong. In the example the probability is to 5 decimal places but I recommend at least 6. For some applications, e.g. searching for de novo mutations or cancer, one wants very high confidence, and can get it with data depths commonly achieved. Phred scores (-10log_10 p(call is wrong)) have higher dynamic range, but seem to be confusing in this context.
- I would propose to distribute one file with genotypes for all sites called as variant, perhaps with additional “close-to-threshold” sites also present with the filter that ruled them out. Then a second file without genotypes with every position, showing which positions could be called, what the depth is at them, what the probability of a variant is at them (normally very low - need high resolution again in probability space) etc.
Changes from 3.1 that were agreed are:
- Add a required ID field following POS. Use rs numbers where they are available, else other unique identifiers if available, else '.'.
- The PROB field should become QUAL, and encode a phred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong). If ALT is '.' (no variant) then this is -10log_10 p(variant), and if ALT is not '.' this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although historically people have used integer phred scores, we agreed to permit floating point scores to enable higher resolution for low confidence calls if desired.
- In the INFO field
- Allow keys to be short alphanumeric strings, not just two characters, e.g. dbSNP, HapMap2
- We require the '=' sign and a value for all keys. For boolean values we can write e.g. “HapMap3=1”. We agreed it should not be necessary to give negative values, i.e. it is not necessary for all sites not in HapMap3 to include “HapMap3=0”. The rationale for this is that many keys may be very sparse, and the aim for this field is to provide a list of information for each site, not a table.
- In the FORMAT field we require GT (Genotype) to be first. Again, we permit short alphanumeric strings for keys, not just 2-character keys.
- In the genotype fields
- the subfield separator should be ':' not ';'. This facilitates some processing, and arguably looks cleaner.
- the alleles in the genotype (first subfield) are given with numbers, using 0 for the reference, 1 for the first alternate allele listed in ALT, 2 for the second etc. This is much more compact for large insertions, and consistent with standard genotype practice.
- the genotype can contain just one allele (haploid), two alleles separated by one of '\', '|', '/' (diploid) or potentially more for triploid etc. We need haploid genotypes for male X and Y, MT and haploid organisms. We see no immediate requirement for higher ploidy than diploid, but think it is wise to permit it in principle. It is OK for software that comes across ploidy greater than two to complain that it doesn't handle it yet, but it should detect it.
- We agreed the proposed changes in 3.1 to the FLT field (allow ';' separated list of short strings denoting failed filters)
- In email correspondence, it was proposed that the AF allele frequency info tag should be used for allele frequencies derived from primary sequence data, and that if the AC and AN allele counts tags were used they would summarise the counts in the genotype calls.
- It was confirmed that the CHR field should contain an identifier from the appropriate reference fasta file.
A previous proposal was VCFv3.2, which incorporated proposed changes from a phone call including representatives from Baylor, Broad, Boston College, Oxford, Sanger on August 7, 2009