BCF (Binary VCF) version 2

Please note this specification has been merged with the VCF specification and is now being maintained in github at https://github.com/samtools/hts-specs

The current specification version is 2.1.  


Introduction

VCF is very expressive, accommodates multiple samples, and is widely used in the community. It's biggest drawback is that it is big and slow.  Files are text and therefore require a lot of space on disk. A normal batch of ~100 exomes is a few GB, but large-scale VCFs with thousands of exome samples quickly become hundreds of GBs. Because the file is text, it is extremely slow to parse.

 

Overall, the idea behind is BCF2 is simple. BCF2 is a binary, compressed equivalent of VCF that can be indexed with tabix and can be efficiently decoded from disk or streams. For efficiency reasons BCF2 only supports a subset of VCF, in that all info and genotype fields must have their full types specified. That is, BCF2 requires that if e.g. an info field AC is present then it must contain an equivalent VCF header line noting that AC is an allele indexed array of type integer.

This page is a more detailed description to help implementers understand the format. It is complementary to the BCF2 quick reference.


Overall file organization


A BCF2 file is composed of a mandatory header, followed by a series of BGZF compressed blocks of binary BCF2 records. The BGZF blocks allow BCF2 files to be indexed with tabix.

BGZF blocks are composed of a VCF header with a few additional records anda block of recordsFollowing the last BGZF BCF2 record block is an empty BGZF block (a block containing zero type of data), indicating that the records are done.

A BCF2 header follows exactly the specification as VCF, with a few extensions / restrictions:

  • All BCF2 files must have fully specified contigs definitions.  No record may refer to a contig not present in the header itself.
  • All INFO and GENOTYPE fields must be fully typed in the BCF2 header to enable type-specific encoding of the fields in records. An error should be thrown when converting a VCF to BCF2 when an unknown or not fully specified field is encountered in the records.


Header

The BCF2 header begins with the "BCF2 magic" 5 bytes that encode "BCFXY" where X and Y are bytes indicating the major number (currently 2) and the minor number (currently 1). This magic can be used to quickly examine the file to determine that it's a BCF2 file.  Immediately following the BCF2 magic is the standard VCF header lines in text format, beginning with ##fileformat=VCFvX.Y.  Because the type is encoded directly in the header, the recommended extension for BCF2 formatted files is ".bcf".  BCF2 supports encoding values in a dictionary of strings. The string map is provided by the keyword ##dictionary=S0,S1,...,SN as a comma-separate ordered list of strings. See the "Dictionary of strings" sectionfor more details.


Dictionary of strings

Throughout the BCF file most string values are be specified by integer reference to their dictionary values. For example, the following VCF record:

##INFO=<ID=ASP,Number=0,Type=Flag,Description="X">
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Y">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="Z">
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
#CHROM POS ID REF ALT QUAL FILTER INFO
20 10144 rs144773400 TA T . PASS ASP;RSPOS=10145,dbSNPBuildID=134
20 10228 rs143255646 TA T . PASS ASP;RSPOS=10229;dbSNPBuildID=134

would be encoded inline in BCF2 by reference to the relative position of the header line in the header (ASP=1, RSPOS=2, dbSNPBuildID=3, and PASS implicitly encoded in the last offset PASS=4)

##INFO=<ID=ASP,Number=0,Type=Flag,Description="X">
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Y">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="Z">
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
#CHROM POS ID REF ALT QUAL FILTER INFO
0 10144 rs144773400 TA T . s0 s1;s2=10145;s3=134
0 10228 rs143255646 TA T . s0 s1;s2=10229;s3=134

Note that the dictionary encoding has the magic prefix 's' here to indicate that the field's value is actually in the dictionary entry giving by the subsequent offset. This representation isn't actually the one used in BCF2 records but it provides a clean visual guide for the above example.  Note also how the contig has been recoded as a offset into the list of contig declarations.

Note that "PASS" is always implicitly encoded as the first entry in the header dictionary.  This is because VCF allows FILTER fields to be PASS without explicitly listing this in the FILTER field itself.

Dictionary of contigs

The CHROM field in BCF2 is encoded as an integer offset into the list of ##contig field headers in the VCF header.  The offsets begin, like the dictionary of strings, at 0.  So for example if in BCF2 the contig value is 10, this indicates that the actual chromosome is the 11th element in the ordered list of ##contig elements.  Here's a more concrete example:
 

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
##contig=<ID=21,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
##contig=<ID=22,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens">
#CHROM POS ID REF ALT QUAL FILTER INFO
20 1 . T A . PASS .
21 2 . T A . PASS .
22 3 . T A . PASS .
 

the actual CHROM field values in the encoded BCF2 records would be 0, 1, and 2 corresponding to the first (offset 0) ##contig element, etc. 

BCF2 records


In BCF2, the original VCF records are converted to binary and encoded as BGZF blocks. Each record is conceptually two parts. First is the site information (chr, pos, INFO field). Immediately after the sites data is the genotype data for every sample in the BCF2 file. The genotype data may be omitted entirely from the record if there is no genotype data in the VCF file. Note that it's acceptable to not BGZF compress a BCF2 file, but not all readers may handle this uncompressed encoding.


Site encoding

BCF2 site information encoding
Field Type Notes
l_shared uint32_t Data length from CHROM to the end of INFO
l_indiv uint32_t Data length of FORMAT and individual genotype fields
CHROM int32_t Given as an offset into the mandatory contig dictionary
POS int32_t 0-based leftmost coordinate
rlen int32_t Length of the record as projected onto the reference sequence. May be the actual length of the REF allele but for symbolic alleles should be the declared length respecting the END attribute
n_allele_info int32_t n_info, where n_allele is the number of REF+ALT alleles in this record, and n_info is the number of VCF INFO fields present in this record
n_fmt_sample uint32_t n_sample, where n_fmt is the number of format fields for genotypes in this record, and n_samples is the number of samples present in this sample. Note that the number of samples must be equal to the number of samples in the header
QUAL float Variant quality; 0x7F800001 for a missing value
ID typed string  
REF+ALT list of n_allele typed strings the first allele is REF (mandatory) followed by n_alleles - 1 ALT
alleles, all encoded as typed strings
FILTER Typed vector of integers a vector of integer offsets into dictionary, one for each FILTER field value. "." is encoded as MISSING
INFO field key/value pairs n_info pairs of typed vectors The first value must be a typed atomic integer giving the offset of the INFO field key into the dictionary. The second value is a typed vector giving the value of the field
Genotype value block see below see below


Genotypes encoding

Genotype fields are encoded not by sample as in VCF but rather by field, with a vector of values for each sample following each field. In BCF2, the following VCF line:

FORMAT NA00001 NA00002 NA00003

GT:GQ:DP 0/0:48:1 0/1:48:8 1/1:43:5

would encoded as the equivalent of:

GT=0/0,0/1,1/1  GQ=48,9,43  DP=1,8,5

Suppose there are i genotype fields in a specific record. Each i is encoded by a triplet:

BCF2 site information encoding
Field Type Notes
fmt_key typed int Format key as an offset into the dictionary
fmt_type uint8_t+ Typing byte of each individual value, possibly followed by a typed int for the vector length. In effect this is the same as the typing value for a single vector, but for genotype values it appears only once before the array of genotype field values
fmt_values (by fmt type) Array of values. The information of each individual is concatenated in the vector. Every value is of the same fmt type. Variable-length vectors are padded with missing values; a string is stored as a vector of cha

The value is always implicitly a vector of N values, where N is the number of samples. The type byte of the value field indicates the type of each value of the N length vector. For atomic values this is straightforward (size = 1). But if the type field indicates that the values are themselves vectors (as often occurs, such as with the PL field) then each of the N values in the outer vector is itself a vector of values. This encoding is efficient when every value in the genotype field vector has the same length and type.

Note that the specific order of fields isn't defined, but it's probably a good idea to respect the ordering as specified in the input VCF/BCF2 file.

If there are no sample records (genotype data) in this VCF/BCF2 file, the size of the genotypes block will be 0.


Type encoding

In BCF2 values are all strongly typed in the file. The type information is encoded in a prefix byte before the value, which contains information about the low-level type of the value(s) such as int32 or float, as well as the number of elements in the value. The encoding is as follows:

BCF2 type descriptor byte
Bit Meaning
5,6,7,8 bits The number of elements of the upcoming type. For atomic values, the size must be 1. If the size is set to 15, this indicates that the vector has 15 or more elements, and that the subsequent BCF2 byte stream contains a typed Integer indicating the true size of the vector. If the size is between 2-14, then this Integer is omitted from the stream and the upcoming stream begins immediately with the first value of the vector. A size of 0 indicates that the value is MISSING
1,2,3,4 bits Type

The final four bits encodes an unsigned integer that indicates the type of the upcoming value in the data stream.

BCF2 types
value in the lowest 4 bits hexadecimal encoding corresponding atomic type
1 0x?1 Integer [8 bit]
2 0x?2 Integer [16 bit]
3 0x?3 Integer [32 bit]
5 0x?5 Float [32 bit]
7 0x?7 Character, ASCII encoded in 8 bits.Note this is not used in BCF2, but its type is reserved
in case this becomes necessary. In BCF2 characters are simply represented by strings with a single element
0,4,6,8-15   reserved for future use

 

 

Integers

Integers may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value.


Floats

Floats are encoded as single-precision (32 bit) in the basic format defined by the IEEE_754-1985standard. This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (seeJava's Double class for example). BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have:

For example, a bit-wise example of a IEEE floating-point standard single precision (32-bit) NaN would be:
s111 1111 1axx xxxx xxxx xxxx xxxx xxxx where s is the sign (most often ignored in applications), a determines the type of NaN, and x
is an extra payload (most often ignored in applications). If a = 1, it is a quiet NaN; if a is zero and the payload is nonzero, then it is a
signaling NaN.

A good way to understand these values is to play around withthe IEEE encoder webiste.


BCF2 bit representation for floating point NaN and MISSING
Value 32-bit precision
NaN 0b0111 1111 1100 0000 0000 0000 0000 0000 = 0x7FC00000
MISSING 0b0111 1111 1000 0000 0000 0000 0000 0001 = 0x7F800001


Character

Character values are not explicitly typed in BCF2. Instead, VCF Character values should be encoded by a single character string. As with Strings, UNICODE characters are not supported.


Flag


Flags values -- which can only appear in INFO fields -- in BCF2 should be encoded by any non-MISSING value. The recommended best practice is to encode the value as an 1-element INT8 (type 0x11) with value of 1 to indicate present. Because FLAG values can only be encoded in INFO fields, BCF2 provides no mechanism to encode FLAG values in genotypes, but could be easily extended to do so if allowed in a future VCF version.


String

There are two basic encodings for strings. For INFO, FORMAT, and FILTER keys these are encoded by integer offsets into the header dictionary. For string values, such as found in the ID, REF, ALT, INFO, and FORMAT fields, strings are encoded as typed array of ASCII encoded bytes. The array isn't terminated by a null byte. The length of the string is given by the length of the type descriptor.

Suppose you want to encode the string ACAC. First, we need the type descriptor byte, which is the string type 0x07 or'd with inline size (4) yielding the type byte of 0x40 | 0x07 = 0x47. Immediately following the type byte is the four byte ASCII encoding of "ACAC" 0x41 0x43 0x41 0x43. So the final encoding is:

0x47 // String type with inline size of 4 0x41 0x43 0x41 0x43 // ACAC in ASCII

Suppose you want to encode the string MarkDePristoWorksAtTheBroad, a string of size 27. First, we need the type descriptor byte, which is the string type 0x07. Because the size exceeds the inline size (27 &gt; 15) we set the size to overflow, yielding the type byte of 0xF0 | 0x07 = 0xF7. Immediately following the type byte is the typed size of 27, which we encode by the atomic INT8 value: 0x11 followed by the actual size 0x1B. Finally comes the actual bytes of the string: 0x4D 0x61 0x72 0x6B 0x44 0x65 0x50 0x72 0x69 0x73 0x74 0x6F 0x57 0x6F 0x72 0x6B 0x73 0x41 0x74 0x54 0x68 0x65 0x42 0x72 0x6F 0x61 0x64. So the final encoding is:

0xF7 // string with overflow size
0x11 0x1B // overflow size encoded as INT8 with value 27
0x4D 0x61 0x72 0x6B 0x44 0x65 0x50 0x72 0x69 0x73 0x74 0x6F 0x57 0x6F 0x72 0x6B 0x73 0x41 0x74 0x54 0x68 0x65 0x42 0x72 0x6F 0x61 0x64 // message in ASCII

Suppose you want to encode the missing value '.'. This is simply a string of size 0 = 0x07.

In VCF there are sometimes fields of type list of strings, such as a number field of unbounded size encoding the amino acid changes due to a mutation.  Since BCF2 doesn't directly support vectors of strings (a vector of character is already a string) we collapse the list of strings into a single comma-separated string, encode it as a regular BCF2 vector of characters, and on reading explode it back into the list of strings.  This works because strings in VCF cannot contain "," (its a field separator) and so we can safely use "," to separate the individual strings.  For efficiency reasons we put a comma at the start of the collapsed string, so that just the first character can be examined to determine if the string is collapsed.

To be concrete, suppose we have a info field around X=[A,B,C,D].  This is encoded in BCF2 as a single string ",A,B,C,D" of size 8, so it would have type byte 0x87 followed by the ASCII encoding 0x2C 0x41 0x2C 0x42 0x2C 0x43 0x2C 0x44.


Vectors

The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values. The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte. For example, a vector of 3 16-bit integers would be layed out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.

Missing values in vectors are handled slightly differently from atomic values. There are two possibilities for missing values:

  1. One (or more) of the values in the vector may be missing, but others in the vector are not. Here each value should be represented in the vector, and each corresponding BCF2 vector value either set to its present value or the type equivalent MISSING value.
  2. Alternatively the entire vector of values may be missing. In this case the correct encoding is as a type byte with size 0 and the appropriate type MISSING.

Suppose we are encoding the record "AC=[1,2,3]" from the INFO field. The AC key is encoded in the standard way. This would be immediately followed by a typed 8-bit integer vector of size 3, which is encoded by the type descriptor 0x31. The type descriptor is immediately followed by the three 8-bit integer values: 0x01 0x02 0x03, for a grant total of 4 bytes: 0x31010203.

Suppose we are at a site with many alternative alleles so AC=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. Since there are 16 values, we have to use the long vector encoding. The type of this field is 8 bit integer with the size set to 15 to indicate that the size is the next stream value, so this has type of 0xF1. The next value in the stream is the size, as a typed 8-bit atomic integer: 0x11 with value 16 0x10. Each integer AC value is represented by it's value as a 8 bit integer. The grand total representation here is:

0xF1 0x01 0x10 // 8 bit integer vector with overflow size
0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x0C 0x0D 0x0E 0x0F 0x10 // 1-16 as hexadecimal 8 bit integers

Suppose this INFO field contains the "AC=.", indicating that the AC field is missing from a record with two alt alleles. The correct representation is as the typed pair of AC followed by a MISSING vector of type 8-bit integer: 0x01.


Vectors of mixed length

In some cases genotype fields may be vectors whose length differs among samples. For example, some CNV call sets encode different numbers of genotype likelihoods for each sample, given the large number of potential copy number states, rather padding all samples to have the same number of fields. For example, one sample could have CN0:0,CN1:10 and another CN0:0,CN1:10,CN2:10. In the situation when a genotype field contain vector values of different lengths, these are represented in BCF2 by a vector of the maximum length per sample, with all values in the each vector aligned to the left, and MISSING values assigned to all values not present in the original vector. The BCF2 encoder / decoder must automatically add and remove these MISSING values from the vectors.

For example, suppose I have two samples, each with a FORMAT field X. Sample A has values [1], while sample B has [2,3]. In BCF2 this would be encoded as [1, MISSING] and [2, 3]. Diving into the complete details, suppose X is at offset 3 in the dictionary, which is encoded by the typed INT8 descriptor 0x11 followed by the value 0x03. Next we have the type of the each format field, which here is a 2 element INT8 vector: 0x21. Next we have the encoding for each sample, A = 0x01 0x80 followed by B = 0x02 0x03. All together we have:

0x11 0x03 // X dictionary offset
0x21 // each value is a 2 element INT8 value
0x01 0x80 // A is [1, MISSING]
0x02 0x03 // B is [2, 3]

Note that this means that it's illegal to encode a vector VCF field with missing values; the BCF2 codec should signal an error in this case.


Genotype (GT) field

A genotype is encoded in a typed integer vector (can be 8, 16, or even 32 bit if necessary) with the number of elements equal to the maximum ploidy among all samples at a site. For one individual, each integer in the vector is organized as ‘(allele+1) << 1 | phased’ where allele is set to -1 if the allele in GT is a dot ‘.’ (thus the higher bits are all 0). The vector is padded with missing values if the GT having fewer ploidy.

  • Example 0/1 in standard format: (0 + 1) &lt;&lt; 1 | 0 = 0x02 followed by (1 + 1) &lt;&lt; 1 | 0 = 0x04 </li></ul>
  • Example: 0/1 and 1/1 and 0/0 for three samples: the first sample is 0x0204, the second 0x0404 and the third is simply 0x0202. So we'd expect to see on disk an array of 6 bytes 0x020404040202
  • Example: 0|1 is just like 0/1 but with the phasing bit set for the second allele: (1 + 1) &lt;&lt; 1 | 1 = 0x05 preceded by the standard first byte value 0x04. So here we have 0x0405.
  • Example: ./. where both alleles are missing: 0x00 0x00.
  • Example: 0 as a haploid, and so is represented by a single byte 0x02
  • Example: 1 as a haploid, and so is represented by a single byte 0x04
  • Example: 0/1/2 is tetraploid, with alleles 0x02 0x04 0x06
  • Example: 0/1|2 is tetraploid with a single phased allele, encoded as 0x02 0x04 0x07
  • Example: on chromosome X we have a male (1) and a female (0/1). Here we pad out the final allele for the male, so 0x04 (first allele) 0x80 (absent) followed by the female 0x02 0x04.

Misc. notes

A type byte value of 0x00 is an allowed special case meaning MISSING but without an explicit type provided. 

 

Encoding a VCF record example

Let's encode a realistic (but made-up) VCF record. This is a A/C SNP in HM3 (not really) called in 3 samples. In this section we'll build up the BCF2 encoding for this record.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
chr1 101 rs123 A C 30.1 PASS HM3;AC=3;AN=6;AA=C GT:GQ:DP:AD:PL 0/0:10:32:32,0:0,10,100 0/1:10:48:32,16:10,0,100 1/1:10:64:0,64:100,10,0


Encoding CHROM and POS

First, let's assume that chr1 is the second chromosome to appear in the contig list -- right after chrM (MT). So it's offset is 1. The POS BCF2 field value is 101 (obviously). Because these are both typed values in the BCF2 record, we encode both in their most compact 8-bit value form. The type byte for an atomic 8-bit integer is 0x11. The value for the contig offset is 1 = 0x01. The value 101 is encoded as the single byte 0x65. So in total these are represented as:

0x01000000 // CHROM offset is at 1 in 32 bit little endian
0x64000000 // POS in 0 base 32 bit little endian
0x01000000 // rlen = 1 (it's just a SNP)


Encoding QUAL

The QUAL field value is 30.1, which we encode as an untyped single precision 32-bit float:

0x41 0xF0 0xCC 0xCD // QUAL = 30.1 as 32-bit float


Encoding ID

For ID type byte would is a 5-element string (type descriptor 0x59), which would then be followed by the five bytes for the string of 0x72 0x73 0x31 0x32 0x33. The full encoding is:

0x59 0x72 0x73 0x31 0x32 0x33 // ID


Encoding REF / ALT fields

We encode each of REF and ALT as typed strings, first REF followed immediately by ALT. Each is a 1 element string (0x19), which would then be followed by the single bytes for the bases of 0x43 and 0x41:

0x19 0x41 // REF A
0x19 0x43 // ALT C

Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x19 (1 element string) with the value of 0x54.


Encoding FILTER

"PASS" is implicitly encoded as the last entry in the header dictionary (see dictionary of strings). Here we encode the PASS FILTER field as a vector of size 1 of type 8-bit, which has type byte is 0x11. The value is the offset 0:

0x11 0x00 // FILTER field PASS


Encoding the INFO fields

HM3;AC=3;AN=6;AA=C
Let's assume that the header dictionary elements for HM3, AC, AN, and AA are at 80, 81, 82, and 83 respectively. All of these can be encoded by 1-element INT8 values (0x11), with associated hex values of 0x50, 0x51, 0x52, and 0x53 respectively.

First is HM3. The entry begins with the key: 0x11 0x50. Next we have a Flag value to indicate the field is present, represented as a 1 element INT8 value of 1. Altogether we have:

0x11 0x50 0x11 0x01 // HM3 flag is present

Now let's encode the two atomic 8-bit integer fields AC and AN:

0x11 0x51 // AC key
0x11 0x03 // with value of 3
0x11 0x52 // AN key
0x11 0x06 // with value of 6

The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x19) with value 0x43 (C). So the entire key/value pair is:

0x11 0x51 // AA key
0x19 0x43 // with value of C


Encoding Genotypes

Continuing with our example:

FORMAT NA00001 NA00002 NA00003

GT:GQ:DP:AD:PL 0/0:10:32:32,0:0,10,100 0/1:10:48:32,16:10,0,100 1/1:10:64:0,64:100,10,0

Here we have the specially encoded GT field. We have two integer fields GQ and DP. We have the AD field, which is a vector of 2 values per sample. And finally we have the PL field which is 3 values per sample. Let's say that the FORMAT keys for GT, GQ, DP, AD, and PL are at offsets 1, 2, 3, and 4, 5, respectively.
Now let's encode each of the genotype fields in order of the VCF record (GT, GQ, DP, AD, and then PL):

  1. GT triplet begins with the key: 0x1101. Next is the type of the field, which will be a 2-element (diploid) INT8 type: 0x21. This is followed by 3 2-byte arrays of values 0x0202 0x0204 0x0404 (see genotype encoding example for details). The final encoding is 0x1101 0x21 0x020202040404
  2. GQ triplet begins with the key 0x1102. Because these values are small, we encode them as 8 bit atomic integers with type code 0x11. As each value is the same (10 = 0x0A) the GQ field is encoded as 0x1102 0x11 0x0A0A0A
  3. DP almost identical to GQ. First is the 0x1103 key, followed by 3 8-bit atomic integers encoded as 0x11 (the type) 0x20 (DP=32), 0x30 (DP=48) and 0x40 (DP=64). So we have: 0x1103 0x11203040
  4. AD is more complex. The key is simple, just like the others, with 0x1104. Because the AD field is a vector of 2 values for each genotype, the value of key/value pair a vector type. Because the integer values in each AD field of each sample are small they are encoded by 8 bit values. So the value type is = 0x21. For sample one there are two values: 32,0 which are 0x30 and 0x00. Samples two and three are 0x30 0x20 and 0x00 0x40 respectively. So ultimately this field is encoded as 0x1104 0x21 0x300030200040
  5. PL is just like AD but with three values per sample. The key is 0x1105. Because the PL field is a vector of 3 values for each genotype, the value of key/value pair a vector type, and because the size is 3 it's encoded in the size field of the type. Again, because the integer values in each PL field of each sample are small they are encoded by 8 bit values. So the value type 0x31. For sample one there are three values: 0, 10, and 100 which are 0x00, 0x0A, and 0x64. Samples two and three have the same values but in a slightly different order. So ultimately the PL field is encoded as 0x1105 0x31 0x000A64 0x0A0064 0x640A00

So the genotype block contains:

0x1101 0x21 0x020202040404 // GT
0x1102 0x11 0x0A0A0A // GQ
0x1103 0x11 0x203040 // DP
0x1104 0x21 0x300030200040 // AD
0x1105 0x31 0x000A640A0064640A00 // PL


Putting it all together

We need to determine a few values before writing out the final block:

  • l_shared = 54 (Data length from CHROM to the end of INFO)
  • l_indiv = 42 (Data length of FORMAT and individual genotype fields)
  • n_allele_info = n_allele&lt;&lt;16|n_info = 2 &lt;&lt; 16 | 4 = 0x00020004
  • n_fmt_samples = n_fmt&lt;&lt;24|n_sample = 5 &lt;&lt; 24 | 3 = 0x05000003

0x36000000 // l_shared as little endian hex
0x2A000000 // l_indiv as little endian hex
0x01000000 // CHROM offset is at 1 in 32 bit little endian
0x64000000 // POS in 0 base 32 bit little endian
0x01000000 // rlen = 1 (it's just a SNP)
0x41 0xF0 0xCC 0xCD // QUAL = 30.1 as 32-bit float
0x00020004 // n_allele_info 0x05000003 // n_fmt_samples
0x59 0x72 0x73 0x31 0x32 0x33 // ID
0x19 0x41 // REF A
0x19 0x43 // ALT C
0x11 0x00 // FILTER field PASS
0x11 0x50 0x11 0x01 // HM3 flag is present
0x11 0x51 // AC key
0x11 0x03 // with value of 3
0x11 0x52 // AN key
0x11 0x06 // with value of 6
0x11 0x51 // AA key
0x19 0x43 // with value of C
0x1101 0x21 0x020202040404 // GT
0x1102 0x11 0x0A0A0A // GQ
0x1103 0x11 0x203040 // DP
0x1104 0x21 0x300030200040 // AD
0x1105 0x31 0x000A640A0064640A00 // PL

That's quite a lot of information encoded in only 96 bytes!


BCF2 block gzip and indexing

These raw binary records may be subsequently encoded into BGZF blocks following the BGZF compression format, section 3 of the SAM format specification. BCF2 records can be raw, though, in cases where the decoding / encoding costs of bgzipping the data make it reasonable to process the data uncompressed, such as streaming BCF2s through pipes with samtools and bcftools.  Here the files should be still compressed with BGZF but with compression 0.  Note that currently the GATK generates raw BCF2 files (not BGZF compression at all) but this will change in the near future.

BCF2 files are expected to be indexed through the same index scheme, section 4 as BAM files and other block-compressed files with BGZF.