What format are your variant files in?

The main formats that our files are in are fastq files for sequence data, bam format for alignment data and vcf format for variant calls. There is more information about these formats here

Our filenames should also provide information about the file in questions.

  • fastq files, E|SRR000000.filt.fastq.gz, this represents the SRA run id associated with this sequencing run. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README
  • bam files, NA00000.location.platform.population.analysis_group.YYYYMMDD.bam. The bam files and their respective bai and bas files are named from the individual the sequence data comes from. The name also includes the location of the mapping, this can either be a chromosome name or mapped/unmapped to represent that the data is mapped to the whole genome or is not mapped to any location, the platform the sequence was generated on, the population the individual belongs to, the analysis group describes the sequencing stratergy, low_coverage for low coverage wgs and exome for whole exome pull down and the date should match the date of the sequence.index which was used to provide the input sequence for the alignment, bas files which contain statistics about each alignment and bai files which act as indexes for tools reading the bam files are also provided for each bam file. The Alignment README covers the format of the bas files in more detail.
  • vcf files, ALL.[chrN|wgs|wex].2of4intersection.20100804.[snps|indels|sv].genotypes.[analysis_group].vcf.gz. The vcf file name gives the population that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then a description of how the call set was produced, the Date again matches the input sequence index. Next can be an alternate field which describes what type of variant the file contains. In the absence of this the files should contain just snps finally we have either sites or genotypes. A sites file just contains the first 8 columns of the vcf format and the genotypes files contain individual genotype data aswell. The filename does also contain some optional fields. The first section shows if the file contains some part of the genome or the whole genome. wgs or wex files will generally contain sites on at least all the autosomes, wex being limited to the exome though. After the date the file may specify what type of variant it contains, without this field it should be assumed to contain snps. The final optional tag should describe on which analysis group the data is based. This may be either low coverage or exome, without this tag the data is most likely to be just based on low coverage data but it might represent an intergrated data set which includes data from both the low coverage and the exome sequencing effort

We also provide meta data and statistics about our alignment and sequence data in tab based index and bam statistic files (bas) which are also described on the formats page