Key file formats
Information on key file formants is normally provided within README files located near the relevant data on the FTP site. Information on the major formats is also collected here. Tools developed by the 1000 Genomes project or appropirate to work with the data from the project are listed on the tools page.
Index file formatsData file formats
Index file formats
Sequence index
The sequence.index file is a tab delimited file containing all the meta data you should need to download and subset the files on this ftp site by individual, library, experiment and sequencing technology.
The columns are:
1. FASTQ_FILE, path to fastq file on ftp site
2. MD5, md5sum of file
3. RUN_ID, SRA/ERA run accession
4. STUDY_ID, SRA/ERA study accession
5. STUDY_NAME, Name of stury
6. CENTER_NAME, Submission centre name
7. SUBMISSION_ID, SRA/ERA submission accession
8. SUBMISSION_DATE, Date sequence submitted, YYYY-MM-DAY
9. SAMPLE_ID, SRA/ERA sample accession
10. SAMPLE_NAME, Sample name
11. POPULATION, Sample population
12. EXPERIMENT_ID, Experiment accession
13. INSTRUMENT_PLATFORM, Type of sequencing machine
14. INSTRUMENT_MODEL, Model of sequencing machine
15. LIBRARY_NAME, Library name
16. RUN_NAME, Name of machine run
17. RUN_BLOCK_NAME, Name of machine run sector
18. INSERT_SIZE, Submitter specifed insert size
19. LIBRARY_LAYOUT, Library layout, this can be either PAIRED or SINGLE
20. PAIRED_FASTQ, Name of mate pair file if exists (Runs with failed mates will have a library layout of PAIRED but no paired fastq file)
21. WITHDRAWN, 0/1 to indicate if the file has been withdrawn, only present if a file has been withdrawn
22. WITHDRAWN_DATE, date of withdrawal, this should only be defined if a file is withdrawn
23. COMMENT, comment about reason for withdrawal
24. READ_COUNT, read count for the file
25. BASE_COUNT, basepair count for the file
26. ANALYSIS_GROUP, the analysis group of the sequence, this reflects sequencinq strategy. Current this includes low coverage, high coverage and exon targetted to reflect the 3 stratergies used by the 1000 genomes project.
Any run_id can have up to 3 files associated with it. Single runs have one file. Paired runs can have anywhere from 1 to 3 files depending on the success of the pairing.
Alignment Index
The alignment.index file is a tab delimited file containing all the meta data needed to download the bam alignment files from the ftp site. The alignment file formants are described below.
The columns are:
1. bam file2. MD5
3. bai index file
4. bai index file md5
5. bas statistics file
6. bas statistic file md5
All other meta information associated with a particular file can be found in the filename e.gHG00098.chrom1.ILLUMINA.bwa.GBR.low_coverage.20100804.bam
The fields represent:
Sample NameChromosome name, nonchrom represents the reads mapped to non chromosomal parts of the reference genome and unmapped represents readpairs which did not map to the reference genome
Instrument Platform, the sequencing platform the sequence comes from.
Algorithm, the algorithm used to map the data
Population, the population the sample comes from
Analysis group, the sequencing strategy used to generate the sequence data
Sequence index date, the date from the sequence.index used to supply the input sequence data for the alignment
Each bam file is paired with an index file which has the same name but the extension .bai and a statistic file which has the extension .bas. More information about the files can be found in README.alignment_data
The sequence files all use the reference genome which can be found under:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/
Bas File
Data file formats
Alignment files (BAM)
BAM files are binary representations of the Sequence Alignment/Map format. These files and the associated SAMtools package are described in a recent Bioinformatics publication.
Additional information about SAM/BAM is available at the SAMtools development site: http://samtools.sourceforge.net/.
Additional tools used by the 1000 Genomes project that use BAM files are described on the tools page.
Variant Call Format (VCF)
The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions.
The format is still under development and the current format definition is available on a publicly visiable page of the 1000 Genomes Wiki: VCF 4.0