Can I map your variant coordinates between different genome assemblies?

Answer:

We have data presented on GRCh38, GRCh37 and NCBI36, please check the data portal to see what assembly the data is on. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.

If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI

Related questions:

Do you have assembled FASTA sequences for samples?

Answer:

Recent projects, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC) have produced assemblies. These are linked to from the page for that data collection. These are haplotype resolved assemblies. More details can be found in the accompanying publication.

The 1000 Genomes Project did not create any assemblies from the genome sequence data it generated.

The Gerstein Lab at Yale University created a diploid version of the NA12878 sequence, which is available from the Gerstein website under NA12878_diploid. When used, groups should cite AlleleSeq: analysis of allele-specific expression and binding in a network framework, Rozowsky et al., Molecular Systems Biology 7:522.

You can create a FASTA file incorporating the variants from an individual with a VCFtools Perl script called vcf-consensus.

An example set of command lines would be:

#Extract the region and individual of interest from the VCF file you want to produce the consensus from
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.vcf.gz

#Index the new VCF file so it can be used by vcf-consensus
tabix -p vcf HG00098.vcf.gz

#Run vcf-consensus and use --sample to apply sample-specific variants. If not given, all the variants are applied
cat ref.fa | vcf-consensus HG00098.vcf.gz --sample HG00098 > HG00098.fa

You can get more support for VCFtools on their help mailing list.

Related questions:

Which reference assembly do you use?

Answer:

The reference assembly the 1000 Genomes Project has mapped sequence data to has changed over the course of the project.

For the pilot phase we mapped data to NCBI36. A copy of our reference fasta file can be found on the ftp site.

For the phase 1 and phase 3 analysis we mapped to GRCh37. Our fasta file which can be found on our ftp site called human_g1k_v37.fasta.gz, it contains the autosomes, X, Y and MT but no haplotype sequence or EBV.

Our most recent alignment release was mapped to GRCh38, this also contained decoy sequence, alternative haplotypes and EBV. It was mapped using an alt aware version of BWA-mem. The fasta files can be found on our ftp site.

Related questions:

Why are the coordinates of some variants different to what is displayed in other databases?

Answer:

Data from the 1000 Genomes Project has now been analysed multiple times and on different reference assemblies. In addition, further data sets have been generated and analysed. These studies, therefore, use different data, different reference assemblies and different methodologies, which can lead to different variant calls being made.

The studies that IGSR makes data available for are listed with their accompanying publications in our data portal. The publications can provide further details.

Related questions: