Over the last 20 years, the technology for DNA sequencing has rapidly evolved, making it possible to sequence individual whole genomes on a large scale. But when trying to find the underlying genetic causes of disease, every sequence donated matters. How can the sequences generated using older technology be reused? And can researchers analyze sequence data generated in different locations, on different platforms, using different technologies? These are becoming increasingly important issues for the research community to address. 

The Genome Center for Alzheimer’s Disease (GCAD) tackled these problems for the whole exome sequencing data in the Alzheimer’s Disease Sequencing Project (ADSP). The recent publication “Human Whole-exome Genotype Data for Alzheimer’s Disease” describes a method to harmonize and analyze 20,504 whole exome sequencing samples used to create the largest whole exome sequencing dataset for AD available to qualified researchers via NIAGADS DSS, NG00067

To address the biases that may exist while conducting whole exome sequencing at different institutions using different platforms, GCAD developed a modified version of their genomic variant calling pipeline and data management tool for ADSP (VCPA), known as VCPA-WES.  

This allowed joint calling of 20,504 individuals sequenced at 14 different institutions using 10 capture kits for WES. CRAMs were generated without any capture kit information, which resulted in CRAM metrics differences being independent of capture kit and mostly due to different sequencing platforms. Importantly, this also allowed the area of the exome that could be examined to include all areas captured instead of more traditional approaches which are limited to where capture kits overlapped. 

By carefully designing the VCPA-WES pipeline, the authors were able to greatly reduce experimental artifacts (no systemic bias attributed to sequencing center or platform, with the exception of Q30 score and 20X coverage) and achieve calling of >1.4 million variants, comprising 15 billion genotypes and 99.43% concordance in variant calls with the previously published ADSP-Discovery WES. 

Qualified researchers can apply for access to the largest joint called WES dataset for AD (NG00067) at Home – DSS NIAGADS

Researchers can also check out Human Whole-exome Genotype Data for Alzheimer’s Disease is available via Nature Communications to see how VCPA-WES might be applied to their own research.