Enabling efficient analysis of biobank-scale data with genotype representation graphs

Abstract

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms. Exploiting variant-sharing across samples enables GRG to compress 200,000 UK Biobank phased human genomes to 5–26 gigabytes per chromosome, also enabling graph-traversal algorithms to reuse computed values in random access memory. Constructing and processing GRG files scales to a million whole genomes. Using allele frequencies and association effects as examples, we show that computation on GRG via graph traversal runs the fastest among all tested alternatives. GRG-based algorithms have the potential to increase the scalability and reduce the cost of analyzing large genomic datasets.

This is a preview of subscription content, access via your institution

Access options

/* style specs start */

/* style specs end */

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

/* style specs start */
style {
display: none !important;
}
.LiveAreaSection * {
align-content: stretch;
align-items: stretch;
align-self: auto;
animation-delay: 0s;
animation-direction: normal;
animation-duration: 0s;
animation-fill-mode: none;
animation-iteration-count: 1;
animation-name: none;
animation-play-state: running;
animation-timing-function: ease;
azimuth: center;
backface-visibility: visible;
background-attachment: scroll;
background-blend-mode: normal;
background-clip: borderBox;
background-color: transparent;
background-image: none;
background-origin: paddingBox;
background-position: 0 0;
background-repeat: repeat;
background-size: auto auto;
block-size: auto;
border-block-end-color: currentcolor;
border-block-end-style: none;
border-block-end-width: medium;
border-block-start-color: currentcolor;
border-block-start-style: none;
border-block-start-width: medium;
border-bottom-color: currentcolor;
border-bottom-left-radius: 0;
border-bottom-right-radius: 0;
border-bottom-style: none;
border-bottom-width: medium;
border-collapse: separate;
border-image-outset: 0s;
border-image-repeat: stretch;
border-image-slice: 100%;
border-image-source: none;
border-image-width: 1;
border-inline-end-color: currentcolor;
border-inline-end-style: none;
border-inline-end-width: medium;
border-inline-start-color: currentcolor;
border-inline-start-style: none;
border-inline-start-width: medium;
border-left-color: currentcolor;
border-left-style: none;
border-left-width: medium;
border-right-color: currentcolor;
border-right-style: none;
border-right-width: medium;
border-spacing: 0;
border-top-color: currentcolor;
border-top-left-radius: 0;
border-top-right-radius: 0;
border-top-style: none;
border-top-width: medium;
bottom: auto;
box-decoration-break: slice;
box-shadow: none;
box-sizing: border-box;
break-after: auto;
break-before: auto;
break-inside: auto;
caption-side: top;
caret-color: auto;
clear: none;
clip: auto;
clip-path: none;
color: initial;
column-count: auto;
column-fill: balance;
column-gap: normal;
column-rule-color: currentcolor;
column-rule-style: none;
column-rule-width: medium;
column-span: none;
column-width: auto;
content: normal;
counter-increment: none;
counter-reset: none;
cursor: auto;
display: inline;
empty-cells: show;
filter: none;
flex-basis: auto;
flex-direction: row;
flex-grow: 0;
flex-shrink: 1;
flex-wrap: nowrap;
float: none;
font-family: initial;
font-feature-settings: normal;
font-kerning: auto;
font-language-override: normal;
font-size: medium;
font-size-adjust: none;
font-stretch: normal;
font-style: normal;
font-synthesis: weight style;
font-variant: normal;
font-variant-alternates: normal;
font-variant-caps: normal;
font-variant-east-asian: normal;
font-variant-ligatures: normal;
font-variant-numeric: normal;
font-variant-position: normal;
font-weight: 400;
grid-auto-columns: auto;
grid-auto-flow: row;
grid-auto-rows: auto;
grid-column-end: auto;
grid-column-gap: 0;
grid-column-start: auto;
grid-row-end: auto;
grid-row-gap: 0;
grid-row-start: auto;
grid-template-areas: none;
grid-template-columns: none;
grid-template-rows: none;
height: auto;
hyphens: manual;
image-orientation: 0deg;
image-rendering: auto;
image-resolution: 1dppx;
ime-mode: auto;
inline-size: auto;
isolation: auto;
justify-content: flexStart;
left: auto;
letter-spacing: normal;
line-break: auto;
line-height: normal;
list-style-image: none;
list-style-position: outside;
list-style-type: disc;
margin-block-end: 0;
margin-block-start: 0;
margin-bottom: 0;
margin-inline-end: 0;
margin-inline-start: 0;
margin-left: 0;
margin-right: 0;
margin-top: 0;
mask-clip: borderBox;
mask-composite: add;
mask-image: none;
mask-mode: matchSource;
mask-origin: borderBox;
mask-position: 0 0;
mask-repeat: repeat;
mask-size: auto;
mask-type: luminance;
max-height: none;
max-width: none;
min-block-size: 0;
min-height: 0;
min-inline-size: 0;
min-width: 0;
mix-blend-mode: normal;
object-fit: fill;
object-position: 50% 50%;
offset-block-end: auto;
offset-block-start: auto;
offset-inline-end: auto;
offset-inline-start: auto;
opacity: 1;
order: 0;
orphans: 2;
outline-color: initial;
outline-offset: 0;
outline-style: none;
outline-width: medium;
overflow: visible;
overflow-wrap: normal;
overflow-x: visible;
overflow-y: visible;
padding-block-end: 0;
padding-block-start: 0;
padding-bottom: 0;
padding-inline-end: 0;
padding-inline-start: 0;
padding-left: 0;
padding-right: 0;
padding-top: 0;
page-break-after: auto;
page-break-before: auto;
page-break-inside: auto;
perspective: none;
perspective-origin: 50% 50%;
pointer-events: auto;
position: static;
quotes: initial;
resize: none;
right: auto;
ruby-align: spaceAround;
ruby-merge: separate;
ruby-position: over;
scroll-behavior: auto;
scroll-snap-coordinate: none;
scroll-snap-destination: 0 0;
scroll-snap-points-x: none;
scroll-snap-points-y: none;
scroll-snap-type: none;
shape-image-threshold: 0;
shape-margin: 0;
shape-outside: none;
tab-size: 8;
table-layout: auto;
text-align: initial;
text-align-last: auto;
text-combine-upright: none;
text-decoration-color: currentcolor;
text-decoration-line: none;
text-decoration-style: solid;
text-emphasis-color: currentcolor;
text-emphasis-position: over right;
text-emphasis-style: none;
text-indent: 0;
text-justify: auto;
text-orientation: mixed;
text-overflow: clip;
text-rendering: auto;
text-shadow: none;
text-transform: none;
text-underline-position: auto;
top: auto;
touch-action: auto;
transform: none;
transform-box: borderBox;
transform-origin: 50% 50%0;
transform-style: flat;
transition-delay: 0s;
transition-duration: 0s;
transition-property: all;
transition-timing-function: ease;
vertical-align: baseline;
visibility: visible;
white-space: normal;
widows: 2;
width: auto;
will-change: auto;
word-break: normal;
word-spacing: normal;
word-wrap: normal;
writing-mode: horizontalTb;
z-index: auto;
-webkit-appearance: none;
-moz-appearance: none;
-ms-appearance: none;
appearance: none;
margin: 0;
}
.LiveAreaSection {
width: 100%;
}
.LiveAreaSection .login-option-buybox {
display: block;
width: 100%;
font-size: 17px;
line-height: 30px;
color: #222;
padding-top: 30px;
font-family: Harding, Palatino, serif;
}
.LiveAreaSection .additional-access-options {
display: block;
font-weight: 700;
font-size: 17px;
line-height: 30px;
color: #222;
font-family: Harding, Palatino, serif;
}
.LiveAreaSection .additional-login > li:not(:first-child)::before {
transform: translateY(-50%);
content: “”;
height: 1rem;
position: absolute;
top: 50%;
left: 0;
border-left: 2px solid #999;
}
.LiveAreaSection .additional-login > li:not(:first-child) {
padding-left: 10px;
}
.LiveAreaSection .additional-login > li {
display: inline-block;
position: relative;
vertical-align: middle;
padding-right: 10px;
}
.BuyBoxSection {
display: flex;
flex-wrap: wrap;
flex: 1;
flex-direction: row-reverse;
margin: -30px -15px 0;
}
.BuyBoxSection .box-inner {
width: 100%;
height: 100%;
padding: 30px 5px;
display: flex;
flex-direction: column;
justify-content: space-between;
}
.BuyBoxSection p {
margin: 0;
}
.BuyBoxSection .readcube-buybox {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 1;
flex-basis: 255px;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .subscribe-buybox {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 4;
flex-basis: 300px;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .subscribe-buybox-nature-plus {
background-color: #f3f3f3;
flex-shrink: 1;
flex-grow: 4;
flex-basis: 100%;
background-clip: content-box;
padding: 0 15px;
margin-top: 30px;
}
.BuyBoxSection .title-readcube,
.BuyBoxSection .title-buybox {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 24px;
line-height: 32px;
color: #222;
text-align: center;
font-family: Harding, Palatino, serif;
}
.BuyBoxSection .title-asia-buybox {
display: block;
margin: 0;
margin-right: 5%;
margin-left: 5%;
font-size: 24px;
line-height: 32px;
color: #222;
text-align: center;
font-family: Harding, Palatino, serif;
}
.BuyBoxSection .asia-link,
.Link-328123652,
.Link-2926870917,
.Link-2291679238,
.Link-595459207 {
color: #069;
cursor: pointer;
text-decoration: none;
font-size: 1.05em;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 1.05em6;
}
.BuyBoxSection .access-readcube {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 14px;
color: #222;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection ul {
margin: 0;
}
.BuyBoxSection .link-usp {
display: list-item;
margin: 0;
margin-left: 20px;
padding-top: 6px;
list-style-position: inside;
}
.BuyBoxSection .link-usp span {
font-size: 14px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .access-asia-buybox {
display: block;
margin: 0;
margin-right: 5%;
margin-left: 5%;
font-size: 14px;
color: #222;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .access-buybox {
display: block;
margin: 0;
margin-right: 10%;
margin-left: 10%;
font-size: 14px;
color: #222;
opacity: 0.8px;
padding-top: 10px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .price-buybox {
display: block;
font-size: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
padding-top: 30px;
text-align: center;
}
.BuyBoxSection .price-buybox-to {
display: block;
font-size: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
text-align: center;
}
.BuyBoxSection .price-info-text {
font-size: 16px;
padding-right: 10px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-value {
font-size: 30px;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-per-period {
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .price-from {
font-size: 14px;
padding-right: 10px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 20px;
}
.BuyBoxSection .issue-buybox {
display: block;
font-size: 13px;
text-align: center;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 19px;
}
.BuyBoxSection .no-price-buybox {
display: block;
font-size: 13px;
line-height: 18px;
text-align: center;
padding-right: 10%;
padding-left: 10%;
padding-bottom: 20px;
padding-top: 30px;
color: #222;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
}
.BuyBoxSection .vat-buybox {
display: block;
margin-top: 5px;
margin-right: 20%;
margin-left: 20%;
font-size: 11px;
color: #222;
padding-top: 10px;
padding-bottom: 15px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: 17px;
}
.BuyBoxSection .tax-buybox {
display: block;
width: 100%;
color: #222;
padding: 20px 16px;
text-align: center;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
line-height: NaNpx;
}
.BuyBoxSection .button-container {
display: flex;
padding-right: 20px;
padding-left: 20px;
justify-content: center;
}
.BuyBoxSection .button-container > * {
flex: 1px;
}
.BuyBoxSection .button-container > a:hover,
.Button-505204839:hover,
.Button-1078489254:hover,
.Button-2737859108:hover {
text-decoration: none;
}
.BuyBoxSection .btn-secondary {
background: #fff;
}
.BuyBoxSection .button-asia {
background: #069;
border: 1px solid #069;
border-radius: 0;
cursor: pointer;
display: block;
padding: 9px;
outline: 0;
text-align: center;
text-decoration: none;
min-width: 80px;
margin-top: 75px;
}
.BuyBoxSection .button-label-asia,
.ButtonLabel-3869432492,
.ButtonLabel-3296148077,
.ButtonLabel-1636778223 {
display: block;
color: #fff;
font-size: 17px;
line-height: 20px;
font-family: -apple-system, BlinkMacSystemFont, “Segoe UI”, Roboto,
Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif;
text-align: center;
text-decoration: none;
cursor: pointer;
}
.Button-505204839,
.Button-1078489254,
.Button-2737859108 {
background: #069;
border: 1px solid #069;
border-radius: 0;
cursor: pointer;
display: block;
padding: 9px;
outline: 0;
text-align: center;
text-decoration: none;
min-width: 80px;
max-width: 320px;
margin-top: 20px;
}
.Button-505204839 .btn-secondary-label,
.Button-1078489254 .btn-secondary-label,
.Button-2737859108 .btn-secondary-label {
color: #069;
}
.uList-2102244549 {
list-style: none;
padding: 0;
margin: 0;
}
/* style specs end */

**Fig. 1: Data encodings for phased polymorphisms.**

**Fig. 2: Overview and performance of the GRG construction algorithm.**

**Fig. 3: Cost, time and file size for GRG construction.**

**Fig. 4: Benchmarking GRG for genome-wide summary statistic computation.**

**Fig. 5: The biological interpretation of the GRG data structure.**

Data availability

The phased vcf.gz files used in this study from the Phase III 1000 Genomes Project can be downloaded through https://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/. The UK Biobank data are available to approved researchers through http://dnanexus.com/. The latest SHAPEIT phased VCFs from the field 20279 are used for analysis and the details about it are documented at https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=1910. The constructed GRG from the SHAPEIT phased data will be returned to the UK Biobank for access by approved projects. The inferred tree sequences of SARS-CoV-2 genomes from Zhan et al. 2023 (ref. ⁷⁸) are available by request from the original authors with approved GISAID data access. The original genomes of the SARS-CoV-2 are available on GISAID⁷⁹, via gisaid.org/EPI_SET_230329cd and included in Supplementary Data 1 (GISAID EPI SET PDF). The converted GRG of SARS-CoV-2 genomes are available by request with approved GISAID data access. Source data are provided with this paper.

Code availability

We have released an open-source, lightweight VCF parser picovcf v2.0 through https://github.com/aprilweilab/picovcf/releases/tag/v2.0. The Genotype Representation Graph Library (GRGL)⁸⁰, which implements the construction of GRGs and downstream computations, is available through https://github.com/aprilweilab/grgl. A detailed description of the software is provided in Supplementary Information. External softwares used in this study include: msprime v.1.2.0, https://pypi.org/project/msprime/1.2.0/; tskit v.0.5.5, https://github.com/tskit-dev/tskit/releases/tag/0.5.5; tsinfer v.0.2.3, https://github.com/tskit-dev/tsinfer/releases/tag/0.2.3; cyvcf2 v.0.30.22, https://github.com/brentp/cyvcf2/releases/tag/v0.30.22; plink v.1.90b6.26, https://www.cog-genomics.org/plink/; XSI v4.0.1 (dev), https://github.com/rwk-unil/xSqueezeIt/tree/21562519; Savvy v2.1.0, https://github.com/statgen/savvy/releases/tag/v2.1.0; and GCTA v.1.94.1, https://yanglab.westlake.edu.cn/software/gcta/bin/gcta-1.94.1-linux-kernel-3-x86_64.zip.

References

Kaiser, J. 200,000 whole genomes made available for biomedical studies by U.K. effort. ScienceInsider https://doi.org/10.1126/science.acx9678 (2021).
Browning, B. L. & Browning, S. R. Statistical phasing of 150,119 sequenced genomes in the UK Biobank. Am. J. Hum. Genet. 110, 161–165 (2023).

Article

Google Scholar
Hofmeister, R. J., Ribeiro, D. M., Rubinacci, S. & Delaneau, O. Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nat. Genet. 55, 1243–1249 (2023).

Article

Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

Article

Google Scholar
Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. Preprint at bioRxiv https://doi.org/10.1101/308296 (2018).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

Article

Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

Article

Google Scholar
Layer, R. M., Kindlon, N., Karczewski, K. J., Consortium, E. A. & Quinlan, A. R. Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods 13, 63–65 (2016).

Article

Google Scholar
Danek, A. & Deorowicz, S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics 34, 1834–1840 (2018).

Article

Google Scholar
Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018).

Article

Google Scholar
Deorowicz, S. & Danek, A. GTShark: genotype compression in large projects. Bioinformatics 35, 4791–4793 (2019).

Article

Google Scholar
Lan, D., Tobler, R., Souilmi, Y. & Llamas, B. Genozip: a universal extensible genomic data compressor. Bioinformatics 37, 2225–2230 (2021).

Article

Google Scholar
LeFaive, J., Smith, A. V., Kang, H. M. & Abecasis, G. Sparse allele vectors and the savvy software suite. Bioinformatics 37, 4248–4250 (2021).

Article

Google Scholar
Wertenbroek, R., Rubinacci, S., Xenarios, I., Thoma, Y. & Delaneau, O. XSI—a genotype compression tool for compressive genomics in large biobanks. Bioinformatics 38, 3778–3784 (2022).

Article

Google Scholar
Rivas, M. A. & Chang, C. Efficient storage and regression computation for population-scale genome sequencing studies. Preprint at bioRxiv https://doi.org/10.1101/2024.04.11.589062 (2024).
Durbin, R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics 30, 1266–1272 (2014).

Article

Google Scholar
Wu, K., Otoo, EJ. & Shoshani, A. A performance comparison of bitmap indexes. In Proc. of the Tenth International Conference on Information and Knowledge Management 559–561 (ACM, 2001).
Collet, Y. & Kucherawy, M. RFC8878: zstandard compression and the ‘application/Zstd’ media type. RFC Editor https://doi.org/10.17487/RFC8878 (2021).
Kapli, P., Yang, Z. & Telford, M. J. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020).

Article

Google Scholar
Fitch, W. M. & Margoliash, E. Construction of phylogenetic trees. Science 155, 279–284 (1967).

Article

Google Scholar
Turakhia, Y. et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat. Genet. 53, 809–816 (2021).

Article

Google Scholar
Huson, D. H. & Bryant, D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006).

Article

Google Scholar
Brandt, Y. C., Wei, X. D., Deng, Y., Vaughn, A. H. & Nielsen, R. Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics 221, iyac044 (2022).

Article

Google Scholar
Brandt, D. Y. C., Huber, C. D., Chiang, C. W. K. & Ortega-Del Vecchyo, D. The promise of inferring the past using the ancestral recombination graph. Genome Biol. Evol. 16, evae005 (2024).

Article

Google Scholar
Lewanski, A. L., Grundler, M. C. & Bradburd, G. S. The era of the ARG: an introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics. PLoS Genet. 20, e1011110 (2024).

Article

Google Scholar
Wong, Y. et al. A general and efficient representation of ancestral recombination graphs. Genetics 228, iyae100 (2024).

Article

Google Scholar
Hudson, R. R. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23, 183–201 (1983).

Article

Google Scholar
Tavaré, S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26, 119–164 (1984).

Article
MathSciNet

Google Scholar
Zhang, B. C., Biddanda, A., Gunnarsson, Á. F., Cooper, F. & Palamara, P. F. Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits. Nat. Genet. 55, 768–776 (2023).

Article

Google Scholar
Rasmussen, M. D., Hubisz, M. J., Gronau, I. & Siepel, A. Genome-wide inference of ancestral recombination graphs. PLoS Genet. 10, e1004342 (2014).

Article

Google Scholar
Speidel, L., Forest, M., Shi, S. & Myers, S. R. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51, 1321–1329 (2019).

Article

Google Scholar
Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).

Article

Google Scholar
Eddy, S. R. What is dynamic programming? Nat. Biotechnol. 22, 909–910 (2004).

Article

Google Scholar
Greenland, S., Pearl, J. & Robins, J. M. Causal diagrams for epidemiologic research. Epidemiology 10, 37–48 (1999).

Article

Google Scholar
Kelleher, J., Etheridge, A. M. & McVean, G. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol. 12, e1004842 (2016).

Article

Google Scholar
Baumdicker, F. et al. Efficient ancestry and mutation simulation with msprime 1.0. Genetics 220, iyab229 (2022).

Article

Google Scholar
Giegerich, R. A systematic approach to dynamic programming in bioinformatics. Bioinformatics 16, 665–677 (2000).

Article

Google Scholar
Ralph, P., Thornton, K. & Kelleher, J. Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes. Genetics 215, 779–797 (2020).

Article

Google Scholar
Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C. Introduction to Algorithms (MIT Press, 2022).
Steiß, V., Letschert, T., Schäfer, H. & Pahl, R. PERMORY-MPI: a program for high-speed parallel permutation testing in genome-wide association studies. Bioinformatics 28, 1168–1169 (2012).

Article

Google Scholar
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaSci 4, 7 (2015).

Article

Google Scholar
Purcell, S. & Chang, C. PLINK [1.90b626]. https://www.cog-genomics.org/plink/1.9 (2022).
Kelleher, J., Thornton, K. R., Ashander, J. & Ralph, P. L. Efficient pedigree recording for fast population genetics simulation. PLoS Comput. Biol. 14, e1006581 (2018).

Article

Google Scholar
Visscher, P. M. et al.10 years of GWAS discovery: biology, function and translation. Am. J. Hum. Genet. 101, 5–22 (2017).

Article

Google Scholar
Abdellaoui, A., Yengo, L., Verweij, K. J. H. & Visscher, P. M. 15 years of GWAS discovery: realizing the promise. Am. J. Hum. Genet. 110, 179–194 (2023).

Article

Google Scholar
Loh, P.-R., Baym, M. & Berger, B. Compressive genomics. Nat. Biotechnol. 30, 627–630 (2012).

Article

Google Scholar
Furnas, G. W. & Zacks, J. Multitrees: enriching and reusing hierarchical structure. In Proc. SIGCHI conference on Human factors in Computing Systems Celebrating Interdependence 330–336 (ACM, 1994).
Wakeley, J. Coalescent Theory: An Introduction (Roberts & Co, 2009).
Hudson, R. R. Testing the constant-rate neutral allele model with protein sequence data. Evolution 37, 203–217 (1983).

Article

Google Scholar
Saunders, I. W., Brohede, J. & Hannan, G. N. Estimating genotyping error rates from Mendelian errors in SNP array genotypes and their impact on inference. Genomics 90, 291–296 (2007).

Article

Google Scholar
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).

Article

Google Scholar
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

Article

Google Scholar
Hudson, R. R. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, 1–44 (1990).

Google Scholar
Griffiths, R. C. & Marjoram, P. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3, 479–502 (1996).

Article

Google Scholar
Griffiths, R. C. & Marjoram, P. An ancestral recombination graph. Prog. Popul. Genet. Hum. Evol. 257–270 (1997).
Slatkin, M. Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–485 (2008).

Article

Google Scholar
Salehi Nowbandegani, P. et al. Extremely sparse models of linkage disequilibrium in ancestrally diverse association studies. Nat. Genet. 55, 1494–1502 (2023).

Article

Google Scholar
Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).

Article

Google Scholar
Geza, E. et al. A comprehensive survey of models for dissecting local ancestry deconvolution in human genome. Brief. Bioinform. 20, 1709–1724 (2019).

Article

Google Scholar
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).

Article

Google Scholar
Gray, AG., Moore, AW., Nichol RC., Connolly, AJ., Genovese, C. & Wasserman, L. Multi-Tree Methods for Statistics on Very Large Datasets in Astronomy. In Twenty Years of ADASS: A Retrospective of the First Twenty Years of the Astronomical Data Analysis Software and Systems Conference Series (eds F. Ochsenbein et al.) 314 (Astronomical Society of the Pacific, 2014).
Waggener, W. N. Pulse Code Modulation Techniques: With Applications in Communications and Data Recording (Van Nostrand Reinhold, 1995).
Miller, J. R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).

Article

Google Scholar
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).

Article

Google Scholar
Burkhard, W. A. & Keller, R. M. Some approaches to best-match file searching. Commun. ACM 16, 230–236 (1973).

Article

Google Scholar
Duff, I. S. Computer solution of large sparse positive definite systems (Alan George and Joseph W. Liu). SIAM Rev. 26, 289–291 (1984).

Article

Google Scholar
Buluç, A., Fineman, J. T., Frigo, M., Gilbert, J. R. & Leiserson, C. E. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proc. Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures 233–244 (ACM, 2009).
Brandes, U., Eiglsperger, M., Herman, I., Himsolt, M. & Marshall, M. S. in Graph Drawing (eds Mutzel, P. et al.) Vol. 2265, 501–512 (Springer, 2002).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

Article

Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

Article

Google Scholar
Eggertsson, H. P. et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat. Genet. 49, 1654–1660 (2017).

Article

Google Scholar
Ribeiro, D., Hofmeister, R., Rubinacci, S. & Delaneau, O. Phasing of the UK Biobank whole genome sequencing data interim release of 200,031 samples (2023).
GNU Scientific Library Reference Manual: For GSL Version 1.12. (Network Theory, 2009).
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).

Article

Google Scholar
Pedersen, B. S. & Quinlan, A. R. cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33, 1867–1869 (2017).

Article

Google Scholar
Kingman, J. F. C. On the genealogy of large populations. J. Appl. Probab. 19, 27–43 (1982).

Article
MathSciNet

Google Scholar
Kingman, J. F. C. The coalescent. Stoch. Process. Their Appl. 13, 235–248 (1982).

Article
MathSciNet

Google Scholar
Zhan, SH., Ignatieva, A., Wong, Y., et al. Towards pandemic-scale ancestral recombination graphs of SARS-CoV-2. Preprint at bioRxiv https://doi.org/10.1101/2023.06.08.544212 (2023).
GISAID (GISAID); https://doi.org/10.55876/gis8.230329cd (2023).
DeHaas, D. & Pan, Z. aprilweilab/grgl: GRG paper version. Zenodo https://doi.org/10.5281/zenodo.14002478 (2024).

Download references

Acknowledgements

We acknowledge all data contributors, that is, the authors and their originating laboratories responsible for obtaining the specimens, and their submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank J. Kelleher for sharing the inferred SARS-CoV-2 tree sequences with us. This work is partly supported by NIH R35-GM150579 to X.W. This research has been conducted using the UK Biobank Resource under application number 97908.

Author information

These authors contributed equally: Drew DeHaas, Ziqing Pan.

Authors and Affiliations

Department of Computational Biology, Cornell University, Ithaca, NY, USA

Drew DeHaas, Ziqing Pan & Xinzhu Wei

Authors

Drew DeHaas

View author publications

You can also search for this author in
PubMed Google Scholar
Ziqing Pan

View author publications

You can also search for this author in
PubMed Google Scholar
Xinzhu Wei

View author publications

You can also search for this author in
PubMed Google Scholar

Contributions

D.D., Z.P. and X.W. wrote and revised the paper. D.D., Z.P. and X.W. designed the experiments and interpreted the results. D.D. and Z.P. carried out the main analyses and illustrated the results. D.D. developed the software. Z.P. carried out the pilot investigation. X.W. conceived and supervised the project.

Corresponding author

Correspondence to
Xinzhu Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Jerome Kelleher and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Comparisons of GRG construction with varying parameters T and P.

a-b: The effect of parameter T on edge count (a) and file size (b) of GRG. Parameter T controls how many tree-GRGs to construct before mapping mutations. The x-axes are the segment length used for each tree, which can then be converted into the parameter T using the chromosome length. Data is from chromosomes 19 and 22 of the 1000 Genomes Project dataset. c-d: The effect of parameter P on file sizes (c) and construction time (d) of GRG. Data is from chromosome 19 of the 1000 Genomes Project dataset.

Source data

Extended Data Fig. 2 Comparisons of file formats for phased whole-genome sequencing data.

a: Compression cost in British pound sterling (GBP) for converting the whole-chromosome polymorphisms from tabular formats (that is BGEN or BCF) to GRG, XSI and Savvy on the UK Biobank chromosomes 22 (left) and 13 (right). b: Compression time for different formats conversion. See Supplementary Table 1 for a summary of file format features. c: Comparisons of file sizes for encoding whole-genome polymorphisms in the 1000 Genomes Project dataset chromosomes 1–22. Each dot represents one chromosome.

Source data

Extended Data Fig. 3 GRG construction statistics for three different populations from the 1000 Genomes Project dataset.

Samples from Africans (AFR), Europeans (EUR) and East Asians (EAS) are used to construct GRGs separately. We randomly sample 500 individuals from each population. We compare the time to construct GRG (a), GRG file size (b) and number of alternative alleles (c) for each population in the dataset. Each dot presents a chromosome. If every site were bi-allelic, this would be equal to the number of sites. GRG explicitly represents every alternative allele value.

Source data

Extended Data Fig. 4 Schematic illustrations of graph-based algorithms for computing allele frequencies and association effects.

a: Computation of allele frequencies using standard tabular methods. Genotypes of each variant are read row by row from disk and allele frequencies are computed independently for each variant in the tabular structure. The colors for each mutation in panel a matches colors for nodes with mutation in panel b. b: Computation of allele frequencies using GRG. In a GRG, the sample frequencies of nodes are firstly computed (the value inside each node) and then mapped to the mutations based on the node-mutation mapping. Node IDs are labeled next to nodes as numeric digits. Mutation nodes are colored accordingly and indicated with a mutation set (for example, {m1}). The frequency of leaf nodes is initialized as 1/n (for instance, nodes 0–3), where n is the number of haplotypes in the population. For other nodes (for instance, internal nodes and root nodes), frequencies are computed sequentially based on the node IDs (for instance, from node 4 to node 7) as the sum of frequencies of each node’s children nodes (for instance, node 4 = node 0 + node 1). c: Computing association effect (β) with genotype and phenotype vectors. The necessary dot product terms for computing β are given in the second part of the equation. We assume sample 0 and sample 1 are from individual 0, and sample 2 and sample 3 are from individual 1. The individual phenotypes are shown as a vector below the genotype matrix. d: A schematic illustration of computing association effect by computing the dot product terms through graph traversals. Similar to b, the node IDs and mutation lists are labeled near the nodes, while values for each term are listed in the nodes. The initialization at the leaf nodes and the node values after the traversal are illustrated inside the nodes.

Extended Data Fig. 5 Querying haplotypes in a GRG.

Two examples of queried haplotypes in a GRG are shown (the haplotype for sample 0 and the haplotype for sample 2). On the left, we highlight all paths to traverse in order to achieve the full haplotype of a queried sample node in a GRG. The mutations at each node are listed nearby. The full haplotypes are shown on the right, where mutations occurring in the path are designated as alternative alleles (in black) and other positions are designated as reference alleles (in grey).

Extended Data Fig. 6 Allele frequency computations.

a: Comparisons of computational efficiency in the UK Biobank. Allele frequencies were computed on 9 chromosomes (chromosomes 8, 10, 12, 14, 16, 18, 20, 21, 22, labeled as solid dots). These results were used to predict the expected computation time for the remaining chromosomes (presented as open dots) with a linear projection. For Savvy and XSI, we focused on 2 chromosomes (chromosome 13 and 22). b: Comparisons of computational costs in Pound Sterling (GBP) for calculating allele frequencies in the UK Biobank Research Analysis Platform. Similar to a, solid dots represent the results from our experiments while open dots represent the projected computational cost for the remaining chromosomes. c: Comparisons of computational time for allele frequencies in the 1000 Genomes Project: each data point represents the computational time of allele frequencies for each chromosome in the 1000 Genomes Project, either using tabular data structures (VCF and plink bed files (PLINK)) or GRG structure as input.

Source data

Extended Data Fig. 7 Computing allele frequency on subsets of the dataset.

a-c: Average time to compute allele frequency on three separate regions of size 100Kbp (a), 1Mbp (b) and 5 Mbp (c) with different encodings. The ‘GRG (in-memory)’ time is the time to compute allele frequency by traversing the graph, once the graph is loaded into RAM. The ‘GRG’ time includes the time to load the graph from disk. d: Time to compute allele frequency for a subset of 1000 randomly chosen diploid samples on a genome of size 100Mbp. The same random subset was used for Savvy and GRG.

Source data

Extended Data Fig. 8 Comparisons of file formats for phased whole-genome sequencing data.

a-c are the results for the 1000 Genomes Project where each dot represents a full chromosome. a: Conversion times from uncompressed or compressed VCF to other tabular formats including BGEN, IGD, compressed VCF (vcf.gz) and plink bed files (BED). b: Time for GRG construction from different file formats with 20 computational CPU threads. c: Time to scan each file type after the data is loaded into memory. d: Time to convert tskit tree sequences to GRGs against varying numbers of samples with the simulated data. The tree sequences are simulated by msprime with a length of 100Mbp and a human-like genetic architecture.

Source data

Extended Data Fig. 9 Investigating the optimality of GRG construction.

A simulated genetic sequence of 100Mbp is used in the benchmark for panels a and b. a: Comparisons of the file sizes among VCFs (VCF), bed files (PLINK), true tree sequences from msprime (true TS), and GRGs converted from true tree sequences (true TS-GRG). b: Comparisons of file sizes of true TS, true TS-GRG, inferred tree sequences from tsinfer (inferred TS), GRGs converted from inferred tree sequences (inferred TS-GRG) and GRG constructed from true genotypes of varied sample sizes (GRG). The phased whole-genome sequences from the Phase III 1000 Genomes Project are used for benchmarking for panels c and d. c: Comparisons of the file sizes using data from the 1000 Genomes Project. Each chromosome is represented as a data point. d: Runtime of GRG construction and tsinfer inference. The benchmark is among the inference time using tsinfer (tsinfer), the construction time of GRG (GRG), and the conversion time from the inferred tree sequences to inferred TS-GRGs (inferred TS-GRG).

Source data

Extended Data Fig. 10 Robustness of GRG construction with noisy simulated data.

The file sizes (a-b) and creation efficiency (c-d) of inferred tree sequences and constructed GRGs in simulated data: (a, c) without error and (b, d) with 0.1% error. The file size comparison is among the inferred tree sequences from tsinfer (inferred TS), the GRG converted from the inferred TS (inferred TS-GRG) and the GRG constructed directly from the genotypes (GRG). The computation time benchmark is for tsinfer and GRG construction. e, f: The ratio of the number of mutations in inferred tree sequences to the number of input alternative alleles in VCF with different error rates in the data.

Source data

Supplementary information

Supplementary Information

Supplementary Sections 1–13, Tables 1 and 2, and Figs. 1–10.

Reporting Summary

Peer Review File

Supplementary Data 1

This file is from the original data contributor and is required to be uploaded by the GISAID database. Per the GISAID website, the use of the Supplemental Table must be in its original form without alteration, redaction or other comment.

Source data

Source Data Fig. 2

Source data for Fig. 2b–d.

Source Data Fig. 3

Source data for Fig. 3.

Source Data Fig. 4

Source data for Fig. 4.

Source Data Extended Data Fig. 1

Source data for Extended Data Fig. 1.

Source Data Extended Data Fig. 2

Source data for Extended Data Fig. 2.

Source Data Extended Data Fig. 3

Source data for Extended Data Fig. 3.

Source Data Extended Data Fig. 6

Source data for Extended Data Fig. 6.

Source Data Extended Data Fig. 7

Source data for Extended Data Fig. 7.

Source Data Extended Data Fig. 8

Source data for Extended Data Fig. 8.

Source Data Extended Data Fig. 9

Source data for Extended Data Fig. 9.

Source Data Extended Data Fig. 10

Source data for Extended Data Fig. 10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

DeHaas, D., Pan, Z. & Wei, X. Enabling efficient analysis of biobank-scale data with genotype representation graphs.
Nat Comput Sci (2024). https://doi.org/10.1038/s43588-024-00739-9

Download citation

Received: 08 May 2024
Accepted: 06 November 2024
Published: 05 December 2024
DOI: https://doi.org/10.1038/s43588-024-00739-9

Enabling efficient analysis of biobank-scale data with genotype representation graphs

Abstract

Access options

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 1 Comparisons of GRG construction with varying parameters T and P.

Extended Data Fig. 2 Comparisons of file formats for phased whole-genome sequencing data.

Extended Data Fig. 3 GRG construction statistics for three different populations from the 1000 Genomes Project dataset.

Extended Data Fig. 4 Schematic illustrations of graph-based algorithms for computing allele frequencies and association effects.

Extended Data Fig. 5 Querying haplotypes in a GRG.

Extended Data Fig. 6 Allele frequency computations.

Extended Data Fig. 7 Computing allele frequency on subsets of the dataset.

Extended Data Fig. 8 Comparisons of file formats for phased whole-genome sequencing data.

Extended Data Fig. 9 Investigating the optimality of GRG construction.

Extended Data Fig. 10 Robustness of GRG construction with noisy simulated data.

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Supplementary Data 1

Source data

Source Data Fig. 2

Source Data Fig. 3

Source Data Fig. 4

Source Data Extended Data Fig. 1

Source Data Extended Data Fig. 2

Source Data Extended Data Fig. 3

Source Data Extended Data Fig. 6

Source Data Extended Data Fig. 7

Source Data Extended Data Fig. 8

Source Data Extended Data Fig. 9

Source Data Extended Data Fig. 10

Rights and permissions

About this article

Cite this article

Share this article

Leave a Reply Cancel reply