Background & Summary

Geothermal features in Antarctica represent unique and extreme environments where heat from volcanic activity creates isolated oases of warmth amidst the frigid landscape. These geothermally influenced habitats can be found in exposed volcanic mountains, islands, and even subglacial environments1,2. Currently, there are four active geothermal sites in Antarctica: one on Deception Island in the South Shetland Islands near the Antarctic Peninsula, and three in continental Antarctica, specifically in Victoria Land (Mount Erebus, Mount Melbourne, and Mount Rittmann). Geothermal activity results in warm soils and fumaroles, which can support bryophytes and microbial life by providing liquid water and relatively stable temperatures, despite the harsh external climate. This unique thermal moderation enables the persistence of diverse microbial communities specifically adapted to the combined thermal and chemical extremes of these habitats.

Over the past four decades, researchers have explored microbial diversity in these Antarctic geothermal environments using various methodologies. Early studies relied heavily on culture-dependent techniques, leading to the identification of only a limited number of bacterial phyla3,4,5,6. However, the advent of molecular methods, such as marker gene-based surveys and shotgun metagenomics, has unveiled a surprisingly diverse array of both cosmopolitan and endemic prokaryotic lineages, encompassing over 20 phyla1,7,8,9. These microorganisms have evolved mechanisms to thrive in extreme conditions characterized by high temperatures, fluctuating pH levels, and the presence of various metals and sulfur compounds. These unique conditions facilitate the development of specialized metabolic pathways, including thermophilic and chemolithotrophic processes. Studying these microbial communities provides valuable insights into the resilience and adaptability of life, offering clues about the potential for life in similar extreme environments elsewhere on Earth and possibly on other planets. While previous research has primarily focused on Mount Erebus and Deception Island, there is a scarcity of studies conducted on Mount Melbourne and Mount Rittmann8,9,10,11.

Shotgun metagenomic sequencing provides a comprehensive overview of microbial community composition and function in various environments. Deeper sequencing enhances our understanding at the genome level by reconstructing metagenome-assembled genomes (MAGs), which is particularly useful in less complex environments12,13. However, this approach has limitations when short-read sequencing is applied: it often misses low-abundance taxa, and the resulting MAGs can be fragmented and error-prone14. Furthermore, short-read-based MAGs often represent population-level composite genomes, making it difficult to distinguish closely related species or strains15. In contrast, long-read metagenomic sequencing, particularly PacBio high-accuracy long-read (HiFi) sequencing, substantially improves nucleotide accuracy and read length, enabling the assembly of complete MAGs16. Hi-C sequencing also provides an advance for microbial ecology by chemically linking phages, plasmids to host cells, as well as retrieve strain-level MAGs using Hi-C contact maps17,18. Single-cell genomics, on the other hand, allows for the isolation and amplification of DNA from individual cells. This approach enables the detailed characterization of rare and uncultured microbes, revealing strain-level diversity that might be overlooked in bulk metagenomic analyses19. Single-cell amplified genomes (SAGs) are especially useful in uncovering unique adaptations and evolutionary strategies of microbes in extreme or specialized environments20,21. Therefore, integrating SAGs with PacBio HiFi and Hi-C metagenomic sequencing can significantly enhance our understanding of microbial diversity and functionality by providing more complete and accurate genomic information from both abundant and rare microbial community members.

Here, we present an integrated microbial genomic diversity dataset from two high-altitude geothermal sites, Mount Melbourne and Mount Rittmann, in Antarctica. This dataset includes Pacbio HiFi and Hi-C sequencing reads, resulting in 75 high-quality MAGs, 69% of which are single-contig MAGs. Additionally, 224 SAGs, derived from a nontargeted cell sorting method, were sequenced, assembled, and taxonomically identified. This collection of genomic data on bacteria and archaea facilitates a genome-level understanding of microbial communities and their functionalities in these unique environments, enabling researchers to make comparisons with other geothermal habitats on Earth.

Methods

Soil sampling

Soil sampling was conducted at two high-altitude geothermal sites located in northern Victoria Land, Antarctica: Mount Melbourne (74°21′S, 164°42′E; 2,733 m) and Mount Rittmann (73°27′S, 165°30′E; 2,600 m) (Fig. 1a and b; Table 1). These sites are within Antarctic Specially Protected Areas (ASPA) No. 175. At Mount Melbourne, four soil samples were collected along the Cryptogram Ridge, located at the northeast edge of an old caldera rim, and combined to form a composite sample. Similarly, at Mount Rittmann, four soil samples were collected from areas with active fumarolic activity on the upper and middle slopes of the caldera rim, and then pooled to create a composite sample. Surface soil temperatures in the Cryptogram Ridge reach 40 to 50 °C6, while temperatures of 43.4 °C have been recorded at Mount Rittmann22. To minimize the site contamination, we adhered to strict protocols, including the use of clean suits and masks during sampling. Surface soils (0–10 cm depth) were collected using sterilized trowels treated with 70% ethanol. The collected soil samples were transported to the Jang Bogo station and stored at −20 °C before being shipped to South Korea via air cargo under cold room conditions. Subsamples were placed in glycerol-TE buffer for subsequent single-cell sorting23.

Fig. 1
figure 1

Sampling locations and workflow for multi-modal microbiome profiling strategies. (a) Site map, (b) Landscape pictures of Mount Melbourne and Mount Rittmann, and (c) schematic workflows for Pacbio HiFi and Hi-C metagenomic sequencing, and single-cell genomics. This figure was created with BioRender.com.

Full size image
Table 1 Site information and soil physicochemical properties.
Full size table

PacBio HiFi metagenomic sequencing

DNA was extracted from 10 g of soils using the DNeasy PowerMax Soil Kit (QIAGEN) following the manufacturer’s instructions. Due to the low microbial biomass in these geothermal habitats, DNA from four separate extractions were pooled to generate a composite sample for each location. Electrophoresis was performed on a 1% agarose gel to visualize the extracted DNA. The quantity and quality of the extracted DNA were assessed using the PicoGreen (Invitrogen) method with a Victor 3 Fluorometry. The DNA size distribution was determined using the Femto Pulse method (Agilent), revealing average fragment sizes of 17,167 bp for Mount Melbourne and 10,156 bp for Mount Rittmann. The genomic DNA was sheared using the Megarupor3 and purified using AMPurePb magnetic beads. HiFi SMRTbell libraries were then prepared following the PacBio protocol “Preparing 10 kb library using SMRTbell Express Template Prep kit 2.0 for metagenomics shotgun sequencing”, and subsequently annealed with the Sequel II Bind Kit 2.2 and Internal Control Kit 1.0.0. Sequencing was performed on the Sequel II platform (Pacific Biosciences) using the Sequel II Sequencing Kit 2.0 and Sequel II SMRT cell 8 M Tray, with a 30-h video recorded for each SMRT cell. Library preparation and PacBio HiFi sequencing were carried out at Macrogen Inc. (Seoul, Republic of Korea).

Summary statistics of HiFi reads were calculated and visualized using Pauvre24. HiFi sequencing reads were assembled using three different assemblers: metaFlye v2.9.425, hifiasm-meta v0.3.1 (r63.2)26, and metaMDBG-0.327. We ran hifiasm-meta with the default parameters, and primary contigs (p_ctg.gfa) were used for further analyses. metaFlye was executed with the –pacbio-hifi and –meta options. metaMDBG was run with its default settings, and “l” (linear), “c” (circular), and “rc” (rescued circular) contigs were used. Minimap2-v2.28 (r1209)28 was used with the ‘-ax map-hifi’ option to determine the fraction of reads mapped to the assemblies.

The resultant assemblies were then input into the HiFi-MAG-Pipeline from pb-metagenomics-tools29 to generate high-quality MAGs (Fig. 1c). In brief, contigs underwent a ‘completeness-aware’ strategy, retaining MAGs with ≥93% completeness as high-quality outputs, while MAGs with <93% completeness are subjected to additional binning and refining pathways to improve their quality.

Hi-C sequencing

20 g of soil sample was processed following the low biomass protocol outlined in the ProxiMetaTM Hi-C Kit Protocol v4.0. Hi-C libraries were then prepared using ProxiMetaTM Kit (Phage Genomics, Seattle, WA) according to the manufacturer’s instructions (Fig. 1c). This procedure includes cross-linking, digestion, and the formation of chimeric junctions. Subsequently, the H-C libraries were sequenced on the NovaSeq6000 paired-end 150 bp at DNA Link Inc. (Seoul, Republic of Korea). The raw Hi-C reads and HiFi assembled contigs were analyzed using ProxiMeta metagenome deconvolution platform30.

Recovered MAGs from both HiFi-MAG-Pipeline (HiFi reads) and ProxiMeta (Hi-C reads) were dereplicated using dRep v3.5.0 at an average nucleotide identity (ANI) >0.99 and maximum alignment coverage >0.95 to remove redundant contigs while retaining strain-level diversity31. CheckM2 v1.0.232 was used to assess the quality of MAGs, and phylogenetic inference and taxonomic assignment of MAGs were performed using GTDB-Tk v2.4.033 with GTDB release 214. We defined high-quality MAGs as those with ≥90% completeness and ≤5% contamination. The resultant genome tree was then annotated and visualized using Interactive Tree of Life (iTOL) v6.9.134 (Fig. 2).

Fig. 2
figure 2

Genome tree of metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs). In total, 75 dereplicated high-quality MAGs, and 21 medium-quality SAGs were included in the tree.

Full size image

Single-cell genomics

Soil samples were shipped with dry ice to the Bigelow Laboratory for Ocean Sciences’ Single Cell Genomics Center (SCGC), and microbial cells were sorted following SCGC procedures (Fig. 1c). In brief, microbial particles were stained with the fluorescent DNA stain SYTO-9 (Thermo Fisher Scientific) and then sorted using an inFlux Mariner (Beckman Coulter). Each sample was sorted into two 384-well plates by fluorescence-activated cell sorting (FACS). The cells were lysed through two freeze-thaw cycles followed by KOH treatment, and genomic DNA was amplified using Whole Genome Amplification-X (WGA-X®)35. WGA Cp values indicate time (in hours) required to reach half of the maximal DNA-SYTO-9 fluorescence. Per plate, 64 negative control (no droplet) and 3 positive control (10 cells per well) were included.

Sequencing libraries were prepared using Nextera XT (Illumina), and sequencing was performed on an Illumina NextSeq2000 with P3 reagents (2 × 100 bp), resulting in approximately 1.1 billion paired-end reads. Initial quality trimming of raw reads was performed using Trimmomatic v0.3936 and de novo assembly was conducted using SPAdes v2.2.1037. Contigs shorter than 2,000 bp were removed. Completeness and contamination of SAGs were determined using CheckM2 v1.0.232. Contaminant contigs were identified by principal component analysis on nucleotide tetramer frequencies38. Functional annotation was performed by Prokka39, supplemented by Swiss-Prot database.

The quality of SAGs was determined based on the Minimum Information about a Single Amplified Genome (MISAG)40. SAGs with <50% completeness were considered ‘low quality’, while those with ≥50% completeness and <10% contamination were considered ‘medium quality’. The recovery of rRNA genes were checked by Barrnap v0.941. Taxonomic assignment of SAG was performed using GTDB-Tk v2.4.033, and classification based on rRNA genes was carried out using SILVA v138.1 database42.

Data Records

The PacBio HiFi and Hi-C sequencing reads, along with single-cell amplified genomes (SAGs), are submitted to the NCBI Sequence Reads archive (SRA) under BioProject accession number PRJNA112633143. HiFi reads are available at SRR29483436 and SRR29483437, Hi-C at SRR29496777 and SRR2949677843. High-quality MAGs are available at SRR30404090-SRR30463440 and SAGs at SRR29679156-SRR29679379 under the same NCBI Bioproject43. The detailed information for MAGs (Supplementary Table 1), SAGs (Supplementary Table 2), and virus-host associations (Supplementary Table 3) were deposited to figshare44 with https://doi.org/10.6084/m9.figshare.26157934.

Technical Validation

Pacbio HiFi sequencing generates long read lengths with high accuracy (99.9% single-molecule read accuracy) using the circular consensus sequencing (CCS) method. HiFi read length and quality distribution for two samples demonstrate the generation of high-quality reads with long read length: 11,808 bp ± 4,112 bp (mean ± standard deviation) for Mount Melbourne, and 9,343 bp ± 3,139 bp for Mount Rittmann (Fig. 3a and d; Table 2). The MAGs that passed the HiFi-MAG-Pipeline criteria (≥70% completeness, ≤10% contamination, <20 contigs) include 113 MAGs from Mount Melbourne (Fig. 3b and c) and 89 MAGs from Mount Rittmann (Fig. 3e and f). Dereplication at 99% ANI yielded 57 high-quality MAGs.

Fig. 3
figure 3

Quality validation for PacBio HiFi reads and recovered metagenome-assembled genomes (MAGs). Pacbio HiFi read length and quality distribution for (a) Mount Melbourne and (d) Mount Rittmann. (b,e) show the relationship between completeness and contamination for all generated MAGs with marginal density plots, highlighting the quality pass criteria of completeness >70% and contamination <10%. (c,f) display the relationship between completeness and contamination for the quality-criteria passed MAGs. (a), (b), and (c) represent data from Mount Melbourne, while (d), (e), and (f) correspond to data from Mount Rittmann.

Full size image
Table 2 Summary of PacBio HiFi and Hi-C metagenomic sequencing dataset and single-cell amplified genomes (SAGs).
Full size table

ProxiMeta generated 83 and 259 clusters (proximity-assembled genomes) for Mount Melbourne and Mount Rittmann, respectively (Table 2). Proximity ligation binning resulted in a 31.6% increase (from 57 MAGs to 75 MAGs) in the total number of high-quality MAGs compared to those obtained solely from the HiFi-MAG-Pipeline. Of these, 29 MAGs (39%) were unique to the HiFi-MAG-Pipeline, 18 MAGs (24%) were unique to ProxiMeta, and 28 MAGs (37%) were shared between the HiFi-MAG-Pipeline and ProxiMeta (Fig. 4a). ProxiMeta also identified 14 virus-host associations44, though no plasmids were linked to their host MAGs.

Fig. 4
figure 4

The final set of high-quality metagenome-assembled genomes (MAGs) obtained from both HiFi and Hi-C reads. (a) Categories of high-quality MAGs and (b) the relationship between completeness and contamination for MAGs from each method. In panel (a), ‘Hi-C unique’ refers to MAGs unique to ProxiMeta-based binning, ‘HiFi unique’ refers to MAGs unique to the HiFi-MAG-Pipeline, and ‘Both (Hi-C or HiFi)’ represents MAGs shared between the two methods but preferentially selected by one based on the MAG quality.

Full size image

In single-cell genomics, we used WGA-X for genomic DNA amplification instead of multiple displacement amplification (MDA) to enhance genome recovery and reduce amplification biases42. SAGs chosen for sequencing were selected based on their low Cp values, as genome recovery has been shown to increase with decreasing WGA-X Cp values (hours)35. Among 490 positive wells, 224 wells with Cp values less than 02:13 were selected for genome sequencing, comprising 108 from Mount Melbourne and 116 from Mount Rittmann. The completeness of these SAGs ranged broadly from 3.69% to 80.38% (Fig. 5), consistent with findings from other soil studies45,46. Contamination levels were low, averaging 0.21%, with only one SAG above 10% and all others below 2%. Additionally, 135 out of 224 SAGs (60.3%) recovered 16S rRNA sequences, indicating a higher retrieval rate of 16S rRNA genes compared to previous single-cell genomics studies in soil45,47.

Fig. 5
figure 5

Quality validation plots for single-cell amplified genomes (SAGs). Relationship between completeness and contamination for all generated SAGs with marginal box plots for (a) Mount Melbourne, and (b) Mount Rittmann.

Full size image

Usage Notes

We sequenced 224 SAGs out of 490 positive wells, leaving more than half of the SAGs unsequenced. We encourage researchers to use these stored SAGs for taxonomic identification based on SSU rRNA gene or genome sequencing. Despite advancements in chemistry, our SAGs still exhibit low genome recovery while retaining low contamination <2% (Fig. 5). This limitation can be mitigated by co-assembling multiple SAGs that are closely related, or by performing hybrid assemblies that combine SAGs with bulk metagenomic data obtained from the same samples48.