To test our hypotheses, we employed two molecular tests (HDMKPRF and LASSI Plus) to detect selection dynamics using population genomic data from three beetle species. Both methods rely on population-level genetic variation but differ primarily in their genomic focus (coding regions vs. genome-wide), and the timescales over which they have the most statistical power. HDMKPRF (high-dimension McDonald-Kreitman Poisson random field method, Zhao et al. 2019) is an extension of the classical MK-test (McDonald and Kreitman 1991) and the MKPRF method developed by Sawyer and Hartl (1992), improving the inference of directionality and the relative strength of selection along lineages unique to each species. LASSI Plus (Harris and DeGiorgio 2020), on the other hand, relies on the site frequency spectrum (SFS), and thus provides insights into more recent selection events. Moreover, HDMKPRF is limited to coding regions, while LASSI Plus detects selection across the entire genome, including the regulatory regions flanking genes. Finally, LASSI Plus estimates haplotype frequency spectrum statistics within sliding windows of population genomic data to detect soft and hard sweeps, with stronger power for the latter category due to the more pronounced signature they leave in the SFS (Harris et al. 2018).
The three Galerucella species (Coleoptera: Chrysomelidae) are closely related, with recent divergence times: G. pusilla and G. calmariensis diverged around 77,000 years ago while G. tenella diverged around 400,000 years ago (Fig. 1, Hambäck et al. 2013). G. pusilla and G. calmariensis are monophagous on Lythrum salicaria, whereas G. tenella is oligophagous with the primary host Filipendula ulmaria. The three beetle species have similar life cycles. Adults appear in the area in May and begin laying eggs on leaves or stems of their host plants. It takes a few weeks for the eggs to hatch, 2-3 weeks for the larvae to pupate, and another 2-3 weeks for the adults to emerge from the pupae. Adults then overwinter until next May. The geographic distribution in Sweden differs between species: G. pusilla occurs in the south up to central Sweden (62°N, 17°E) whereas G. calmariensis and G. tenella occur both in the south and north, along the entire Baltic seashore (Supporting Information, Fig. S1). Within regions, the species commonly co-occur at moderate to high densities, though variable between sites.
The three species share a common endoparasitoid wasp enemy Asecodes parviclava (Hymenoptera: Eulophidae), which lays one or more eggs in the beetle larvae (Stenberg and Hambäck 2010). No other parasitoid species attack these beetles in the region. When successfully parasitized, the wasp eggs hatch, and the wasp larvae turn the beetle larvae to a black mummy containing the wasp pupae. However, if the beetles manage to defend themselves, their immune system encapsulates and kills the wasp eggs, allowing the host larvae to continue growing and developing (Fors et al. 2014). Previous studies show that the beetle species differ in their capacity to mount an effective defence against parasitoid attack. While G. pusilla exhibits a strong ability to encapsulate wasp eggs, encapsulation is rarely observed in G. calmariensis and occurs at an intermediate frequency in G. tenella (Fors et al. 2016; Fors et al. 2014). These differences also correspond to selection by wasp females in response to larval odour cues (Fors et al. 2018). The parasitism rates vary between sites and most notably is higher in the northernmost sites, often >50%, compared to southernmost sites, often <10%.
We collected 45 adult individuals, 15 samples from each Galerucella species, during May and June 2019 from the following sites: three G. calmariensis populations: Norrfjärden (62°3'28''N, 17°26'18''E), Våtnäs (61°32'93''N, 17°12'77''E) and Hölick (61°37'22''N, 17°27'18''E); three G. pusilla populations: Rastsjön (60°6'36''N, 17°53'97''E), Lörudden (62°14'14''N, 17°39'12''E) and Haversjön (59°2'31''N, 17°9'49''E); three G. tenella populations: Umeå-1 (63°46'72''N, 20°36'00''E), Umeå-2 (63°46'36''N, 20°37'48''E) and Umeå-3 (63°47'18''N, 20°35'89''E). For each population, five individuals were sampled.
All individual samples were snap-frozen in liquid nitrogen and stored at -80° before DNA extraction. Genomic DNA were extracted from whole adult bodies using KingFisher™ Cell and Tissue DNA Kit following the "DNA Extraction from Single Insects" sample preparation protocol. After extraction, DNA concentrations were measured with a Qubit 3.0 Fluorometer using the dsDNA HS Assay Kit (Thermo Fisher Scientific) and a Nanodrop 8000 to ensure an absorbance ratio at 260/280 between 1.7 and 2. We estimated DNA fragmentation using agarose gel electrophoresis stained with 2% GelRed and only retained samples with minimal degradation. Library preparation was performed with the Illumina TruSeq DNA PCR-free library preparation kit, followed by paired-end 2×150-bp sequencing on a NovaSeq6000 platform at SciLifeLab, Sweden. Library preparation failed for one G. calmariensis sample from Våtnäs and this sample was excluded from downstream analysis. The total number of samples with whole-genome resequencing data was thus 44 and we generated 1.6 billion reads of sequence data ( > Q30) in total (out of 1.8 billion reads), corresponding to an average of 34.8 million reads per sample.
We assessed the quality of the resequencing data using FastQC v0.11.55 before and after filtering, retaining only reads ≥50 bp with a quality score >30 in both read start and end. All sequence reads were mapped against the Galerucella calmariensis reference genome, which was the least fragmented genome (Yang et al. 2021), using NextGenMap version 0.4.12 (Sedlazeck et al. 2013). The reference genome had an assembly size of 588 Mbp, containing 39,255 scaffolds and 40,031 predicted proteins with 91.3% and 85.1% complete orthologs in the genome and proteome, respectively, compared with the endopterygota_odb10 database (Simão et al. 2015) (For further info on the reference genome assembly see Yang et al. 2021). Mapping rates were consistent between samples (85% to 95%). We filtered the resulting bam files with Samtools v1.3.1 (Li et al. 2009) to retain alignments with mapping quality>20 (-q 20).
We next called SNPs across all samples using FREEBAYES v0.9.21 (Garrison and Marth 2012). For SNP filtering of all sites, we only kept bi-allelic sites with a minimum read depth of 5X, a quality score >30 and a maximum proportion of missing data of 20%. Genetic diversity (nucleotide polymorphism, π) was estimated for each species using pixy (Korunes and Samuk 2021). Independent of these steps, we performed a PCA analysis to assess population genetic structure across populations within each species. For this purpose, we first conducted LD-based pruning of our high quality SNPs (--indep-pairwise 50 10 0.2), followed by a principal component analysis (PCA) using Plink v1.9 (Purcell et al. 2007) across all the samples and for each species separately (Figures S2 and S3).
To detect selection, we employed two molecular tests that rely on population level genetic variation but primarily differ in their genomic focus (coding regions vs. genome-wide), and the time scale over which they have the most statistical power. For coding regions, we used HDMKPRF (Zhao et al. 2019), an extension of MKPRF to analyse selection across multiple species (Bustamante et al. 2001; Sawyer and Hartl 1992). A comprehensive justification for using HDMKPRF to detect genes under selection can be found in Okamura et al. (2022); however, a key advantage is its higher power for detecting weak and moderate selection. Compared to the classical MK-test (McDonald and Kreitman 1991), the greater power of both MKPRF and HDMKPRF arises from adopting a Poisson random-field framework (Sawyer and Hartl 1992), where per gene selection intensities are estimated using a Bayesian approach that combines information across multiple loci and derive posterior distributions. The concept of selection intensities is based on Bustamante et al. (2001) and is akin to a neutrality index measure (Hahn 2019). HDMKPRF improves the MKPRF-test by simultaneously analysing polymorphism and divergence across multiple species, allowing the test to determine in which lineage selection occurred (Zhao et al. 2019). Additionally, HDMKPRF estimates population genetic parameters for each species, such as effective population sizes and mutation rates, and uses these to account for lineage specific differences when estimating selection intensity per locus (for details see Zhao et al. 2019).
To detect positive selection outside the coding regions of genes, we used a maximum likelihood analysis of the haplotype frequency spectrum across the genome to identify putative targets of positive selection via signatures of both soft and hard sweeps. For this, we employed LASSI Plus (Harris and DeGiorgio 2020) and the saltiLASSI statistic (DeGiorgio and Szpiech 2022). This approach is capable of using unphased sequencing data to infer haplotypes and identify genomic regions within population samples that exhibit greater than expected changes in their haplotype allele frequencies, given the background genomic patterns assumed to represent neutrality. This method can estimate both the likelihood of a given haplotype sweeping and the inferred width and number of haplotypes sweeping within a given species. LASSI Plus estimates haplotype frequency spectrum statistics in sliding windows of population genomic data to detect soft and hard sweeps, with stronger power for the latter category due to the more dramatic signature left in the SFS (Harris et al. 2018).
Multiple consensus sequences of coding sequences (CDS) for all samples were extracted using bam2consensus function from BamBam v1.4 (Borowiec 2016), allowing a minimum read coverage per site of 4X. BamBam uses individual bam files, mapped to the G. calmariensis draft de-novo genome, and extracts consensus sequences for each CDS region based upon the genome annotation. We then assessed summary statistics of the consensus sequences using AMAS v1.0 (Borowiec 2016). Only CDS regions longer than 300 bp and with a low proportion of missing values (<10%) were retained for downstream analysis (N = 11,368).
To limit our analysis to orthologous loci among the three species, we assessed orthology between the G. calmariensis protein sets and those of three other Coleoptera species (Asian long-horned beetle [Anoplophora glabripennis], red flour beetle [Tribolium castaneum] and mountain pine beetle [Dendroctonus ponderosae]) using OrthoVenn2 (Xu et al. 2019) with default settings. A total of 4591 single-copy-orthologs (SCO) were identified in G. calmariensis. These SCOs were cross-references from the previously identified high-quality consensus sequences, resulting in 4154 SCOs for downstream analysis.
Estimates of selection dynamics using synonymous and nonsynonymous polymorphisms and divergence across species can be confounded by rare variants. Such variants could represent weakly deleterious mutations or technical artifacts (e.g., sequencing error). We therefore quantified α, the contribution of positive selection to amino acid divergence in genes, for both our full dataset and after singleton removal. The latter increased the estimated proportion of positively selected genes in all three species and reduced indices of negative selection (Table 1). To balance the concerns of quality filtering versus over-filtering through the additional removal of rare variants, we proceeded with selection inferences using the singletons removed dataset. In G. calmariensis, we primarily detected positive selection (α > 0) and a higher proportion of positively selected genes. Conversely, in G. tenella, weak negative selection (α < 0) was more prevalent, even after singleton removal, with a higher proportion of negatively selected genes.
The script (Table S1) for performing the HDMKPRF to derive selection intensities was kindly provided by Zhao et al. (2019) The input data for the analysis included all 4154 SCOs and the analysis was implemented by first running 200,000 burn-in steps, followed by estimating posterior parameter distributions from 400,000 steps in a Markov chain Monte Carlo process with a thinning interval of 5, based on the author's recommendations. A gene was considered to be under positive selection when the 95% posterior credibility interval for the selection intensity was >0, and under negative selection when the interval was <0. Because estimates of positive selection using population resequencing data are usually biased downward by the segregation of slightly deleterious mutations and some singleton errors (Charlesworth and Eyre-Walker 2008), we removed singleton polymorphisms from all gene sets using a custom script (Sattath et al. 2011). We then tested the effect on the power of detecting adaptive evolution by comparing HDMKPRF results for gene sets before and after singleton removal.
To specifically analyse immune genes, we first used BLASTP to identify candidate genes from a previous RNA-seq study (Yang et al. 2020) in the G. calmariensis proteome. This gene set contains 166 genes suggested to play a role in the immune response to parasitoid wasp attacks in Drosophila (Table S2), subdivided into seven functional immune gene categories: recognition (N = 17), signalling (N = 35), effector (N = 21), proteases (N = 35), haematopoiesis (N = 31), melanisation (N = 18), and wound healing (N = 9). The threshold used in BLASTP was an E-value ≤ 1 × 10 and a bitscore>60, which identified a set of 96 immune genes in our genomic dataset. When multiple hits were recovered during the BLAST search, we used the one with the highest bitscore. When incorporating these genes into our SCO gene analysis, only 40 of the immune genes passed the conservative threshold (they are part of the 4154 SCOs gene set). To further focus on the 96 immune genes, we conducted a second HDMKPRF analysis. Given that this second analysis was run on a smaller gene set, HDMKPRF has reduced power. Therefore, we compared the estimated selection intensities for the 40 SCOs between the two analyses and found them to be almost identical, suggesting this set size was sufficient for accurate model parameter estimates.
To avoid reference bias in the analysis, we first aligned the short read sequencing data from 15 individuals of each species to their respective reference genomes (Yang et al. 2021) using the bwa-mem2 v.2.0pre2 (Vasimuddin et al. 2019). Variants were then called using bcftools v.1.13-35-ge3ba077 to generate an all-site VCF (Danecek et al. 2021). The resulting VCFs were filtered for low-quality calls (QUAL > 30), read depth (5-50), the exclusion of indels and no more than 2 alternative alleles per site. Inferences of selective sweeps were made using the salti statistic in the LASSI Plus software package (k = 10, window size=52 SNPs, step size=12). To identify outlier windows across the genome (i.e, regions likely to have experienced a sweep), we extracted all windows with a salti statistic (L) greater than 4 standard deviations above the mean. L is a composite likelihood ratio test statistic indicating that the haplotype frequency spectrum in a given window is distorted relative to genomic background (for statistical code see Table S3).
Finally, for illustrative purposes to compare the distribution and location of selective sweeps between species while avoiding reference bias from aligning samples to a single reference, we scaffolded each species genome against a common chromosome-level assembly from a species in a sister genus, the beetle Lochmaea crataegi (NCBI: GCA_947563755.1). Scaffolding was performed using Ragtag v.2.1.0, with default settings, using minimap2 (Alonge et al. 2022), and alignments were filtered to remove any contigs shorter than 50 kb.
Comparisons of genome-wide targets of positive selection across species were conducted using two approaches. First, we identified genes within any of the detected sweep regions by intersecting the outlier sweep locations with gene annotations using bedtools intersect v2.27.1 (Quinlan and Hall 2010), for each species genome (Yang et al. 2021). Protein sequences of any gene, from start to stop, overlapping any sweep region interval, were then retrieved for each species. We assessed the overlap of the protein sets identified in the three species using Orthovenn2 (Xu et al. 2019), which clusters proteins based on sequence similarity. This approach allows us to identify potential targets of selection shared across the three species, without requiring selection on the exact same gene region, since this analysis accommodates independent members of a gene family being targeted. To determine whether any species had a higher proportion of immune genes among the putative targets of positive selection, we also included the 96 candidate immune genes previously identified in G. calmariensis in the Orthovenn2 analysis. Second, we repeated the analysis but extended the candidate gene region by including 5 kb on either side of the gene body (i.e., 5 kb upstream of the start and downstream of the stop) to capture potential signatures of positive selection in regulatory regions. We chose this 10 kb flanking region to account for regulatory evolution based on evidence from other insect groups (Ghavi-Helm et al. 2014; Lewis and Reed 2019).