Skip to content

4. Whole-genome population and association studies

4.6. Population genomics: Experimental design

Whereas genome-wide association studies (GWAS) link SNPs to specific traits, population genomics identifies changes in the genetic composition of populations. To do this, population genomics analyzes hundreds to thousands, even millions, of markers (usually SNPs) across multiple individuals and populations to unravel microevolutionary patterns and processes. Due to the massive amounts of data that are typically generated by high-throughput sequencing or genotyping technologies, population genomics has evolved from classic population genetics (in which the number of markers was in the order of tens) to become a data-driven and computational science. It requires computational resources, memory, and expertise in bioinformatics due to the large-scale data generated in these studies.

In honey bees, as in many other organisms, population genomics provides important insights into the bees’ demographic and adaptive history (Chen et al., 2016; Cridland et al., 2017; Fuller et al., 2015; Harpur et al., 2014; Henriques, Browne, et al., 2018; Nelson et al., 2017; Wallberg et al., 2014) as well as into the genomic basis of important traits, such as tolerance to diseases and parasites (Saelao et al., 2020), royal jelly production (Rizwan et al., 2020; Wragg et al., 2016), as well as defensive, scouting and recruiting behaviors (Arián Avalos et al., 2020; Harpur et al., 2020; Southey et al., 2016).

4.6.1. Sampling strategy

When starting out a population genomics study, the first step is to design the sampling strategy. Before collecting the samples, several sampling-related aspects must be considered, including (i) sample sizes of individuals and markers, (ii) sample breadth, (iii) sampling design, (iv) sampling workers versus drones, and (v) sampling a single individual versus multiple individuals per colony. Final decisions are greatly constrained by the study's goals (e.g., estimating diversity within colonies or at the population level, inferring population structure, finding signatures of local adaptation, or landscape genomics), the complexity of the genetic patterns of the focal subspecies, and the budget available.

4.6.1.1. Sample sizes of individuals and markers

Simulation studies have shown that population genetics inference (e.g., diversity, demographics, differentiation, or gene flow) is influenced by the sample sizes of markers and individuals (Aguirre-Liguori et al., 2020; Flesch et al., 2018; Foster et al., 2021; Landguth et al., 2012). While increasing both simultaneously produces more robust estimates, empirical and simulation studies have consistently shown that the accuracy benefits far more from increasing the number of markers than the number of individuals (Landguth et al., 2012; Nazareno et al., 2017; Willing et al., 2012).

To the best of our knowledge, there are no simulation studies on the optimal sample sizes for population genetics or genomics inquiries in honey bees. However, inferring from studies on other organisms, when the number of markers is in the order of hundreds to thousands (which is becoming commonplace in the post-genomics era) around eight individuals per population suffice for accurate estimates of diversity and differentiation (Aguirre-Liguori et al., 2020; Flesch et al., 2018; Hongran Li et al., 2020; Nazareno et al., 2017).

However, in honey bee studies, sample sizes have typically been larger. These have ranged from n = 9 to 87 individuals per group in population genomics studies Avalos et al., 2020; Chen et al., 2016; Harpur et al., 2014; Henriques, Wallberg, et al., 2018; Nelson et al., 2017; Wallberg et al., 2014; Wragg et al., 2016) and from n = 12 to 117 when developing SNP assays (Chapman et al., 2015; Henriques, Parejo, et al., 2018; J. C. Jones et al., 2020).

While empirical evidence from honey bee studies is lacking, the optimal number of sampled individuals will certainly vary among subspecies, depending on their distributional range or evolutionary complexity (Henriques, Parejo, et al., 2018). For example, the optimal sample size for A. m. ruttneri, a subspecies confined to the small island nation of Malta, is expected to be much smaller than that of A. m. mellifera, a subspecies with one of the greatest geographical distributions, or of A. m. iberiensis, a subspecies with a complex history involving natural hybridization. Finally, it is also important to keep in mind that unequal population samples can impact the outcomes of some analytical approaches, such as inferences on population structure (Puechmaille, 2016).

4.6.1.2. Sample breadth

In addition to the sample size, when sampling honey bees for population genomics studies, it is also important to consider the number of sampled populations, or breadth (coverage of distributional range), which can influence inferences for classic population genetics as well as landscape genetics or outlier tests (Aguirre-Liguori et al., 2020; Albert et al., 2010; Nazareno et al., 2017; Schwartz & McKelvey, 2009). In honey bees, Henriques et al. (2018) showed that sampling a geographically restricted area within the A. m. iberiensis distributional range would erroneously identify a set of SNPs with the fixation index between this and C-lineage subspecies equal to one (F~ST~ = 1), with an impact on the design of reduced panels of highly informative SNPs. This is because the true F~ST~ values are \< 1, meaning that the SNPs are not diagnostic anymore and therefore have a lower information content. When the goal of the study is to find genomic evidence of local adaptation, it is critical to ensure that populations (or individuals) are sampled across environmental gradients (e.g., latitudinal or altitudinal) and environments (e.g., arid and humid), for increased power in detecting outlier SNPs and therefore candidate genes (Manel et al., 2012).

4.6.1.3. Sampling design

The sampling design will depend on the individuals' spatial distribution and the study's goal, among other factors (Paradis, 2020). For example, if the goal is to find signatures of selection in honey bee populations, sampling should cover environmental gradients. When sampling encompasses pairs of populations that maximize the environmental differences while minimizing the evolutionary differences, there is great potential for detecting selection footprints (Delaneau et al., 2012; Lotterhos & Whitlock, 2015). This approach was followed in a study of honey bee adaptation to altitude in East Africa (Wallberg et al., 2017). By sampling two pairs of populations representing mountain forests and lowland savannahs, the authors could detect strong candidates for adaptation to highland habitats in whole-genome scans.

Several sampling designs can be used by honey bee researchers. For continuously distributed populations, random and systematic sampling designs (e.g., transects or grids) have proven to be effective in simulation studies (Oyler-McCance et al., 2013). The systematic design was implemented in Iberia to unravel the genetic diversity patterns and underlying processes of A. m. iberiensis, via the establishment of three north-south transects. With this design, the authors uncovered a secondary contact zone in Iberia and found evidence for selection as another evolutionary force shaping complex genetic patterns in A. m. iberiensis (Chávez-Galarza et al., 2013, 2015; Henriques, Wallberg, et al., 2018). If, on the other hand, the distribution is patchy, cluster sampling (sampling several groups of individuals) is more appropriate (Oyler-McCance et al., 2013).

4.6.1.4. Sampling workers versus drones

The choice of sampling workers (diploid) over drones (haploid) will depend on the questions being addressed and, thereby, on the type of analyses that will be performed (see Section 4.2 for more on general ploidy and pooling considerations). For example, Hardy-Weinberg Equilibrium (HWE) testing can only be done with individual diploid workers. However, there are analyses that do not require testing for HWE, and in this case, using haploid data can be advantageous. In addition to reducing sequencing costs, drones generate phased data, thereby circumventing the hurdles of statistical phasing. Phased genomic data offers increased power for detecting selection signatures by facilitating the employment of haplotype-based methods.

4.6.1.5. Sampling a single individual versus multiple individuals per colony

Classic population genetics analysis (e.g., HWE, diversity, differentiation, structure) requires sampling one single diploid worker per colony. This approach circumvents the problem of over-representing the queen’s genotype and violating the assumption of sample independence by sampling multiple individuals. However, when the study addresses questions at the intra-colony level (e.g., patriline analysis, colony structure), sampling multiple individuals is an unavoidable requirement. The number of sampled individuals per colony can vary, depending on the questions being addressed and on budget limitations. For example, the numerous studies published in the pre-genomics era, which required patriline analysis, typically sampled tens of workers per colony (61 ± 72.6; see the review of Tarpy et al. (2004)). When multiple individuals are sampled from within a colony, they have historically been genotyped separately, but pooling is becoming increasingly popular in the post-genomics era (see Section 4.2).