4. Whole-genome population and association studies

4.1. Introduction

When considering whole-genome analyses, one major advantage of working with A. mellifera, A. cerana or even A. dorsata is that their genomes are very compact, only ~223-228 Mb long (Oppenheim et al., 2020; Wallberg et al., 2019; Wang et al., 2020). In comparison, the honey bee parasite Varroa harvest a larger genome size of 368 Mb (Cornman et al., 2010; Techer et al., 2019) while the human genome, for instance, is even larger with 3055 Mb (Nurk et al., 2022). As a result, it quickly became evident that the cost-to-benefit ratio of using whole genome sequencing for honey bee applications was excellent, as compared to other approaches. For instance, SNP chips only allow the study of between 10,000 and 100,000 markers for a price that will not be much lower than whole genome sequencing (between one third and half the price at the time of writing), whereas the latter approach enables the detection of several million markers. Other disadvantages of using SNP chips are that the choice of markers is biased, and that the high density of SNPs and indels in the A. mellifera genome complicates the chip design (allele drop-out can often happen due to neighboring polymorphism).

Nevertheless, the whole-genome sequencing approach has its drawbacks, with complex and intricate analysis pipelines and longer computational times. While bioinformatics is included in most molecular biology and ecology related curriculum, the steep learning curve can be discouraging for new users without proper guidance (Carvalho & Rustici, 2013). Additionally, whole genome computations should really be carried on remote access high performance clusters (HPC) or local workstations. It may thus not be ideal if rapid answers to questions such as parentage testing or subspecies assignment are needed. Moreover, data storage can be a critical issue, with an average of 2-3 Gb needed per sample for the raw sequencing files (FASTQ: text-based file that contains the nucleotide sequence and associated quality score), 3-4 Gb for the corresponding alignment files (BAM: binary alignment map files) and files up to one terabyte for reporting genomic sequence variation (VCF: variant calling files) depending on the dataset size.

Until now, most sequencing results were obtained with paired-end Illumina sequencing. With the recent progress made in long-read sequencing, having reference-quality assemblies for several individuals, ideally from different subspecies within a targeted Apis species can become a future standard. This will allow the construction of pan-genome graphs, taking inter-individual variation into account at the reference genome level. Long-read sequencing of individual samples will also allow the detection of large structural rearrangements and repeat landscapes that cannot be analysed accurately with short reads.