Online Resource: Varroa Host Switch Genomics and Demography

This online resource is there to provide additional interactive visualization tool for the Varroa mites sampling and sequencing data obtained from the associated paper “Techer, M. A., Roberts, J. M. K., Cartwright, R. A., & Mikheyev, A. S. (2020). The first steps toward a global pandemic: Reconstructing the demographic history of parasite host switches in its native range. https://www.researchsquare.com/article/rs-196900/v1

A Snakemake pipeline created and used for genomics mapping, variant calling, and analysis is provided in this host GitHub repository. Most genomics and demographic analysis using fastsimcoal2 were computed on the Okinawa Institute of Science and Technology OIST cluster, Sango.

Click on the “code” button on the right to make appear the libraries used and R code for each section.

#packages necessary to run the code in order of need
library("knitr") # for the markdown
library("rmdformats")
library("tidyverse") # for reading table and parsing data
library("ggplot2")  # for general plotting 
library("rgdal") # for maps and using the function "readOGR"
library("leaflet") # to create interactive maps
library("plotly") # for interactive plots
library("multcompView") # for putting letters on Anova
library("boot") # for computing bootstraps on demographic parameters

## Global options
Sys.setenv('R_MAX_VSIZE'=100000000000)
options(max.print="10000")
opts_chunk$set(echo=FALSE,
                 cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)
setwd("~/Documents/GitHub/PopGenomics_notes/Host_Switch_Oct2020")

Varroa Asian extended dataset n = 63

Sampling distribution

We sequenced 31 adult females V. destructor and 32 V. jacobsoni throughout Varroa mites native ranges where original and novel hosts occur in sympatry.

We import the metadata of the samples here with following information:
ID = Name code of the samples obtained from the CSIRO varroa collection
Species = Identified species according to mitogenomes barcoding and analysis
Host = Varroa mites were collected from within honey bee colonies identified by the collector as either Western honey bee Apis mellifera or Asian honey bee Apis cerana.
Country/Date = Exact informations on sampling
Site = Code name of each sampling site corresponding on the map Latitude/Longitude = GPS coordinates from samples obtained before 2008 are inferred from the approximate location given. All samples geocoordinates after 2008 were obtained directly from field collection.

Lineage = mtDNA haplogroup identified from mtDNA COX1 partial region (reconstruct from HiSeq4000 reads and confirmed by Sanger sequencing).
Haplotype = mtDNA haplotype name by concatenating four genes (COX1, COX3, ATP6 and CytB)

Host_Read_Identity = Host DNA identified by the number of reads mapped against either A. cerana or A. mellifera
Reads_mellifera/cerana = Number of reads mapped against the reference genome of A. cerana or A. mellifera (Q > 20)

OGR data source with driver: ESRI Shapefile 
Source: "/Users/alphamanae/Documents/GitHub/varroa-host-jump", layer: "TM_WORLD_BORDERS_SIMPL-0.3"
with 246 features
It has 11 fields
Integer64 fields read as strings:  POP2005 

The following map is interactive. If you pass your mouse over a particular point, a pop-up will present essential information about the specific sample. On the right top corner, a layer control button allows sub-selecting species/host couple.

V. destructor on A. cerana
V. destructor on A. mellifera
V. jacobsoni on A. cerana
V. jacobsoni on A. mellifera

Figure OR1: Interactive sampling map of V. destructor and V. jacobsoni distribution on their original and new hosts in their native range.

Reads Mapping and Coverage

Commands used for each analysis step are available on our Snakemake script. Briefly, we assessed demultiplexed fastq reads quality using FastQC. We then mapped reads to the V. destructor reference genome on NCBI [GCF_002443255.1] separately from the complete mitogenome [NC_004454.2] using the soft-clipping and very sensitive mode of NextGenMap v0.5.0 (following a comparison with Bowtie2 v2.6). Reads were sorted, and duplicates were removed using SAMtools and subsampled to a maximum coverage of 200 using VariantBam to speed up processing. Mapping rates and reads depths were computed from the generated BAM files.

The mean read depth was computed from each .bam file mapped to the reference vdes3.0.fasta using samtools depth -a FILE.bam | awk '{c++;s+=$3}END{print s/c}'.

Figure OR2: Mapping quality and coverage of the samples allows in depth genomic comparisons.

All-sites dataset (nuclear analysis)

Variant and invariant sites calling

We decided to guide our workflow by following the methods described in Yamasaki et al. (2020) “Genome-wide patterns of divergence and introgression after secondary contact between Pungitius sticklebacks”.

First, we called both variants and non-variants from our .bam files using the function mpileup and call from bcftools to obtain a dataset to be used for SFS (Site Frequency Spectrum) analysis. We designed a Snakemake pipeline which implement these commands in the rule mpileup_call such as:

 bcftools mpileup {params.mapqual} --samples-file {params.idlist} --fasta-ref {VDESRef} {params.span} {params.bams} | bcftools call --multiallelic-caller -Ov -o {output} 

{VDESRef} is the path to our fasta file containing the nuclear contigs and the mitogenome.

{params.span} corresponds to one of the 1/300 genome split regions to parallelize the computation that we will merge afterward as a single file.

{params.bams} corresponds to all the .bam files here generated with NextGenMap. It is worth noting we decided to only keep the reads associated with a minimum mapping quality of 10 (leaving reads with less than 0.1% chance of being mapped incorrectly) with {params.mapqual}

{params.idlist} we specify here the samples list we want to use for the analysis (e.g., if we decide in the future to subset with only V. destructor mites)

We obtained the file /flash/MikheyevU/Maeva/varroa-jump-asia/data_2020/var/ngm/allsites63ind/allsites63_mpileupM2020.vcf (126Gb), for which bcftools stats revealed the following stats:

number of samples: 63
number of records: 359036379
number of no-ALTs: 353752353
number of SNPs: 4404174
number of MNPs: 0
number of indels: 879852
number of others: 0
number of multiallelic sites: 88005
number of multiallelic SNP sites: 18782

In comparison, we obtained ~3.7 million SNPs with the 2019 dataset, which included only 43 samples and in 2020, we detected ~4.4 million SNPs.

SN 0 number of samples: 43
SN 0 number of records: 359094759
SN 0 number of no-ALTs: 354498052
SN 0 number of SNPs: 3763743
SN 0 number of MNPs: 0
SN 0 number of indels: 832964
SN 0 number of others: 0
SN 0 number of multiallelic sites: 75285
SN 0 number of multiallelic SNP sites: 13417

To filter this raw file, we need to know the site mean depth. However, I couldn’t use vcftools --site-mean-depth option on this .vcf as it seems the DP is not coded in the right format. Instead, we parsed the .vcf file directly by using the following Perl command.

 perl -ne 'print "$1\n" if /DP=(\d+)/' {input.vcf} > allsites63_mpileupM2020_depth.txt 

We use this output file to plot the raw depth distribution before any filtering. In parallel, we also checked if computing the depth of only the seven major chromosomes will greatly affect the mean computed, and it does have only little effect. Thus no need to run the rule prevfilterVCF with the following parameters:

 vcftools --vcf {input.rawvcf} --keep {input.list} --chr NW_019211454.1 --chr NW_019211455.1 --chr NW_019211456.1 --chr NW_019211457.1 --chr NW_019211458.1 --chr NW_019211459.1 --chr NW_019211460.1 --recode --recode-INFO-all --out {output} 

mean63
[1] 15.16026
mean63*2
[1] 30.32052

Filtering the high depth artifact, indels and MNPs

So following this, we know that the site mean depth is 15.16 (double is 30.32) so we filtered the file as follow:

 vcftools --vcf {input.rawvcf} --keep {input.list} {params.span} {params.filters} {params.cutoff} --recode --recode-INFO-all --out {output} 

{input.list} refers to the sample list if we decided to subset only V. destructor or V. jacobsoni individuals.

{params.span} includes the seven chromosomes
 --chr NW_019211454.1 --chr NW_019211455.1 --chr NW_019211456.1 --chr NW_019211457.1 --chr NW_019211458.1 --chr NW_019211459.1 --chr NW_019211460.1 
{params.filters} includes several filtering options that allow to reduce the artifact of sequencing due to low or high depth regions with
 --remove-indels --minDP 8 --minQ 30 --max-missing 1 --max-alleles 2 
{params.cutoff} here would correspond to twice the mean depth we computed from the previous step and here is
 --max-meanDP 30.3 

The generated new file /flash/MikheyevU/Maeva/varroa-jump-asia/data_2020/var/ngm/allsites63ind/allsites63_filtered2020.recode.vcf (41G) has the following stats:

SN 0 number of samples: 63
SN 0 number of records: 122409498
SN 0 number of no-ALTs: 120279163
SN 0 number of SNPs: 2130335
SN 0 number of MNPs: 0
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0

Host switch and demographic analyses

Reducing the effect of selection by keeping sites 50kb away from annotated genes

We downloaded the tabular version of the protein-coding regions here. After extracting them in excel in a file named gene_Vdes_3.txt in which we have the columns Accession, Start, Stop, GeneID, and Locus (the last two are removed). We drew the duplicate entries and kept only for the seven main chromosome NW_019211454.1-0.60.1

Before processing to further filtering, we computed the depth distribution for a routine check.

The next step consisted of excluding any genes and 50kb wide regions flanking annotated genes. We use VCFtools for this step and subsequently proceed

 vcftools --vcf {input.rawvcf} --exclude-bed {input.list} --recode --recode-INFO-all --out {output} 

{input.list} is the path to our list gene_Vdes_3_50kb.txt.

We, of course, lose a lot of the variation observed by such strict depletion. However we still harvest 224,568 biallelic SNPs and 12,370,234 invariant sites to build the SFS.

We will run fastsimcoal2 independently for V. destructor and V. jacobsoni and will not select all individuals as we wish to only test for similar dates and in sympatry. For that I will subset the final vcf with a sample list using :

 bcftools view --samples-file {input.list} -Ov -o {output} {input.rawvcf} 

Converting vcf into minor allelic frequency spectrum (folded SFS)

We used vcf2sfs scripts which were developed by Shenglin-liu to convert, fold and plot the file. Important note, the vcf file should be sorted so that the order of individuals name is the same as in the population file (separated by a tabulation, not only space)

Maximum likelihood distribution fastsimcoal2 runs

After estimating the demographic history for each Varroa species on their novel and original hosts using fastsimcoal2, we plotted here the distribution of the log-likelihood from the 100 run replicates for each scenario and each SFS subset size.

This method for model selection comes on top of the AIC calculation and is followed the same methods as described in Meier et al. 2016. Mol. Ecol..

Figure OR3: Distribution of the log likelihood for V. destructor demographic runs in regards to the mutation rate setting (1.0 x 10-11, 2.0 x 10-10, 4.0 x 10-10, 6.0 x 10-10 to 8.0 x 10-10).

Replicates generated with lower mutation rates than the directly estimated 8.0 x 10-10 were not associated with a better fit to the observed SFS from our genomic data for V. destructor.


    Kruskal-Wallis rank sum test

data:  mutation_model by lhoods_value
Kruskal-Wallis chi-squared = 497.61, df = 494, p-value = 0.446

No significant differences were detected in the likelihood variance among different mutation rates. However, pairwise Wilcoxon test significance (letters) after Bonferroni corrections showed that the lower mutation rate differentiated from other rates.


    Kruskal-Wallis rank sum test

data:  mutation_model by lhoods_value
Kruskal-Wallis chi-squared = 499, df = 498, p-value = 0.479

Figure OR4: Distribution of the log likelihood for V. jacobsoni demographic runs depending of the mutation rate selected.

Replicates generated with lower mutation rates than the directly estimated 8.0 x 10-10 were not associated with a better fit to the observed SFS from our genomic data for V. jacobsoni.

Bootstraps V. destructor

After summarizing the 100 best replicates from the 100 simulated SFS run, we computed the mean estimate value and 95% confidence interval for each demographic parameter using the boot package.

[1] 17166.46
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NVAC1.boot)

Intervals : 
Level      Normal              Basic         
95%   (16602, 17731 )   (16611, 17730 )  

Level     Percentile            BCa          
95%   (16603, 17722 )   (16584, 17711 )  
Calculations and Intervals on Original Scale
[1] 29658.88
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NVAM0.boot)

Intervals : 
Level      Normal              Basic         
95%   (28695, 30627 )   (28668, 30602 )  

Level     Percentile            BCa          
95%   (28716, 30649 )   (28786, 30751 )  
Calculations and Intervals on Original Scale
[1] 155.82
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NBOTAM.boot)

Intervals : 
Level      Normal              Basic         
95%   (146.1, 165.4 )   (146.3, 165.5 )  

Level     Percentile            BCa          
95%   (146.2, 165.4 )   (146.0, 165.2 )  
Calculations and Intervals on Original Scale
[1] 886.69
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = TJUMP.boot)

Intervals : 
Level      Normal              Basic         
95%   (866.6, 907.0 )   (867.7, 907.6 )  

Level     Percentile            BCa          
95%   (865.8, 905.7 )   (864.4, 904.9 )  
Calculations and Intervals on Original Scale
[1] 709.85
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = TBOTEND.boot)

Intervals : 
Level      Normal              Basic         
95%   (693.6, 726.4 )   (693.6, 726.6 )  

Level     Percentile            BCa          
95%   (693.1, 726.1 )   (692.8, 725.9 )  
Calculations and Intervals on Original Scale
[1] -0.0013822
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = GAM.boot)

Intervals : 
Level      Normal              Basic         
95%   (-0.0015, -0.0013 )   (-0.0015, -0.0013 )  

Level     Percentile            BCa          
95%   (-0.0015, -0.0013 )   (-0.0015, -0.0013 )  
Calculations and Intervals on Original Scale
[1] 4.8149
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = N1M0.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 4.693,  4.938 )   ( 4.693,  4.938 )  

Level     Percentile            BCa          
95%   ( 4.692,  4.937 )   ( 4.692,  4.937 )  
Calculations and Intervals on Original Scale
[1] 4.1949
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = N0M1.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 4.110,  4.280 )   ( 4.108,  4.275 )  

Level     Percentile            BCa          
95%   ( 4.115,  4.282 )   ( 4.121,  4.293 )  
Calculations and Intervals on Original Scale
[1] 0.0002848
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = MACtoAM.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0003,  0.0003 )   ( 0.0003,  0.0003 )  

Level     Percentile            BCa          
95%   ( 0.0003,  0.0003 )   ( 0.0003,  0.0003 )  
Calculations and Intervals on Original Scale
[1] 0.0001436
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = MAMtoAC.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0001,  0.0001 )   ( 0.0001,  0.0001 )  

Level     Percentile            BCa          
95%   ( 0.0001,  0.0001 )   ( 0.0001,  0.0001 )  
Calculations and Intervals on Original Scale

Bootstraps V. jacobsoni

We repeated the same process of bootstraps independently for V. jacobsoni.

[1] 17375.64
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NVAC1.boot)

Intervals : 
Level      Normal              Basic         
95%   (16589, 18156 )   (16584, 18162 )  

Level     Percentile            BCa          
95%   (16589, 18167 )   (16570, 18152 )  
Calculations and Intervals on Original Scale
[1] 5220.21
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NVAM0.boot)

Intervals : 
Level      Normal              Basic         
95%   (4766, 5677 )   (4758, 5663 )  

Level     Percentile            BCa          
95%   (4777, 5682 )   (4797, 5703 )  
Calculations and Intervals on Original Scale
[1] 138.01
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = NBOTAM.boot)

Intervals : 
Level      Normal              Basic         
95%   (125.8, 150.3 )   (125.9, 150.6 )  

Level     Percentile            BCa          
95%   (125.4, 150.1 )   (125.8, 150.4 )  
Calculations and Intervals on Original Scale
[1] 183.24
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = TJUMP.boot)

Intervals : 
Level      Normal              Basic         
95%   (179.5, 186.9 )   (179.8, 187.1 )  

Level     Percentile            BCa          
95%   (179.4, 186.7 )   (178.7, 186.3 )  
Calculations and Intervals on Original Scale
[1] 170.44
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = TBOTEND.boot)

Intervals : 
Level      Normal              Basic         
95%   (165.2, 175.6 )   (165.4, 175.7 )  

Level     Percentile            BCa          
95%   (165.2, 175.5 )   (164.8, 175.2 )  
Calculations and Intervals on Original Scale
[1] -0.463584
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = GAM.boot)

Intervals : 
Level      Normal              Basic         
95%   (-0.4882, -0.4394 )   (-0.4870, -0.4388 )  

Level     Percentile            BCa          
95%   (-0.4884, -0.4402 )   (-0.4910, -0.4419 )  
Calculations and Intervals on Original Scale
[1] 0.0160944
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = MACtoAM.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0154,  0.0168 )   ( 0.0154,  0.0168 )  

Level     Percentile            BCa          
95%   ( 0.0154,  0.0168 )   ( 0.0155,  0.0169 )  
Calculations and Intervals on Original Scale
[1] 0.0295358
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = MAMtoAC.boot)

Intervals : 
Level      Normal              Basic         
95%   ( 0.0282,  0.0309 )   ( 0.0281,  0.0308 )  

Level     Percentile            BCa          
95%   ( 0.0283,  0.0309 )   ( 0.0284,  0.0311 )  
Calculations and Intervals on Original Scale
[1] 268.9309
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = N1M0.boot)

Intervals : 
Level      Normal              Basic         
95%   (261.8, 276.2 )   (262.1, 276.5 )  

Level     Percentile            BCa          
95%   (261.4, 275.8 )   (260.7, 275.3 )  
Calculations and Intervals on Original Scale
[1] 146.8824
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = N0M1.boot)

Intervals : 
Level      Normal              Basic         
95%   (136.3, 157.4 )   (136.1, 157.1 )  

Level     Percentile            BCa          
95%   (136.6, 157.7 )   (137.1, 158.3 )  
Calculations and Intervals on Original Scale

Genome-wide analysis of genetic diversity and divergence

We use the all-sites dataset to detect if the bottleneck associated with host-switch affected all genomic regions the same way. For that, we computed three indices using a sliding window throughout the genome.

For this part, we followed the tutorial by Mark Ravinet and Joana Meier, and we used the python scripts developed by Simon Martin for parsing our .vcf into a .geno object and then computed the diversity and divergence analyses in sliding window with popgenWindows.py.

π, as a measure of genetic variation FST, as a measure of genomic differentiation Dxy, as a measure of absolute divergence

Estimation of genetic diversity (π) within host/species population

We started with a preliminary run using a large window (-w 50000 -s 20000 -m 100) which specified a window size of 50kb that is sliding by 20kb and with a minimum of 100 sites covered. The windows were too large, and so I went with a much smaller one going of 10kb (-w 10000), sliding by 5kb (-s 5000) with a minimum of 100 sites covered (-m 100).

Below are the graphs of the nucleotide diversity computed from allsites63_filtered2020_small.Fst.Dxy.pi.csv.gz
1) populations are separated by species and hosts
2) populations are separated by species, hosts and lineages

Figure OR5: Nucleotide diversity in V. destructor mites between original host and new host populations.

We plotted the nucleotide variation present along the genome of V. destructor mites parasitizing their original host, A. cerana (dark-red), or the new one, A. mellifera (light-red). We can see that the nucleotide diversity is much smaller in A. mellifera mites in general, which reflect the expected host switch bottleneck (# lineages and diversity were higher in the source population).

Figure OR6: Nucleotide diversity in V. jacobsoni mites between original host and new host populations.

Similarly, we plotted for V. jacobsoni with blue shades.

Estimation of genetic differentiation FST including invariants sites

We observed differences in the variability among populations and we wanted to see whether after host switch, mites populations differentiated and whether divergence is only localized in some regions. We computed FST estimates for entire chromosome but the analysis only considered bi-allelic sites.

Figure OR7: Genome-wide differentiation among all V. destructor mites

Figure OR8: Genome-wide differentiation among all switched V. destructor (sharing a Korean K1 mtDNA ancestry)

The Korean K1 lineage is one of the two reported to have switched hosts in Asia. Here, we also found that all V. destructor mites from A. mellifera in Asia clustered with K1 mites from A. cerana.

Figure OR9: Genome-wide differentiation among all V. jacobsoni mites

Figure OR10: Genome-wide differentiation among all switched V. jacobsoni (sharing a Papua New Guinea 1 mtDNA ancestry)

Estimation of genetic divergence between populations (dxy)

Dxy absolute measure of differentiation is independent of the levels of diversity within the two populations being compared Cruickshank et Hanh, 2014.

Figure OR11: Absolute genetic divergence between the two Varroa species populations parasitizing the shared original host A. cerana

We used these levels as they should be the most extreme between species to compare the levels observed among host populations within species.

Figure OR12: Absolute genetic divergence between the two V. destructor host populations ( A. cerana vs A. mellifera)

Figure OR13: Comparison of the absolute genetic divergence between host population pairs of source, switched and non-switched V. destructor in regards to maternal lineage.

Here we aim to detect whether some regions are highly diverging among non-switched and switched mite lineages.

In the next step, we do the same comparison process but among V. jacobsoni mites.

Figure OR14: Comparison of the absolute genetic divergence between host population pairs of source, switched and non-switched V. jacobsoni in regards to maternal lineage.

Zoom in of outliers

We wanted to zoom in on the outliers sites, and we started with chromosome 6 (NW_019211459.1:22180001-22190000). Contrary to the plots before, we do an interactive plot_ly which allowed us to zoom in on the areas of interest.

Figure OR14: Dxy sliding window between V. destructor on A. cerana vs on A. mellifera.

After checking the most outlier window, I used the genome data browser in NCBI to identify which genes may be associated to this region.

Chromosome 2 NW_019211455.1:7750001-7760000 VJAC
NW_019211455.1:7755001-7765000 VJAC
NW_019211455.1:7760001-7770000 VJAC VDES

NW_019211455.1:34005001-34015000 VJAC VDES

NW_019211455.1:60405001-60415000 VJAC
NW_019211455.1:60410001-60420000 VJAC VDES

Chromosome 3 NW_019211456.1:760001-770000 VJAC VDES

Chromosome 6 NW_019211459.1:22180001-22190000 VJAC VDES NW_019211459.1:22515001-22525000 VJAC VDES NW_019211459.1:22520001-22530000 VJAC VDES

Chromosome 2

The 1st region (NW_019211455.1:7750001-7770000) spanned on a large annotated gene LOC111244985 and three small ones LOC111244987, LOC111244984, LOC111244986 which are all uncharacterized.

The 2nd region (NW_019211455.1:34005001-34015000) is not associated with any annotated genes, but some RNA-seq exon coverage are found.

The 3rd region (NW_019211455.1:60405001-60420000) has no annotated genes.

Chromosome 3

In the region (NW_019211456.1:760001-770000), we found one single annotated region with the locus LOC111246310, also uncharacterized.

Chromosome 6

The entire region (NW_019211459.1:22180001-22190000) spanned on the annotated gene LOC111252929 which is described as a calmodulin-binding transcription activator 2-like.

In the same chromosome, we also have the region spanning from (NW_019211459.1:22515001-22530000) which shows two estimates outside of the other values. These two regions are associated to an uncharacterized LOC111253430 locus.

**All sites with Dxy value > 0.01 were investigated manually by examining individual genotypes, which showed an excess in heterozygosity levels. These regions were also poorly conserved within species. They had BLAST matches elsewhere in the genome, suggesting the presence of sequencing artifacts rather than biological signals (labeled *af on the plots).**

All-sites (mitochondrial analysis)

SNP-only dataset

Variant calling and filtering steps

Linkage disequilibrium pruning

Inferring population structure with PCA

Figure OR15: Interactive 3D-PCA using a MAF filtering of 0.01 on the 63 mites sequenced

Figure OR16: PCA using a MAF filtering of 0.01 on the 63 mites sequenced colored by species.

We did the same within each species later to obtain the base of figure 2 presented in the manuscript.

Population structure and admixture with NGSadmix

(see graphs in manuscript)