8. Proteomics

8.4. Proteomics data processing

8.4.1. Software recommendations

Raw mass spectrometry data must be ‘searched’ in order to derive biological meaning from it. That is, a computer algorithm matches mass spectra to peptide sequences, deconvolutes the pool of identified peptides into the most parsimonious set of proteins that must be present to explain all those peptides, translates peptide intensity data into quantitative protein data, and controls false discovery rates. While there are many software options available to perform these tasks (reviewed elsewhere (Verheggen et al., 2020)), we recommend MaxQuant (Cox & Mann, 2008)owing to its i) high regard in the field, ii) robust label-free quantification (LFQ) algorithm, iii) delayed normalization feature to accommodate fractionated samples, iv) continual feature upgrades, v) capability of handling DIA and DDA data, and vi) lack of associated cost (please see the latest information from the annual MaxQuant Summer School for upcoming tutorials: https://maxquant.org/summer_school/)..) DIA-NN is also an excellent search tool which is especially well-suited for processing DIA data (Demichev et al., 2020). MaxQuant and DIA-NN are only compatible with Windows and Linux machines.Theoretically, spectra can also be sequenced de novo (without an existing protein database) using other software, such as PEAKS (B. Ma et al., 2003); however, this is only viable for very high-quality spectra and the resulting proteome coverage is therefore exceedingly low.

8.4.1.1. MaxQuant and DIA-NN search parameters

The default search parameters within MaxQuant and DIA-NN are generally appropriate for most shot-gun proteomics analyses, but some options should be considered (and see Sinitcyn et al. (2021) for handling DIA data in MaxQuant). In particular, “match between runs” or “MBR” is an option that increases sensitivity by borrowing peptide identification information across samples. For example, if a spectrum is confidently matched to a peptide in sample 1 but not sample 2, sample 2 is re-inspected for likely features of that spectrum, and, if found, it can receive the same peptide assignment. This approach assumes that if a peptide is confidently identified in one sample, it has a high likelihood of being present in other, similar samples, and the spectrum quality threshold for that peptide matching can be reasonably lowered. We recommend enabling this option in both MaxQuant and DIA-NN to reduce the frequency of missing data. In DIA-NN, “unrelated runs” should also be checked if samples represent independent replicates, and the “Protein inference” option should be set to “Protein names (from FASTA)”.

8.4.1.2. Choosing an appropriate protein database

For typical shot-gun proteomics experiments, the data processing software requires at least two inputs: the raw data files and a database of proteins to which it can compare spectra. It is very important to choose an appropriate protein database; failure to do so can result in flawed data with an unacceptable level of false positive or false negative errors. One example of this happening in the literature includes a published paper claiming to discover a link between invertebrate iridescent virus-6, detected in honey bee proteomics samples, and colony collapse disorder (CCD) (Bromenshenk et al., 2010). The database used to search the mass spectrometry data included only viral protein sequences and no host (honey bee) proteins, despite host proteins composing the majority of the sample.

This means that, since spectrum matching is a probabilistic task, it is possible for spectra from host peptides to match to viral peptides if those are the most likely assignments within the constraints of the protein database supplied. Indeed, that is exactly what happened, leading to incorrect peptide assignments, dramatically skewed false discoveries, and ultimately flawed conclusions. When the host proteins were included in the search database, spectra that previously matched to iridescent virus-6 actually had far higher scoring matches to host peptides, indicating that the virus was unlikely to have actually existed in the sample (Foster, 2011; Tokarz et al., 2011), let alone cause CCD.

Since genome builds, as well as gene and protein annotation databases, are continually upgraded, the most up-to-date reference proteome should be used. Furthermore, and following the above discussion, the database should contain all sequences with a reasonable probability of being found in the sample. For honey bees, this means that, in addition to honey bee protein sequences, honey bee virus sequences should be included in the protein database (FASTA file) for virtually all sample types, given the high incidence of asymptomatic infections (Grozinger & Flenniken, 2019). Nosema spp., chalkbrood (Ascosphaera apis), European foulbrood (Melissococcus plutonius), American foulbrood (Paenibacillus larvae), or any other likely pathogen or colonizing microbe may be added as well, if applicable. We recommend obtaining FASTA files from Uniprot due to the ease of subsequently incorporating gene ontology (GO) information during data analysis. We also recommend including protein sets for the core gut bacteria (Motta & Moran, 2024) when bee abdomens form part of the sample (see Section 10).

8.4.1.3 Statistical analysis

When finished searching, MaxQuant will output a series of tables, including one named ProteinGroups.txt, which contains the protein quantitation information with the dominant members of the protein groups and LFQ intensities. The equivalent output from DIA-NN is report.pg_matrix.tsv file. The matrix of protein names (rows) and LFQ intensities (columns) is used for subsequent differential expression analyses. Any proteins indicated as reverse hits, potential contaminants, or those only identified by site are undesirable and typically excluded. MaxQuant’s companion program, Perseus (Tyanova et al., 2016), can be used for basic statistical tests and figure generation; however, most R packages originally intended for microarray or RNA-seq data analysis (e.g. limma (Ritchie et al., 2015)) are also appropriate for proteomics data and offer more flexibility. For users who are new to proteomics analysis, we recommend using Perseus, since it is a user-friendly platform developed specifically for proteomics, and is accompanied by detailed step-by-step tutorials (http://www.coxdocs.org/doku.php?id=perseus:user:use_cases:interactions) and lectures (http://www.coxdocs.org/doku.php?id=perseus:user:tutorials)..) The tutorial “label-free interaction data” provides a detailed guide to data preparation (loading, filtering, transforming, etc.), quality control, statistical analyses, and visualization.Currently, Perseus is only compatible with Windows.

Once differential expression analysis is complete, the results may be used for gene ontology (GO) term enrichment tests similar to what might be conducted for microarray or RNA-seq data. While a multitude of suitable tools exist for such analyses, reviewed in (Laukens et al., 2015), we recommend ErmineJ (Gillis et al., 2010; Lee et al., 2005) for its flexibility, simplicity, and capability for accounting for both multiple hypothesis testing and protein multifunctionality when determining enrichment significance.