In this tutorial you will learn the basics of phydms. We will walk through examples of analyses you may want to run to compare your deep mutational scanning data to natural sequence evolution. For more details on any of the steps, please see the phydms documentation.

phydms was developed at the Bloom Lab (full list of contributors)

Comparing your DMS data to natural sequence evolution

After you perform your deep mutational scanning experiment, you may want to know how well the experimental measurements you made in the lab describe the natural sequence evolution of your protein. Using phydms you can compare the phylogenetic substitution model ExpCM, which takes into account the site-specific amino-acid preferences from your deep mutational scanning experiment, to traditional, non-site specific models from the YNGKP family. phydms_comprehensive will run phydms in several different modes to generate results for the appropriate comparisons.

For the standard phydms_comprehensive analysis, you will need amino-acid preferences from your deep mutational scanning experiment and a codon-level sequence alignment of your gene. phydms_comprehensive will then

• infer a tree using RAxML
• run phydms with the YNGKP_M0, the YNKGP_M5, the ExpCM, and a control ExpCM run with averaged preferences
• summarize the results

See the full documentation for specifics on these models.

We are going to walk through an example of phydms_comprehensive for the \(\beta\)-lactamase gene. We will compare an ExpCM with deep mutational scanning data from Stiffler et al, 2015 to the YNGKP family of models.

phydms_comprehensive command-line usage

Here is the full list of requirements and options for phydms_comprehensive. Below is a discussion of the input files, running the phydms_comprehensive command, and interpretation of the results.

phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
                            [--brlen {scale,optimize}] [--omegabysite]
                            [--diffprefsbysite] [--gammaomega] [--gammabeta]
                            [--no-avgprefs] [--randprefs] [-v]
                            outprefix alignment prefsfiles [prefsfiles ...]

Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms

positional arguments:
  outprefix             Output file prefix.
  alignment             Existing FASTA file with aligned codon sequences.
  prefsfiles            Existing files with site-specific amino-acid
                        preferences.

optional arguments:
  -h, --help            show this help message and exit
  --raxml RAXML         Path to RAxML (e.g., 'raxml') (default: None)
  --tree TREE           Existing Newick file giving input tree. (default:
                        None)
  --ncpus NCPUS         Use this many CPUs; -1 means all available. (default:
                        -1)
  --brlen {scale,optimize}
                        How to handle branch lengths: scale by single
                        parameter or optimize each one (default: optimize)
  --omegabysite         Fit omega (dN/dS) for each site. (default: False)
  --diffprefsbysite     Fit differential preferences for each site. (default:
                        False)
  --gammaomega          Fit ExpCM with gamma distributed omega. (default:
                        False)
  --gammabeta           Fit ExpCM with gamma distributed beta. (default:
                        False)
  --no-avgprefs         No fitting of models with preferences averaged across
                        sites for ExpCM. (default: False)
  --randprefs           Include ExpCM models with randomized preferences.
                        (default: False)
  -v, --version         show program's version number and exit

Input Files

This example uses the following files: betaLactamase_enrichmentScores.csv, betaLactamase_prefs.csv, and betaLactamase_alignment.fasta.

Amino-acid preferences

Often data from deep mutational scanning experiments is reported as the enrichment of a given amino acid compared to the wild-type amino acid. Here is a snippet of the log enrichment scores for \(\beta\)-lactamase from the file betaLactamase_enrichmentScores.csv (Stiffler et al, 2015 Supplemental File 1) and the full dataset visualized as a heatmap.

library(ggplot2)
df = read.csv("example_data/betaLactamase_enrichmentScores.csv")
head(df,6)
  Site wildtype AminoAcid Trial_1_AmpConc_2500 Trial_2_AmpConc_2500.1
1   26        H         A         -0.003757501            -0.01580963
2   26        H         C         -0.319356518            -0.51716333
3   26        H         D         -0.155937758            -0.20727546
4   26        H         E         -0.200042482            -0.35800585
5   26        H         F         -0.493981846            -1.14299234
6   26        H         G         -0.043527004            -0.07695634
base_size <- 15
p <- ggplot(df, aes(Site, AminoAcid)) + geom_tile(aes(fill = Trial_1_AmpConc_2500),colour = "white") + scale_fill_gradient(low = "white",high = "steelblue") + guides(fill = guide_legend(title = "Enrichment relative to \nwildtype amino acid"))
p = p + theme_grey(base_size = base_size) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))
p

phydms uses amino-acid preferences rather than enrichment scores. We can transform these log enrichment scores to amino-acid preferences by normalizing the exponentiation of the scores for each site. Below are the first few sites and amino acids of the resulting preferences. Notice that the numbering has changed from above. In phydms a site is numbered in relation to the first site in the preferences rather than to the start codon.

import pandas as pd
df = pd.read_csv("example_data/betaLactamase_enrichmentScores.csv")
minenrichment = 1.0e-4 # minimum allowed enrichment
df["preference"] = [max(minenrichment, (10**df["Trial_1_AmpConc_2500"][x] + 10**df["Trial_1_AmpConc_2500"][x])/2) for x in range(len(df))]
df = df.pivot(index = "Site", columns = "AminoAcid", values = "preference")
df.fillna(1, inplace = True)
df = df.div(df.sum(axis=1), axis=0)
df.insert(0, "site", range(1,len(df)+1))
df.to_csv("example_data/betaLactamase_prefs.csv", index = False)
print df.iloc[:,:9].head(6).to_string(index = False)
site         A         C         D         E         F         G         H         I
                                                                                    
   1  0.070118  0.033902  0.049391  0.044621  0.022678  0.063982  0.070727  0.036060
   2  0.002058  0.035362  0.057655  0.078354  0.040415  0.012446  0.070924  0.049739
   3  0.057806  0.042278  0.050113  0.059354  0.048054  0.068718  0.063015  0.043187
   4  0.013668  0.004088  0.003170  0.002596  0.001298  0.001802  0.005931  0.177391
   5  0.066102  0.004275  0.000273  0.002788  0.027495  0.003923  0.025634  0.105117
   6  0.056257  0.048057  0.056197  0.070021  0.042867  0.054567  0.048590  0.047577

We can visualize the preferences as a logoplot using phydms_logoplot:

For more information on the preference file formats, please see the full phydms documentation. For more detailed information on transforming enrichment scores into amino-acid preferences, please see dms_tools: Algorithm to infer site-specific preferences.

Sequences

We will use an alignment of \(\beta\)-lactamase sequences. It is important to note there is exactly the same number of preferences as sites in the alignment.

Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites

You can use the phydms auxiliary program phydms_prepalignment to filter your sequences and prepare an alignment for phydms_comprehensive.

phydms_comprehensive

We can now run phydms_comprehensive by specifying our output prefix (in this case a directory called betaLactamase), our preferences, and our alignment. The output below is the phydms_comprehensive run log.

phydms_comprehensive betaLactamase/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml
2018-01-24 21:46:20,943 - INFO - Beginning execution of phydms_comprehensive in directory /Users/sarah/Desktop/phydms/tutorial

2018-01-24 21:46:20,943 - INFO - Progress is being logged to betaLactamase/log.log

2018-01-24 21:46:20,943 - INFO - Version information:
    Time and date: Wed Jan 24 21:46:20 2018
    Platform: Darwin-15.6.0-x86_64-i386-64bit
    Python version: 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25)  [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
    phydms version: 2.2.dev1
    Bio version: 1.68
    cython version: 0.27.3
    numpy version: 1.13.3
    scipy version: 0.19.0
    matplotlib version: 2.0.2
    natsort version: 5.1.1
    sympy version: 1.0
    six version: 1.10.0
    pandas version: 0.20.3
    pyvolve version: 0.8.4
    statsmodels version: 0.8.0
    weblogolib version: 3.5.0
    PyPDF2 version: 1.26.0

2018-01-24 21:46:20,943 - INFO - Parsed the following command-line arguments:
    randprefs = False
    noavgprefs = False
    raxml = raxml
    outprefix = betaLactamase/
    prefsfiles = ['example_data/betaLactamase_prefs.csv']
    alignment = example_data/betaLactamase_alignment.fasta
    omegabysite = False
    gammaomega = False
    diffprefsbysite = False
    gammabeta = False
    brlen = optimize
    tree = None
    ncpus = -1

2018-01-24 21:46:20,944 - INFO - Checking that the alignment example_data/betaLactamase_alignment.fasta is valid...
2018-01-24 21:46:21,150 - INFO - Valid alignment specifying 50 sequences of length 789.

2018-01-24 21:46:21,150 - INFO - Tree not specified.
2018-01-24 21:46:21,169 - INFO - Inferring tree with RAxML using command raxml
2018-01-24 21:46:23,122 - INFO - RAxML inferred tree is now named betaLactamase/RAxML_tree.newick
2018-01-24 21:46:23,123 - INFO - Removed the following existing files that have names that match the names of output files that will be created: betaLactamase/YNGKP_M5_log.log, betaLactamase/ExpCM_betaLactamase_prefs_log.log, betaLactamase/YNGKP_M0_log.log, betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log

2018-01-24 21:46:23,123 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model YNGKP_M5. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick YNGKP_M5 betaLactamase/YNGKP_M5 --brlen optimize --ncpus 1

2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model ExpCM_betaLactamase_prefs. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick ExpCM_example_data/betaLactamase_prefs.csv betaLactamase/ExpCM_betaLactamase_prefs --brlen optimize --ncpus 1

2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model YNGKP_M0. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick YNGKP_M0 betaLactamase/YNGKP_M0 --brlen optimize --ncpus 1

2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model averaged_ExpCM_betaLactamase_prefs. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick ExpCM_example_data/betaLactamase_prefs.csv betaLactamase/averaged_ExpCM_betaLactamase_prefs --brlen optimize --avgprefs --ncpus 1

2018-01-24 21:53:11,022 - INFO - Analysis completed for YNGKP_M0
2018-01-24 21:53:11,023 - INFO - Found expected output file betaLactamase/YNGKP_M0_log.log
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_tree.newick
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_loglikelihood.txt
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_modelparams.txt
2018-01-24 21:53:11,025 - INFO - Analysis successful for YNGKP_M0

2018-01-24 22:02:11,197 - INFO - Analysis completed for YNGKP_M5
2018-01-24 22:02:11,207 - INFO - Found expected output file betaLactamase/YNGKP_M5_log.log
2018-01-24 22:02:11,207 - INFO - Found expected output file betaLactamase/YNGKP_M5_tree.newick
2018-01-24 22:02:11,208 - INFO - Found expected output file betaLactamase/YNGKP_M5_loglikelihood.txt
2018-01-24 22:02:11,208 - INFO - Found expected output file betaLactamase/YNGKP_M5_modelparams.txt
2018-01-24 22:02:11,208 - INFO - Analysis successful for YNGKP_M5

2018-01-24 22:03:07,584 - INFO - Analysis completed for ExpCM_betaLactamase_prefs
2018-01-24 22:03:07,584 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_log.log
2018-01-24 22:03:07,584 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_tree.newick
2018-01-24 22:03:07,585 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_loglikelihood.txt
2018-01-24 22:03:07,585 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_modelparams.txt
2018-01-24 22:03:07,585 - INFO - Analysis successful for ExpCM_betaLactamase_prefs

2018-01-24 22:05:44,594 - INFO - Analysis completed for averaged_ExpCM_betaLactamase_prefs
2018-01-24 22:05:44,594 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log
2018-01-24 22:05:44,594 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_tree.newick
2018-01-24 22:05:44,595 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_loglikelihood.txt
2018-01-24 22:05:44,595 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_modelparams.txt
2018-01-24 22:05:44,595 - INFO - Analysis successful for averaged_ExpCM_betaLactamase_prefs

2018-01-24 22:05:45,612 - INFO - Successful completion of phydms_comprehensive

Output files

phydms_comprehensive produces both the standard phydms output files for each model and a summary file. Please see the full documentation for more information on the standard output files.

ls betaLactamase/*
betaLactamase/ExpCM_betaLactamase_prefs_log.log
betaLactamase/ExpCM_betaLactamase_prefs_loglikelihood.txt
betaLactamase/ExpCM_betaLactamase_prefs_modelparams.txt
betaLactamase/ExpCM_betaLactamase_prefs_tree.newick
betaLactamase/RAxML_bestTree.betaLactamase_alignment
betaLactamase/RAxML_tree.newick
betaLactamase/YNGKP_M0_log.log
betaLactamase/YNGKP_M0_loglikelihood.txt
betaLactamase/YNGKP_M0_modelparams.txt
betaLactamase/YNGKP_M0_tree.newick
betaLactamase/YNGKP_M5_log.log
betaLactamase/YNGKP_M5_loglikelihood.txt
betaLactamase/YNGKP_M5_modelparams.txt
betaLactamase/YNGKP_M5_tree.newick
betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log
betaLactamase/averaged_ExpCM_betaLactamase_prefs_loglikelihood.txt
betaLactamase/averaged_ExpCM_betaLactamase_prefs_modelparams.txt
betaLactamase/averaged_ExpCM_betaLactamase_prefs_tree.newick
betaLactamase/log.log
betaLactamase/modelcomparison.md
betaLactamase/ramxl_output.txt
betaLactamase/raxml_output.txt

Interpretation of phydms_comprehensive results

To compare the ExpCM with \(\beta\)-lactamase deep mutational scanning data to the YNGKP family we can look at the summary file modelcomparison.md.

Model deltaAIC LogLikelihood nParams ParamValues
ExpCM_betaLactamase_prefs 0.00 -2592.16 6 beta=1.36, kappa=2.64, omega=0.69
YNGKP_M5 717.42 -2944.87 12 alpha_omega=0.30, beta_omega=0.49, kappa=3.02
averaged_ExpCM_betaLactamase_prefs 794.48 -2989.40 6 beta=0.82, kappa=2.36, omega=0.28
YNGKP_M0 819.26 -2996.79 11 kappa=2.39, omega=0.28

First, we can see the ExpCM has the largest log-likelihood of all four of the models. It significantly outperforms (evaluated by the \(\Delta\)AIC) the non-site-specific YNGKP models. It also outperforms the ExpCM control where the preferences are averaged across the sites rendering the model non-site-specific. These comparisons are evidence that the ExpCM model informed by the deep mutational scanning results describes the natural evolution of \(\beta\)-lactamase better than traditional, non-site-specific models.

We can also evaluate the amino-acid preferences by the value of the ExpCM stringency parameter, \(\beta\). \(\beta\) is a way to gauge how well the selection in the lab compares to selection in nature. When \(\beta\) is fit to be greater than \(1\) it means the selection in nature prefers the same amino acids but with a greater stringency. The converse is true when \(\beta\) is less than \(1\). When our preferences are re-scaled with a \(\beta>1\), we see the strongly preferred amino acids “grow” while the weakly preferred amino acids “shrink”.

Above is a snippet of the \beta-lactamase preferences without scaling by \beta

Above is a snippet of the \(\beta\)-lactamase preferences without scaling by \(\beta\)

Above is a snippet of the \beta-lactamase preferences scaled by \beta = 3

Above is a snippet of the \(\beta\)-lactamase preferences scaled by \(\beta = 3\)

When \(\beta=0\), the heights become uniform and the ExpCM loses all of its site-specific information.
Above is a snippet of the \beta-lactamase preferences scaled by \beta = 0

In the \(\beta\)-lactamase example, \(\beta\) is fit to be \(1.36\). Since this number is close to \(1\), we can conclude that not only is the ExpCM with amino-acid preferences a better description of natural sequence evolution than non-site specific models but that natural evolution prefers the same amino acids that are preferred in the experiment, but with slightly greater stringency. For more information on scaling amino-acid preferences by \(\beta\), please see Bloom, 2014.

Please see the full documentation if you would like to learn more about the phydms_comprehensive program and its other options.

Comparing two DMS datasets for the same protein

If you perform a deep mutational scanning experiment multiple times under slightly different experimental conditions, you may want to compare how well each dataset explains natural sequence variation. These experimental differences could be in how the variant libraries were generated, how the selection pressure was exerted, etc. We can use phydms_comprehensive to compare ExpCM models with two or more different sets of preferences to both the YNGKP family of models and to each other.

For this phydms_comprehensive analysis, you will need multiple sets of amino-acid preferences for the same protein from your deep mutational scanning experiments and a codon-level sequence alignment of your gene. phydms_comprehensive will then

• infer a tree using RAxML
• run phydms with the YNGKP_M0, the YNKGP_M5 and the ExpCM (with and without averaged preferences) for each set of preferences
• summarize the results

See the full documentation for specifics on these models or the phydms_comprehensive program.

We are going to walk through an example of phydms_comprehensive and compare ExpCMs with amino-acid preferences for the influenza virus protein hemagglutinin described in Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016.

phydms_comprehensive command-line usage

Here is the full list of requirements and options for phydms_comprehensive. Below is a discussion of the input files, running the phydms_comprehensive command, and interpretation of the results.

phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
                            [--brlen {scale,optimize}] [--omegabysite]
                            [--diffprefsbysite] [--gammaomega] [--gammabeta]
                            [--no-avgprefs] [--randprefs] [-v]
                            outprefix alignment prefsfiles [prefsfiles ...]

Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms

positional arguments:
  outprefix             Output file prefix.
  alignment             Existing FASTA file with aligned codon sequences.
  prefsfiles            Existing files with site-specific amino-acid
                        preferences.

optional arguments:
  -h, --help            show this help message and exit
  --raxml RAXML         Path to RAxML (e.g., 'raxml') (default: None)
  --tree TREE           Existing Newick file giving input tree. (default:
                        None)
  --ncpus NCPUS         Use this many CPUs; -1 means all available. (default:
                        -1)
  --brlen {scale,optimize}
                        How to handle branch lengths: scale by single
                        parameter or optimize each one (default: optimize)
  --omegabysite         Fit omega (dN/dS) for each site. (default: False)
  --diffprefsbysite     Fit differential preferences for each site. (default:
                        False)
  --gammaomega          Fit ExpCM with gamma distributed omega. (default:
                        False)
  --gammabeta           Fit ExpCM with gamma distributed beta. (default:
                        False)
  --no-avgprefs         No fitting of models with preferences averaged across
                        sites for ExpCM. (default: False)
  --randprefs           Include ExpCM models with randomized preferences.
                        (default: False)
  -v, --version         show program's version number and exit

Input Files

This example uses the following files: HA_prefs_Thyagarajan.csv, HA_prefs_Doud.csv, and HA_alignment.fasta.

Amino-acid preferences

The HA amino-acid preferences from Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016 were measured using two different library construction strategies. (Please see Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016 for more information on the reverse-genetics strategy and the helper virus strategy respectively). We would like to know if one set of preferences significantly changes the behavior of the ExpCM compared to the other set or the average of the two sets.

We can compare the amino-acid preference measurements found in HA_prefs_Thyagarajan.csv and HA_prefs_Doud.csv using phydms_logoplots:

HA_prefs_Thyagarajan

HA_prefs_Thyagarajan

HA_prefs_Doud

HA_prefs_Doud

As we would expect, the preferences measured from the two experiments are similar but not identical.

For more information on how to change enrichment scores to amino-acid preferences, please see the Comparing your DMS data to natural sequence evolution section of this tutorial. For more information on the preference file formats, please see the full phydms documentation or an example in the Comparing your DMS data to natural sequence evolution section of this tutorial.

Sequences

We will use an alignment of HA sequences. It is important to note there is exactly the same number of preferences in each preference file as sites in the alignment.

Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long

Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites
Preferences read from: example_data/HA_prefs_Thyagarajan.csv
There are preferences measured for 564 sites

You can use the phydms auxiliary program phydms_prepalignment to filter your sequences and prepare an alignment for phydms_comprehensive.

phydms_comprehensive

We can now run phydms_comprehensive by specifying our output prefix (in this case a directory called HA), our preferences, and our alignment.

phydms_comprehensive HA/ example_data/HA_alignment.fasta example_data/HA_prefs_Thyagarajan.csv example_data/HA_prefs_Doud.csv --raxml raxml

Output files

phydms_comprehensive produces both the standard phydms output files for each model and a summary file. See the the Comparing your DMS data to natural sequence evolution section of the tutorial for an example and full phydms documentation for more information on the standard output files.

Interpretation of phydms_comprehensive results

To compare the ExpCM with the different preferences, we can look at the summary file modelcomparison.md.

Model deltaAIC LogLikelihood nParams ParamValues
ExpCM_HA_prefs_Doud 0.00 -4877.65 6 beta=2.11, kappa=5.14, omega=0.52
ExpCM_HA_prefs_Thyagarajan 44.18 -4899.74 6 beta=1.72, kappa=4.94, omega=0.55
averaged_ExpCM_HA_prefs_Doud 2090.58 -5922.94 6 beta=0.68, kappa=5.36, omega=0.22
averaged_ExpCM_HA_prefs_Thyagarajan 2097.86 -5926.58 6 beta=0.31, kappa=5.36, omega=0.22
YNGKP_M5 2113.50 -5928.40 12 alpha_omega=0.30, beta_omega=1.42, kappa=4.68
YNGKP_M0 2219.64 -5982.47 11 kappa=4.61, omega=0.20

First, we can see that both ExpCM models all have significantly larger (evaluated by the \(\Delta\)AIC) log-likelihoods than the two ExpCM with averaged preferences. These, in turn, have significantly larger log-likelihoods than the YNGKP family. Second, we can see the ExpCM with the preferences from Doud and Bloom, 2016 has a larger log-likelihood than then ExpCM with the Thyagarajan and Bloom, 2014 preferences.

These comparisons are evidence that the ExpCM with the Doud and Bloom, 2016 preferences is a better description of natural sequence evolution than the ExpCM with the Thyagarajan and Bloom, 2014 preferences or either one of the non-site-specific models.

Please see the full documentation if you would like to learn more about the phydms_comprehensive program and its other options.

Detecting Diversifying Selection

A deep mutational scanning experiment measures the amino-acid preferences of a given protein for a given selection pressure. However, it is not expected that the selection in lab faithfully describes the selection a protein faces in nature. We can use phydms_comprehensive and the flag --omegabysite to identify sites which deviate from the ExpCM model via an unexpectedly high or low rate of amino-acid substitution. That is, we will be able to differentiate between sites which are under diversifying selection, a high rate, and sites which are under a selective constraint not measured in lab, a low rate. This is in contrast to differential selection, which selects for unexpected amino-acid substitutions rather than unexpected rates.

For more information on the exact procedure, please see the full documentation or Bloom, 2017. This procedure is analogous to the FEL method described by Pond and Frost, 2005.

We are going to walk through an example of phydms_comprehensive to detect sites under diversifying selection in influenza virus protein hemagglutinin using the preferences measured Doud and Bloom, 2016 and in \(\beta\)-lactamase using the preferences measured by Stiffler et al, 2015.

phydms_comprehensive command-line usage

Here is the full list of requirements and options for phydms_comprehensive. To detect diversifying pressure, we are going to include the optional flag --omegabysite. Below is a discussion of the input files, running the phydms_comprehensive command, and interpretation of the results.

phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
                            [--brlen {scale,optimize}] [--omegabysite]
                            [--diffprefsbysite] [--gammaomega] [--gammabeta]
                            [--no-avgprefs] [--randprefs] [-v]
                            outprefix alignment prefsfiles [prefsfiles ...]

Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms

positional arguments:
  outprefix             Output file prefix.
  alignment             Existing FASTA file with aligned codon sequences.
  prefsfiles            Existing files with site-specific amino-acid
                        preferences.

optional arguments:
  -h, --help            show this help message and exit
  --raxml RAXML         Path to RAxML (e.g., 'raxml') (default: None)
  --tree TREE           Existing Newick file giving input tree. (default:
                        None)
  --ncpus NCPUS         Use this many CPUs; -1 means all available. (default:
                        -1)
  --brlen {scale,optimize}
                        How to handle branch lengths: scale by single
                        parameter or optimize each one (default: optimize)
  --omegabysite         Fit omega (dN/dS) for each site. (default: False)
  --diffprefsbysite     Fit differential preferences for each site. (default:
                        False)
  --gammaomega          Fit ExpCM with gamma distributed omega. (default:
                        False)
  --gammabeta           Fit ExpCM with gamma distributed beta. (default:
                        False)
  --no-avgprefs         No fitting of models with preferences averaged across
                        sites for ExpCM. (default: False)
  --randprefs           Include ExpCM models with randomized preferences.
                        (default: False)
  -v, --version         show program's version number and exit

Input Files

A full discussion of the amino-acid preferences and sequences for \(\beta\)-lacatamse and HA can be found in the earlier sections of this tutorial. Briefly,

BetaLactamase
Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites

HA
Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long
Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites

phydms_comprehensive

We can now run phydms_comprehensive by specifying our output prefix (in this case a directory called HA_omegabysite or betaLactmase_omegabysite), our preferences, our alignment and the flag --omegabysite. Each alignment requires its own phydms_comprehensive run.

phydms_comprehensive HA_omegabysite/ example_data/HA_alignment.fasta example_data/HA_prefs_Doud.csv --raxml raxml --omegabysite

phydms_comprehensive betaLactamase_omegabysite/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml --omegabysite

Output files

phydms_comprehensive produces both the standard phydms output files for each model and a summary file. See the Comparing your DMS data to natural sequence evolution section of the tutorial for an example and the full phydms documentation for more information on the standard output files.

Interpretation of phydms_comprehensive results

To detect sites under diversifying selection, we can look at the summary files with the suffix _omegabysite.txt. First, we will look at the results from the HA analysis.

HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt lists the site, the fitted \(\omega_r\) value, the p-value for the hypothesis \(H_0: \omega_r = 1\), the dLnL (difference in log-likelihood between ExpCM with \(\omega_r = 1\) and the ExpCM with the fitted \(\omega_r\) value), and the Q-value, which controls for multiple comparisons via the false discovery rate. For more information on these metrics, please see the full phydms_comprehensive documentation. The sites are sorted by the p-value which means the sites with the strongest evidence for deviations from \(\omega_r = 1\) will be at the top of the file.

head -n 20 HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt
# Omega fit to each site after fixing tree and all other parameters.
# Fits compared to null model of omega = 1.
# P-values NOT corrected for multiple testing, so consider Q-values too.
# Q-values computed separately for omega > and < 1 # Will fit different synonymous rate for each site.
#
site    omega   P   dLnL    Q
277 0.000   0.00246 4.585   0.694
147 0.000   0.00184 4.853   0.694
1   1.000   1   0.000   1
373 1.000   1   0.000   1
374 0.000   0.4 0.353   1
375 0.000   0.284   0.574   1
376 1.000   1   0.000   1
377 1.000   1   0.000   1
378 0.000   0.523   0.204   1
380 1.000   1   0.000   1
372 0.000   0.507   0.220   1
381 0.000   0.975   0.000   1
382 0.000   0.456   0.278   1
383 1.000   1   0.000   1

and sites with the weakest evidence for deviation from \(\omega_r = 1\) will be at the bottom of the file.

tail -n 20 HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt
204 0.356   0.292   0.554   1
141 100.000 0.0783  1.550   1
195 1.000   1   0.000   1
193 0.000   0.618   0.125   1
179 1.000   0.999   0.000   1
180 1.000   1   0.000   1
181 0.455   0.585   0.149   1
182 1.000   1   0.000   1
183 0.462   0.428   0.314   1
184 1.820   0.633   0.114   1
194 100.000 0.152   1.027   1
185 1.306   0.797   0.033   1
187 1.000   1   0.000   1
188 1.000   1   0.000   1
189 1.000   1   0.000   1
190 0.000   0.0538  1.860   1
191 0.000   0.686   0.081   1
192 1.000   1   0.000   1
186 1.000   1   0.000   1
564 0.000   0.472   0.259   1

You will notice that the sites with strong evidence have a fitted \(\omega_r\) that is either very large (\(100\)) or very small (\(0\)) while the sites with weak evidence have a fitted \(\omega_r\) close to \(1\).

Here are the sites in \(\beta\)-lactamase with the strongest evidence for deviations from \(\omega_r = 1\)

head -n 20 betaLactamase_omegabysite/ExpCM_betaLactamase_prefs_omegabysite.txt
# Omega fit to each site after fixing tree and all other parameters.
# Fits compared to null model of omega = 1.
# P-values NOT corrected for multiple testing, so consider Q-values too.
# Q-values computed separately for omega > and < 1 # Will fit different synonymous rate for each site.
#
site    omega   P   dLnL    Q
248 20.043  0.000567    5.941   0.0745
213 37.250  0.000345    6.404   0.0745
218 95.530  0.00262 4.527   0.23
233 0.000   0.00164 4.956   0.432
139 33.371  0.0112  3.215   0.738
134 0.000   0.0133  3.064   0.875
146 0.000   0.00988 3.328   0.875
87  0.029   0.0102  3.299   0.875
257 0.000   0.0169  2.853   0.889
209 100.000 0.0175  2.823   0.92
105 72.028  0.0286  2.396   0.951
15  100.000 0.0326  2.284   0.951
141 100.000 0.0307  2.334   0.951
46  82.846  0.0218  2.629   0.951

We can see in both examples there is a small subset of sites which deviate display a faster (\(\omega_r>1\)) or slower (\(\omega_r <1\)) than expected rate of amino-acid substitutions.

Detecting Differential Selection

A deep mutational scanning experiment measures the amino-acid preferences of a given protein for a given selection pressure. However, it is not expected that the selection in lab faithfully describes the selection a protein faces in nature. We can use phydms_comprehensive and the flag --diffprefsbysite to identify sites which deviate from the ExpCM model via differential selection. In contrast to diversifying selection, differential selection leads to unexpected amino-acid substitutions rather than unexpected rates.

For more information on the exact procedure, please see the full documentation or Bloom, 2017.

We are going to walk through an example of phydms_comprehensive to detect sites under differential selection in influenza virus protein hemagglutinin using the preferences measured Doud and Bloom, 2016 and in \(\beta\)-lactamase using the preferences measured by Stiffler et al, 2015.

phydms_comprehensive command-line usage

Here is the full list of requirements and options for phydms_comprehensive. To detect diversifying pressure, we are going to include the optional flag --diffprefsbysite. Below is a discussion of the input files, running the phydms_comprehensive command, and interpretation of the results.

phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
                            [--brlen {scale,optimize}] [--omegabysite]
                            [--diffprefsbysite] [--gammaomega] [--gammabeta]
                            [--no-avgprefs] [--randprefs] [-v]
                            outprefix alignment prefsfiles [prefsfiles ...]

Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms

positional arguments:
  outprefix             Output file prefix.
  alignment             Existing FASTA file with aligned codon sequences.
  prefsfiles            Existing files with site-specific amino-acid
                        preferences.

optional arguments:
  -h, --help            show this help message and exit
  --raxml RAXML         Path to RAxML (e.g., 'raxml') (default: None)
  --tree TREE           Existing Newick file giving input tree. (default:
                        None)
  --ncpus NCPUS         Use this many CPUs; -1 means all available. (default:
                        -1)
  --brlen {scale,optimize}
                        How to handle branch lengths: scale by single
                        parameter or optimize each one (default: optimize)
  --omegabysite         Fit omega (dN/dS) for each site. (default: False)
  --diffprefsbysite     Fit differential preferences for each site. (default:
                        False)
  --gammaomega          Fit ExpCM with gamma distributed omega. (default:
                        False)
  --gammabeta           Fit ExpCM with gamma distributed beta. (default:
                        False)
  --no-avgprefs         No fitting of models with preferences averaged across
                        sites for ExpCM. (default: False)
  --randprefs           Include ExpCM models with randomized preferences.
                        (default: False)
  -v, --version         show program's version number and exit

Input Files

A full discussion of the amino-acid preferences and sequences for \(\beta\)-lacatamse and HA can be found in the earlier sections of this tutorial. Briefly,

BetaLactamase
Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites

HA
Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long
Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites

phydms_comprehensive

We can now run phydms_comprehensive by specifying our output prefix (in this case a directory called HA_diffprefs or betaLactmase_diffprefs), our preferences, our alignment and the flag --diffprefsbysite. Each alignment requires its own phydms_comprehensive run.

phydms_comprehensive HA_diffprefs/ example_data/HA_alignment.fasta example_data/HA_prefs_Doud.csv --raxml raxml --diffprefsbysite

phydms_comprehensive betaLactamase_diffprefs/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml --diffprefsbysite

Output files

phydms_comprehensive produces both the standard phydms output files for each model and a summary file. See the Comparing your DMS data to natural sequence evolution section of the tutorial for an example and the full phydms documentation for more information on the standard output files.

Interpretation of phydms_comprehensive results

To detect sites under differential selection, we can look at the summary files with the suffix diffprefsbysite.txt. First, we will look at the results from the HA analysis.

HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt lists the site, the differential preference for each amino acid, and half absolute sum of the differential preferences for that site. A large differential preference value means that substitutions to that amino acid were seen at that site in the alignment more often than expected given the ExpCM and the measured amino-acid preferences. The last column can be used as a summary of differential selection strength for a given site. Larger values indicate the site is being strongly deferentially selected. For more information on these metrics, please see the full phydms_comprehensive documentation.

Here are the sites (and a subset of the differential preferences) in HA with the strongest evidence for differential selection

import pandas as pd
df = pd.read_csv("HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].head(6).to_string(index = False)
site   dpi_A   dpi_C   dpi_D   dpi_E   dpi_F   dpi_G   dpi_H   dpi_I  half_sum_abs_dpi
 174 -0.0089 -0.0050 -0.0074 -0.0091 -0.0022 -0.0097 -0.0065 -0.0094            0.1958
 127 -0.0166 -0.0020 -0.0018 -0.0006  0.1702 -0.0137 -0.0054 -0.0226            0.1702
 171 -0.0148 -0.0001 -0.0166 -0.0147 -0.0003  0.1638 -0.0028 -0.0159            0.1638
 277 -0.0072 -0.0030 -0.0051 -0.0182 -0.0025  0.1633 -0.0042 -0.0092            0.1633
   9 -0.0123  0.1620 -0.0031 -0.0022 -0.0074 -0.0068 -0.0087 -0.0152            0.1620
 496 -0.0079 -0.0040 -0.0054 -0.0098 -0.0049 -0.0092 -0.0076 -0.0117            0.1554

and sites with the weakest evidence

import pandas as pd
df = pd.read_csv("HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].tail(6).to_string(index = False)
site   dpi_A   dpi_C   dpi_D   dpi_E   dpi_F   dpi_G   dpi_H   dpi_I  half_sum_abs_dpi
 351 -0.0010 -0.0006 -0.0003 -0.0001  0.0068 -0.0012 -0.0001 -0.0014            0.0092
 327 -0.0001  0.0016 -0.0003 -0.0003  0.0003 -0.0010  0.0004 -0.0043            0.0088
 378  0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000            0.0000
 342 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000            0.0000
  73 -0.0000 -0.0000 -0.0000 -0.0000  0.0000 -0.0000 -0.0000  0.0000            0.0000
 283 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000            0.0000

Here are the sites in \(\beta\)-lactamase with the strongest evidence for differential selection

import pandas as pd
df = pd.read_csv("betaLactamase_diffprefs/ExpCM_betaLactamase_prefs_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].head(6).to_string(index = False)
site   dpi_A   dpi_C   dpi_D   dpi_E   dpi_F   dpi_G   dpi_H   dpi_I  half_sum_abs_dpi
 213 -0.3701 -0.0003 -0.0009 -0.0003 -0.0003 -0.5691 -0.0005 -0.0003            0.9656
 214 -0.0518 -0.0002 -0.0492 -0.0417  0.0032 -0.0533 -0.0048  0.0049            0.3394
 104 -0.0201 -0.0054 -0.0076 -0.0092 -0.0035 -0.0009 -0.0080 -0.0285            0.2555
 167 -0.0147 -0.0042 -0.0063 -0.0198 -0.0027 -0.0145 -0.0063 -0.0072            0.2271
  44 -0.0217 -0.0119 -0.0006 -0.0006 -0.0006 -0.0211 -0.0031  0.0429            0.2206
  90 -0.0187 -0.0024  0.1839 -0.0134 -0.0009 -0.0329 -0.0097 -0.0037            0.1839

We visualize the \(\beta\)-lactamase differential preferences for each site using phydms_logoplot

phydms_logoplot logoplots/betaLactamase_prefs_diff.pdf --mapmetric functionalgroup --colormap mapmetric --diffprefs betaLactamase_diffprefs/ExpCM_betaLactamase_prefs_diffprefsbysite.txt

We can see in both the HA and the \(\beta\)-lactamase example, there are a small subset of sites which appear to be under differential selection.