phydms
TutorialIn this tutorial you will learn the basics of phydms
. We will walk through examples of analyses you may want to run to compare your deep mutational scanning data to natural sequence evolution. For more details on any of the steps, please see the phydms documentation.
phydms
was developed at the Bloom Lab (full list of contributors)
DMS
data to natural sequence evolutionAfter you perform your deep mutational scanning experiment, you may want to know how well the experimental measurements you made in the lab describe the natural sequence evolution of your protein. Using phydms
you can compare the phylogenetic substitution model ExpCM, which takes into account the site-specific amino-acid preferences from your deep mutational scanning experiment, to traditional, non-site specific models from the YNGKP family. phydms_comprehensive
will run phydms
in several different modes to generate results for the appropriate comparisons.
For the standard phydms_comprehensive
analysis, you will need amino-acid preferences from your deep mutational scanning experiment and a codon-level sequence alignment of your gene. phydms_comprehensive
will then
• infer a tree using RAxML
• run phydms
with the YNGKP_M0, the YNKGP_M5, the ExpCM, and a control ExpCM run with averaged preferences
• summarize the results
See the full documentation for specifics on these models.
We are going to walk through an example of phydms_comprehensive
for the \(\beta\)-lactamase gene. We will compare an ExpCM with deep mutational scanning data from Stiffler et al, 2015 to the YNGKP family of models.
phydms_comprehensive
command-line usageHere is the full list of requirements and options for phydms_comprehensive
. Below is a discussion of the input files, running the phydms_comprehensive
command, and interpretation of the results.
phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
[--brlen {scale,optimize}] [--omegabysite]
[--diffprefsbysite] [--gammaomega] [--gammabeta]
[--no-avgprefs] [--randprefs] [-v]
outprefix alignment prefsfiles [prefsfiles ...]
Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms
positional arguments:
outprefix Output file prefix.
alignment Existing FASTA file with aligned codon sequences.
prefsfiles Existing files with site-specific amino-acid
preferences.
optional arguments:
-h, --help show this help message and exit
--raxml RAXML Path to RAxML (e.g., 'raxml') (default: None)
--tree TREE Existing Newick file giving input tree. (default:
None)
--ncpus NCPUS Use this many CPUs; -1 means all available. (default:
-1)
--brlen {scale,optimize}
How to handle branch lengths: scale by single
parameter or optimize each one (default: optimize)
--omegabysite Fit omega (dN/dS) for each site. (default: False)
--diffprefsbysite Fit differential preferences for each site. (default:
False)
--gammaomega Fit ExpCM with gamma distributed omega. (default:
False)
--gammabeta Fit ExpCM with gamma distributed beta. (default:
False)
--no-avgprefs No fitting of models with preferences averaged across
sites for ExpCM. (default: False)
--randprefs Include ExpCM models with randomized preferences.
(default: False)
-v, --version show program's version number and exit
This example uses the following files: betaLactamase_enrichmentScores.csv
, betaLactamase_prefs.csv
, and betaLactamase_alignment.fasta
.
Often data from deep mutational scanning experiments is reported as the enrichment of a given amino acid compared to the wild-type amino acid. Here is a snippet of the log enrichment scores for \(\beta\)-lactamase from the file betaLactamase_enrichmentScores.csv (Stiffler et al, 2015 Supplemental File 1
) and the full dataset visualized as a heatmap.
library(ggplot2)
df = read.csv("example_data/betaLactamase_enrichmentScores.csv")
head(df,6)
Site wildtype AminoAcid Trial_1_AmpConc_2500 Trial_2_AmpConc_2500.1
1 26 H A -0.003757501 -0.01580963
2 26 H C -0.319356518 -0.51716333
3 26 H D -0.155937758 -0.20727546
4 26 H E -0.200042482 -0.35800585
5 26 H F -0.493981846 -1.14299234
6 26 H G -0.043527004 -0.07695634
base_size <- 15
p <- ggplot(df, aes(Site, AminoAcid)) + geom_tile(aes(fill = Trial_1_AmpConc_2500),colour = "white") + scale_fill_gradient(low = "white",high = "steelblue") + guides(fill = guide_legend(title = "Enrichment relative to \nwildtype amino acid"))
p = p + theme_grey(base_size = base_size) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0))
p
phydms
uses amino-acid preferences rather than enrichment scores. We can transform these log enrichment scores to amino-acid preferences by normalizing the exponentiation of the scores for each site. Below are the first few sites and amino acids of the resulting preferences. Notice that the numbering has changed from above. In phydms
a site is numbered in relation to the first site in the preferences rather than to the start codon.
import pandas as pd
df = pd.read_csv("example_data/betaLactamase_enrichmentScores.csv")
minenrichment = 1.0e-4 # minimum allowed enrichment
df["preference"] = [max(minenrichment, (10**df["Trial_1_AmpConc_2500"][x] + 10**df["Trial_1_AmpConc_2500"][x])/2) for x in range(len(df))]
df = df.pivot(index = "Site", columns = "AminoAcid", values = "preference")
df.fillna(1, inplace = True)
df = df.div(df.sum(axis=1), axis=0)
df.insert(0, "site", range(1,len(df)+1))
df.to_csv("example_data/betaLactamase_prefs.csv", index = False)
print df.iloc[:,:9].head(6).to_string(index = False)
site A C D E F G H I
1 0.070118 0.033902 0.049391 0.044621 0.022678 0.063982 0.070727 0.036060
2 0.002058 0.035362 0.057655 0.078354 0.040415 0.012446 0.070924 0.049739
3 0.057806 0.042278 0.050113 0.059354 0.048054 0.068718 0.063015 0.043187
4 0.013668 0.004088 0.003170 0.002596 0.001298 0.001802 0.005931 0.177391
5 0.066102 0.004275 0.000273 0.002788 0.027495 0.003923 0.025634 0.105117
6 0.056257 0.048057 0.056197 0.070021 0.042867 0.054567 0.048590 0.047577
We can visualize the preferences as a logoplot using phydms_logoplot:
For more information on the preference file formats, please see the full phydms
documentation. For more detailed information on transforming enrichment scores into amino-acid preferences, please see dms_tools
: Algorithm to infer site-specific preferences.
We will use an alignment of \(\beta\)-lactamase sequences. It is important to note there is exactly the same number of preferences as sites in the alignment.
Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites
You can use the phydms
auxiliary program phydms_prepalignment
to filter your sequences and prepare an alignment for phydms_comprehensive
.
phydms_comprehensive
We can now run phydms_comprehensive
by specifying our output prefix (in this case a directory called betaLactamase
), our preferences, and our alignment. The output below is the phydms_comprehensive
run log.
phydms_comprehensive betaLactamase/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml
2018-01-24 21:46:20,943 - INFO - Beginning execution of phydms_comprehensive in directory /Users/sarah/Desktop/phydms/tutorial
2018-01-24 21:46:20,943 - INFO - Progress is being logged to betaLactamase/log.log
2018-01-24 21:46:20,943 - INFO - Version information:
Time and date: Wed Jan 24 21:46:20 2018
Platform: Darwin-15.6.0-x86_64-i386-64bit
Python version: 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
phydms version: 2.2.dev1
Bio version: 1.68
cython version: 0.27.3
numpy version: 1.13.3
scipy version: 0.19.0
matplotlib version: 2.0.2
natsort version: 5.1.1
sympy version: 1.0
six version: 1.10.0
pandas version: 0.20.3
pyvolve version: 0.8.4
statsmodels version: 0.8.0
weblogolib version: 3.5.0
PyPDF2 version: 1.26.0
2018-01-24 21:46:20,943 - INFO - Parsed the following command-line arguments:
randprefs = False
noavgprefs = False
raxml = raxml
outprefix = betaLactamase/
prefsfiles = ['example_data/betaLactamase_prefs.csv']
alignment = example_data/betaLactamase_alignment.fasta
omegabysite = False
gammaomega = False
diffprefsbysite = False
gammabeta = False
brlen = optimize
tree = None
ncpus = -1
2018-01-24 21:46:20,944 - INFO - Checking that the alignment example_data/betaLactamase_alignment.fasta is valid...
2018-01-24 21:46:21,150 - INFO - Valid alignment specifying 50 sequences of length 789.
2018-01-24 21:46:21,150 - INFO - Tree not specified.
2018-01-24 21:46:21,169 - INFO - Inferring tree with RAxML using command raxml
2018-01-24 21:46:23,122 - INFO - RAxML inferred tree is now named betaLactamase/RAxML_tree.newick
2018-01-24 21:46:23,123 - INFO - Removed the following existing files that have names that match the names of output files that will be created: betaLactamase/YNGKP_M5_log.log, betaLactamase/ExpCM_betaLactamase_prefs_log.log, betaLactamase/YNGKP_M0_log.log, betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log
2018-01-24 21:46:23,123 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model YNGKP_M5. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick YNGKP_M5 betaLactamase/YNGKP_M5 --brlen optimize --ncpus 1
2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model ExpCM_betaLactamase_prefs. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick ExpCM_example_data/betaLactamase_prefs.csv betaLactamase/ExpCM_betaLactamase_prefs --brlen optimize --ncpus 1
2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model YNGKP_M0. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick YNGKP_M0 betaLactamase/YNGKP_M0 --brlen optimize --ncpus 1
2018-01-24 21:46:23,124 - INFO - Starting analysis to optimize tree in betaLactamase/RAxML_tree.newick using model averaged_ExpCM_betaLactamase_prefs. The command is: phydms example_data/betaLactamase_alignment.fasta betaLactamase/RAxML_tree.newick ExpCM_example_data/betaLactamase_prefs.csv betaLactamase/averaged_ExpCM_betaLactamase_prefs --brlen optimize --avgprefs --ncpus 1
2018-01-24 21:53:11,022 - INFO - Analysis completed for YNGKP_M0
2018-01-24 21:53:11,023 - INFO - Found expected output file betaLactamase/YNGKP_M0_log.log
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_tree.newick
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_loglikelihood.txt
2018-01-24 21:53:11,024 - INFO - Found expected output file betaLactamase/YNGKP_M0_modelparams.txt
2018-01-24 21:53:11,025 - INFO - Analysis successful for YNGKP_M0
2018-01-24 22:02:11,197 - INFO - Analysis completed for YNGKP_M5
2018-01-24 22:02:11,207 - INFO - Found expected output file betaLactamase/YNGKP_M5_log.log
2018-01-24 22:02:11,207 - INFO - Found expected output file betaLactamase/YNGKP_M5_tree.newick
2018-01-24 22:02:11,208 - INFO - Found expected output file betaLactamase/YNGKP_M5_loglikelihood.txt
2018-01-24 22:02:11,208 - INFO - Found expected output file betaLactamase/YNGKP_M5_modelparams.txt
2018-01-24 22:02:11,208 - INFO - Analysis successful for YNGKP_M5
2018-01-24 22:03:07,584 - INFO - Analysis completed for ExpCM_betaLactamase_prefs
2018-01-24 22:03:07,584 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_log.log
2018-01-24 22:03:07,584 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_tree.newick
2018-01-24 22:03:07,585 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_loglikelihood.txt
2018-01-24 22:03:07,585 - INFO - Found expected output file betaLactamase/ExpCM_betaLactamase_prefs_modelparams.txt
2018-01-24 22:03:07,585 - INFO - Analysis successful for ExpCM_betaLactamase_prefs
2018-01-24 22:05:44,594 - INFO - Analysis completed for averaged_ExpCM_betaLactamase_prefs
2018-01-24 22:05:44,594 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log
2018-01-24 22:05:44,594 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_tree.newick
2018-01-24 22:05:44,595 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_loglikelihood.txt
2018-01-24 22:05:44,595 - INFO - Found expected output file betaLactamase/averaged_ExpCM_betaLactamase_prefs_modelparams.txt
2018-01-24 22:05:44,595 - INFO - Analysis successful for averaged_ExpCM_betaLactamase_prefs
2018-01-24 22:05:45,612 - INFO - Successful completion of phydms_comprehensive
phydms_comprehensive
produces both the standard phydms
output files for each model and a summary file. Please see the full documentation for more information on the standard output files.
ls betaLactamase/*
betaLactamase/ExpCM_betaLactamase_prefs_log.log
betaLactamase/ExpCM_betaLactamase_prefs_loglikelihood.txt
betaLactamase/ExpCM_betaLactamase_prefs_modelparams.txt
betaLactamase/ExpCM_betaLactamase_prefs_tree.newick
betaLactamase/RAxML_bestTree.betaLactamase_alignment
betaLactamase/RAxML_tree.newick
betaLactamase/YNGKP_M0_log.log
betaLactamase/YNGKP_M0_loglikelihood.txt
betaLactamase/YNGKP_M0_modelparams.txt
betaLactamase/YNGKP_M0_tree.newick
betaLactamase/YNGKP_M5_log.log
betaLactamase/YNGKP_M5_loglikelihood.txt
betaLactamase/YNGKP_M5_modelparams.txt
betaLactamase/YNGKP_M5_tree.newick
betaLactamase/averaged_ExpCM_betaLactamase_prefs_log.log
betaLactamase/averaged_ExpCM_betaLactamase_prefs_loglikelihood.txt
betaLactamase/averaged_ExpCM_betaLactamase_prefs_modelparams.txt
betaLactamase/averaged_ExpCM_betaLactamase_prefs_tree.newick
betaLactamase/log.log
betaLactamase/modelcomparison.md
betaLactamase/ramxl_output.txt
betaLactamase/raxml_output.txt
phydms_comprehensive
resultsTo compare the ExpCM with \(\beta\)-lactamase deep mutational scanning data to the YNGKP family we can look at the summary file modelcomparison.md
.
Model | deltaAIC | LogLikelihood | nParams | ParamValues |
---|---|---|---|---|
ExpCM_betaLactamase_prefs | 0.00 | -2592.16 | 6 | beta=1.36, kappa=2.64, omega=0.69 |
YNGKP_M5 | 717.42 | -2944.87 | 12 | alpha_omega=0.30, beta_omega=0.49, kappa=3.02 |
averaged_ExpCM_betaLactamase_prefs | 794.48 | -2989.40 | 6 | beta=0.82, kappa=2.36, omega=0.28 |
YNGKP_M0 | 819.26 | -2996.79 | 11 | kappa=2.39, omega=0.28 |
First, we can see the ExpCM has the largest log-likelihood of all four of the models. It significantly outperforms (evaluated by the \(\Delta\)AIC) the non-site-specific YNGKP models. It also outperforms the ExpCM control where the preferences are averaged across the sites rendering the model non-site-specific. These comparisons are evidence that the ExpCM model informed by the deep mutational scanning results describes the natural evolution of \(\beta\)-lactamase better than traditional, non-site-specific models.
We can also evaluate the amino-acid preferences by the value of the ExpCM stringency parameter, \(\beta\). \(\beta\) is a way to gauge how well the selection in the lab compares to selection in nature. When \(\beta\) is fit to be greater than \(1\) it means the selection in nature prefers the same amino acids but with a greater stringency. The converse is true when \(\beta\) is less than \(1\). When our preferences are re-scaled with a \(\beta>1\), we see the strongly preferred amino acids “grow” while the weakly preferred amino acids “shrink”.
Above is a snippet of the \(\beta\)-lactamase preferences without scaling by \(\beta\)
Above is a snippet of the \(\beta\)-lactamase preferences scaled by \(\beta = 3\)
When \(\beta=0\), the heights become uniform and the ExpCM loses all of its site-specific information.
In the \(\beta\)-lactamase example, \(\beta\) is fit to be \(1.36\). Since this number is close to \(1\), we can conclude that not only is the ExpCM with amino-acid preferences a better description of natural sequence evolution than non-site specific models but that natural evolution prefers the same amino acids that are preferred in the experiment, but with slightly greater stringency. For more information on scaling amino-acid preferences by \(\beta\), please see Bloom, 2014.
Please see the full documentation if you would like to learn more about the phydms_comprehensive
program and its other options.
DMS
datasets for the same proteinIf you perform a deep mutational scanning experiment multiple times under slightly different experimental conditions, you may want to compare how well each dataset explains natural sequence variation. These experimental differences could be in how the variant libraries were generated, how the selection pressure was exerted, etc. We can use phydms_comprehensive
to compare ExpCM models with two or more different sets of preferences to both the YNGKP family of models and to each other.
For this phydms_comprehensive
analysis, you will need multiple sets of amino-acid preferences for the same protein from your deep mutational scanning experiments and a codon-level sequence alignment of your gene. phydms_comprehensive
will then
• infer a tree using RAxML
• run phydms
with the YNGKP_M0, the YNKGP_M5 and the ExpCM (with and without averaged preferences) for each set of preferences
• summarize the results
See the full documentation for specifics on these models or the phydms_comprehensive
program.
We are going to walk through an example of phydms_comprehensive
and compare ExpCMs with amino-acid preferences for the influenza virus protein hemagglutinin described in Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016.
phydms_comprehensive
command-line usageHere is the full list of requirements and options for phydms_comprehensive
. Below is a discussion of the input files, running the phydms_comprehensive
command, and interpretation of the results.
phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
[--brlen {scale,optimize}] [--omegabysite]
[--diffprefsbysite] [--gammaomega] [--gammabeta]
[--no-avgprefs] [--randprefs] [-v]
outprefix alignment prefsfiles [prefsfiles ...]
Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms
positional arguments:
outprefix Output file prefix.
alignment Existing FASTA file with aligned codon sequences.
prefsfiles Existing files with site-specific amino-acid
preferences.
optional arguments:
-h, --help show this help message and exit
--raxml RAXML Path to RAxML (e.g., 'raxml') (default: None)
--tree TREE Existing Newick file giving input tree. (default:
None)
--ncpus NCPUS Use this many CPUs; -1 means all available. (default:
-1)
--brlen {scale,optimize}
How to handle branch lengths: scale by single
parameter or optimize each one (default: optimize)
--omegabysite Fit omega (dN/dS) for each site. (default: False)
--diffprefsbysite Fit differential preferences for each site. (default:
False)
--gammaomega Fit ExpCM with gamma distributed omega. (default:
False)
--gammabeta Fit ExpCM with gamma distributed beta. (default:
False)
--no-avgprefs No fitting of models with preferences averaged across
sites for ExpCM. (default: False)
--randprefs Include ExpCM models with randomized preferences.
(default: False)
-v, --version show program's version number and exit
This example uses the following files: HA_prefs_Thyagarajan.csv
, HA_prefs_Doud.csv
, and HA_alignment.fasta
.
The HA amino-acid preferences from Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016 were measured using two different library construction strategies. (Please see Thyagarajan and Bloom, 2014 and Doud and Bloom, 2016 for more information on the reverse-genetics strategy and the helper virus strategy respectively). We would like to know if one set of preferences significantly changes the behavior of the ExpCM compared to the other set or the average of the two sets.
We can compare the amino-acid preference measurements found in HA_prefs_Thyagarajan.csv
and HA_prefs_Doud.csv
using phydms_logoplots
:
HA_prefs_Thyagarajan
HA_prefs_Doud
As we would expect, the preferences measured from the two experiments are similar but not identical.
For more information on how to change enrichment scores to amino-acid preferences, please see the Comparing your DMS
data to natural sequence evolution section of this tutorial. For more information on the preference file formats, please see the full phydms
documentation or an example in the Comparing your DMS
data to natural sequence evolution section of this tutorial.
We will use an alignment of HA sequences. It is important to note there is exactly the same number of preferences in each preference file as sites in the alignment.
Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long
Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites
Preferences read from: example_data/HA_prefs_Thyagarajan.csv
There are preferences measured for 564 sites
You can use the phydms
auxiliary program phydms_prepalignment
to filter your sequences and prepare an alignment for phydms_comprehensive
.
phydms_comprehensive
We can now run phydms_comprehensive
by specifying our output prefix (in this case a directory called HA
), our preferences, and our alignment.
phydms_comprehensive HA/ example_data/HA_alignment.fasta example_data/HA_prefs_Thyagarajan.csv example_data/HA_prefs_Doud.csv --raxml raxml
phydms_comprehensive
produces both the standard phydms
output files for each model and a summary file. See the the Comparing your DMS
data to natural sequence evolution section of the tutorial for an example and full phydms
documentation for more information on the standard output files.
phydms_comprehensive
resultsTo compare the ExpCM with the different preferences, we can look at the summary file modelcomparison.md
.
Model | deltaAIC | LogLikelihood | nParams | ParamValues |
---|---|---|---|---|
ExpCM_HA_prefs_Doud | 0.00 | -4877.65 | 6 | beta=2.11, kappa=5.14, omega=0.52 |
ExpCM_HA_prefs_Thyagarajan | 44.18 | -4899.74 | 6 | beta=1.72, kappa=4.94, omega=0.55 |
averaged_ExpCM_HA_prefs_Doud | 2090.58 | -5922.94 | 6 | beta=0.68, kappa=5.36, omega=0.22 |
averaged_ExpCM_HA_prefs_Thyagarajan | 2097.86 | -5926.58 | 6 | beta=0.31, kappa=5.36, omega=0.22 |
YNGKP_M5 | 2113.50 | -5928.40 | 12 | alpha_omega=0.30, beta_omega=1.42, kappa=4.68 |
YNGKP_M0 | 2219.64 | -5982.47 | 11 | kappa=4.61, omega=0.20 |
First, we can see that both ExpCM models all have significantly larger (evaluated by the \(\Delta\)AIC) log-likelihoods than the two ExpCM with averaged preferences. These, in turn, have significantly larger log-likelihoods than the YNGKP family. Second, we can see the ExpCM with the preferences from Doud and Bloom, 2016 has a larger log-likelihood than then ExpCM with the Thyagarajan and Bloom, 2014 preferences.
These comparisons are evidence that the ExpCM with the Doud and Bloom, 2016 preferences is a better description of natural sequence evolution than the ExpCM with the Thyagarajan and Bloom, 2014 preferences or either one of the non-site-specific models.
Please see the full documentation if you would like to learn more about the phydms_comprehensive
program and its other options.
A deep mutational scanning experiment measures the amino-acid preferences of a given protein for a given selection pressure. However, it is not expected that the selection in lab faithfully describes the selection a protein faces in nature. We can use phydms_comprehensive
and the flag --omegabysite
to identify sites which deviate from the ExpCM model via an unexpectedly high or low rate of amino-acid substitution. That is, we will be able to differentiate between sites which are under diversifying selection, a high rate, and sites which are under a selective constraint not measured in lab, a low rate. This is in contrast to differential selection, which selects for unexpected amino-acid substitutions rather than unexpected rates.
For more information on the exact procedure, please see the full documentation or Bloom, 2017. This procedure is analogous to the FEL method described by Pond and Frost, 2005.
We are going to walk through an example of phydms_comprehensive
to detect sites under diversifying selection in influenza virus protein hemagglutinin using the preferences measured Doud and Bloom, 2016 and in \(\beta\)-lactamase using the preferences measured by Stiffler et al, 2015.
phydms_comprehensive
command-line usageHere is the full list of requirements and options for phydms_comprehensive
. To detect diversifying pressure, we are going to include the optional flag --omegabysite
. Below is a discussion of the input files, running the phydms_comprehensive
command, and interpretation of the results.
phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
[--brlen {scale,optimize}] [--omegabysite]
[--diffprefsbysite] [--gammaomega] [--gammabeta]
[--no-avgprefs] [--randprefs] [-v]
outprefix alignment prefsfiles [prefsfiles ...]
Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms
positional arguments:
outprefix Output file prefix.
alignment Existing FASTA file with aligned codon sequences.
prefsfiles Existing files with site-specific amino-acid
preferences.
optional arguments:
-h, --help show this help message and exit
--raxml RAXML Path to RAxML (e.g., 'raxml') (default: None)
--tree TREE Existing Newick file giving input tree. (default:
None)
--ncpus NCPUS Use this many CPUs; -1 means all available. (default:
-1)
--brlen {scale,optimize}
How to handle branch lengths: scale by single
parameter or optimize each one (default: optimize)
--omegabysite Fit omega (dN/dS) for each site. (default: False)
--diffprefsbysite Fit differential preferences for each site. (default:
False)
--gammaomega Fit ExpCM with gamma distributed omega. (default:
False)
--gammabeta Fit ExpCM with gamma distributed beta. (default:
False)
--no-avgprefs No fitting of models with preferences averaged across
sites for ExpCM. (default: False)
--randprefs Include ExpCM models with randomized preferences.
(default: False)
-v, --version show program's version number and exit
A full discussion of the amino-acid preferences and sequences for \(\beta\)-lacatamse and HA can be found in the earlier sections of this tutorial. Briefly,
BetaLactamase
Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites
HA
Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long
Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites
phydms_comprehensive
We can now run phydms_comprehensive
by specifying our output prefix (in this case a directory called HA_omegabysite
or betaLactmase_omegabysite
), our preferences, our alignment and the flag --omegabysite
. Each alignment requires its own phydms_comprehensive
run.
phydms_comprehensive HA_omegabysite/ example_data/HA_alignment.fasta example_data/HA_prefs_Doud.csv --raxml raxml --omegabysite
phydms_comprehensive betaLactamase_omegabysite/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml --omegabysite
phydms_comprehensive
produces both the standard phydms
output files for each model and a summary file. See the Comparing your DMS
data to natural sequence evolution section of the tutorial for an example and the full phydms
documentation for more information on the standard output files.
phydms_comprehensive
resultsTo detect sites under diversifying selection, we can look at the summary files with the suffix _omegabysite.txt
. First, we will look at the results from the HA analysis.
HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt lists the site, the fitted \(\omega_r\) value, the p-value for the hypothesis \(H_0: \omega_r = 1\), the dLnL (difference in log-likelihood between ExpCM with \(\omega_r = 1\) and the ExpCM with the fitted \(\omega_r\) value), and the Q-value, which controls for multiple comparisons via the false discovery rate. For more information on these metrics, please see the full phydms_comprehensive
documentation. The sites are sorted by the p-value which means the sites with the strongest evidence for deviations from \(\omega_r = 1\) will be at the top of the file.
head -n 20 HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt
# Omega fit to each site after fixing tree and all other parameters.
# Fits compared to null model of omega = 1.
# P-values NOT corrected for multiple testing, so consider Q-values too.
# Q-values computed separately for omega > and < 1 # Will fit different synonymous rate for each site.
#
site omega P dLnL Q
277 0.000 0.00246 4.585 0.694
147 0.000 0.00184 4.853 0.694
1 1.000 1 0.000 1
373 1.000 1 0.000 1
374 0.000 0.4 0.353 1
375 0.000 0.284 0.574 1
376 1.000 1 0.000 1
377 1.000 1 0.000 1
378 0.000 0.523 0.204 1
380 1.000 1 0.000 1
372 0.000 0.507 0.220 1
381 0.000 0.975 0.000 1
382 0.000 0.456 0.278 1
383 1.000 1 0.000 1
and sites with the weakest evidence for deviation from \(\omega_r = 1\) will be at the bottom of the file.
tail -n 20 HA_omegabysite/ExpCM_HA_prefs_Doud_omegabysite.txt
204 0.356 0.292 0.554 1
141 100.000 0.0783 1.550 1
195 1.000 1 0.000 1
193 0.000 0.618 0.125 1
179 1.000 0.999 0.000 1
180 1.000 1 0.000 1
181 0.455 0.585 0.149 1
182 1.000 1 0.000 1
183 0.462 0.428 0.314 1
184 1.820 0.633 0.114 1
194 100.000 0.152 1.027 1
185 1.306 0.797 0.033 1
187 1.000 1 0.000 1
188 1.000 1 0.000 1
189 1.000 1 0.000 1
190 0.000 0.0538 1.860 1
191 0.000 0.686 0.081 1
192 1.000 1 0.000 1
186 1.000 1 0.000 1
564 0.000 0.472 0.259 1
You will notice that the sites with strong evidence have a fitted \(\omega_r\) that is either very large (\(100\)) or very small (\(0\)) while the sites with weak evidence have a fitted \(\omega_r\) close to \(1\).
Here are the sites in \(\beta\)-lactamase with the strongest evidence for deviations from \(\omega_r = 1\)
head -n 20 betaLactamase_omegabysite/ExpCM_betaLactamase_prefs_omegabysite.txt
# Omega fit to each site after fixing tree and all other parameters.
# Fits compared to null model of omega = 1.
# P-values NOT corrected for multiple testing, so consider Q-values too.
# Q-values computed separately for omega > and < 1 # Will fit different synonymous rate for each site.
#
site omega P dLnL Q
248 20.043 0.000567 5.941 0.0745
213 37.250 0.000345 6.404 0.0745
218 95.530 0.00262 4.527 0.23
233 0.000 0.00164 4.956 0.432
139 33.371 0.0112 3.215 0.738
134 0.000 0.0133 3.064 0.875
146 0.000 0.00988 3.328 0.875
87 0.029 0.0102 3.299 0.875
257 0.000 0.0169 2.853 0.889
209 100.000 0.0175 2.823 0.92
105 72.028 0.0286 2.396 0.951
15 100.000 0.0326 2.284 0.951
141 100.000 0.0307 2.334 0.951
46 82.846 0.0218 2.629 0.951
We can see in both examples there is a small subset of sites which deviate display a faster (\(\omega_r>1\)) or slower (\(\omega_r <1\)) than expected rate of amino-acid substitutions.
A deep mutational scanning experiment measures the amino-acid preferences of a given protein for a given selection pressure. However, it is not expected that the selection in lab faithfully describes the selection a protein faces in nature. We can use phydms_comprehensive
and the flag --diffprefsbysite
to identify sites which deviate from the ExpCM model via differential selection. In contrast to diversifying selection, differential selection leads to unexpected amino-acid substitutions rather than unexpected rates.
For more information on the exact procedure, please see the full documentation or Bloom, 2017.
We are going to walk through an example of phydms_comprehensive
to detect sites under differential selection in influenza virus protein hemagglutinin using the preferences measured Doud and Bloom, 2016 and in \(\beta\)-lactamase using the preferences measured by Stiffler et al, 2015.
phydms_comprehensive
command-line usageHere is the full list of requirements and options for phydms_comprehensive
. To detect diversifying pressure, we are going to include the optional flag --diffprefsbysite
. Below is a discussion of the input files, running the phydms_comprehensive
command, and interpretation of the results.
phydms_comprehensive -h
usage: phydms_comprehensive [-h] (--raxml RAXML | --tree TREE) [--ncpus NCPUS]
[--brlen {scale,optimize}] [--omegabysite]
[--diffprefsbysite] [--gammaomega] [--gammabeta]
[--no-avgprefs] [--randprefs] [-v]
outprefix alignment prefsfiles [prefsfiles ...]
Comprehensive phylogenetic model comparison and detection of selection
informed by deep mutational scanning data. This program runs 'phydms'
repeatedly to compare substitution models and detect selection. The 'phydms'
package is written by the Bloom lab (see
https://github.com/jbloomlab/phydms/contributors). Version 2.2.dev1. Full
documentation at http://jbloomlab.github.io/phydms
positional arguments:
outprefix Output file prefix.
alignment Existing FASTA file with aligned codon sequences.
prefsfiles Existing files with site-specific amino-acid
preferences.
optional arguments:
-h, --help show this help message and exit
--raxml RAXML Path to RAxML (e.g., 'raxml') (default: None)
--tree TREE Existing Newick file giving input tree. (default:
None)
--ncpus NCPUS Use this many CPUs; -1 means all available. (default:
-1)
--brlen {scale,optimize}
How to handle branch lengths: scale by single
parameter or optimize each one (default: optimize)
--omegabysite Fit omega (dN/dS) for each site. (default: False)
--diffprefsbysite Fit differential preferences for each site. (default:
False)
--gammaomega Fit ExpCM with gamma distributed omega. (default:
False)
--gammabeta Fit ExpCM with gamma distributed beta. (default:
False)
--no-avgprefs No fitting of models with preferences averaged across
sites for ExpCM. (default: False)
--randprefs Include ExpCM models with randomized preferences.
(default: False)
-v, --version show program's version number and exit
A full discussion of the amino-acid preferences and sequences for \(\beta\)-lacatamse and HA can be found in the earlier sections of this tutorial. Briefly,
BetaLactamase
Sequences read from: example_data/betaLactamase_alignment.fasta
There are 50 sequences
Each sequence is 263 amino acids long
Preferences read from: example_data/betaLactamase_prefs.csv
There are preferences measured for 263 sites
HA
Sequences read from: example_data/HA_alignment.fasta
There are 34 sequences
Each sequence is 564 amino acids long
Preferences read from: example_data/HA_prefs_Doud.csv
There are preferences measured for 564 sites
phydms_comprehensive
We can now run phydms_comprehensive
by specifying our output prefix (in this case a directory called HA_diffprefs
or betaLactmase_diffprefs
), our preferences, our alignment and the flag --diffprefsbysite
. Each alignment requires its own phydms_comprehensive
run.
phydms_comprehensive HA_diffprefs/ example_data/HA_alignment.fasta example_data/HA_prefs_Doud.csv --raxml raxml --diffprefsbysite
phydms_comprehensive betaLactamase_diffprefs/ example_data/betaLactamase_alignment.fasta example_data/betaLactamase_prefs.csv --raxml raxml --diffprefsbysite
phydms_comprehensive
produces both the standard phydms
output files for each model and a summary file. See the Comparing your DMS
data to natural sequence evolution section of the tutorial for an example and the full phydms
documentation for more information on the standard output files.
phydms_comprehensive
resultsTo detect sites under differential selection, we can look at the summary files with the suffix diffprefsbysite.txt
. First, we will look at the results from the HA analysis.
HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt lists the site, the differential preference for each amino acid, and half absolute sum of the differential preferences for that site. A large differential preference value means that substitutions to that amino acid were seen at that site in the alignment more often than expected given the ExpCM and the measured amino-acid preferences. The last column can be used as a summary of differential selection strength for a given site. Larger values indicate the site is being strongly deferentially selected. For more information on these metrics, please see the full phydms_comprehensive
documentation.
Here are the sites (and a subset of the differential preferences) in HA with the strongest evidence for differential selection
import pandas as pd
df = pd.read_csv("HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].head(6).to_string(index = False)
site dpi_A dpi_C dpi_D dpi_E dpi_F dpi_G dpi_H dpi_I half_sum_abs_dpi
174 -0.0089 -0.0050 -0.0074 -0.0091 -0.0022 -0.0097 -0.0065 -0.0094 0.1958
127 -0.0166 -0.0020 -0.0018 -0.0006 0.1702 -0.0137 -0.0054 -0.0226 0.1702
171 -0.0148 -0.0001 -0.0166 -0.0147 -0.0003 0.1638 -0.0028 -0.0159 0.1638
277 -0.0072 -0.0030 -0.0051 -0.0182 -0.0025 0.1633 -0.0042 -0.0092 0.1633
9 -0.0123 0.1620 -0.0031 -0.0022 -0.0074 -0.0068 -0.0087 -0.0152 0.1620
496 -0.0079 -0.0040 -0.0054 -0.0098 -0.0049 -0.0092 -0.0076 -0.0117 0.1554
and sites with the weakest evidence
import pandas as pd
df = pd.read_csv("HA_diffprefs/ExpCM_HA_prefs_Doud_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].tail(6).to_string(index = False)
site dpi_A dpi_C dpi_D dpi_E dpi_F dpi_G dpi_H dpi_I half_sum_abs_dpi
351 -0.0010 -0.0006 -0.0003 -0.0001 0.0068 -0.0012 -0.0001 -0.0014 0.0092
327 -0.0001 0.0016 -0.0003 -0.0003 0.0003 -0.0010 0.0004 -0.0043 0.0088
378 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
342 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
73 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000
283 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000
Here are the sites in \(\beta\)-lactamase with the strongest evidence for differential selection
import pandas as pd
df = pd.read_csv("betaLactamase_diffprefs/ExpCM_betaLactamase_prefs_diffprefsbysite.txt", sep='\t', skiprows=(0,1,2))
print df.iloc[:,range(9) + [-1]].head(6).to_string(index = False)
site dpi_A dpi_C dpi_D dpi_E dpi_F dpi_G dpi_H dpi_I half_sum_abs_dpi
213 -0.3701 -0.0003 -0.0009 -0.0003 -0.0003 -0.5691 -0.0005 -0.0003 0.9656
214 -0.0518 -0.0002 -0.0492 -0.0417 0.0032 -0.0533 -0.0048 0.0049 0.3394
104 -0.0201 -0.0054 -0.0076 -0.0092 -0.0035 -0.0009 -0.0080 -0.0285 0.2555
167 -0.0147 -0.0042 -0.0063 -0.0198 -0.0027 -0.0145 -0.0063 -0.0072 0.2271
44 -0.0217 -0.0119 -0.0006 -0.0006 -0.0006 -0.0211 -0.0031 0.0429 0.2206
90 -0.0187 -0.0024 0.1839 -0.0134 -0.0009 -0.0329 -0.0097 -0.0037 0.1839
We visualize the \(\beta\)-lactamase differential preferences for each site using phydms_logoplot
phydms_logoplot logoplots/betaLactamase_prefs_diff.pdf --mapmetric functionalgroup --colormap mapmetric --diffprefs betaLactamase_diffprefs/ExpCM_betaLactamase_prefs_diffprefsbysite.txt
We can see in both the HA and the \(\beta\)-lactamase example, there are a small subset of sites which appear to be under differential selection.