Moshkov et al. 2022
- tka.external_models.moshkov.load_assay_metadata() DataFrame [source]
Loads assay metadata of the assays used by Moshkov et al.
- tka.external_models.moshkov.predict_from_ge(df: List[str], gene_id: str, checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame [source]
Make predictions from a pd.DataFrame of standard scaled gene expressions and a trained model checkpoint.
- Parameters:
df (pd.DataFrame) – a pd.DataFrame with the columns being L1000 features (977 features) and the index column being the identification column
gene_id (str) – type of identifier present in the header row - one of “affyID”, “entrezID” or “ensemblID”
checkpoint_dir (str) – Directory containing the trained checkpoint.
auc_modality_filter (dict, optional) –
If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:
[‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]
Allowed auc thresholds are any floating point values between 0.5 and 1.0.
- Returns:
Predictions with df’s first column as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
>>> df ENSG00000132423 ENSG00000182158 ENSG00000122873 ENSG00000213585 ... 0 -0.559783 1.127299 0.767661 -0.103637 ... 1 1.055605 -0.131212 0.170593 0.485176 ... ... ... ... ... ... ... (10, 977) # Assuming df is a pd.Dataframe with shape (X, 977) # and the columns are either ensembl, entrez or affyIDs. >>> predict_from_ge( ... df=df, ... gene_id="ensemblID", ... checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op" ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 ... 0 0.013138 ... 0.207173 ... 1 0.064487 ... 0.389113 ... ... ... ... ... ... (10, 270)
- tka.external_models.moshkov.predict_from_mobc(df_real: DataFrame, checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame [source]
Make predictions from a dataframe of batch effect corrected morphology profiles from CellProfiler and a trained model checkpoint.
- Parameters:
df_real (pd.DataFrame) – a pd.DataFrame with the columns being CellProfiler features (1746 features) and the index column being the identification column
checkpoint_dir (str) – Directory containing the trained checkpoint.
auc_modality_filter (dict, optional) –
If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:
[‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]
Allowed auc thresholds are any floating point values between 0.5 and 1.0.
- Returns:
Predictions with df_real’s first column as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
In the following code, identifier_col remains to the only data left besides CellProfiler features. Also, sphering normalization is used to modify df_real and this is why df_dmso is required.
>>> import pandas as pd >>> from tka.utils import prepare_df_for_mobc_predictions >>> # Load dataset for prediction >>> df = pd.read_csv("path/to/dataset.csv") >>> df_dmso = df_filter.loc[df_filter["Metadata_broad_sample"] == "DMSO"] >>> df_real = df_filter.loc[df_filter["Metadata_broad_sample"] != "DMSO"] >>> out_df = prepare_df_for_mobc_predictions( ... df_dmso=df_dmso, df_real=df_real, identifier_col="Metadata_pert_id" ... ) >>> predict_from_mobc( ... df_real=out_df, ... checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op", ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 BRD-K18619710 0.000000e+00 ... 0.000000e+00 BRD-K20742498 3.456357e-10 ... 1.632998e-03 ... ... ... ... Shape: (X, 270)
- tka.external_models.moshkov.predict_from_smiles(smiles_list: List[str], checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame [source]
Make predictions from a list of SMILES strings using a trained checkpoint.
- Parameters:
smiles_list (List[str]) – List of SMILES strings for which to make predictions.
checkpoint_dir (str) – Directory containing the trained checkpoint.
auc_modality_filter (dict, optional) –
If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:
[‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]
Allowed auc thresholds are any floating point values between 0.5 and 1.0.
- Returns:
Predictions with SMILES as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
>>> predict_from_smiles( ... smiles_list=["CCC", "CCCC", "CH4"], ... checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-cp-es-op" ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 CCC 0.000082 ... 0.442998 CCCC 0.000082 ... 0.442998 CH4 Invalid SMILES ... Invalid SMILES (3, 270)