Moshkov et al. 2022

tka.external_models.moshkov.load_assay_metadata() DataFrame[source]

Loads assay metadata of the assays used by Moshkov et al.

tka.external_models.moshkov.predict_from_ge(df: List[str], gene_id: str, checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame[source]

Make predictions from a pd.DataFrame of standard scaled gene expressions and a trained model checkpoint.

Parameters:
  • df (pd.DataFrame) – a pd.DataFrame with the columns being L1000 features (977 features) and the index column being the identification column

  • gene_id (str) – type of identifier present in the header row - one of “affyID”, “entrezID” or “ensemblID”

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • auc_modality_filter (dict, optional) –

    If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:

    [‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]

    Allowed auc thresholds are any floating point values between 0.5 and 1.0.

Returns:

Predictions with df’s first column as indices and assays as columns.

Return type:

pd.DataFrame

Examples

>>> df
    ENSG00000132423  ENSG00000182158  ENSG00000122873  ENSG00000213585  ...
0         -0.559783         1.127299         0.767661        -0.103637  ...
1          1.055605        -0.131212         0.170593         0.485176  ...
...             ...              ...              ...              ...  ...
(10, 977)
# Assuming df is a pd.Dataframe with shape (X, 977)
# and the columns are either ensembl, entrez or affyIDs.
>>> predict_from_ge(
...     df=df,
...     gene_id="ensemblID",
...     checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op"
... )
smiles  AmyloidFormation.Absorb.AB42_1_1  ...  HoxA13DNABinding.FluorOligo.HoxDNA_93_259  ...
0                               0.013138  ...                                   0.207173  ...
1                               0.064487  ...                                   0.389113  ...
...                                  ...  ...                                        ...  ...
(10, 270)
tka.external_models.moshkov.predict_from_mobc(df_real: DataFrame, checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame[source]

Make predictions from a dataframe of batch effect corrected morphology profiles from CellProfiler and a trained model checkpoint.

Parameters:
  • df_real (pd.DataFrame) – a pd.DataFrame with the columns being CellProfiler features (1746 features) and the index column being the identification column

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • auc_modality_filter (dict, optional) –

    If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:

    [‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]

    Allowed auc thresholds are any floating point values between 0.5 and 1.0.

Returns:

Predictions with df_real’s first column as indices and assays as columns.

Return type:

pd.DataFrame

Examples

In the following code, identifier_col remains to the only data left besides CellProfiler features. Also, sphering normalization is used to modify df_real and this is why df_dmso is required.

>>> import pandas as pd
>>> from tka.utils import prepare_df_for_mobc_predictions
>>> # Load dataset for prediction
>>> df = pd.read_csv("path/to/dataset.csv")
>>> df_dmso = df_filter.loc[df_filter["Metadata_broad_sample"] == "DMSO"]
>>> df_real = df_filter.loc[df_filter["Metadata_broad_sample"] != "DMSO"]
>>> out_df = prepare_df_for_mobc_predictions(
...     df_dmso=df_dmso, df_real=df_real, identifier_col="Metadata_pert_id"
... )
>>> predict_from_mobc(
...     df_real=out_df,
...     checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op",
... )
smiles         AmyloidFormation.Absorb.AB42_1_1  ...  HoxA13DNABinding.FluorOligo.HoxDNA_93_259
BRD-K18619710                      0.000000e+00  ...                               0.000000e+00
BRD-K20742498                      3.456357e-10  ...                               1.632998e-03
        ...                               ...  ...                                        ...
Shape: (X, 270)
tka.external_models.moshkov.predict_from_smiles(smiles_list: List[str], checkpoint_dir: str, auc_modality_filter: dict = {}) DataFrame[source]

Make predictions from a list of SMILES strings using a trained checkpoint.

Parameters:
  • smiles_list (List[str]) – List of SMILES strings for which to make predictions.

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • auc_modality_filter (dict, optional) –

    If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has two keys - ‘auc’ and ‘modality’. Allowed modalities are in the following list:

    [‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]

    Allowed auc thresholds are any floating point values between 0.5 and 1.0.

Returns:

Predictions with SMILES as indices and assays as columns.

Return type:

pd.DataFrame

Examples

>>> predict_from_smiles(
...     smiles_list=["CCC", "CCCC", "CH4"],
...     checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-cp-es-op"
... )
smiles AmyloidFormation.Absorb.AB42_1_1  ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259
CCC                            0.000082  ...                                  0.442998
CCCC                           0.000082  ...                                  0.442998
CH4                      Invalid SMILES  ...                            Invalid SMILES
(3, 270)