Moshkov et al. 2022

tka.external_models.moshkov.load_assay_metadata() DataFrame[source]

Loads assay metadata of the assays used by Moshkov et al.

tka.external_models.moshkov.predict_from_ge(df: List[str], gene_id: str, checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0) DataFrame[source]

Make predictions from a pd.DataFrame of standard scaled gene expressions and a trained model checkpoint.

Parameters:
  • df (pd.DataFrame) – a pd.DataFrame with the columns being L1000 features (977 features) and the index column being the identification column

  • gene_id (str) – type of identifier present in the header row - one of “affyID”, “entrezID” or “ensemblID”

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].

  • auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.

Returns:

Predictions with df’s first column as indices and assays as columns.

Return type:

pd.DataFrame

Examples

>>> df
    ENSG00000132423  ENSG00000182158  ENSG00000122873  ENSG00000213585  ...
0         -0.559783         1.127299         0.767661        -0.103637  ...
1          1.055605        -0.131212         0.170593         0.485176  ...
...             ...              ...              ...              ...  ...
(10, 977)
# Assuming df is a pd.Dataframe with shape (X, 977)
# and the columns are either ensembl, entrez or affyIDs.
>>> predict_from_ge(
...     df=df,
...     gene_id="ensemblID",
...     checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op"
... )
smiles  AmyloidFormation.Absorb.AB42_1_1  ...  HoxA13DNABinding.FluorOligo.HoxDNA_93_259  ...
0                               0.013138  ...                                   0.207173  ...
1                               0.064487  ...                                   0.389113  ...
...                                  ...  ...                                        ...  ...
(10, 270)
tka.external_models.moshkov.predict_from_mobc(df_real: DataFrame, checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0, impute_missing_features: bool = True) DataFrame[source]

Make predictions from a dataframe of batch effect corrected morphology profiles from CellProfiler and a trained model checkpoint.

Parameters:
  • df_real (pd.DataFrame) – a pd.DataFrame with the columns being features (either CellProfiler or custom) and the index column being the identification column

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].

  • auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.

  • impute_missing_features (bool) – If set to True, all missing features will be replaced by the mean value from the training set.

Returns:

Predictions with df_real’s first column as indices and assays as columns.

Return type:

pd.DataFrame

Examples

In the following code, identifier_col remains to the only data left besides CellProfiler features. Also, sphering normalization is used to modify df_real and this is why df_dmso is required.

>>> import pandas as pd
>>> df = pd.read_csv("<path_to_dataset>")
>>> predict_from_mobc(
...     df_real = df,
...     checkpoint_dir = ".../2023_Moshkov_NatComm/models/2023-01-mobc-es-op",
...     model_id = "2023-01-mobc-es-op"
...     auc_threshold = 0.9
... )
smiles         AmyloidFormation.Absorb.AB42_1_1  ...  HoxA13DNABinding.FluorOligo.HoxDNA_93_259
BRD-K18619710                      0.000000e+00  ...                               0.000000e+00
BRD-K20742498                      3.456357e-10  ...                               1.632998e-03
        ...                               ...  ...                                        ...
Shape: (X, 270)
tka.external_models.moshkov.predict_from_smiles(smiles_list: List[str], checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0) DataFrame[source]

Make predictions from a list of SMILES strings using a trained checkpoint.

Parameters:
  • smiles_list (List[str]) – List of SMILES strings for which to make predictions.

  • checkpoint_dir (str) – Directory containing the trained checkpoint.

  • model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].

  • auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.

Returns:

Predictions with SMILES as indices and assays as columns.

Return type:

pd.DataFrame

Examples

>>> predict_from_smiles(
...     smiles_list=["CCC", "CCCC", "CH4"],
...     checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-cp-es-op"
... )
smiles AmyloidFormation.Absorb.AB42_1_1  ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259
CCC                            0.000082  ...                                  0.442998
CCCC                           0.000082  ...                                  0.442998
CH4                      Invalid SMILES  ...                            Invalid SMILES
(3, 270)