Moshkov et al. 2022
- tka.external_models.moshkov.load_assay_metadata() DataFrame [source]
Loads assay metadata of the assays used by Moshkov et al.
- tka.external_models.moshkov.predict_from_ge(df: List[str], gene_id: str, checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0) DataFrame [source]
Make predictions from a pd.DataFrame of standard scaled gene expressions and a trained model checkpoint.
- Parameters:
df (pd.DataFrame) – a pd.DataFrame with the columns being L1000 features (977 features) and the index column being the identification column
gene_id (str) – type of identifier present in the header row - one of “affyID”, “entrezID” or “ensemblID”
checkpoint_dir (str) – Directory containing the trained checkpoint.
model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].
auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.
- Returns:
Predictions with df’s first column as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
>>> df ENSG00000132423 ENSG00000182158 ENSG00000122873 ENSG00000213585 ... 0 -0.559783 1.127299 0.767661 -0.103637 ... 1 1.055605 -0.131212 0.170593 0.485176 ... ... ... ... ... ... ... (10, 977) # Assuming df is a pd.Dataframe with shape (X, 977) # and the columns are either ensembl, entrez or affyIDs. >>> predict_from_ge( ... df=df, ... gene_id="ensemblID", ... checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-mobc-es-op" ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 ... 0 0.013138 ... 0.207173 ... 1 0.064487 ... 0.389113 ... ... ... ... ... ... (10, 270)
- tka.external_models.moshkov.predict_from_mobc(df_real: DataFrame, checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0, impute_missing_features: bool = True) DataFrame [source]
Make predictions from a dataframe of batch effect corrected morphology profiles from CellProfiler and a trained model checkpoint.
- Parameters:
df_real (pd.DataFrame) – a pd.DataFrame with the columns being features (either CellProfiler or custom) and the index column being the identification column
checkpoint_dir (str) – Directory containing the trained checkpoint.
model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].
auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.
impute_missing_features (bool) – If set to True, all missing features will be replaced by the mean value from the training set.
- Returns:
Predictions with df_real’s first column as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
In the following code, identifier_col remains to the only data left besides CellProfiler features. Also, sphering normalization is used to modify df_real and this is why df_dmso is required.
>>> import pandas as pd >>> df = pd.read_csv("<path_to_dataset>") >>> predict_from_mobc( ... df_real = df, ... checkpoint_dir = ".../2023_Moshkov_NatComm/models/2023-01-mobc-es-op", ... model_id = "2023-01-mobc-es-op" ... auc_threshold = 0.9 ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 BRD-K18619710 0.000000e+00 ... 0.000000e+00 BRD-K20742498 3.456357e-10 ... 1.632998e-03 ... ... ... ... Shape: (X, 270)
- tka.external_models.moshkov.predict_from_smiles(smiles_list: List[str], checkpoint_dir: str, model_id: str, auc_threshold: float = 0.0) DataFrame [source]
Make predictions from a list of SMILES strings using a trained checkpoint.
- Parameters:
smiles_list (List[str]) – List of SMILES strings for which to make predictions.
checkpoint_dir (str) – Directory containing the trained checkpoint.
model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”, “2024-01-mobc-es-op”].
auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.
- Returns:
Predictions with SMILES as indices and assays as columns.
- Return type:
pd.DataFrame
Examples
>>> predict_from_smiles( ... smiles_list=["CCC", "CCCC", "CH4"], ... checkpoint_dir=".../Moshkov(etal)-single-models/2021-02-cp-es-op" ... ) smiles AmyloidFormation.Absorb.AB42_1_1 ... HoxA13DNABinding.FluorOligo.HoxDNA_93_259 CCC 0.000082 ... 0.442998 CCCC 0.000082 ... 0.442998 CH4 Invalid SMILES ... Invalid SMILES (3, 270)