Utils

class tka.utils.SpheringNormalizer(controls)[source]

Bases: object

normalize(X)[source]
sphering_transform(X, lambda_, rotate=True)[source]
tka.utils.docs()[source]

Upon calling the function, the website with the documentation will pop up on screen.

tka.utils.is_valid_smiles(smiles: str)[source]

Returns True if SMILES representations is valid and False otherwise.

tka.utils.load_l1000_ordered_feature_columns(gene_id)[source]

Loads L1000 ordered features in a list format based on the specified gene_id

Parameters:

gene_id (str) – one of “affyID”, “ensemblID” or “entrezID”

Raises:

ValueError – If either gene_id is not of the allowed probes.

Returns:

L1000 ordered features in a list format based on the specified gene_id

Return type:

list

tka.utils.load_mobc_ordered_feature_columns(model_id: str = '2023-02-mobc-es-op')[source]

Loads cell morphology ordered features in a list format. Currently all models use CellProfiler features.

Parameters:

model_id (str) – One of [“2023-02-mobc-es-op”, “2023-01-mobc-es-op”, “2021-02-mobc-es-op”].

tka.utils.prepare_df_for_mobc_predictions(df_dmso: DataFrame, df_real: DataFrame, identifier_col: str = 'SMILES', grouping_col: str = '', normalize: bool = True)[source]

Prepares df_real for predict_from_ge() inference.

Based on DMSO negative controls this function normalizes df_real, extracts relevant features and possible groups them based on a prespecified column.

Parameters:
  • identifier_col_vals (List[str]) – List of strings (ids) corresponding to input data points (or any other identifiers).

  • df_dmso (pd.DataFrame) – a df of negative control samples where the rows represent samples and columns cellprofiler features

  • df_real (pd.DataFrame) – a df of treated samples where the rows represent samples and columns cellprofiler features

  • identifier_col (str) – name of the column used for indexing the output dataframe

  • grouping_col (str, optional) – if provided the output df will be grouped and mean aggregated based on that column

  • normalize (bool) – If set to False, sphering normalization will not be used and df_dmso is not required.

Raises:

ValueError – if any columns are missing from either df_real or df_dmso compared against mobc_features

Returns:

a normalized df analogous to df_real but with only CellProfiler features

Return type:

pd.DataFrame

tka.utils.transform_l1000_ids(from_id, to_id, gene_ids, dataset_path='l1000_mapped.csv', ignore_missing=False) Dict[source]

Transforms L1000 gene IDs from one format to another.

Parameters:
  • from_id (str) – The source probe type (“affyID”, “entrezID”, “ensemblID”).

  • to_id (str) – The target probe type (“affyID”, “entrezID”, “ensemblID”).

  • gene_ids (list) – List of L1000 gene IDs to transform.

  • dataset_path (str) – Path to the DataFrame containing L1000 gene IDs for each probe type.

  • ignore_missing (bool) – If set to True, it will not raise an error on missing or invalid probe IDs.

Raises:
  • ValueError – If either from_id or to_id is not one of the allowed values.

  • ValueError – If any of the gene IDs in the dataset is not within the scope of L1000.

Returns:

Original and transformed L1000 gene IDs as keys and values respectively.

Return type:

dict

tka.utils.transform_moshkov_outputs(identifier_col_vals: List[str], output: List[List], model_id: str, auc_threshold: float = 0.0, use_full_assay_names: bool = False) DataFrame[source]

Transform Moshkov outputs into a Pandas DataFrame.

Parameters:
  • identifier_col_vals (List[str]) – List of id strings corresponding to input data points (or any other identifiers).

  • output (List[List[]]) – List of lists containing output data (shape: X, 270).

  • auc_threshold (float, optional) – If supplied, assays whose prediction accuracies are lower than auc_threshold, will be dropped. Allowed auc_threshold values are any floating point values between 0.5 and 1.0.

  • use_full_assay_names (bool, optional) – Whether to use full assay names from the CSV. Defaults to False.

Returns:

df with identifier_col_vals as the first column and assay data columns.

Return type:

pd.DataFrame