Utils
- tka.utils.docs()[source]
Upon calling the function, the website with the documentation will pop up on screen.
- tka.utils.is_valid_smiles(smiles: str)[source]
Returns True if SMILES representations is valid and False otherwise.
- tka.utils.load_l1000_ordered_feature_columns(gene_id)[source]
Loads L1000 ordered features in a list format based on the specified gene_id
- Parameters:
gene_id (str) – one of “affyID”, “ensemblID” or “entrezID”
- Raises:
ValueError – If either gene_id is not of the allowed probes.
- Returns:
L1000 ordered features in a list format based on the specified gene_id
- Return type:
list
- tka.utils.load_mobc_ordered_feature_columns(version: str = 'v4')[source]
Loads CellProfiler ordered features in a list format.
- Parameters:
version (str) – One of [“v3”, “v4”].
- tka.utils.prepare_df_for_mobc_predictions(df_dmso: DataFrame, df_real: DataFrame, identifier_col: str = 'SMILES', grouping_col: str = '', normalize: bool = True)[source]
Prepares df_real for predict_from_ge() inference.
Based on DMSO negative controls this function normalizes df_real, extracts relevant features and possible groups them based on a prespecified column.
- Parameters:
identifier_col_vals (List[str]) – List of strings (ids) corresponding to input data points (or any other identifiers).
df_dmso (pd.DataFrame) – a df of negative control samples where the rows represent samples and columns cellprofiler features
df_real (pd.DataFrame) – a df of treated samples where the rows represent samples and columns cellprofiler features
identifier_col (str) – name of the column used for indexing the output dataframe
grouping_col (str, optional) – if provided the output df will be grouped and mean aggregated based on that column
normalize (bool) – If set to False, sphering normalization will not be used and df_dmso is not required.
- Raises:
ValueError – if any columns are missing from either df_real or df_dmso compared against mobc_features
- Returns:
a normalized df analogous to df_real but with only CellProfiler features
- Return type:
pd.DataFrame
- tka.utils.transform_l1000_ids(from_id, to_id, gene_ids, dataset_path='l1000_mapped.csv', ignore_missing=False) Dict [source]
Transforms L1000 gene IDs from one format to another.
- Parameters:
from_id (str) – The source probe type (“affyID”, “entrezID”, “ensemblID”).
to_id (str) – The target probe type (“affyID”, “entrezID”, “ensemblID”).
gene_ids (list) – List of L1000 gene IDs to transform.
dataset_path (str) – Path to the DataFrame containing L1000 gene IDs for each probe type.
ignore_missing (bool) – If set to True, it will not raise an error on missing or invalid probe IDs.
- Raises:
ValueError – If either from_id or to_id is not one of the allowed values.
ValueError – If any of the gene IDs in the dataset is not within the scope of L1000.
- Returns:
Original and transformed L1000 gene IDs as keys and values respectively.
- Return type:
dict
- tka.utils.transform_moshkov_outputs(identifier_col_vals: List[str], output: List[List], auc_modality_filter: dict = {}, use_full_assay_names: bool = False) DataFrame [source]
Transform Moshkov outputs into a Pandas DataFrame.
- Parameters:
identifier_col_vals (List[str]) – List of id strings corresponding to input data points (or any other identifiers).
output (List[List[]]) – List of lists containing output data (shape: X, 270).
auc_modality_filter (dict, optional) –
If supplied, assays whose prediction accuracies are lower than auc at a given modality, will be dropped. The dict has three keys - ‘auc’ and ‘modality’, ‘version’. Allowed modalities are in the following list:
[‘late_fusion_cs_ge’, ‘late_fusion_cs_ge_mobc’, ‘late_fusion_cs_mobc’, ‘late_fusion_ge_mobc’, ‘cpcl_es_op’, ‘cp_es_op’, ‘ges_es_op’, ‘ge_cp_es_op’, ‘ge_es_op’, ‘ge_mobc_cp_es_op’, ‘ge_mobc_es_op’, ‘ge_mo_cp_es_op’, ‘ge_mo_es_op’, ‘mobc_cp_es_op’, ‘mobc_es_op’, ‘mo_cp_es_op’, ‘mo_es_op’]
Allowed auc thresholds are any floating point values between 0.5 and 1.0. Allowed versions are ‘v3’ and ‘v4’
use_full_assay_names (bool, optional) – Whether to use full assay names from the CSV. Defaults to False.
- Raises:
ValueError – If the auc_modality_filter contains invalid modality.
- Returns:
df with identifier_col_vals as the first column and assay data columns.
- Return type:
pd.DataFrame