standardize

Module standardize

This modules is used to standardize molecules and molecular DataFrames.

class npfc.standardize.FullUncharger[source]

A class derived from rdkit.Chem.MolStandardize.charge.Uncharger, so instead of attempting to create zwitterions all possible charges are removed from the molecule.

For instance:

>>> # Uncharger:
>>> [O-][N+](C)(C)C[O-] -> [O-][N+](C)(C)CO
>>> # FullUncharger:
>>> [O-][N+](C)(C)C[O-] -> O[N+](C)(C)CO

Create an instance of FullUncharger.

Todo

This will remove charges from -2 to +2 only. This could be improved using more general smarts?

full_uncharge(mol)[source]

Neutralize molecule by adding/removing hydrogens. Does not attempt to preserve zwitterions. For now takes into account only charges of -2 and +2.

Parameters

mol (Mol) – the input molecule

Return type

Mol

Returns

the uncharged molecule

class npfc.standardize.Standardizer(protocol=None, col_mol='mol', col_id='idm', elements_medchem={'B', 'Br', 'C', 'Cl', 'F', 'H', 'I', 'N', 'O', 'P', 'S'}, timeout=10)[source]

A class for standardizing molecular structures. The standardization itself is based on a protocol that the user can modify.

By default this protocol consists in 15 tasks applied to each molecule invidually:

  1. initiate_mol: check if the molecule passed the RDKit conversion

  2. filter_empty: filter molecules with empty structures

  3. disconnect_metal: break bonds involving metallic atoms, resulting in potentially several molecules per structure.

  4. clear_mixtures: retrieve only the “best” molecule from a mixture, which might not always be the largest one.

  5. deglycosylate: remove all external sugars-like rings from the molecule and return the remaining non-linear entity.

  6. filter_num_heavy_atom: filter molecules with a heavy atom count not in the accepted range. By default: num_heavy_atom > 3.

  7. filter_molecular_weight: filter molecules with a molecular weight not in the accepted range. By default: molecular_weight <= 1000.0.

  8. filter_num_ring: filter molecules with a number of rings (Smallest Sets of Smallest Rings or SSSR) not in the accepted range. By default: num_ring > 0.

  9. filter_elements: filter molecules with elements not considered as medchem. By default: elements in H, B, C, N, O, F, P, S, Cl, Br, I.

  10. clear_isotopes: set all atoms to their most common isotope (i.e. 14C becomes 12C which is C).

  11. normalize: always write the same functional groups in the same manner.

  12. uncharge: remove all charges on a molecule when it is possible. This is different from rdkit.Chem.MolStandardize.charge module as there is no attempt for reaching the zwitterion.

  13. canonicalize: enumerate the canonical tautomer.

  14. clear_stereo: remove all remaining stereochemistry flags on the molecule.

  15. reset_mol: convert forth and back to SMILES format to discard potential residual outdated flags on atoms and bonds.

Other steps are not part of this protocol but can be executed as well for convenience:

  • depict: find the “best” possible 2D depiction of the molecule among Input/rdDepictor/Avalon/CoordGen methods

  • extract_murcko: return the Murcko Scaffold from the molecule

  • clear_side_chains: remove any exocyclic atom that is not part of a linker

  • reset_mol: reset the molecule by converting to and then from smiles

This results in new columns in the input DataFrame:

  • the ‘mol’ column: updated structure (only for the protocol)

  • the ‘status’ column: either passed, filtered or error.

  • the ‘task’ column: the latest task that was applied to the molecule.

The standardizer works either on a molecule (method: ‘run’) or on a DataFrame containing molecules (‘run_df’).

In the latter case, the inchikey is computed and can be used for identifying duplicate entries.

A timeout value is set by default and will be applied to each molecule individually to avoid the process being stuck on marginally difficult cases. This value can be set either during the Standardizer object initialization or by defining as an option in the protocol (priority is given to the latter if defined).

Create a Standardizer object.

Parameters
  • protocol (Optional[str]) – Either a JSON file or a dictionary. The resultung dictinary needs a ‘tasks’ key that lists all tasks to be excuted as a list.

  • col_mol (str) – the column with the molecule for when running the run_df method

  • col_id (str) – the column with the id for when running the run_df method

  • filter_duplicates

clear_isotopes(mol)[source]

Return a molecule without any isotopes.

Parameters

mol (Mol) – the input molecule

Return type

Mol

Returns

the molecule without isotope

clear_mixtures(mol)[source]

Return the “best” molecule found in a molecular structure.

The “best” molecule is determined by the following criteria, sorted by priority:

  1. contains only medchem elements

  2. contains at least one ring

  3. has the largest molecular weight of the mixture

To summarize:

medchem > non linear > molecular weight

So the largest molecule of a mixture might not always be selected, for instance a very long aliphatic chain would be dismissed to keep a benzene molecule instead.

This is implemented in such a way because our fragments used for substructure search contain at least one ring. On the contrary, this long aliphatic chain would be kept in a mixture with a non-medchem molecule.

Parameters

mol (Mol) – the input molecule(s)

Return type

Mol

Returns

the best molecule

clear_side_chains(mol, debug=False)[source]

Clear the side chains of a molecule.

This method operates in 3 steps:

  1. Remove quickly all atoms in side chains but the one attached to a ring, starting from the terminal atom. (would certainly fail in case of linear molecules)

  2. Iterate over each remaining exocyclic atoms to remove only atoms when it does not break the ring aromaticity. Simple and double bonds can be broken and the atoms in rings which were attached to removed atoms are neutralized.

  3. Remove eventual nitrogen radicals by Smiles editing.

Warning

I found only nitrogen radicals in my dataset, this might be insufficient on a larger scale.

Warning

I found a bug for this molecule ‘O=C(O)C1OC(OCC2OC(O)C(O)C(O)C2O)C(O)C(O)C1O’, where a methyl remains after processing.

Parameters

mol (Mol) – the molecule to simplify

Return type

Mol

Returns

a simplified copy of the molecule

deglycosylate(mol, mode='run')[source]

Function to deglycosylate molecules.

Several rules are applied for removing Sugar-Like Rings (SLRs) from molecules:

  1. Only external SLRs are removed, so a molecule with aglycan-SLR-aglycan is not modified

  2. Only molecules with both aglycans and SLRs are modified (so only SLRs or none are left untouched)

  3. Linear aglycans are considered to be part of linkers and are thus never returned as results

  4. Glycosidic bonds are defined as either O or CO and can be linked to larger linear linker. So from a SLR side, either nothing or only 1 C are allowed before the glycosidic bond oxygen

  5. Linker atoms until the glycosidic bond oxygen atom are appended to the definition of the SLR, so that any extra methyl is also removed.

_images/std_deglyco_algo.svg
Parameters
  • mol (Mol) – the input molecule

  • mode (str) – either ‘run’ for actually deglycosylating the molecule or ‘graph’ for returning a graph of rings instead (useful for presentations or debugging)

Return type

Union[Mol, Graph]

Returns

the deglycosylated molecule or a graph of rings

run(mol, timeout=10)[source]

Execute the standardization protocol on a molecule. Molecule that exceed the timeout value are filtered with a task=’timeout’.

As a final step of the protocol, InChiKeys (‘inchikey’) are computed for identifying molecules.

Parameters
  • mol (Mol) – the input molecule

  • timeout (int) – the maximum number of seconds for processing a molecule

Return type

tuple

Returns

a tuple containing the molecule, its status and the further task name it reached

run_df(df)[source]

Apply the standardization protocol on a DataFrame, with the possibility of directly filtering duplicate entries as well. This can be very useful as the standardization process can expose duplicate entries due to salts removal, neutralization, canonical tautomer enumeration, and stereochemistry centers unlabelling

If a reference file is specified, duplicate removals becomes possible accross chunks.

Parameters
  • df (DataFrame) – the input DataFrame

  • timeout – the maximum number of seconds for processing a molecule

Return type

tuple

Returns

three DataFrames separated by status:

  • passed

  • filtered

  • error

Note

As a side effect, the output DataFrames get indexed by idm. The ‘inchikey’ col is not returned, but the values can be accessed using the reference file.

Parameters
  • df – The DataFrame with molecules to standardize

  • return – a tuple of 3 DataFrames: standardized, filtered and error.