standardize¶
Module standardize¶
This modules is used to standardize molecules and molecular DataFrames.
- class npfc.standardize.FullUncharger[source]¶
A class derived from rdkit.Chem.MolStandardize.charge.Uncharger, so instead of attempting to create zwitterions all possible charges are removed from the molecule.
For instance:
>>> # Uncharger: >>> [O-][N+](C)(C)C[O-] -> [O-][N+](C)(C)CO
>>> # FullUncharger: >>> [O-][N+](C)(C)C[O-] -> O[N+](C)(C)CO
Create an instance of FullUncharger.
Todo
This will remove charges from -2 to +2 only. This could be improved using more general smarts?
- class npfc.standardize.Standardizer(protocol=None, col_mol='mol', col_id='idm', elements_medchem={'B', 'Br', 'C', 'Cl', 'F', 'H', 'I', 'N', 'O', 'P', 'S'}, timeout=10)[source]¶
A class for standardizing molecular structures. The standardization itself is based on a protocol that the user can modify.
By default this protocol consists in 15 tasks applied to each molecule invidually:
initiate_mol: check if the molecule passed the RDKit conversion
filter_empty: filter molecules with empty structures
disconnect_metal: break bonds involving metallic atoms, resulting in potentially several molecules per structure.
clear_mixtures: retrieve only the “best” molecule from a mixture, which might not always be the largest one.
deglycosylate: remove all external sugars-like rings from the molecule and return the remaining non-linear entity.
filter_num_heavy_atom: filter molecules with a heavy atom count not in the accepted range. By default: num_heavy_atom > 3.
filter_molecular_weight: filter molecules with a molecular weight not in the accepted range. By default: molecular_weight <= 1000.0.
filter_num_ring: filter molecules with a number of rings (Smallest Sets of Smallest Rings or SSSR) not in the accepted range. By default: num_ring > 0.
filter_elements: filter molecules with elements not considered as medchem. By default: elements in H, B, C, N, O, F, P, S, Cl, Br, I.
clear_isotopes: set all atoms to their most common isotope (i.e. 14C becomes 12C which is C).
normalize: always write the same functional groups in the same manner.
uncharge: remove all charges on a molecule when it is possible. This is different from rdkit.Chem.MolStandardize.charge module as there is no attempt for reaching the zwitterion.
canonicalize: enumerate the canonical tautomer.
clear_stereo: remove all remaining stereochemistry flags on the molecule.
reset_mol: convert forth and back to SMILES format to discard potential residual outdated flags on atoms and bonds.
Other steps are not part of this protocol but can be executed as well for convenience:
depict: find the “best” possible 2D depiction of the molecule among Input/rdDepictor/Avalon/CoordGen methods
extract_murcko: return the Murcko Scaffold from the molecule
clear_side_chains: remove any exocyclic atom that is not part of a linker
reset_mol: reset the molecule by converting to and then from smiles
This results in new columns in the input DataFrame:
the ‘mol’ column: updated structure (only for the protocol)
the ‘status’ column: either passed, filtered or error.
the ‘task’ column: the latest task that was applied to the molecule.
The standardizer works either on a molecule (method: ‘run’) or on a DataFrame containing molecules (‘run_df’).
In the latter case, the inchikey is computed and can be used for identifying duplicate entries.
A timeout value is set by default and will be applied to each molecule individually to avoid the process being stuck on marginally difficult cases. This value can be set either during the Standardizer object initialization or by defining as an option in the protocol (priority is given to the latter if defined).
Create a Standardizer object.
- Parameters
protocol (
Optional
[str
]) – Either a JSON file or a dictionary. The resultung dictinary needs a ‘tasks’ key that lists all tasks to be excuted as a list.col_mol (
str
) – the column with the molecule for when running the run_df methodcol_id (
str
) – the column with the id for when running the run_df methodfilter_duplicates –
- clear_isotopes(mol)[source]¶
Return a molecule without any isotopes.
- Parameters
mol (
Mol
) – the input molecule- Return type
Mol
- Returns
the molecule without isotope
- clear_mixtures(mol)[source]¶
Return the “best” molecule found in a molecular structure.
The “best” molecule is determined by the following criteria, sorted by priority:
contains only medchem elements
contains at least one ring
has the largest molecular weight of the mixture
To summarize:
So the largest molecule of a mixture might not always be selected, for instance a very long aliphatic chain would be dismissed to keep a benzene molecule instead.
This is implemented in such a way because our fragments used for substructure search contain at least one ring. On the contrary, this long aliphatic chain would be kept in a mixture with a non-medchem molecule.
- Parameters
mol (
Mol
) – the input molecule(s)- Return type
Mol
- Returns
the best molecule
- clear_side_chains(mol, debug=False)[source]¶
Clear the side chains of a molecule.
This method operates in 3 steps:
Remove quickly all atoms in side chains but the one attached to a ring, starting from the terminal atom. (would certainly fail in case of linear molecules)
Iterate over each remaining exocyclic atoms to remove only atoms when it does not break the ring aromaticity. Simple and double bonds can be broken and the atoms in rings which were attached to removed atoms are neutralized.
Remove eventual nitrogen radicals by Smiles editing.
Warning
I found only nitrogen radicals in my dataset, this might be insufficient on a larger scale.
Warning
I found a bug for this molecule ‘O=C(O)C1OC(OCC2OC(O)C(O)C(O)C2O)C(O)C(O)C1O’, where a methyl remains after processing.
- Parameters
mol (
Mol
) – the molecule to simplify- Return type
Mol
- Returns
a simplified copy of the molecule
- deglycosylate(mol, mode='run')[source]¶
Function to deglycosylate molecules.
Several rules are applied for removing Sugar-Like Rings (SLRs) from molecules:
Only external SLRs are removed, so a molecule with aglycan-SLR-aglycan is not modified
Only molecules with both aglycans and SLRs are modified (so only SLRs or none are left untouched)
Linear aglycans are considered to be part of linkers and are thus never returned as results
Glycosidic bonds are defined as either O or CO and can be linked to larger linear linker. So from a SLR side, either nothing or only 1 C are allowed before the glycosidic bond oxygen
Linker atoms until the glycosidic bond oxygen atom are appended to the definition of the SLR, so that any extra methyl is also removed.
- run(mol, timeout=10)[source]¶
Execute the standardization protocol on a molecule. Molecule that exceed the timeout value are filtered with a task=’timeout’.
As a final step of the protocol, InChiKeys (‘inchikey’) are computed for identifying molecules.
- run_df(df)[source]¶
Apply the standardization protocol on a DataFrame, with the possibility of directly filtering duplicate entries as well. This can be very useful as the standardization process can expose duplicate entries due to salts removal, neutralization, canonical tautomer enumeration, and stereochemistry centers unlabelling
If a reference file is specified, duplicate removals becomes possible accross chunks.
- Parameters
df (
DataFrame
) – the input DataFrametimeout – the maximum number of seconds for processing a molecule
- Return type
- Returns
three DataFrames separated by status:
passed
filtered
error
Note
As a side effect, the output DataFrames get indexed by idm. The ‘inchikey’ col is not returned, but the values can be accessed using the reference file.
- Parameters
df – The DataFrame with molecules to standardize
return – a tuple of 3 DataFrames: standardized, filtered and error.