load¶
Module load¶
A module containing the Loader class, used for storing DataFrames with molecules on disk.
- npfc.load.count_mols(input_file, keep_uncompressed=False)[source]¶
Count the number of molecules found in a text file. In case the file is compressed (gzip), it is uncompressed first. The resulting uncompressed file can be kept for further use.
#### this function failed to the ZINC (9,902,598)
- npfc.load.file(input_file, in_id='idm', in_mol='mol', csv_sep='|', mol_format='rdkit', out_id='idm', out_mol='mol', keep_props=True, decode=True)[source]¶
Load a file into a DataFrame.
- Parameters
input_file (
str
) – the input file to loadin_id (
str
) – the column/property to use for molecule idsin_mol (
str
) – the column to use for molecules (irrerlevant for SDF)csv_sep (
str
) – the column separator to use for parsing the input file (CSV)mol_format (
str
) – the input format for moleculesout_id (
str
) – the column name used for storing molecule idsout_mol (
str
) – the column name used for storing moleculeskeep_props (
bool
) – keep all properties found in the input file. If False, then only out_id and out_mol are kept.decode (
Union
[bool
,List
[str
]]) – decode base64 strings into objects. Columns with encoded objects are labelled with a leading ‘_’. For molecules, reserved names are ‘mol’ and ‘mol_frag’.
- Return type
DataFrame
- Returns
a DataFrame
- npfc.load.pgsql(dbname, user, psql, src_id, src_mol, mol_format=None, col_mol='mol', col_id='idm', keep_db_cols=False)[source]¶
Load molecules from a PGSQL query. The col_mol will is parsed by RDKit depending on the mol_format argument. If no mol_format is specified, then the column is returned untouched.
Note
For this function to work, you need to be able to log into the target database without prompt. Tested only with ChEMBL24 with default ports.
- Parameters
dbname (
str
) – the input postgres database nameuser (
str
) – the user name for logging into databasepgsql – the posgresql statement to execute
mol_format (
Optional
[str
]) – the molecular format to use to parse the molecules. If none is specified then no parsing is performed. Currently only molblock and smiles are allowedcol_mol (
str
) – the column with the molecules to parsekeep_db_cols (
bool
) – keep src_id and src_mol columns in output DataFrame. This does not impact any other column extracted from the psql query
- Return type
DataFrame
- Returns
a DataFrame with Mol objects