load

Module load

A module containing the Loader class, used for storing DataFrames with molecules on disk.

npfc.load.count_mols(input_file, keep_uncompressed=False)[source]

Count the number of molecules found in a text file. In case the file is compressed (gzip), it is uncompressed first. The resulting uncompressed file can be kept for further use.

#### this function failed to the ZINC (9,902,598)

Parameters
  • input_file (str) – the input file

  • keep_uncompressed (bool) – if the input file is compressed (gzip), do not remove the uncompressed file when finished

npfc.load.file(input_file, in_id='idm', in_mol='mol', csv_sep='|', mol_format='rdkit', out_id='idm', out_mol='mol', keep_props=True, decode=True)[source]

Load a file into a DataFrame.

Parameters
  • input_file (str) – the input file to load

  • in_id (str) – the column/property to use for molecule ids

  • in_mol (str) – the column to use for molecules (irrerlevant for SDF)

  • csv_sep (str) – the column separator to use for parsing the input file (CSV)

  • mol_format (str) – the input format for molecules

  • out_id (str) – the column name used for storing molecule ids

  • out_mol (str) – the column name used for storing molecules

  • keep_props (bool) – keep all properties found in the input file. If False, then only out_id and out_mol are kept.

  • decode (Union[bool, List[str]]) – decode base64 strings into objects. Columns with encoded objects are labelled with a leading ‘_’. For molecules, reserved names are ‘mol’ and ‘mol_frag’.

Return type

DataFrame

Returns

a DataFrame

npfc.load.pgsql(dbname, user, psql, src_id, src_mol, mol_format=None, col_mol='mol', col_id='idm', keep_db_cols=False)[source]

Load molecules from a PGSQL query. The col_mol will is parsed by RDKit depending on the mol_format argument. If no mol_format is specified, then the column is returned untouched.

Note

For this function to work, you need to be able to log into the target database without prompt. Tested only with ChEMBL24 with default ports.

Parameters
  • dbname (str) – the input postgres database name

  • user (str) – the user name for logging into database

  • pgsql – the posgresql statement to execute

  • mol_format (Optional[str]) – the molecular format to use to parse the molecules. If none is specified then no parsing is performed. Currently only molblock and smiles are allowed

  • col_mol (str) – the column with the molecules to parse

  • keep_db_cols (bool) – keep src_id and src_mol columns in output DataFrame. This does not impact any other column extracted from the psql query

Return type

DataFrame

Returns

a DataFrame with Mol objects