chemicalchecker.core.chemcheck.ChemicalChecker

class ChemicalChecker(cc_root=None, custom_data_path=None, dbconnect=True)[source]

Bases: object

ChemicalChecker class.

Initialize a ChemicalChecker instance.

If the CC_ROOT directory is empty a skeleton of CC is initialized. Otherwise the directory is explored and molset and datasets variables are discovered.

Parameters:
  • cc_root (None, str) – The Chemical Checker root directory. If not specified the root is taken from the config file.

  • custom_data_path (None, str) – Path to one or more h5 files, detect their signature type, molset and dataset code form their ‘attrs’ record.

  • dbconnect (True, Bool) – if True, try to connect to the DB

Methods

add_model_metadata

Add metadata to available models.

add_sign_metadata

Add metadata to available signatures.

check_dir_existance_create

Args: dir_path(str): root path additional_path(list) : list of strings including additional path parts to append to the root path

check_file_existance_create

This method create an empty file if it doesn't exist already

copy_signature_from

Copy a signature file from another CC instance.

datasets_exemplary

Iterator on Chemical Checker exemplary datasets.

export

Export a signature h5 file to a given path.

export_cc

Export a zipped folder containing the minimum files necessary

export_symlinks

Creates symlinks for all available signatures H5 or models path

get_data_signature

Return the data signature for the given dataset code.

get_global_signature

Return the (stacked) global signature If the given molecule belongs to the universe.

get_h5_metadata

Return H5 metadata.

get_metadatas

Extract the metadata from files using a function.

get_model_metadata

Return model metadata.

get_molecule

Return a molecule Mol object.

get_signature

Return the signature for the given dataset code.

get_signature_path

Return the signature path for the given dataset code.

get_universe_inchi

link

Link H5 files and models from a given custom directory.

report_available

Report available signatures in the CC.

report_dimensions

Report dimensions of all available signatures in the CC.

report_keys

Report keys of all available signatures in the CC.

report_status

Report status of signatures in the CC.

set_verbosity

Set the verbosity for logging module.

sign_metadata

Return the metadata of the Chemical Checker.

signature

symlink_to

Link current CC instance to other via symlinks.

Attributes

coordinates

Iterator on Chemical Checker coordinates.

datasets

Iterator on Chemical Checker datasets.

metadata

Return the metadata of the Chemical Checker.

name

Return the name of the Chemical Checker.

add_model_metadata(molsets=['full', 'reference'], dataset='*', signature='sign*')[source]

Add metadata to available models.

Parameters:
  • molset (str, optional) – Filter for the moleculeset e.g. ‘full’ or ‘reference’

  • dataset (str, optional) – Filter for the dataset e.g. A1.001

  • signature (str, optional) – Filter for signature type e.g. ‘sign1’

add_sign_metadata(molset='*', dataset='*', signature='*')[source]

Add metadata to available signatures.

Parameters:
  • molset (str, optional) – Filter for the moleculeset e.g. ‘full’ or ‘reference’

  • dataset (str, optional) – Filter for the dataset e.g. A1.001

  • signature (str, optional) – Filter for signature type e.g. ‘sign1’

check_dir_existance_create(additional_path=None)[source]

Args: dir_path(str): root path additional_path(list) : list of strings including additional

path parts to append to the root path

check_file_existance_create()[source]

This method create an empty file if it doesn’t exist already

property coordinates

Iterator on Chemical Checker coordinates.

copy_signature_from(source_cc, cctype, molset, dataset_code, overwrite=False)[source]

Copy a signature file from another CC instance.

Parameters:
  • source_cc (ChemicalChecker) – A different CC instance.

  • cctype (str) – The Chemical Checker datatype (i.e. one of the sign*).

  • molset (str) – The molecule set name.

  • dataset_code (str) – The dataset code of the Chemical Checker.

property datasets

Iterator on Chemical Checker datasets.

datasets_exemplary()[source]

Iterator on Chemical Checker exemplary datasets.

export(destination, signature, h5_filter=None, h5_names_map={}, overwrite=False, version=None)[source]

Export a signature h5 file to a given path.

Which dataset to copy can be specified as well as how to rename some dataset.

Parameters:
  • destination (str) – A destination path.

  • signature (sign) – A signature object.

  • h5_filter (list) – List of h5 dataset name to export.

  • h5_names_map (dict) – Dictionary of current to final h5 dataset name.

  • overwrite (boo) – Whether to allow overwriting the export.

  • version (int) – Mark the exported signature with a version number.

export_cc(root_destination, folder_destination)[source]
Export a zipped folder containing the minimum files necessary

to run a complete CC protocol

It includes:
  • full: all sign0 (.h5 files + fit.ready file in models folder)

  • reference: sign1 models folder (.pkl files only) –> sign1 are going to be generated based on sign0

    at the initialization of the ChemicalChecker instance

  • reference: sign2 models (savedmodel folder only) –> sign2 are going to be generated based on sign1 and neig1 (also generated once sign1 is ready)

Parameters:
  • root_destination (str) – An export destination path

  • folder_destination (str) – additional path to append to root_destination –> used to define the base_dir when zipping

Creates symlinks for all available signatures H5 or models path

in a single folder.

Parameters:
  • dest_path (str) – The destination for symlink, if None then the default under the cc_root is generated.

  • signatures (bool) – export signature files.

  • models (bool) – export models paths.

get_data_signature(cctype, dataset_code)[source]

Return the data signature for the given dataset code.

Parameters:
  • cctype (str) – The Chemical Checker datatype (i.e. one of the sign*).

  • dataset_code (str) – The dataset code of the Chemical Checker.

Returns:

A DataSignature object, the specific type

depends on the cctype passed. It only allows access to the sign data.

Return type:

data(Signature)

get_global_signature(mol_str, str_type=None)[source]

Return the (stacked) global signature If the given molecule belongs to the universe.

Parameters:
  • mol_str – Compound identifier (e.g. SMILES string)

  • str_type – Type of identifier (‘inchikey’, ‘inchi’ and ‘smiles’ are accepted) if ‘None’ we do our best to guess.

static get_h5_metadata(fn, format_dict)[source]

Return H5 metadata.

Returns:

tuple of the type (‘full’, ‘A’, ‘A1’, ‘A1.001’,’sign3’, h5_path) or None if something’s wrong

static get_metadatas(files, metadata_func, format_dict)[source]

Extract the metadata from files using a function.

static get_model_metadata(fn, format_dict)[source]

Return model metadata.

Returns:

tuple of the type (‘full’, ‘A’, ‘A1’, ‘A1.001’,’sign3’, h5_path) or None if something’s wrong

get_molecule(mol_str, str_type=None)[source]

Return a molecule Mol object.

Parameters:
  • mol_str – Compound identifier (e.g. SMILES string)

  • str_type – Type of identifier (‘inchikey’, ‘inchi’ and ‘smiles’ are accepted) if ‘None’ we do our best to guess.

get_signature(cctype, molset, dataset_code, *args, **kwargs)[source]

Return the signature for the given dataset code.

Parameters:
  • cctype (str) – The Chemical Checker datatype (i.e. one of the sign*).

  • molset (str) – The molecule set name.

  • dataset_code (str) – The dataset code of the Chemical Checker.

  • params (dict) – Optional. The set of parameters to initialize and compute the signature. If the signature is already initialized this argument will be ignored.

  • as_dataframe (bool) – True to get the signature as pandas DataFrame.

Returns:

A Signature object, the specific type depends

on the cctype passed.

Return type:

data(Signature)

get_signature_path(cctype, molset, dataset_code)[source]

Return the signature path for the given dataset code.

This should be the only place where we define the directory structure. The signature directory tipically contain the signature HDF5 file.

Parameters:
  • cctype (str) – The Chemical Checker datatype i.e. one of the sign*.

  • molset (str) – The molecule set name.

  • dataset_code (str) – The dataset of the Chemical Checker.

Returns:

The signature path.

Return type:

signature_path(str)

Link H5 files and models from a given custom directory.

Populates local CC instance with symlinks to external signatures H5s or models.

Parameters:

custom_data_path (str) – Path to a directory signature containing H5s models or symlinks.

property metadata

Return the metadata of the Chemical Checker.

property name

Return the name of the Chemical Checker.

report_available(molset='*', dataset='*', signature='*')[source]

Report available signatures in the CC.

Get the moleculeset/dataset combination where signatures are available. Use arguments to apply filters.

Parameters:
  • molset (str, optional) – Filter for the moleculeset e.g. ‘full’ or ‘reference’

  • dataset (str, optional) – Filter for the dataset e.g. A1.001

  • signature (str, optional) – Filter for signature type e.g. ‘sign1’

Returns:

Nested dictionary with molset, dataset and list of signatures

report_dimensions(molset='*', dataset='*', signature='*', matrix='V')[source]

Report dimensions of all available signatures in the CC.

Get the moleculeset/dataset combination where signatures are available. Report the size of the ‘V’ matrix. Use arguments to apply filters. :param molset: Filter for the moleculeset e.g. ‘full’ or ‘reference’ :type molset: str :param dataset: :type dataset: str :param signature: Filter for signature type e.g. ‘sign1’ :type signature: str

Returns:

Nested dictionary with molset, dataset and list of signatures

report_keys(molset='full', dataset='*', signature='sign1')[source]

Report keys of all available signatures in the CC.

Get the moleculeset/dataset combination where signatures are available. Report the list of keys. Use arguments to apply filters. :param molset: Filter for the moleculeset e.g. ‘full’ or ‘reference’ :type molset: str :param dataset: :type dataset: str :param signature: Filter for signature type e.g. ‘sign1’ :type signature: str

Returns:

Nested dictionary with molset, dataset and list of signatures

report_status(molset='*', dataset='*', signature='*')[source]

Report status of signatures in the CC.

Parameters:
  • molset (str) – Filter for the moleculeset e.g. ‘full’ or ‘reference’

  • dataset (str) –

  • signature (str) – Filter for signature type e.g. ‘sign1’

Returns:

Nested dictionary with molset, dataset and list of signatures

static set_verbosity(level='warning', logger_name='chemicalchecker', format=None)[source]

Set the verbosity for logging module.

sign_metadata(key, molset, dataset, cctype)[source]

Return the metadata of the Chemical Checker.

Link current CC instance to other via symlinks.

When experimenting with signature parameters it’s useful to have low cctype (e.g. sign0, sign1) not copied but simply linked.

Parameters:
  • source_cc (ChemicalChecker) – A different CC instance to link.

  • cctypes (list) – The signature (i.e. sign*) to link.

  • molsets (list) – The molecule set name to link.

  • datasets (list) – The codes of dataset to link.

  • rename_dataset (dict) – None by default which to no renaming. Otherwise a mapping of source to destination name should be provided.

  • models (bool) – If True, models directory will also be linked. This will delete the local models for the specified datasets.