chemicalchecker.core.preprocess.Preprocess

class Preprocess(signature_path, dataset, *args, **kwargs)[source]

Bases: object

Preprocess class.

Initialize a Preprocess instance.

This class handles calling the external run.py for each dataset and provide shared methods.

Parameters:

signature_path (str) – the path to the signature directory.

Methods

call_preprocess

Call the external pre-process script.

fit

Call the external preprocess script to generate H5 data.

get_datasources

get_parser

is_fit

predict

Call the external preprocess script to generate H5 data.

preprocess

Return the file with the raw data preprocessed.

preprocess_predict

Runs the preprocessing script 'predict'.

save_output

Save raw data produced by the preprocess script as matrix.

to_feature_string

Covert signature to a string with feature names.

to_features

Convert signature to explicit feature names.

call_preprocess(output, method, infile=None, entry=None)[source]

Call the external pre-process script.

fit()[source]

Call the external preprocess script to generate H5 data.

The preprocess script is invoked with the fit argument, which means features are extracted from datasoruce and saved.

predict(input_data_file, destination, entry_point)[source]

Call the external preprocess script to generate H5 data.

classmethod preprocess(sign, **params)[source]

Return the file with the raw data preprocessed. :param sign: signature object (e.g. obtained from cc.get_signature) :param params: specific parameters for a given preprocess script

Returns:

The name of the file where the data is saved.

Return type:

datafile(str)

ex:

os.path.join(self.raw_path, “preprocess.h5”)

classmethod preprocess_predict(sign, input_file, destination, entry_point)[source]

Runs the preprocessing script ‘predict’.

Run on an input file of raw data formatted correctly for the space of interest

Parameters:
  • sign – signature object ( e.g. obtained from cc.get_signature)

  • input_file (str) – path to the H5 file containing the data on which to apply ‘predict’

  • destination (str) – Path to a H5 file where the predicted signature will be saved.

  • entry_point (str) – Entry point of the input data for the signaturization process. It depends on the type of data passed at the input_data_file.

Returns:

The H5 file containing the predicted data after

preprocess

Return type:

datafile(str)

static save_output(output_file, inchikey_raw, method, models_path, discrete, features, features_int=False, chunk=2000)[source]

Save raw data produced by the preprocess script as matrix.

The result of preprocess scripts are usually in compact format (e.g. binary data only list features with value of 1) since data might be sparse and memory intensive to handle. This method convert it to a signature like (explicit, extended) format. The produced H5 will contain 3 dataset:

  • ‘keys’: identifier (usually inchikey),

  • ‘features’: features names,

  • ‘X’: the data matrix

Parameters:
  • output_file (str) – Path to output H5 file.

  • inchikey_raw (dict) – inchikey -> list of values (dense format).

  • method (str) – Same as used in the preprocess script.

  • models_path (str) – Path to signature models directory.

  • discrete (bool) – True if data is binary/discrete, False for continuous data.

  • features (list) – List of feature names from original sign0, None when method is ‘fit’.

  • features_int (str) – Features have no name, we can use integers as feature names.

  • chunk (int) – Chunk size for loading data.

to_feature_string(signatures, string_func)[source]

Covert signature to a string with feature names.

Parameters:
  • signatures (array) – Signature array(s).

  • string_func (func) – A function taking a dictionary as input and returning a single string.

to_features(signatures)[source]

Convert signature to explicit feature names.

Parameters:

signatures (array) – a signature 0 for 1 or more molecules

Returns:

1 dictionary per signature where keys are

feature_name and value as values.

Return type:

list of dict