chemicalchecker.core.preprocess.Preprocess
- class Preprocess(signature_path, dataset, *args, **kwargs)[source]
Bases:
object
Preprocess class.
Initialize a Preprocess instance.
This class handles calling the external run.py for each dataset and provide shared methods.
- Parameters:
signature_path (str) – the path to the signature directory.
Methods
Call the external pre-process script.
Call the external preprocess script to generate H5 data.
get_datasources
get_parser
is_fit
Call the external preprocess script to generate H5 data.
Return the file with the raw data preprocessed.
Runs the preprocessing script 'predict'.
Save raw data produced by the preprocess script as matrix.
Covert signature to a string with feature names.
Convert signature to explicit feature names.
- call_preprocess(output, method, infile=None, entry=None)[source]
Call the external pre-process script.
- fit()[source]
Call the external preprocess script to generate H5 data.
The preprocess script is invoked with the fit argument, which means features are extracted from datasoruce and saved.
- predict(input_data_file, destination, entry_point)[source]
Call the external preprocess script to generate H5 data.
- classmethod preprocess(sign, **params)[source]
Return the file with the raw data preprocessed. :param sign: signature object (e.g. obtained from cc.get_signature) :param params: specific parameters for a given preprocess script
- Returns:
The name of the file where the data is saved.
- Return type:
datafile(str)
- ex:
os.path.join(self.raw_path, “preprocess.h5”)
- classmethod preprocess_predict(sign, input_file, destination, entry_point)[source]
Runs the preprocessing script ‘predict’.
Run on an input file of raw data formatted correctly for the space of interest
- Parameters:
sign – signature object ( e.g. obtained from cc.get_signature)
input_file (str) – path to the H5 file containing the data on which to apply ‘predict’
destination (str) – Path to a H5 file where the predicted signature will be saved.
entry_point (str) – Entry point of the input data for the signaturization process. It depends on the type of data passed at the input_data_file.
- Returns:
- The H5 file containing the predicted data after
preprocess
- Return type:
datafile(str)
- static save_output(output_file, inchikey_raw, method, models_path, discrete, features, features_int=False, chunk=2000)[source]
Save raw data produced by the preprocess script as matrix.
The result of preprocess scripts are usually in compact format (e.g. binary data only list features with value of 1) since data might be sparse and memory intensive to handle. This method convert it to a signature like (explicit, extended) format. The produced H5 will contain 3 dataset:
‘keys’: identifier (usually inchikey),
‘features’: features names,
‘X’: the data matrix
- Parameters:
output_file (str) – Path to output H5 file.
inchikey_raw (dict) – inchikey -> list of values (dense format).
method (str) – Same as used in the preprocess script.
models_path (str) – Path to signature models directory.
discrete (bool) – True if data is binary/discrete, False for continuous data.
features (list) – List of feature names from original sign0, None when method is ‘fit’.
features_int (str) – Features have no name, we can use integers as feature names.
chunk (int) – Chunk size for loading data.