chemicalchecker.util.sanitize.sanitizer.Sanitizer
- class Sanitizer(*args, impute_missing=True, trim=True, max_features=10000, check_features=True, min_feature_abs=5, max_feature_freq=0.8, check_keys=True, min_keys_abs=1, max_keys_freq=0.8, sample_size=1000, max_categories=20, zero_as_missing=True, chunk_size=10000, tmp_path=None, **kwargs)[source]
Bases:
object
Sanitizer class.
Initialize a Sanitizer instance.
- Parameters:
impute_missing (bool) – True if NaNs (and -inf/+inf) will be imputed. NaN will be median, -inf/+inf will be min/max of the column.
trim (bool) – Trim dataset to have a maximum number of features.
max_features (int) – Maximum number of features to keep (default=10000).
check_features (bool) – True if we want to drop features based on frequency arguments. For categorical data, 0 is considered as missing. For continuous, any non numerical value.
min_feature_abs (int) – Minimum number (counts) of occurrences of feature, column-wise. (default=5).
max_feature_freq (float) – Maximum proportion of occurrences of the feature, column-wise. (default=0.8).
check_keys (bool) – True if we want to drop keys based on frequency arguments. For categorical data, 0 is considered as missing. For continuous, any non numerical value.
min_key_abs (int) – Minimum number (counts) of occurrences of feature, row-wise. (default=1).
max_keys_freq (float) – Maximum proportion of occurrences of the feature, row-wise. (default=0.8).
sample_size (int) – rows used for determining data type.
max_categories (int) – Maximum number of categories we can expect.
zero_as_missing (bool) – Only applyied to categorical data (usually) binary where the 0 denotes a missing information. Used when filtering row or columns by frequency.
Methods
Sanitize data
- transform(data=None, V=None, keys=None, keys_raw=None, features=None, sign=None)[source]
Sanitize data
- Parameters:
data (str) – Path to a H5 or a DataSignature (default=None).
V (matrix) – Input matrix (default=None).
keys (array) – Keys (default=None).
keys_raw (array) – Keys raw (default=None).
features (array) – Features (default=None).
sign (DataSignature) – Auxiliary data used to impute (default=None).