chemicalchecker.util.sanitize.sanitizer.Sanitizer

class Sanitizer(*args, impute_missing=True, trim=True, max_features=10000, check_features=True, min_feature_abs=5, max_feature_freq=0.8, check_keys=True, min_keys_abs=1, max_keys_freq=0.8, sample_size=1000, max_categories=20, zero_as_missing=True, chunk_size=10000, tmp_path=None, **kwargs)[source]

Bases: object

Sanitizer class.

Initialize a Sanitizer instance.

Parameters:
  • impute_missing (bool) – True if NaNs (and -inf/+inf) will be imputed. NaN will be median, -inf/+inf will be min/max of the column.

  • trim (bool) – Trim dataset to have a maximum number of features.

  • max_features (int) – Maximum number of features to keep (default=10000).

  • check_features (bool) – True if we want to drop features based on frequency arguments. For categorical data, 0 is considered as missing. For continuous, any non numerical value.

  • min_feature_abs (int) – Minimum number (counts) of occurrences of feature, column-wise. (default=5).

  • max_feature_freq (float) – Maximum proportion of occurrences of the feature, column-wise. (default=0.8).

  • check_keys (bool) – True if we want to drop keys based on frequency arguments. For categorical data, 0 is considered as missing. For continuous, any non numerical value.

  • min_key_abs (int) – Minimum number (counts) of occurrences of feature, row-wise. (default=1).

  • max_keys_freq (float) – Maximum proportion of occurrences of the feature, row-wise. (default=0.8).

  • sample_size (int) – rows used for determining data type.

  • max_categories (int) – Maximum number of categories we can expect.

  • zero_as_missing (bool) – Only applyied to categorical data (usually) binary where the 0 denotes a missing information. Used when filtering row or columns by frequency.

Methods

transform

Sanitize data

transform(data=None, V=None, keys=None, keys_raw=None, features=None, sign=None)[source]

Sanitize data

Parameters:
  • data (str) – Path to a H5 or a DataSignature (default=None).

  • V (matrix) – Input matrix (default=None).

  • keys (array) – Keys (default=None).

  • keys_raw (array) – Keys raw (default=None).

  • features (array) – Features (default=None).

  • sign (DataSignature) – Auxiliary data used to impute (default=None).