chemicalchecker.util.remove_near_duplicates.remove_near_duplicates.RNDuplicates

class RNDuplicates(nbits=128, only_duplicates=False, cpu=1)[source]

Bases: object

RNDuplicates class.

Initialize a RNDuplicates instance.

Parameters:
  • nbits (int) – Number of bits to use to quantize.

  • only_duplicates (boolean) – Remove only exact duplicates.

  • cpu (int) – Number of cores to use.

Methods

remove

Remove redundancy from data.

save

Save non-redundant data.

remove(data, keys=None, save_dest=None, just_mappings=False)[source]

Remove redundancy from data.

Parameters:
  • data (array) – The data to remove duplicates from. It can be a numpy array or a file path to a HDF5 file with dataset V.

  • keys (array) – Array of keys for the input data. If None, keys are taken from HDF5 dataset keys.

  • save_dest (str) – If the result needs to be saved in a file, the path to the file. (default: None)

  • just_mappings (bool) – Just return the mappings. Only applies if save_dest is None. (default=False)

Returns:

data (array): mappings (dictionary):

Return type:

keys (array)

save(destination)[source]

Save non-redundant data.

Save non-redundant data to a HDF5 file.

Returns:

The destination file path.

Return type:

destination (str)