chemicalchecker.core.signature_data.DataSignature

class DataSignature(data_path, ds_data='V', keys_name='keys')[source]

Bases: object

DataSignature class.

Initialize a DataSignature instance.

Methods

add_datasets

Add dataset to a H5

apply_mappings

Map signature throught mappings.

as_dataframe

check_mappings

chunk_iter

Iterator on chunks of data

chunker

Iterate on signatures.

clear

close_hdf5

compute_distance_pvalues

Compute the distance pvalues according to the selected metric.

consistency_check

Check that signature is valid.

copy_from

Copy dataset 'key' to current signature.

dataloader

Return a pytorch DataLoader object for quick signature iteration.

export_features

filter_h5_dataset

Apply a maks to a dataset, dropping columns or rows.

generator_fn

Return the generator function that we can query for batches.

get_h5_dataset

Get a specific dataset in the signature.

get_vectors

Get vectors for a list of keys, sorted by default.

get_vectors_lite

Iterate on signatures.

h5_str

hstack_signatures

Merge horizontally a list of signatures.

index

Give the index according to the key.

is_valid

make_filtered_copy

Make a copy of applying a filtering mask on rows.

open_hdf5

refresh

Refresh all cached properties

string_dtype

subsample

Subsample from a signature without replacement.

vstack_signatures

Merge vertically a list of signatures.

Attributes

info_h5

Get the dictionary of dataset and shapes.

shape

Get the V matrix shape.

size

Get the V matrix size.

__getitem__(key)[source]

Return the vector corresponding to the key.

The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).

__iter__()[source]

By default iterate on signatures V.

add_datasets(data_dict, overwrite=True, chunks=None, compression=None)[source]

Add dataset to a H5

apply_mappings(out_file, mappings=None)[source]

Map signature throught mappings.

chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True)[source]

Iterator on chunks of data

chunker(size=2000, n=None)[source]

Iterate on signatures.

compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)[source]

Compute the distance pvalues according to the selected metric.

Parameters:
  • bg_file (Str) – The file where to store the distances.

  • metric (str) – the metric name (cosine or euclidean).

  • sample_pairs (int) – Amount of pairs for distance calculation.

  • unflat (bool) – Remove flat regions whenever we observe them.

  • memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.

  • limit_inks (list) – Compute distances only for this subset on inchikeys.

Returns:

Dictionary with distances and Pvalues

Return type:

bg_distances(dict)

consistency_check()[source]

Check that signature is valid.

copy_from(sign, key, chunk=None)[source]

Copy dataset ‘key’ to current signature.

Parameters:
  • sign (SignatureBase) – The source signature.

  • key (str) – The dataset to copy from.

dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False)[source]

Return a pytorch DataLoader object for quick signature iteration.

filter_h5_dataset(key, mask, axis, chunk_size=1000)[source]

Apply a maks to a dataset, dropping columns or rows.

key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will

be kept.

axis (int): Wether the mask refers to rows (0) or columns (1).

generator_fn(weak_shuffle=False, batch_size=None)[source]

Return the generator function that we can query for batches.

get_h5_dataset(h5_dataset_name, mask=None)[source]

Get a specific dataset in the signature.

get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)[source]

Get vectors for a list of keys, sorted by default.

Parameters:
  • keys (list) – a List of string, only the overlapping subset to the signature keys is considered.

  • include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.

  • dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.

get_vectors_lite(keys, chunk_size=2000, chunk_above=10000)[source]

Iterate on signatures.

static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None)[source]

Merge horizontally a list of signatures.

index(key)[source]

Give the index according to the key.

Parameters:

key (str) – the key to search index in the matrix.

Returns:

Index in the matrix

Return type:

index(int)

property info_h5

Get the dictionary of dataset and shapes.

make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)[source]

Make a copy of applying a filtering mask on rows.

destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,

date, name…)

data_file (str): A specific file to copy (by default is the signature

h5)

refresh()[source]

Refresh all cached properties

property shape

Get the V matrix shape.

property size

Get the V matrix size.

subsample(n, seed=42)[source]

Subsample from a signature without replacement.

Parameters:

n (int) – Maximum number of samples (default=10000).

Returns:

A (samples, features) matrix. keys(array): The list of keys.

Return type:

V(matrix)

static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100)[source]

Merge vertically a list of signatures.