chemicalchecker.core.signature_data.DataSignature
- class DataSignature(data_path, ds_data='V', keys_name='keys')[source]
Bases:
object
DataSignature class.
Initialize a DataSignature instance.
Methods
Add dataset to a H5
Map signature throught mappings.
as_dataframe
check_mappings
Iterator on chunks of data
Iterate on signatures.
clear
close_hdf5
Compute the distance pvalues according to the selected metric.
Check that signature is valid.
Copy dataset 'key' to current signature.
Return a pytorch DataLoader object for quick signature iteration.
export_features
Apply a maks to a dataset, dropping columns or rows.
Return the generator function that we can query for batches.
Get a specific dataset in the signature.
Get vectors for a list of keys, sorted by default.
Iterate on signatures.
h5_str
Merge horizontally a list of signatures.
Give the index according to the key.
is_valid
Make a copy of applying a filtering mask on rows.
open_hdf5
Refresh all cached properties
string_dtype
Subsample from a signature without replacement.
Merge vertically a list of signatures.
Attributes
Get the dictionary of dataset and shapes.
Get the V matrix shape.
Get the V matrix size.
- __getitem__(key)[source]
Return the vector corresponding to the key.
The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).
- compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)[source]
Compute the distance pvalues according to the selected metric.
- Parameters:
bg_file (Str) – The file where to store the distances.
metric (str) – the metric name (cosine or euclidean).
sample_pairs (int) – Amount of pairs for distance calculation.
unflat (bool) – Remove flat regions whenever we observe them.
memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.
limit_inks (list) – Compute distances only for this subset on inchikeys.
- Returns:
Dictionary with distances and Pvalues
- Return type:
bg_distances(dict)
- copy_from(sign, key, chunk=None)[source]
Copy dataset ‘key’ to current signature.
- Parameters:
sign (SignatureBase) – The source signature.
key (str) – The dataset to copy from.
- dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False)[source]
Return a pytorch DataLoader object for quick signature iteration.
- filter_h5_dataset(key, mask, axis, chunk_size=1000)[source]
Apply a maks to a dataset, dropping columns or rows.
key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will
be kept.
axis (int): Wether the mask refers to rows (0) or columns (1).
- generator_fn(weak_shuffle=False, batch_size=None)[source]
Return the generator function that we can query for batches.
- get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)[source]
Get vectors for a list of keys, sorted by default.
- Parameters:
keys (list) – a List of string, only the overlapping subset to the signature keys is considered.
include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.
dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.
- static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None)[source]
Merge horizontally a list of signatures.
- index(key)[source]
Give the index according to the key.
- Parameters:
key (str) – the key to search index in the matrix.
- Returns:
Index in the matrix
- Return type:
index(int)
- property info_h5
Get the dictionary of dataset and shapes.
- make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)[source]
Make a copy of applying a filtering mask on rows.
destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,
date, name…)
- data_file (str): A specific file to copy (by default is the signature
h5)
- property shape
Get the V matrix shape.
- property size
Get the V matrix size.