chemicalchecker.core.sign0.sign0
- class sign0(signature_path, dataset, **params)[source]
Bases:
BaseSignature
,DataSignature
Signature type 0 class.
Initialize a Signature.
- Parameters:
signature_path (str) – the path to the signature directory.
dataset (str) – NS ex A1.001, here only serves as the ‘name’ record of the h5 file.
Methods
Add dataset to a H5
Map signature throught mappings.
as_dataframe
This signature data is available.
Return the background distances according to the selected metric.
check_mappings
Iterator on chunks of data
Iterate on signatures.
Remove everything from this signature.
Remove everything from this signature for both referene and full.
close_hdf5
Compute the distance pvalues according to the selected metric.
Check that signature is valid.
Copy dataset 'key' to current signature.
Return a pytorch DataLoader object for quick signature iteration.
diagnosis
export_features
Apply a maks to a dataset, dropping columns or rows.
Process the input data.
Conclude fit method.
Execute the fit method on the configured HPC.
Execute the any method on the configured HPC.
Return the generator function that we can query for batches.
Return the CC where the signature is present
Get data in the right format.
Get a specific dataset in the signature.
Return the intersection between two signatures.
Return a signature from a different molset
Return the neighbors signature, given a signature
Return the non redundant intersection between two signatures.
Return the signature type for current dataset
get_status_stack
Get vectors for a list of keys, sorted by default.
Iterate on signatures.
h5_str
Merge horizontally a list of signatures.
Give the index according to the key.
is_fit
is_valid
Make a copy of applying a filtering mask on rows.
mark_ready
open_hdf5
Given data, produce a sign0.
Define feature names.
Given keys, and key type validate them.
refesh
Refresh all cached properties
Restricts the keys contained in the universe.
Map the non redundant signature in explicit full molset.
Save a non redundant signature in reference molset.
sort
string_dtype
Subsample from a signature without replacement.
Write smiles to h5.
update_status
Perform validations.
Merge vertically a list of signatures.
Attributes
Get the dictionary of dataset and shapes.
Signature qualified name (e.g.
Get the V matrix shape.
Get the V matrix size.
status
- __getitem__(key)
Return the vector corresponding to the key.
The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).
- __iter__()
By default iterate on signatures V.
- __repr__()
String representig the signature.
- add_datasets(data_dict, overwrite=True, chunks=None, compression=None)
Add dataset to a H5
- apply_mappings(out_file, mappings=None)
Map signature throught mappings.
- available()
This signature data is available.
- background_distances(metric, limit_inks=None, name=None)
Return the background distances according to the selected metric.
- Parameters:
metric (str) – the metric name (cosine or euclidean).
- chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True)
Iterator on chunks of data
- chunker(size=2000, n=None)
Iterate on signatures.
- clear()
Remove everything from this signature.
- clear_all()
Remove everything from this signature for both referene and full.
- compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)
Compute the distance pvalues according to the selected metric.
- Parameters:
bg_file (Str) – The file where to store the distances.
metric (str) – the metric name (cosine or euclidean).
sample_pairs (int) – Amount of pairs for distance calculation.
unflat (bool) – Remove flat regions whenever we observe them.
memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.
limit_inks (list) – Compute distances only for this subset on inchikeys.
- Returns:
Dictionary with distances and Pvalues
- Return type:
bg_distances(dict)
- consistency_check()
Check that signature is valid.
- copy_from(sign, key, chunk=None)
Copy dataset ‘key’ to current signature.
- Parameters:
sign (SignatureBase) – The source signature.
key (str) – The dataset to copy from.
- dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False)
Return a pytorch DataLoader object for quick signature iteration.
- filter_h5_dataset(key, mask, axis, chunk_size=1000)
Apply a maks to a dataset, dropping columns or rows.
key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will
be kept.
axis (int): Wether the mask refers to rows (0) or columns (1).
- fit(cc_root=None, pairs=None, X=None, keys=None, features=None, data_file=None, key_type='inchikey', agg_method='average', do_triplets=False, sanitize=True, sanitizer_kwargs={}, **kwargs)[source]
Process the input data.
We produce a sign0 (full) and a sign0 (reference). Data are sorted (keys and features).
- Parameters:
cc_root (str) – Path to a CC instance. This is important to produce the triplets. If None specified, the same CC where the signature is present will be used (default=None).
pairs (array of tuples or file) – Data. If file it needs to H5 file with dataset called ‘pairs’.
X (matrix or file) – Data. If file it needs to H5 file with datasets called ‘X’, ‘keys’ and maybe ‘features’.
keys (array) – Row names.
key_type (str) – Type of key. May be inchikey or smiles (default=’inchikey’).
features (array) – Column names (default=None).
data_file (str) – Input data file in the form of H5 file and it should contain the required data in datasets.
do_triplets (boolean) – Draw triplets from the CC (default=True).
- fit_end(**kwargs)
Conclude fit method.
We compute background distances, run validations (including diagnostic) and finally marking the signature as ready.
- fit_hpc(*args, **kwargs)
Execute the fit method on the configured HPC.
- Parameters:
args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.
- func_hpc(func_name, *args, **kwargs)
Execute the any method on the configured HPC.
- Parameters:
args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.
- generator_fn(weak_shuffle=False, batch_size=None)
Return the generator function that we can query for batches.
- get_cc(cc_root=None)
Return the CC where the signature is present
- get_data(pairs, X, keys, features, data_file, key_type, agg_method)[source]
Get data in the right format.
Input data for ‘fit’ or ‘predict’ can come in 2 main different format: as matrix or as pairs. If a ‘X’ matrix is passed we also expect the row identifier (‘keys’) and optionally column identifier (‘features’). If ‘pairs’ (dense representation) are passed we expect a combination of key and feature that can be associated with a value or not. The information can be bundled in a H5 file or provided as argument. Basic check are performed to ensure consistency of ‘keys’ and ‘features’.
- Parameters:
pairs (list) – list of pair (key, feature) or (key, feature, value)
X (array) – 2D matrix, rows corresponds to molecules and columns corresponds to features
keys (list) – list of string identifier for molecules
features (list) – list of string identifier for features
data_file (str) – path to a input file, at least must contain the datasets: ‘pairs’ or ‘X’ and ‘keys’
key_type (str) – the type of molecule identifier used
agg_method (str) – the aggregation method to use
- get_h5_dataset(h5_dataset_name, mask=None)
Get a specific dataset in the signature.
- get_intersection(sign)
Return the intersection between two signatures.
- get_molset(molset)
Return a signature from a different molset
- get_neig()
Return the neighbors signature, given a signature
- get_non_redundant_intersection(sign)
Return the non redundant intersection between two signatures.
(i.e. keys and vectors that are common to both signatures.) N.B: to maximize overlap it’s better to use signatures of type ‘full’. N.B: Near duplicates are found in the first signature.
- get_sign(sign_type)
Return the signature type for current dataset
- get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)
Get vectors for a list of keys, sorted by default.
- Parameters:
keys (list) – a List of string, only the overlapping subset to the signature keys is considered.
include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.
dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.
- get_vectors_lite(keys, chunk_size=2000, chunk_above=10000)
Iterate on signatures.
- static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None)
Merge horizontally a list of signatures.
- index(key)
Give the index according to the key.
- Parameters:
key (str) – the key to search index in the matrix.
- Returns:
Index in the matrix
- Return type:
index(int)
- property info_h5
Get the dictionary of dataset and shapes.
- make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)
Make a copy of applying a filtering mask on rows.
destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,
date, name…)
- data_file (str): A specific file to copy (by default is the signature
h5)
- predict(pairs=None, X=None, keys=None, features=None, data_file=None, key_type=None, merge=False, merge_method='new', destination=None, chunk_size=10000)[source]
Given data, produce a sign0.
- Parameters:
pairs (array of tuples or file) – Data. If file it needs to H5 file with dataset called ‘pairs’.
X (matrix or file) – Data. If file it needs to H5 file with datasets called ‘X’, ‘keys’ and maybe ‘features’.
keys (array) – Row names.
key_type (str) – Type of key. May be inchikey or smiles. If None specified, no filtering is applied (default=None).
features (array) – Column names (default=None).
merge (bool) – Merge queried data with the currently existing one.
merge_method (str) – Merging method to be applied when a repeated key is found. Can be ‘average’, ‘old’ or ‘new’ (default=new).
destination (str) – Path to the H5 file. If none specified, a (V, keys, features) tuple is returned.
- process_features(features, n)[source]
Define feature names.
Process features. Give an arbitrary name to features if not provided. Returns the feature names as a numpy array of strings.
- process_keys(keys, key_type, sort=False)[source]
Given keys, and key type validate them.
If None is specified, then all keys are kept, and no validation is performed.
- Returns:
the processed InChIKeys ray_keys(list): raw input keys indices (list): index of valid keys
- Return type:
keys(list)
- property qualified_name
Signature qualified name (e.g. ‘B1.001-sign1-full’).
- refresh()
Refresh all cached properties
- save_full(overwrite=False)
Map the non redundant signature in explicit full molset.
It generates a new signature in the full folders.
- Parameters:
overwrite (bool) – Overwrite existing (default=False).
- save_reference(cpu=4, overwrite=False)
Save a non redundant signature in reference molset.
It generates a new signature in the references folders.
- Parameters:
cpu (int) – Number of CPUs (default=4),
overwrite (bool) – Overwrite existing (default=False).
- property shape
Get the V matrix shape.
- property size
Get the V matrix size.
- subsample(n, seed=42)
Subsample from a signature without replacement.
- Parameters:
n (int) – Maximum number of samples (default=10000).
- Returns:
A (samples, features) matrix. keys(array): The list of keys.
- Return type:
V(matrix)
- to_csv(filename, smiles=None)
Write smiles to h5.
At the moment this is done quering the Structure table for inchikey inchi mapping and then converting via Converter.
- validate(apply_mappings=True, metric='cosine', diagnostics=False)
Perform validations.
A validation file is an external resource basically presenting pairs of molecules and whether they share or not a given property (i.e the file format is inchikey inchikey 0/1). Current test are performed on MOA (Mode Of Action) and ATC (Anatomical Therapeutic Chemical) corresponding to B1.001 and E1.001 dataset.
- Parameters:
apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are moleules from the full signature and values are moleules from the reference set.
- static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100)
Merge vertically a list of signatures.