chemicalchecker.core.sign0.sign0

class sign0(signature_path, dataset, **params)[source]

Bases: BaseSignature, DataSignature

Signature type 0 class.

Initialize a Signature.

Parameters:

signature_path (str) – the path to the signature directory.
dataset (str) – NS ex A1.001, here only serves as the ‘name’ record of the h5 file.

Methods

`add_datasets`	Add dataset to a H5
`apply_mappings`	Map signature throught mappings.
`as_dataframe`
`available`	This signature data is available.
`background_distances`	Return the background distances according to the selected metric.
`check_mappings`
`chunk_iter`	Iterator on chunks of data
`chunker`	Iterate on signatures.
`clear`	Remove everything from this signature.
`clear_all`	Remove everything from this signature for both referene and full.
`close_hdf5`
`compute_distance_pvalues`	Compute the distance pvalues according to the selected metric.
`consistency_check`	Check that signature is valid.
`copy_from`	Copy dataset 'key' to current signature.
`dataloader`	Return a pytorch DataLoader object for quick signature iteration.
`diagnosis`
`export_features`
`filter_h5_dataset`	Apply a maks to a dataset, dropping columns or rows.
`fit`	Process the input data.
`fit_end`	Conclude fit method.
`fit_hpc`	Execute the fit method on the configured HPC.
`func_hpc`	Execute the any method on the configured HPC.
`generator_fn`	Return the generator function that we can query for batches.
`get_cc`	Return the CC where the signature is present
`get_data`	Get data in the right format.
`get_h5_dataset`	Get a specific dataset in the signature.
`get_intersection`	Return the intersection between two signatures.
`get_molset`	Return a signature from a different molset
`get_neig`	Return the neighbors signature, given a signature
`get_non_redundant_intersection`	Return the non redundant intersection between two signatures.
`get_sign`	Return the signature type for current dataset
`get_status_stack`
`get_vectors`	Get vectors for a list of keys, sorted by default.
`get_vectors_lite`	Iterate on signatures.
`h5_str`
`hstack_signatures`	Merge horizontally a list of signatures.
`index`	Give the index according to the key.
`is_fit`
`is_valid`
`make_filtered_copy`	Make a copy of applying a filtering mask on rows.
`mark_ready`
`open_hdf5`
`predict`	Given data, produce a sign0.
`process_features`	Define feature names.
`process_keys`	Given keys, and key type validate them.
`refesh`
`refresh`	Refresh all cached properties
`restrict_to_universe`	Restricts the keys contained in the universe.
`save_full`	Map the non redundant signature in explicit full molset.
`save_reference`	Save a non redundant signature in reference molset.
`sort`
`string_dtype`
`subsample`	Subsample from a signature without replacement.
`to_csv`	Write smiles to h5.
`update_status`
`validate`	Perform validations.
`vstack_signatures`	Merge vertically a list of signatures.

Attributes

`info_h5`	Get the dictionary of dataset and shapes.
`qualified_name`	Signature qualified name (e.g.
`shape`	Get the V matrix shape.
`size`	Get the V matrix size.
`status`

__getitem__(key)

Return the vector corresponding to the key.

The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).

__iter__(): By default iterate on signatures V.

__repr__(): String representig the signature.

add_datasets(data_dict, overwrite=True, chunks=None, compression=None): Add dataset to a H5

apply_mappings(out_file, mappings=None): Map signature throught mappings.

available(): This signature data is available.

background_distances(metric, limit_inks=None, name=None)

Return the background distances according to the selected metric.

Parameters:: metric (str) – the metric name (cosine or euclidean).

chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True): Iterator on chunks of data

chunker(size=2000, n=None): Iterate on signatures.

clear(): Remove everything from this signature.

clear_all(): Remove everything from this signature for both referene and full.

compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)

Compute the distance pvalues according to the selected metric.

Parameters:

bg_file (Str) – The file where to store the distances.
metric (str) – the metric name (cosine or euclidean).
sample_pairs (int) – Amount of pairs for distance calculation.
unflat (bool) – Remove flat regions whenever we observe them.
memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.
limit_inks (list) – Compute distances only for this subset on inchikeys.

Returns:

Dictionary with distances and Pvalues

Return type:

bg_distances(dict)

consistency_check(): Check that signature is valid.

copy_from(sign, key, chunk=None)

Copy dataset ‘key’ to current signature.

Parameters:

sign (SignatureBase) – The source signature.
key (str) – The dataset to copy from.

dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False): Return a pytorch DataLoader object for quick signature iteration.

filter_h5_dataset(key, mask, axis, chunk_size=1000)

Apply a maks to a dataset, dropping columns or rows.

key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will

be kept.

axis (int): Wether the mask refers to rows (0) or columns (1).

fit(cc_root=None, pairs=None, X=None, keys=None, features=None, data_file=None, key_type='inchikey', agg_method='average', do_triplets=False, sanitize=True, sanitizer_kwargs={}, **kwargs)[source]

Process the input data.

We produce a sign0 (full) and a sign0 (reference). Data are sorted (keys and features).

Parameters:

cc_root (str) – Path to a CC instance. This is important to produce the triplets. If None specified, the same CC where the signature is present will be used (default=None).
pairs (array of tuples or file) – Data. If file it needs to H5 file with dataset called ‘pairs’.
X (matrix or file) – Data. If file it needs to H5 file with datasets called ‘X’, ‘keys’ and maybe ‘features’.
keys (array) – Row names.
key_type (str) – Type of key. May be inchikey or smiles (default=’inchikey’).
features (array) – Column names (default=None).
data_file (str) – Input data file in the form of H5 file and it should contain the required data in datasets.
do_triplets (boolean) – Draw triplets from the CC (default=True).

fit_end(**kwargs)

Conclude fit method.

We compute background distances, run validations (including diagnostic) and finally marking the signature as ready.

fit_hpc(*args, **kwargs)

Execute the fit method on the configured HPC.

Parameters:

args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.

func_hpc(func_name, *args, **kwargs)

Execute the any method on the configured HPC.

Parameters:

args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.

generator_fn(weak_shuffle=False, batch_size=None): Return the generator function that we can query for batches.

get_cc(cc_root=None): Return the CC where the signature is present

get_data(pairs, X, keys, features, data_file, key_type, agg_method)[source]

Get data in the right format.

Input data for ‘fit’ or ‘predict’ can come in 2 main different format: as matrix or as pairs. If a ‘X’ matrix is passed we also expect the row identifier (‘keys’) and optionally column identifier (‘features’). If ‘pairs’ (dense representation) are passed we expect a combination of key and feature that can be associated with a value or not. The information can be bundled in a H5 file or provided as argument. Basic check are performed to ensure consistency of ‘keys’ and ‘features’.

Parameters:

pairs (list) – list of pair (key, feature) or (key, feature, value)
X (array) – 2D matrix, rows corresponds to molecules and columns corresponds to features
keys (list) – list of string identifier for molecules
features (list) – list of string identifier for features
data_file (str) – path to a input file, at least must contain the datasets: ‘pairs’ or ‘X’ and ‘keys’
key_type (str) – the type of molecule identifier used
agg_method (str) – the aggregation method to use

get_h5_dataset(h5_dataset_name, mask=None): Get a specific dataset in the signature.

get_intersection(sign): Return the intersection between two signatures.

get_molset(molset): Return a signature from a different molset

get_neig(): Return the neighbors signature, given a signature

get_non_redundant_intersection(sign)

Return the non redundant intersection between two signatures.

(i.e. keys and vectors that are common to both signatures.) N.B: to maximize overlap it’s better to use signatures of type ‘full’. N.B: Near duplicates are found in the first signature.

get_sign(sign_type): Return the signature type for current dataset

get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)

Get vectors for a list of keys, sorted by default.

Parameters:

keys (list) – a List of string, only the overlapping subset to the signature keys is considered.
include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.
dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.

get_vectors_lite(keys, chunk_size=2000, chunk_above=10000): Iterate on signatures.

static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None): Merge horizontally a list of signatures.

index(key)

Give the index according to the key.

Parameters:: key (str) – the key to search index in the matrix.
Returns:: Index in the matrix
Return type:: index(int)

property info_h5: Get the dictionary of dataset and shapes.

make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)

Make a copy of applying a filtering mask on rows.

destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,

date, name…)

data_file (str): A specific file to copy (by default is the signature: h5)

predict(pairs=None, X=None, keys=None, features=None, data_file=None, key_type=None, merge=False, merge_method='new', destination=None, chunk_size=10000)[source]

Given data, produce a sign0.

Parameters:

pairs (array of tuples or file) – Data. If file it needs to H5 file with dataset called ‘pairs’.
X (matrix or file) – Data. If file it needs to H5 file with datasets called ‘X’, ‘keys’ and maybe ‘features’.
keys (array) – Row names.
key_type (str) – Type of key. May be inchikey or smiles. If None specified, no filtering is applied (default=None).
features (array) – Column names (default=None).
merge (bool) – Merge queried data with the currently existing one.
merge_method (str) – Merging method to be applied when a repeated key is found. Can be ‘average’, ‘old’ or ‘new’ (default=new).
destination (str) – Path to the H5 file. If none specified, a (V, keys, features) tuple is returned.

process_features(features, n)[source]

Define feature names.

Process features. Give an arbitrary name to features if not provided. Returns the feature names as a numpy array of strings.

process_keys(keys, key_type, sort=False)[source]

Given keys, and key type validate them.

If None is specified, then all keys are kept, and no validation is performed.

Returns:: the processed InChIKeys ray_keys(list): raw input keys indices (list): index of valid keys
Return type:: keys(list)

property qualified_name: Signature qualified name (e.g. ‘B1.001-sign1-full’).

refresh(): Refresh all cached properties

restrict_to_universe()[source]: Restricts the keys contained in the universe.

save_full(overwrite=False)

Map the non redundant signature in explicit full molset.

It generates a new signature in the full folders.

Parameters:: overwrite (bool) – Overwrite existing (default=False).

save_reference(cpu=4, overwrite=False)

Save a non redundant signature in reference molset.

It generates a new signature in the references folders.

Parameters:

cpu (int) – Number of CPUs (default=4),
overwrite (bool) – Overwrite existing (default=False).

property shape: Get the V matrix shape.

property size: Get the V matrix size.

subsample(n, seed=42)

Subsample from a signature without replacement.

Parameters:: n (int) – Maximum number of samples (default=10000).
Returns:: A (samples, features) matrix. keys(array): The list of keys.
Return type:: V(matrix)

to_csv(filename, smiles=None)

Write smiles to h5.

At the moment this is done quering the Structure table for inchikey inchi mapping and then converting via Converter.

validate(apply_mappings=True, metric='cosine', diagnostics=False)

Perform validations.

A validation file is an external resource basically presenting pairs of molecules and whether they share or not a given property (i.e the file format is inchikey inchikey 0/1). Current test are performed on MOA (Mode Of Action) and ATC (Anatomical Therapeutic Chemical) corresponding to B1.001 and E1.001 dataset.

Parameters:: apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are moleules from the full signature and values are moleules from the reference set.

static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100): Merge vertically a list of signatures.