chemicalchecker.core.char.char

class char(signature_path, dataset, **params)[source]

Bases: BaseSignature, DataSignature

Enrichment signature class

Initialize a visualization class.

Parameters:

signature_path (str) – The path to the signature directory.
dataset (object) – The dataset object with all info related.
metric (str) – The metric used for the SAFE algorithm: euclidean or cosine (default: cosine).

Methods

`SAFE`
`add_attr`	Add dataset to a H5
`add_datasets`	Add dataset to a H5
`apply_mappings`	Map signature throught mappings.
`as_dataframe`
`available`	This signature data is available.
`background_distances`	Return the background distances according to the selected metric.
`check_mappings`
`chunk_iter`	Iterator on chunks of data
`chunker`	Iterate on signatures.
`clear`	Remove everything from this signature.
`clear_all`	Remove everything from this signature for both referene and full.
`close_hdf5`
`cluster_analysis`
`compute_distance_pvalues`	Compute the distance pvalues according to the selected metric.
`consistency_check`	Check that signature is valid.
`copy_from`	Copy dataset 'key' to current signature.
`dataloader`	Return a pytorch DataLoader object for quick signature iteration.
`diagnosis`
`export_features`
`filter_h5_dataset`	Apply a maks to a dataset, dropping columns or rows.
`find_thr`	Find the score threshold that better recapitulates the signature 0 data.
`fit`	Fit the visualization class.
`fit_end`	Conclude fit method.
`fit_hpc`	Execute the fit method on the configured HPC.
`func_hpc`	Execute the any method on the configured HPC.
`generator_fn`	Return the generator function that we can query for batches.
`get_cc`	Return the CC where the signature is present
`get_coords`
`get_dict`	Get mappings between feature tags and their descriptions.
`get_h5_attr`	Get a specific attribute in the signature.
`get_h5_dataset`	Get a specific dataset in the signature.
`get_intersection`	Return the intersection between two signatures.
`get_molset`	Return a signature from a different molset
`get_neig`	Return the neighbors signature, given a signature
`get_neighborhoods`
`get_non_redundant_intersection`	Return the non redundant intersection between two signatures.
`get_sign`	Return the signature type for current dataset
`get_status_stack`
`get_vectors`	Get vectors for a list of keys, sorted by default.
`get_vectors_lite`	Iterate on signatures.
`h5_str`
`hstack_signatures`	Merge horizontally a list of signatures.
`index`	Give the index according to the key.
`is_fit`
`is_valid`
`make_filtered_copy`	Make a copy of applying a filtering mask on rows.
`mark_ready`
`molecule_boulder`
`open_hdf5`
`plot_neighborhoods`
`plot_projection`	Plot and store a projection of the whole space (used as a background for the visualizations).
`predict`	Returns the inferred classes for a given molecule.
`predict_feat`	Visualize a feature.
`project_scores`	Use the CC projection's module to compute the tSNE coordinates of the enrichment scores.
`refresh`	Refresh all cached properties
`run_SAFE`	Parallelizes the enrichment analysis making use of the HPC.
`save_full`	Map the non redundant signature in explicit full molset.
`save_reference`	Save a non redundant signature in reference molset.
`string_dtype`
`subsample`	Subsample from a signature without replacement.
`to_csv`	Write smiles to h5.
`update_status`
`validate`	Perform validations.
`vstack_signatures`	Merge vertically a list of signatures.

Attributes

`info_h5`	Get the dictionary of dataset and shapes.
`qualified_name`	Signature qualified name (e.g.
`shape`	Get the V matrix shape.
`size`	Get the V matrix size.
`status`

__getitem__(key)

Return the vector corresponding to the key.

The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).

__iter__(): By default iterate on signatures V.

__repr__(): String representig the signature.

add_attr(data_dict, overwrite=True)[source]: Add dataset to a H5

add_datasets(data_dict, overwrite=True, chunks=None, compression=None): Add dataset to a H5

apply_mappings(out_file, mappings=None): Map signature throught mappings.

available(): This signature data is available.

background_distances(metric, limit_inks=None, name=None)

Return the background distances according to the selected metric.

Parameters:: metric (str) – the metric name (cosine or euclidean).

chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True): Iterator on chunks of data

chunker(size=2000, n=None): Iterate on signatures.

clear(): Remove everything from this signature.

clear_all(): Remove everything from this signature for both referene and full.

compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)

Compute the distance pvalues according to the selected metric.

Parameters:

bg_file (Str) – The file where to store the distances.
metric (str) – the metric name (cosine or euclidean).
sample_pairs (int) – Amount of pairs for distance calculation.
unflat (bool) – Remove flat regions whenever we observe them.
memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.
limit_inks (list) – Compute distances only for this subset on inchikeys.

Returns:

Dictionary with distances and Pvalues

Return type:

bg_distances(dict)

consistency_check(): Check that signature is valid.

copy_from(sign, key, chunk=None)

Copy dataset ‘key’ to current signature.

Parameters:

sign (SignatureBase) – The source signature.
key (str) – The dataset to copy from.

dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False): Return a pytorch DataLoader object for quick signature iteration.

filter_h5_dataset(key, mask, axis, chunk_size=1000)

Apply a maks to a dataset, dropping columns or rows.

key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will

be kept.

axis (int): Wether the mask refers to rows (0) or columns (1).

find_thr(stat='fscore', n_max_samples=10000)[source]: Find the score threshold that better recapitulates the signature 0 data.

fit(safe, sign0=None, sign1=None, back_dist_pvalue=0.01)[source]

Fit the visualization class. A SAFE analysis is performed over the signature 4 of those molecules with available signature 0. This is followed by a tSNE of the resulting scores. Finally, the projected molecules are clustered by HDBSCAN.

Parameters:

safe (bool) – A boolean indicating whether to perform the SAFE analysis or not. This is useful in case the
results. (instance has already been fitted and we want to change the downstream analysis without repeating the SAFE) –
sign0 (object) – Signature 0 of the dataset of interest.
sign1 (object) – Signature 1 of the dataset of interest.
back_dist_pvalue (float) – Distance p-value threshold for a molecule to be considered as close when searching for
(default (neighbors in the SAFE analysis) – 0.01).

fit_end(**kwargs)

Conclude fit method.

We compute background distances, run validations (including diagnostic) and finally marking the signature as ready.

fit_hpc(*args, **kwargs)

Execute the fit method on the configured HPC.

Parameters:

args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.

func_hpc(func_name, *args, **kwargs)

Execute the any method on the configured HPC.

Parameters:

args (tuple) – the arguments for of the fit method
kwargs (dict) – arguments for the HPC method.

generator_fn(weak_shuffle=False, batch_size=None): Return the generator function that we can query for batches.

get_cc(cc_root=None): Return the CC where the signature is present

get_dict()[source]: Get mappings between feature tags and their descriptions.

get_h5_attr(h5_dataset_name)[source]: Get a specific attribute in the signature.

get_h5_dataset(h5_dataset_name, mask=None): Get a specific dataset in the signature.

get_intersection(sign): Return the intersection between two signatures.

get_molset(molset): Return a signature from a different molset

get_neig(): Return the neighbors signature, given a signature

get_non_redundant_intersection(sign)

Return the non redundant intersection between two signatures.

(i.e. keys and vectors that are common to both signatures.) N.B: to maximize overlap it’s better to use signatures of type ‘full’. N.B: Near duplicates are found in the first signature.

get_sign(sign_type): Return the signature type for current dataset

get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)

Get vectors for a list of keys, sorted by default.

Parameters:

keys (list) – a List of string, only the overlapping subset to the signature keys is considered.
include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.
dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.

get_vectors_lite(keys, chunk_size=2000, chunk_above=10000): Iterate on signatures.

static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None): Merge horizontally a list of signatures.

index(key)

Give the index according to the key.

Parameters:: key (str) – the key to search index in the matrix.
Returns:: Index in the matrix
Return type:: index(int)

property info_h5: Get the dictionary of dataset and shapes.

make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)

Make a copy of applying a filtering mask on rows.

destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,

date, name…)

data_file (str): A specific file to copy (by default is the signature: h5)

plot_projection()[source]: Plot and store a projection of the whole space (used as a background for the visualizations).

predict(query, kde=True, scatter=False)[source]

Returns the inferred classes for a given molecule. It also plots the approximate location of the molecule in the space and a KDE representation of the inferred classes.

Parameters:

query (str) – InChI key, name or SMILES of the molecule of interest.
keytype (str) – Type of query. Any of ‘inchikey’, ‘name’ or ‘smiles’.

predict_feat(features, coords=None, mode=None)[source]

Visualize a feature. Plots the tSNE projection of the molecules with available signature 0 and a KDE (Kernel Density Estimate) of the molecules having the feature of interest on top.

Parameters:: feature (str or list) – feature(s) of interest.

project_scores()[source]: Use the CC projection’s module to compute the tSNE coordinates of the enrichment scores.

property qualified_name: Signature qualified name (e.g. ‘B1.001-sign1-full’).

refresh(): Refresh all cached properties

run_SAFE(elements=None)[source]

Parallelizes the enrichment analysis making use of the HPC.

Parameters:

elements (list) – A list containing the column indexes of the features of
re-running (the experimental data that we want to analyze. Only useful for) –
default (failed jobs. By) –
analyzed. (all the features are) –

save_full(overwrite=False)

Map the non redundant signature in explicit full molset.

It generates a new signature in the full folders.

Parameters:: overwrite (bool) – Overwrite existing (default=False).

save_reference(cpu=4, overwrite=False)

Save a non redundant signature in reference molset.

It generates a new signature in the references folders.

Parameters:

cpu (int) – Number of CPUs (default=4),
overwrite (bool) – Overwrite existing (default=False).

property shape: Get the V matrix shape.

property size: Get the V matrix size.

subsample(n, seed=42)

Subsample from a signature without replacement.

Parameters:: n (int) – Maximum number of samples (default=10000).
Returns:: A (samples, features) matrix. keys(array): The list of keys.
Return type:: V(matrix)

to_csv(filename, smiles=None)

Write smiles to h5.

At the moment this is done quering the Structure table for inchikey inchi mapping and then converting via Converter.

validate(apply_mappings=True, metric='cosine', diagnostics=False)

Perform validations.

A validation file is an external resource basically presenting pairs of molecules and whether they share or not a given property (i.e the file format is inchikey inchikey 0/1). Current test are performed on MOA (Mode Of Action) and ATC (Anatomical Therapeutic Chemical) corresponding to B1.001 and E1.001 dataset.

Parameters:: apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are moleules from the full signature and values are moleules from the reference set.

static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100): Merge vertically a list of signatures.