chemicalchecker.core.sign3.sign3

class sign3(signature_path, dataset, **params)[source]

Bases: BaseSignature, DataSignature

Signature type 3 class.

Initialize a Signature.

Parameters:
  • signature_path (str) – The signature root directory.

  • dataset (Dataset) – chemicalchecker.database.Dataset object.

  • params() – Parameters, expected keys are: * ‘sign2’ for learning based on sign2 * ‘prior’ for learning prior in predictions

Methods

add_datasets

Add dataset to a H5

applicability_domain

apply_mappings

Map signature throught mappings.

as_dataframe

available

This signature data is available.

background_distances

Return the background distances according to the selected metric.

check_mappings

chunk_iter

Iterator on chunks of data

chunker

Iterate on signatures.

clear

Remove everything from this signature.

clear_all

Remove everything from this signature for both referene and full.

close_hdf5

complete_sign2_universe

Completes the universe for extra molecules.

compute_distance_pvalues

Compute the distance pvalues according to the selected metric.

conformal_prediction

consistency_check

Check that signature is valid.

copy_from

Copy dataset 'key' to current signature.

dataloader

Return a pytorch DataLoader object for quick signature iteration.

diagnosis

export_features

filter_h5_dataset

Apply a maks to a dataset, dropping columns or rows.

fit

Fit signature 3 given a list of signature 2.

fit_end

Conclude fit method.

fit_hpc

Execute the fit method on the configured HPC.

func_hpc

Execute the any method on the configured HPC.

generator_fn

Return the generator function that we can query for batches.

get_cc

Return the CC where the signature is present

get_h5_dataset

Get a specific dataset in the signature.

get_intersection

Return the intersection between two signatures.

get_molset

Return a signature from a different molset

get_neig

Return the neighbors signature, given a signature

get_non_redundant_intersection

Return the non redundant intersection between two signatures.

get_sign

Return the signature type for current dataset

get_status_stack

get_universe_inchikeys

get_vectors

Get vectors for a list of keys, sorted by default.

get_vectors_lite

Iterate on signatures.

h5_str

hstack_signatures

Merge horizontally a list of signatures.

index

Give the index according to the key.

is_fit

is_valid

make_filtered_copy

Make a copy of applying a filtering mask on rows.

mark_ready

open_hdf5

plot_validations

predict

Use the fitted models to predict.

predict_novelty

Model novelty score via LocalOutlierFactor (semi-supervised).

realistic_subsampling_fn

refresh

Refresh all cached properties

rerun_confidence

Rerun confidence trainining and estimation

save_confidence_distributions

save_full

Map the non redundant signature in explicit full molset.

save_reference

Save a non redundant signature in reference molset.

save_sign2_coverage

Create a file with all signatures 2 coverage of molecule in the CC.

save_sign2_matrix

Save matrix of pairs of horizontally stacked signature 2.

save_sign2_universe

Create a file with all signatures 2 for each molecule in the CC.

string_dtype

subsample

Subsample from a signature without replacement.

to_csv

Write smiles to h5.

train_SNN

Train the Siamese Neural Network model.

train_confidence

Train confidence and prior models.

train_confidence_model

train_prior_model

Train prior predictor.

train_prior_signature_model

Train prior predictor.

update_status

validate

Perform validations.

vstack_signatures

Merge vertically a list of signatures.

Attributes

info_h5

Get the dictionary of dataset and shapes.

qualified_name

Signature qualified name (e.g.

shape

Get the V matrix shape.

sharedx

sharedx_trim

size

Get the V matrix size.

status

__getitem__(key)

Return the vector corresponding to the key.

The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).

__iter__()

By default iterate on signatures V.

__repr__()

String representig the signature.

add_datasets(data_dict, overwrite=True, chunks=None, compression=None)

Add dataset to a H5

apply_mappings(out_file, mappings=None)

Map signature throught mappings.

available()

This signature data is available.

background_distances(metric, limit_inks=None, name=None)

Return the background distances according to the selected metric.

Parameters:

metric (str) – the metric name (cosine or euclidean).

chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True)

Iterator on chunks of data

chunker(size=2000, n=None)

Iterate on signatures.

clear()

Remove everything from this signature.

clear_all()

Remove everything from this signature for both referene and full.

complete_sign2_universe(sign2_self, sign2_universe, sign2_coverage, tmp_path=None, calc_ds_idx=[0, 1, 2, 3, 4], calc_ds_names=['A1.001', 'A2.001', 'A3.001', 'A4.001', 'A5.001'], ref_cc=None)[source]

Completes the universe for extra molecules.

Important if the dataset we are fitting is defined on molecules that largely do not overlap with CC molecules. In that case there is no orthogonal information to derive sign3. We should always have at least the chemistry (calculated) level available for all molecules of the dataset.

Parameters:
  • sign2_self (sign2) – Signature 2 of the current space.

  • sign2_universe (str) – Path to the union of all signatures 2 for all molecules in the CC universe. (~1M x 3200)

  • sign2_coverage (str) – Path to the coverage of all signatures 2 for all molecules in the CC universe. (~1M x 25)

  • tmp_path (str) – Temporary path where to save extra molecules’ signatures.

  • calc_spaces (list) – List of indexes for calculated spaces in the coverage matrix.

Returns:

Paths ot the new sign2 universe and coverage file.

compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)

Compute the distance pvalues according to the selected metric.

Parameters:
  • bg_file (Str) – The file where to store the distances.

  • metric (str) – the metric name (cosine or euclidean).

  • sample_pairs (int) – Amount of pairs for distance calculation.

  • unflat (bool) – Remove flat regions whenever we observe them.

  • memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.

  • limit_inks (list) – Compute distances only for this subset on inchikeys.

Returns:

Dictionary with distances and Pvalues

Return type:

bg_distances(dict)

consistency_check()

Check that signature is valid.

copy_from(sign, key, chunk=None)

Copy dataset ‘key’ to current signature.

Parameters:
  • sign (SignatureBase) – The source signature.

  • key (str) – The dataset to copy from.

dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False)

Return a pytorch DataLoader object for quick signature iteration.

filter_h5_dataset(key, mask, axis, chunk_size=1000)

Apply a maks to a dataset, dropping columns or rows.

key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will

be kept.

axis (int): Wether the mask refers to rows (0) or columns (1).

fit(sign2_list=None, sign2_self=None, triplet_sign=None, sign2_universe=None, complete_universe='full', sign2_coverage=None, model_confidence=True, save_correlations=False, predict_novelty=False, update_preds=True, chunk_size=1000, suffix=None, plots_train=True, triplets_sampler=None, **kwargs)[source]

Fit signature 3 given a list of signature 2.

Parameters:
  • sign2_list (list) – List of signature 2 objects to learn from.

  • sign2_self (sign2) – Signature 2 of the current space.

  • triplet_sign (sign1) – Signature used to define acnhor positive and negative in triplets.

  • sign2_universe (str) – Path to the union of all signatures 2 for all molecules in the CC universe. (~1M x 3200)

  • complete_universe (str) – add chemistry information for molecules not in the universe. ‘full’ use all A* spaces while, ‘fast’ skips A2 (3D conformation) which is slow. False by default, not adding any signature to the universe.

  • sign2_coverage (str) – Path to the coverage of all signatures 2 for all molecules in the CC universe. (~1M x 25)

  • model_confidence (bool) – Whether to model confidence. That is based on standard deviation of prediction with dropout.

  • save_correlations (bool) – tertile, max) for the given input dataset (result of the evaluation).

  • predict_novelty (bool) –

  • update_preds (bool) – Whether to write or update the sign3.h5

  • normalize_scores (bool) – Whether to normalize confidence scores.

  • chunk_size (int) – Chunk size when writing to sign3.h5

  • suffix (str) – Suffix of the generated model.

  • plots_train (bool) – plotting trained models outcomes defaulted to True. it applies to train_prior_model, train_prior_signature_model, train_confidence_model

fit_end(**kwargs)

Conclude fit method.

We compute background distances, run validations (including diagnostic) and finally marking the signature as ready.

fit_hpc(*args, **kwargs)

Execute the fit method on the configured HPC.

Parameters:
  • args (tuple) – the arguments for of the fit method

  • kwargs (dict) – arguments for the HPC method.

func_hpc(func_name, *args, **kwargs)

Execute the any method on the configured HPC.

Parameters:
  • args (tuple) – the arguments for of the fit method

  • kwargs (dict) – arguments for the HPC method.

generator_fn(weak_shuffle=False, batch_size=None)

Return the generator function that we can query for batches.

get_cc(cc_root=None)

Return the CC where the signature is present

get_h5_dataset(h5_dataset_name, mask=None)

Get a specific dataset in the signature.

get_intersection(sign)

Return the intersection between two signatures.

get_molset(molset)

Return a signature from a different molset

get_neig()

Return the neighbors signature, given a signature

get_non_redundant_intersection(sign)

Return the non redundant intersection between two signatures.

(i.e. keys and vectors that are common to both signatures.) N.B: to maximize overlap it’s better to use signatures of type ‘full’. N.B: Near duplicates are found in the first signature.

get_sign(sign_type)

Return the signature type for current dataset

get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)

Get vectors for a list of keys, sorted by default.

Parameters:
  • keys (list) – a List of string, only the overlapping subset to the signature keys is considered.

  • include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.

  • dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.

get_vectors_lite(keys, chunk_size=2000, chunk_above=10000)

Iterate on signatures.

static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None)

Merge horizontally a list of signatures.

index(key)

Give the index according to the key.

Parameters:

key (str) – the key to search index in the matrix.

Returns:

Index in the matrix

Return type:

index(int)

property info_h5

Get the dictionary of dataset and shapes.

make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)

Make a copy of applying a filtering mask on rows.

destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,

date, name…)

data_file (str): A specific file to copy (by default is the signature

h5)

predict(src_file, dst_file, src_h5_ds='x_test', dst_h5_ds='V', model_path=None, chunk_size=1000)[source]

Use the fitted models to predict.

predict_novelty(retrain=False, update_sign3=True, cpu=4)[source]

Model novelty score via LocalOutlierFactor (semi-supervised).

Parameters:
  • retrain (bool) – Drop old model and train again. (default: False)

  • update_sign3 (bool) – Write novelty scores in h5. (default: True)

property qualified_name

Signature qualified name (e.g. ‘B1.001-sign1-full’).

refresh()

Refresh all cached properties

rerun_confidence(cc, suffix, train=True, update_sign=True, chunk_size=10000, sign2_universe=None, sign2_coverage=None, plots_train=True)[source]

Rerun confidence trainining and estimation

save_full(overwrite=False)

Map the non redundant signature in explicit full molset.

It generates a new signature in the full folders.

Parameters:

overwrite (bool) – Overwrite existing (default=False).

save_reference(cpu=4, overwrite=False)

Save a non redundant signature in reference molset.

It generates a new signature in the references folders.

Parameters:
  • cpu (int) – Number of CPUs (default=4),

  • overwrite (bool) – Overwrite existing (default=False).

static save_sign2_coverage(sign2_list, destination)[source]

Create a file with all signatures 2 coverage of molecule in the CC.

Parameters:
  • sign2_list (list) – List of signature 2 objects to learn from.

  • destination (str) – Path where the H5 is saved.

save_sign2_matrix(destination)[source]

Save matrix of pairs of horizontally stacked signature 2.

This is the matrix for training the signature 3. It is defined for all molecules for which we have a signature 2 in the current space. It’s a subset of the universe of stacked sign2 file.

Parameters:

destination (str) – Path where to save the matrix (HDF5 file).

static save_sign2_universe(sign2_list, destination)[source]

Create a file with all signatures 2 for each molecule in the CC.

Parameters:
  • sign2_list (list) – List of signature 2 objects to learn from.

  • destination (str) – Path where the H5 is saved.

property shape

Get the V matrix shape.

property size

Get the V matrix size.

subsample(n, seed=42)

Subsample from a signature without replacement.

Parameters:

n (int) – Maximum number of samples (default=10000).

Returns:

A (samples, features) matrix. keys(array): The list of keys.

Return type:

V(matrix)

to_csv(filename, smiles=None)

Write smiles to h5.

At the moment this is done quering the Structure table for inchikey inchi mapping and then converting via Converter.

train_SNN(params, reuse=True, suffix=None, evaluate=True, plots_train=True, triplets_sampler=None)[source]

Train the Siamese Neural Network model.

This method is used twice. First to evaluate the performances of the Siamese model. Second to train the final model on the full set of data. Triplets file are generated and SNN are trained. When evaluating also save the confidence model.

Parameters:
  • params (dict) – Dictionary with algorithm parameters.

  • reuse (bool) – Whether to reuse intermediate files (e.g. the aggregated signature 2 matrix).

  • suffix (str) – A suffix for the Siamese model path (e.g. ‘sign3/models/siamese_<suffix>’).

  • evaluate (bool) – Whether we are performing a train-test split and evaluating the performances (N.B. this is required for complete confidence scores)

  • plots_train (bool) – plotting outcomes of train models.

train_confidence(siamese, suffix='eval', traintest_file=None, train_file=None, max_x=10000, max_neig=50000, p_self=0.0, plots_train=True)[source]

Train confidence and prior models.

train_prior_model(siamese, train_x, splits, save_path, max_x=10000, n_samples=5, p_self=0.0, plots=True)[source]

Train prior predictor.

train_prior_signature_model(siamese, train_x, splits, save_path, max_x=10000, n_samples=5, p_self=0.0, plots=True)[source]

Train prior predictor.

validate(apply_mappings=True, metric='cosine', diagnostics=False)

Perform validations.

A validation file is an external resource basically presenting pairs of molecules and whether they share or not a given property (i.e the file format is inchikey inchikey 0/1). Current test are performed on MOA (Mode Of Action) and ATC (Anatomical Therapeutic Chemical) corresponding to B1.001 and E1.001 dataset.

Parameters:

apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are moleules from the full signature and values are moleules from the reference set.

static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100)

Merge vertically a list of signatures.