chemicalchecker.core.sign4.sign4

class sign4(signature_path, dataset, **params)[source]

Bases: BaseSignature, DataSignature

Signature type 4 class.

Initialize a Signature.

Parameters:
  • signature_path (str) – The signature root directory.

  • dataset (Dataset) – chemicalchecker.database.Dataset object.

  • params()

    Parameters, expected keys are: * ‘sign0_params’ for learning based on sign0

    (Morgan Fingerprint)

    • ’sign0_conf_params’ for learning confidences based on MFP

Methods

add_datasets

Add dataset to a H5

apply_mappings

Map signature throught mappings.

as_dataframe

available

This signature data is available.

background_distances

Return the background distances according to the selected metric.

check_mappings

chunk_iter

Iterator on chunks of data

chunker

Iterate on signatures.

clear

Remove everything from this signature.

clear_all

Remove everything from this signature for both referene and full.

close_hdf5

compute_distance_pvalues

Compute the distance pvalues according to the selected metric.

consistency_check

Check that signature is valid.

copy_from

Copy dataset 'key' to current signature.

dataloader

Return a pytorch DataLoader object for quick signature iteration.

diagnosis

export_features

filter_h5_dataset

Apply a maks to a dataset, dropping columns or rows.

fit

Fit signature 4 from Morgan Fingerprint.

fit_end

Conclude fit method.

fit_hpc

Execute the fit method on the configured HPC.

func_hpc

Execute the any method on the configured HPC.

generator_fn

Return the generator function that we can query for batches.

get_applicability_predict_fn

get_cc

Return the CC where the signature is present

get_h5_dataset

Get a specific dataset in the signature.

get_intersection

Return the intersection between two signatures.

get_molset

Return a signature from a different molset

get_neig

Return the neighbors signature, given a signature

get_non_redundant_intersection

Return the non redundant intersection between two signatures.

get_predict_fn

get_sign

Return the signature type for current dataset

get_status_stack

get_vectors

Get vectors for a list of keys, sorted by default.

get_vectors_lite

Iterate on signatures.

h5_str

hstack_signatures

Merge horizontally a list of signatures.

index

Give the index according to the key.

is_fit

is_valid

learn_sign0

Learn the signature 3 from sign0.

learn_sign0_conf

Learn the signature 3 applicability from sign0.

make_filtered_copy

Make a copy of applying a filtering mask on rows.

mark_ready

open_hdf5

predict

Use the fitted models to predict.

predict_from_sign0

predict_from_smiles

predict_from_string

Given molecuel string, generate MFP and predict sign3.

refresh

Refresh all cached properties

save_full

Map the non redundant signature in explicit full molset.

save_reference

Save a non redundant signature in reference molset.

string_dtype

subsample

Subsample from a signature without replacement.

to_csv

Write smiles to h5.

update_status

validate

Perform validations.

vstack_signatures

Merge vertically a list of signatures.

Attributes

info_h5

Get the dictionary of dataset and shapes.

qualified_name

Signature qualified name (e.g.

shape

Get the V matrix shape.

shared_keys

sign0_vectors

sign3_vectors

size

Get the V matrix size.

status

__getitem__(key)

Return the vector corresponding to the key.

The key can be a string (then it’s mapped though self.keys) or and int. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).

__iter__()

By default iterate on signatures V.

__repr__()

String representig the signature.

add_datasets(data_dict, overwrite=True, chunks=None, compression=None)

Add dataset to a H5

apply_mappings(out_file, mappings=None)

Map signature throught mappings.

available()

This signature data is available.

background_distances(metric, limit_inks=None, name=None)

Return the background distances according to the selected metric.

Parameters:

metric (str) – the metric name (cosine or euclidean).

chunk_iter(key, chunk_size, axis=0, chunk=False, bar=True)

Iterator on chunks of data

chunker(size=2000, n=None)

Iterate on signatures.

clear()

Remove everything from this signature.

clear_all()

Remove everything from this signature for both referene and full.

compute_distance_pvalues(bg_file, metric, sample_pairs=None, unflat=True, memory_safe=False, limit_inks=None)

Compute the distance pvalues according to the selected metric.

Parameters:
  • bg_file (Str) – The file where to store the distances.

  • metric (str) – the metric name (cosine or euclidean).

  • sample_pairs (int) – Amount of pairs for distance calculation.

  • unflat (bool) – Remove flat regions whenever we observe them.

  • memory_safe (bool) – Computing distances is much faster if we can load the full matrix in memory.

  • limit_inks (list) – Compute distances only for this subset on inchikeys.

Returns:

Dictionary with distances and Pvalues

Return type:

bg_distances(dict)

consistency_check()

Check that signature is valid.

copy_from(sign, key, chunk=None)

Copy dataset ‘key’ to current signature.

Parameters:
  • sign (SignatureBase) – The source signature.

  • key (str) – The dataset to copy from.

dataloader(batch_size=32, num_workers=1, shuffle=False, weak_shuffle=False, drop_last=False)

Return a pytorch DataLoader object for quick signature iteration.

filter_h5_dataset(key, mask, axis, chunk_size=1000)

Apply a maks to a dataset, dropping columns or rows.

key (str): The H5 dataset to filter. mask (np.array): A bool one dimensional mask array. True values will

be kept.

axis (int): Wether the mask refers to rows (0) or columns (1).

fit(sign0=None, sign3=None, suffix=None, include_confidence=True, only_confidence=False, **kwargs)[source]

Fit signature 4 from Morgan Fingerprint.

This method is fitting a model that uses Morgan fingerprint as features to predict signature 3. In future other featurization approaches can be tested.

Parameters:
  • sign0 (str) – Path to the MFP file (i.e. sign0 of A1.001).

  • include_confidence (bool) – Whether to include confidence score in regression problem.

  • only_confidence (bool) – Whether to only train an additional regressor exclusively devoted to confidence.

fit_end(**kwargs)

Conclude fit method.

We compute background distances, run validations (including diagnostic) and finally marking the signature as ready.

fit_hpc(*args, **kwargs)

Execute the fit method on the configured HPC.

Parameters:
  • args (tuple) – the arguments for of the fit method

  • kwargs (dict) – arguments for the HPC method.

func_hpc(func_name, *args, **kwargs)

Execute the any method on the configured HPC.

Parameters:
  • args (tuple) – the arguments for of the fit method

  • kwargs (dict) – arguments for the HPC method.

generator_fn(weak_shuffle=False, batch_size=None)

Return the generator function that we can query for batches.

get_cc(cc_root=None)

Return the CC where the signature is present

get_h5_dataset(h5_dataset_name, mask=None)

Get a specific dataset in the signature.

get_intersection(sign)

Return the intersection between two signatures.

get_molset(molset)

Return a signature from a different molset

get_neig()

Return the neighbors signature, given a signature

get_non_redundant_intersection(sign)

Return the non redundant intersection between two signatures.

(i.e. keys and vectors that are common to both signatures.) N.B: to maximize overlap it’s better to use signatures of type ‘full’. N.B: Near duplicates are found in the first signature.

get_sign(sign_type)

Return the signature type for current dataset

get_vectors(keys, include_nan=False, dataset_name='V', output_missing=False)

Get vectors for a list of keys, sorted by default.

Parameters:
  • keys (list) – a List of string, only the overlapping subset to the signature keys is considered.

  • include_nan (bool) – whether to include requested but absent molecule signatures as NaNs.

  • dataset_name (str) – return any dataset in the h5 which is organized by sorted keys.

get_vectors_lite(keys, chunk_size=2000, chunk_above=10000)

Iterate on signatures.

static hstack_signatures(sign_list, destination, chunk_size=1000, aggregate_keys=None)

Merge horizontally a list of signatures.

index(key)

Give the index according to the key.

Parameters:

key (str) – the key to search index in the matrix.

Returns:

Index in the matrix

Return type:

index(int)

property info_h5

Get the dictionary of dataset and shapes.

learn_sign0(sign0, sign3, params, suffix=None, evaluate=True)[source]

Learn the signature 3 from sign0.

This method is used twice. First to evaluate the performances of the model. Second to train the final model on the full set of data.

Parameters:
  • sign0 (list) – Signature 0 object to learn from.

  • params (dict) – Dictionary with algorithm parameters.

  • reuse (bool) – Whether to reuse intermediate files (e.g. the aggregated signature 3 matrix).

  • suffix (str) – A suffix for the siamese model path (e.g. ‘sign3/models/smiles_<suffix>’).

  • evaluate (bool) – Whether we are performing a train-test split and evaluating the performances (N.B. this is required for complete confidence scores)

  • include_confidence (bool) – whether to include confidences.

learn_sign0_conf(sign0, sign3, params, reuse=True, suffix=None, evaluate=True)[source]

Learn the signature 3 applicability from sign0.

This method is used twice. First to evaluate the performances of the model. Second to train the final model on the full set of data.

Parameters:
  • sign0 (list) – Signature 0 object to learn from.

  • reuse (bool) – Whether to reuse intermediate files (e.g. the aggregated signature 3 matrix).

  • suffix (str) – A suffix for the siamese model path (e.g. ‘sign3/models/smiles_<suffix>’).

  • evaluate (bool) – Whether we are performing a train-test split and evaluating the performances (N.B. this is required for complete confidence scores)

  • include_confidence (bool) – whether to include confidences.

make_filtered_copy(destination, mask, include_all=False, data_file=None, datasets=None, dst_datasets=None, chunk_size=1000, compression=None)

Make a copy of applying a filtering mask on rows.

destination (str): The destination file path. mask (bool array): A numpy mask array (e.g. result of np.isin) include_all (bool): Whether to copy other dataset (e.g. features,

date, name…)

data_file (str): A specific file to copy (by default is the signature

h5)

abstract predict()

Use the fitted models to predict.

predict_from_string(molecules, dest_file, keytype='SMILES', chunk_size=1000, predict_fn=None, keys=None, components=128, applicability=True, y_order=None)[source]

Given molecuel string, generate MFP and predict sign3.

Parameters:
  • molecules (list) – A list of molecules strings.

  • dest_file (str) – File where to save the predictions.

  • keytype (str) – Wether to interpret molecules as InChI or SMILES.

Returns:

The predicted signatures as DataSignature

object.

Return type:

pred_s3(DataSignature)

property qualified_name

Signature qualified name (e.g. ‘B1.001-sign1-full’).

refresh()

Refresh all cached properties

save_full(overwrite=False)

Map the non redundant signature in explicit full molset.

It generates a new signature in the full folders.

Parameters:

overwrite (bool) – Overwrite existing (default=False).

save_reference(cpu=4, overwrite=False)

Save a non redundant signature in reference molset.

It generates a new signature in the references folders.

Parameters:
  • cpu (int) – Number of CPUs (default=4),

  • overwrite (bool) – Overwrite existing (default=False).

property shape

Get the V matrix shape.

property size

Get the V matrix size.

subsample(n, seed=42)

Subsample from a signature without replacement.

Parameters:

n (int) – Maximum number of samples (default=10000).

Returns:

A (samples, features) matrix. keys(array): The list of keys.

Return type:

V(matrix)

to_csv(filename, smiles=None)

Write smiles to h5.

At the moment this is done quering the Structure table for inchikey inchi mapping and then converting via Converter.

validate(apply_mappings=True, metric='cosine', diagnostics=False)

Perform validations.

A validation file is an external resource basically presenting pairs of molecules and whether they share or not a given property (i.e the file format is inchikey inchikey 0/1). Current test are performed on MOA (Mode Of Action) and ATC (Anatomical Therapeutic Chemical) corresponding to B1.001 and E1.001 dataset.

Parameters:

apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are moleules from the full signature and values are moleules from the reference set.

static vstack_signatures(sign_list, destination, chunk_size=10000, vchunk_size=100)

Merge vertically a list of signatures.