chemicalchecker.core.diagnostics.Diagnosis

class Diagnosis(sign, ref_cc=None, ref_cctype='sign0', save=True, plot=True, overwrite=False, load=True, n=10000, seed=42, cpu=4)[source]

Bases: object

Diagnosis class.

Initialize a Diagnosis instance.

Parameters:
  • ref_cc (ChemicalChecker) – A CC instance used as reference.

  • sign (CC signature) – The CC signature object to be diagnosed.

  • save (bool) – Whether to save results in the diags folder of the signature. (default=True)

  • plot (bool) – Whether to save plots in the diags folder of the signature. (default=True)

  • overwrite (bool) – Whether to overwrite the results of the diagnosis. (default=False)

  • n (int) – Number of molecules to sample. (default=10000)

Methods

across_coverage

Check coverage against a collection of other CC signatures.

across_roc

Check coverage against a collection of other CC signatures.

atc_roc

available

canvas

canvas_hpc

Run HPC jobs .

canvas_large

canvas_medium

canvas_small

clear

Remove al diagnostic data.

cluster_sizes

clusters_projection

confidences

confidences_projection

cosine_distances

Cosine distance distribution.

cross_coverage

Intersection of coverages.

cross_roc

Perform validations.

custom_comparative_vertical

diagnostics_hpc

Run HPC jobs .

dimensions

Get dimensions of the signature and compare to other signatures.

euclidean_distances

Euclidean distance distribution.

features_bins

features_iqr

global_ranks_agreement

Sample-specific global accuracy.

global_ranks_agreement_projection

image

intensities

intensities_projection

key_coverage

key_coverage_projection

keys_bins

keys_iqr

moa_roc

neigh_roc

Check ROC against another signature at different NN levels.

orthogonality

outliers

Computes anomaly score of the input samples.

pr

projection

TSNE projection of CC signatures.

ranks_agreement

Sample-specific accuracy.

ranks_agreement_projection

redundancy

roc

values

Attributes

V

keys

across_coverage(*args, datasets=None, exemplary=True, ref_cctype='sign1', **kwargs)[source]

Check coverage against a collection of other CC signatures.

Parameters:
  • datasets (list) – List of datasets. If None, all available are used. (default=None)

  • exemplary (bool) – Whether to use only exemplary datasets (recommended). (default=True)

  • cctype (str) – CC signature type. (default=None)

  • molset (str) – Molecule set to use. Full is recommended. (default=None)

  • kwargs (dict) – params of hte cross_coverage method.

across_roc(*args, datasets=None, exemplary=True, ref_cctype=None, redo=False, include_datasets=None, **kwargs)[source]

Check coverage against a collection of other CC signatures.

Parameters:
  • datasets (list) – List of datasets. If None, all available are used. (default=None).

  • exemplary (bool) – Whether to use only exemplary datasets (recommended). (default=True)

  • ref_cctype (str) – CC signature type. (default=’sign0’)

  • redo (bool) – redo the plot

  • include_datasets (list) – specific datasets to add when exemplary is set to True (default=None)

  • kwargs (dict) – Parameters of the cross_roc method.

canvas_hpc(tmpdir, **kwargs)[source]

Run HPC jobs .

tmpdir(str): Folder (usually in scratch) where the job directory is

generated.

cc_root: CC root path cctype: CC type (sign0, sign1, sign2, sign3) on which the method is applied molset: ‘full’ or ‘reference’ dss: datasets to run the diagnostics on cc_reference: another version of CC to use as diagnostic reference

clear()[source]

Remove al diagnostic data.

cosine_distances(*args, n_pairs=10000, **kwargs)[source]

Cosine distance distribution.

Parameters:

n_pairs (int) – Number of pairs to sample. (default=10000)

cross_coverage(dataset, *args, ref_cctype='sign1', molset='full', try_conn_layer=False, redo=False, **kwargs)[source]

Intersection of coverages.

Parameters:

sign (signature) – A CC signature object to check against.

cross_roc(sign, *args, n_samples=10000, n_neighbors=5, neg_pos_ratio=1, apply_mappings=False, try_conn_layer=False, metric='cosine', redo=False, val_type='roc', **kwargs)[source]

Perform validations.

Parameters:
  • sign (signature) – A CC signature object to validate against.

  • n_samples (int) – Number of samples.

  • apply_mappings (bool) – Whether to use mappings to compute validation. Signature which have been redundancy-reduced (i.e. reference) have fewer molecules. The key are molecules from the full signature and values are molecules from the reference set.

  • try_conn_layer (bool) – Try with the inchikey connectivity layer. (default=False)

  • metric (str) – ‘cosine’ or ‘euclidean’. (default=’cosine’)

  • val_type (str) – ‘roc’ or ‘pr’. (default=’roc’)

  • save (bool) – Specific save parameter. If not specified, the global is set. (default=None).

static diagnostics_hpc(tmpdir, cc_root, cctype, molset, dss, cc_reference, **kwargs)[source]

Run HPC jobs .

tmpdir(str): Folder (usually in scratch) where the job directory is

generated.

cc_root: CC root path cctype: CC type (sign0, sign1, sign2, sign3) on which the method is applied molset: ‘full’ or ‘reference’ dss: datasets to run the diagnostics on cc_reference: another version of CC to use as diagnostic reference

dimensions(*args, datasets=None, exemplary=True, ref_cctype='sign1', molset='full', **kwargs)[source]

Get dimensions of the signature and compare to other signatures.

euclidean_distances(*arg, n_pairs=10000, **kwargs)[source]

Euclidean distance distribution.

Parameters:

n_pairs (int) – Number of pairs to sample. (default=10000)

global_ranks_agreement(*args, n_neighbors=100, min_shared=100, metric='minkowski', p=0.9, ref_cctype=None, **kwargs)[source]

Sample-specific global accuracy.

Estimated as general agreement with the rest of the CC, based on a Z-global ranking.

neigh_roc(ds, *args, ref_cctype=None, n_neighbors=[1, 5, 10, 50, 100], **kwargs)[source]

Check ROC against another signature at different NN levels.

Parameters:
  • ds – Dataset aginst which to run ROC analysis.

  • ref_cctype (str) – CC signature type.

  • neighbors (list) – list of top NN for which we want to compute ROC.

  • molset (str) – Molecule set to use. Full is recommended. (default=’full’)

  • kwargs (dict) – Parameters of hte cross_coverage method.

outliers(*args, n_estimators=1000, **kwargs)[source]

Computes anomaly score of the input samples.

The lower, the more abnormal. Negative scores represent outliers, positive scores represent inliers.

projection(*args, keys=None, focus_keys=None, max_keys=10000, perplexity=None, max_pca=100, redo=False, **kwargs)[source]

TSNE projection of CC signatures.

Parameters:
  • keys (list) – Keys to be projected. If None specified, keys are randomly sampled. (default=None)

  • focus_keys (list) – Keys to be highlighted in the projection. (default=None).

  • max_keys (int) – Maximum number of keys to include in the projection. (default=10000)

ranks_agreement(*args, datasets=None, exemplary=True, ref_cctype='sign0', n_neighbors=100, min_shared=100, metric='minkowski', p=0.9, **kwargs)[source]

Sample-specific accuracy.

Estimated as general agreement with the rest of the CC.