chemicalchecker.core.molkit.Molset

class Molset(cc, molecules, mol_type=None, add_image=True, generic_scaffold=False, molecules_ids=None)[source]

Bases: object

Molset class.

Given a CC instance, provides access to features of a input set of molecules for one or more dataset of interest. The data is organized in a DataFrame which will be annotated with observed and/or predicted features by simple kNN inference.

Initialize a Mol instance.

Parameters:

cc (Chemicalchecker) – Chemical Checker instance.
molecules (list) – A list of molecules in homogenous format ‘mol_type’. This can also be a DataFrame but we expect it to be generated by this same class, so complete and coherent in terms of identifiers.
mol_type (str) – Type of identifier options are ‘inchi’, ‘inchikey’, ‘smiles’ or ‘name’. if ‘name’ is used we query externally (cactus.nci and pubchem) and some molecules might be missing. If ‘inchikey’ is used, we first query our local db then external resource (bonus what we get is added to our local db). We will complete the DataFrame adding other identifiers if not present yet.
mol_col (str) – The name of the columns in the DataFrame.
add_image (bool) – If True a molecule image is added

Methods

`add_image`
`annotate`	Annotate the DataFrame with features fetched CC spaces.
`combine`
`func_hpc`	Execute the any object method on the configured HPC.
`get_chebi`	Get CHEBI id to name dictionary.
`get_chembl_hierarchy`	Get protein class and proteins in tree format.
`get_chembl_protein_classes`	Get protein class id to name dictionary.
`get_inchikey_inchi_map`
`get_name_inchi_map`
`get_uniprot_annotation`	Get information on Uniprot entries.
`load`	Load a Molset instance.
`predict`	Annotate the DataFrame with predicted features based on neighbors.
`project`	Annotate the DataFrame with projection of the signatures.
`save`	Save a Molset instance.
`signaturize`	Annotate the DataFrame with predicted sign4 signatures.

__getitem__(keys)[source]

Forward accessing Dataframe columns.

This allows accessing DataFrame column and keeping the molucule structure visualization.

Parameters:: keys (list) – a list of column of the Dataframe.

annotate(dataset_code, shorten_dscode=True, include_features=False, feature_map=None, include_values=False, features_from_raw=False, filter_values=True, include_sign0=False)[source]

Annotate the DataFrame with features fetched CC spaces.

The minimal annotation if whether the molecule is present or not in the dataset of interest. The features, values and predictions are optionally available. Features and values are fetched from the raw preprocess file of a given space (before dropping any data). The prediction is based on NN at the sign4 level and depends on several parameters: 1) the applicability threshold for considering the sign4 as reliable, 2) the p-value threshold (on sign4 distance) to define the neighbors 3) the number of NN to consider.

Parameters:

dataset_code (str) – the CC dataset code: e.g. B4.001.
include_features (bool) – Include features of the molecules from their raw preprocess sign0.
feature_map (dict) – A dictionary that will be used to map features to a human interpretable format.
include_values (bool) – if True an additional column is added with a list of (feature, value) pairs.
include_prediction (bool) – include NN derived predictions.
mapping_fn (dict) – A dictionary performing the mapping between features ids and their meaning.
shorten_dscode (bool) – If True get rid of the .001 part of the dataset code
features_from_raw (bool) – If True the features are fetched from the raw sign0 which includes all features for all molecules. If False sign0 features are used, so after the Sanitizer step which possibly removes features or molecules.
filter_values (bool) – If True values == 0 are filtered. False is recomended when the dataset is continuous data.
include_sign0 (bool) – If True the full sign0 for each molecule is included in the dataframe.

func_hpc(func_name, func_args, func_kwargs, hpc_kwargs, data_path=None)[source]

Execute the any object method on the configured HPC.

Parameters:

func_name (str) – the name of the function.
func_args (tuple) – the arguments for of the function to be called.
func_kwargs (tuple) – the keyworded arguments for of the function to be called.
hpc_kwargs (dict) – arguments for the HPC class.
data_path (str) – the path to the Molset object, if None a copy of the data in the job folder will be made.

static get_chebi(chebi_obo)[source]: Get CHEBI id to name dictionary.

static get_chembl_hierarchy(chembldb='chembl_27')[source]: Get protein class and proteins in tree format.

static get_chembl_protein_classes(chembldb='chembl_27', prefix_level=False)[source]: Get protein class id to name dictionary.

static get_uniprot_annotation(entries)[source]: Get information on Uniprot entries.

classmethod load(filename, cc=None, cc_root=None, add_image=False)[source]

Load a Molset instance.

Parameters:

filename (str) – The path of the file to load.
cc (ChemicalChecker) – An already initialized CC instance.
cc_root (str) – Path to the root of a Chemical Checker. If None we try using the saved path. If that is not reachable we use the default (default is cc_config dependent) CC.
add_image (bool) – If True a molecule image is added.

predict(dataset_code, shorten_dscode=True, applicability_thr_query=0, applicability_thr_nn=0, max_nr_nn=1000, pvalue_thr_nn=0.0001, limit_top_nn=1000, return_stats=False, return_sign0=False, blacklist=[], aggregation_thr=0.0, return_probas=False)[source]

Annotate the DataFrame with predicted features based on neighbors.

In this case we can potentially get annotation for every molecules. The molecules are first signaturized (sign4) filtered by applicability and then searched against molecules within the CC sign4. Nearest neighbors are selected for each molecule based on user specified parameters. Only NN that are found in the sign0 of the space are preserved. The annotations of these NN are aggregated (multiple strategies are possible) and finally assigned to the molecule.

Parameters:

dataset_code (str) – the CC dataset code: e.g. B4.001.
shorten_dscode (bool) – If True get rid of the .001 part of the dataset code
applicability_thr_query (float) – Only query with a sign4 applicability above this threshold will be searched for neighbors.
applicability_thr_nn (float) – Only neighbors with a sign4 applicability above this threshold will be used for features inference.
max_nr_nn (int) – Maximum number of neighbors to possibly consider.
pvalue_thr (float) – Filter neighbors based on distance.
limit_top_nn (int) – Only keep topN neighbors for feature predictions.
return_stats (bool) – if True return a dataframe with statistics on the NN search filtering steps.
return_sign0 (bool) – if True return sign0 format prediction (only for binary spaces).
blacklist (list) – List of inchikeys of molecules to disregard during neighbors search. Used for comparison to other prediction approach where these molecules are the test set.
return_probas (bool) – if True the probability for each predicted class are also returned.

project(projector, projector_name=None, datasets='exemplary')[source]

Annotate the DataFrame with projection of the signatures.

Parameters:

projector (object) – Any preinitialized projector exposing the ‘fit_transform’ function.
projector_name (str) – The name of the projector, used to define the new DataFrame columns.
datasets (list) – the CC datasets code: e.g. [‘B4.001’]. Use ‘exemplary’ to get signatures for all exemplary CC spaces.

save(destination, overwrite=False)[source]

Save a Molset instance.

Parameters:

destination (str) – The destination path
overwrite (bool) – Whether to overwrite in case the file exists.

signaturize(datasets='exemplary')[source]

Annotate the DataFrame with predicted sign4 signatures.

Parameters:: datasets (list) – the CC datasets code: e.g. [‘B4.001’]. Use ‘exemplary’ to get signatures for all exemplary CC spaces.