chemicalchecker.database.dataset
Bioactivity Dataset definition.
This is how we define a dataset for a bioactivity space:
Column |
Values |
Description |
---|---|---|
Dataset_code |
e.g. |
Identifier of the dataset. |
Level |
e.g. |
The CC level. |
Coordinate |
e.g. |
Coordinates in the CC organization. |
Name |
2D fingerprints |
Display, short-name of the dataset. |
Technical name |
1024-bit Morgan fingerprints |
A more technical name for the dataset, suitable for chemo -/bio-informaticians. |
Description |
2D fingerprints are… |
This field contains a long description of the dataset. It is important that the curator outlines here the importance of the dataset, why did he/she make the decision to include it, and what are the scenarios where this dataset may be useful. |
Unknowns |
|
Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingerprints or gene expression data do not contain unknowns. |
Discrete |
|
The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either. |
Keys |
e.g. |
In the core CC
database, most of the
times this field will
correspond to
|
Features |
e.g. |
When features
correspond to
explicit knowledge,
such as proteins,
gene ontology
processes, or
indications, we
express with this
field the type of
biological entities.
It is not allowed to
mix different feature
types. Features can,
however, have no
type, typically when
they come from a
heavily-processed
dataset, such as
gene-expression data.
Even if we use
|
Exemplary |
|
Is the dataset exemplary of the coordinate. Only one exemplary dataset is valid for each coordinate. Exemplary datasets should have good coverage (both in keys space and feature space) and acceptable quality of the data. |
Public |
|
Some datasets are public, and some are not, especially those that come from collaborations with the pharma industry. |
Essential |
|
Essentail Datasets are required for the signaturization pipeline to work. |
Derived |
|
Dataset can be derived from existing data (i.e. they come from an external datasource) or they are calculated and are virtually available for any compound (e.g. “A1”). |
Datasources |
Foreign key to
|
Data sources that are used for generating signature 0 oof the dataset. |
Dataset-Datasource have Many-to-Many relationshipi .i.e. a dataset can refer
to multiple datasources and one datasource can be used by many datasets.
For example drugbank
is a class datasource
that is used by both B1.001
and E1.001
but each of them also have
additional and different datasources.
Classes
Dataset Table class. |
|
Dataset-Datasource relationship. |