Source code for chemicalchecker.database.dataset

"""Bioactivity Dataset definition.

This is how we define a dataset for a bioactivity space:

+-----------------------+-----------------------+-----------------------+
| Column                | Values                | Description           |
+=======================+=======================+=======================+
| Dataset_code          | e.g. ``A1.001``       | Identifier of the     |
|                       |                       | dataset.              |
+-----------------------+-----------------------+-----------------------+
| Level                 | e.g. ``A``            | The CC level.         |
+-----------------------+-----------------------+-----------------------+
| Coordinate            | e.g. ``A1``           | Coordinates in the CC |
|                       |                       | organization.         |
+-----------------------+-----------------------+-----------------------+
| Name                  | 2D fingerprints       | Display, short-name   |
|                       |                       | of the dataset.       |
+-----------------------+-----------------------+-----------------------+
| Technical name        | 1024-bit Morgan       | A more technical name |
|                       | fingerprints          | for the dataset,      |
|                       |                       | suitable for          |
|                       |                       | chemo                 |
|                       |                       | -/bio-informaticians. |
+-----------------------+-----------------------+-----------------------+
| Description           | 2D fingerprints are…  | This field contains a |
|                       |                       | long description of   |
|                       |                       | the dataset. It is    |
|                       |                       | important that the    |
|                       |                       | curator outlines here |
|                       |                       | the importance of the |
|                       |                       | dataset, why did      |
|                       |                       | he/she make the       |
|                       |                       | decision to include   |
|                       |                       | it, and what are the  |
|                       |                       | scenarios where this  |
|                       |                       | dataset may be        |
|                       |                       | useful.               |
+-----------------------+-----------------------+-----------------------+
| Unknowns              | ``True``/``False``    | Does the dataset      |
|                       |                       | contain known/unknown |
|                       |                       | data? Binding data    |
|                       |                       | from chemogenomics    |
|                       |                       | datasets, for         |
|                       |                       | example, are          |
|                       |                       | positive-unlabeled,   |
|                       |                       | so they do contain    |
|                       |                       | unknowns. Conversely, |
|                       |                       | chemical fingerprints |
|                       |                       | or gene expression    |
|                       |                       | data do not contain   |
|                       |                       | unknowns.             |
+-----------------------+-----------------------+-----------------------+
| Discrete              | ``True``/``False``    | The type of data that |
|                       |                       | ultimately expresses  |
|                       |                       | de dataset, after the |
|                       |                       | pre-processing.       |
|                       |                       | Categorical variables |
|                       |                       | are not allowed; they |
|                       |                       | must be converted to  |
|                       |                       | one-hot encoding or   |
|                       |                       | binarized. Mixed      |
|                       |                       | variables are not     |
|                       |                       | allowed, either.      |
+-----------------------+-----------------------+-----------------------+
| Keys                  | e.g. ``CPD`` (we use  | In the core CC        |
|                       | @afernandez           | database, most of the |
|                       | ``Bioteque``          | times this field will |
|                       | nomenclature). Can be | correspond to         |
|                       | ``NULL``.             | ``CPD``, as the CC is |
|                       |                       | centred on small      |
|                       |                       | molecules. It only    |
|                       |                       | makes sense to have   |
|                       |                       | keys of different     |
|                       |                       | types when we do      |
|                       |                       | connectivity          |
|                       |                       | attempts, that is,    |
|                       |                       | for example, when     |
|                       |                       | mapping disease gene  |
|                       |                       | expression            |
|                       |                       | signatures.           |
+-----------------------+-----------------------+-----------------------+
| Features              | e.g. ``GEN`` (we use  | When features         |
|                       | ``Bioteque``          | correspond to         |
|                       | nomenclature). Can be | explicit knowledge,   |
|                       | ``NULL``.             | such as proteins,     |
|                       |                       | gene ontology         |
|                       |                       | processes, or         |
|                       |                       | indications, we       |
|                       |                       | express with this     |
|                       |                       | field the type of     |
|                       |                       | biological entities.  |
|                       |                       | It is not allowed to  |
|                       |                       | mix different feature |
|                       |                       | types. Features can,  |
|                       |                       | however, have no      |
|                       |                       | type, typically when  |
|                       |                       | they come from a      |
|                       |                       | heavily-processed     |
|                       |                       | dataset, such as      |
|                       |                       | gene-expression data. |
|                       |                       | Even if we use        |
|                       |                       | ``Bioteque``          |
|                       |                       | nomenclature to the   |
|                       |                       | define the type of    |
|                       |                       | biological data, it   |
|                       |                       | is not mandatory that |
|                       |                       | the vocabularies are  |
|                       |                       | the ones used by the  |
|                       |                       | ``Bioteque``; for     |
|                       |                       | example, I can use    |
|                       |                       | non-human UniProt     |
|                       |                       | ACs, if I deem it     |
|                       |                       | necessary.            |
+-----------------------+-----------------------+-----------------------+
| Exemplary             | ``True``/``False``    | Is the dataset        |
|                       |                       | exemplary of the      |
|                       |                       | coordinate. Only one  |
|                       |                       | exemplary dataset is  |
|                       |                       | valid for each        |
|                       |                       | coordinate. Exemplary |
|                       |                       | datasets should have  |
|                       |                       | good coverage (both   |
|                       |                       | in keys space and     |
|                       |                       | feature space) and    |
|                       |                       | acceptable quality of |
|                       |                       | the data.             |
+-----------------------+-----------------------+-----------------------+
| Public                | ``True``/``False``    | Some datasets are     |
|                       |                       | public, and some are  |
|                       |                       | not, especially those |
|                       |                       | that come from        |
|                       |                       | collaborations with   |
|                       |                       | the pharma industry.  |
+-----------------------+-----------------------+-----------------------+
| Essential             | ``True``/``False``    | Essentail Datasets    |
|                       |                       | are required for      |
|                       |                       | the signaturization   |
|                       |                       | pipeline to work.     |
+-----------------------+-----------------------+-----------------------+
| Derived               | ``True``/``False``    | Dataset can be        |
|                       |                       | derived from existing |
|                       |                       | data (i.e. they come  |
|                       |                       | from an external      |
|                       |                       | datasource) or they   |
|                       |                       | are calculated and    |
|                       |                       | are virtually         |
|                       |                       | available for any     |
|                       |                       | compound (e.g. "A1"). |
+-----------------------+-----------------------+-----------------------+
| Datasources           | Foreign key to        | Data sources that are |
|                       | ``DataSource`` table. | used for generating   |
|                       |                       | signature 0 oof the   |
|                       |                       | dataset.              |
+-----------------------+-----------------------+-----------------------+

Dataset-Datasource have Many-to-Many relationshipi .i.e. a dataset can refer
to multiple datasources and one datasource can be used by many datasets.
For example ``drugbank`` is a class :mod:`~chemicalchecker.database.datasource`
that is used by both ``B1.001`` and ``E1.001`` but each of them also have
additional and different datasources.
"""
from sqlalchemy import Column, Text, Boolean, ForeignKey, VARCHAR
from sqlalchemy.orm import class_mapper, ColumnProperty, relationship
import sqlalchemy
from .database import Base, get_session, get_engine

from chemicalchecker.util import logged


[docs]@logged class Dataset(Base): # NS Base is a base class from SQLAlchemy, no __init__?? """Dataset Table class. Parameters: dataset_code(str): primary key, Identifier of the dataset. level(str): The CC level. coordinate(str): Coordinates in the CC organization. name(str): Display, short-name of the dataset. technical_name(str): A more technical name for the dataset, suitable for chemo-/bio-informaticians. description(str): This field contains a long description of the dataset. unknowns(bool): Does the dataset contain known/unknown data. discrete(str): The type of data that ultimately expresses de dataset, after the pre-processing. keys(str): In the core CC database, most of the times this field will correspond to CPD, as the CC is centred on small molecules. features(str): Twe express with this field the type of biological entities. exemplary(bool): Is the dataset exemplary of the coordinate. public(bool): Is dataset public. """ __tablename__ = 'dataset' dataset_code = Column(VARCHAR(6), primary_key=True) level = Column(VARCHAR(1)) coordinate = Column(VARCHAR(2)) name = Column(Text) technical_name = Column(Text) description = Column(Text) unknowns = Column(Boolean) discrete = Column(Boolean) keys = Column(VARCHAR(3)) features = Column(VARCHAR(3)) exemplary = Column(Boolean) public = Column(Boolean) essential = Column(Boolean) # derived = Column(Boolean) # implemented as property datasources = relationship("Datasource", secondary="dataset_has_datasource", back_populates="datasets", lazy='joined')
[docs] def __repr__(self): """String representation.""" return self.dataset_code
def __lt__(self, other): return self.dataset_code < other.dataset_code @property def code(self): return self.dataset_code @property def derived(self): return len(self.datasources) > 0 @staticmethod def _create_table(): engine = get_engine() Base.metadata.create_all(engine, tables=[Dataset.__table__]) @staticmethod def _drop_table(): engine = get_engine() Dataset.__table__.drop(engine) @staticmethod def _table_exists(): engine = get_engine() return sqlalchemy.inspect(engine).has_table(Dataset.__tablename__) @staticmethod def _table_attributes(): attrs = [a for a in class_mapper(Dataset).iterate_properties] col_attrs = [a.key for a in attrs if isinstance(a, ColumnProperty)] input_attrs = [a for a in col_attrs] return input_attrs
[docs] @staticmethod def add(kwargs): """Add a new row to the table. Args: kwargs(dict):The data in dictionary format. """ if type(kwargs) is dict: entry = Dataset(**kwargs) Dataset.__log.debug(entry) session = get_session() session.add(entry) session.commit() session.close()
[docs] @staticmethod def from_csv(filename): """Add entries from CSV file. Args: filename(str): Path to a CSV file. """ import pandas as pd df = pd.read_csv(filename) # The boolean columns must be changed to boolean values otherwise # SQLalchmy passes strings df.unknowns = df.unknowns.apply(lambda x: False if x == 'f' else True) df.discrete = df.discrete.apply(lambda x: False if x == 'f' else True) df.exemplary = df.exemplary.apply(lambda x: False if x == 'f' else True) df.public = df.public.apply(lambda x: False if x == 'f' else True) df.essential = df.essential.apply(lambda x: False if x == 'f' else True) # check columns needed_cols = Dataset._table_attributes() if needed_cols != list(df.columns): raise Exception("Input missing columns: %s", ' '.join(needed_cols)) # add them for row_nr, row in df.iterrows(): try: Dataset.add(row.dropna().to_dict()) except Exception as err: Dataset.__log.error( "Error in line %s: %s", row_nr, str(err))
[docs] @staticmethod def get(code=None, **kwargs): """Get Dataset with given code. Args: code(str):The Dataset code, e.g "A1.001" """ session = get_session() if code is not None: query = session.query(Dataset).filter_by(dataset_code=code, **kwargs) res = query.one_or_none() session.close() return res else: query = session.query(Dataset).distinct( Dataset.dataset_code).filter_by(**kwargs) res = query.all() session.close() return sorted(res)
[docs] @staticmethod def get_coordinates(): """Get Dataset list of possible coordinates.""" session = get_session() query = session.query(Dataset).distinct(Dataset.coordinate) res = query.all() session.close() return res
[docs]@logged class DatasetHasDatasource(Base): """Dataset-Datasource relationship. Many-to-Many relationship. """ __tablename__ = 'dataset_has_datasource' dataset_code = Column(VARCHAR(6), ForeignKey("dataset.dataset_code"), primary_key=True) datasource_name = Column(Text, ForeignKey("datasource.datasource_name"), primary_key=True)
[docs] def __repr__(self): """String representation.""" return self.dataset_code + " maps to " + self.datasource_name
@staticmethod def _create_table(): engine = get_engine() Base.metadata.create_all( engine, tables=[DatasetHasDatasource.__table__]) @staticmethod def _drop_table(): engine = get_engine() DatasetHasDatasource.__table__.drop(engine) @staticmethod def _table_exists(): engine = get_engine() return sqlalchemy.inspect(engine).has_table(DatasetHasDatasource.__tablename__) @staticmethod def _table_attributes(): attrs = [a for a in class_mapper( DatasetHasDatasource).iterate_properties] col_attrs = [a.key for a in attrs if isinstance(a, ColumnProperty)] input_attrs = [a for a in col_attrs if a != 'id'] return input_attrs
[docs] @staticmethod def add(kwargs): """Add a new row to the table. Args: kwargs(dict):The data in dictionary format. """ if type(kwargs) is dict: entry = DatasetHasDatasource(**kwargs) DatasetHasDatasource.__log.debug(entry) session = get_session() session.add(entry) session.commit() session.close()
[docs] @staticmethod def from_csv(filename): """Add entries from CSV file. Args: filename(str): Path to a CSV file. """ import pandas as pd df = pd.read_csv(filename) # check columns needed_cols = DatasetHasDatasource._table_attributes() if needed_cols != list(df.columns): raise Exception("Input missing columns: %s", ' '.join(needed_cols)) # add them for row_nr, row in df.iterrows(): try: DatasetHasDatasource.add(row.dropna().to_dict()) except Exception as err: DatasetHasDatasource.__log.error( "Error in line %s: %s", row_nr, str(err))