chemicalchecker.util.splitter.neighbortriplet.AdriaTripletSampler

class AdriaTripletSampler(*args, **kwargs)[source]

Bases: BaseTripletSampler

The optimal Adria’s way for sampling triplets in small dataset.

Methods

`generate_triplets`	Generate triplets.
`get_split_indeces`	Get random indexes for different splits.
`save_triplets`	Save sampled triplets to file.

generate_triplets(num_triplets=1000000.0, frac_hard=0.3, frac_neig=0.05, metric='jaccard', low_thr=0.1, high_thr=0.5, plot=True)[source]

Generate triplets.

This function generate triplets defining positive and negatives assuming a binary triplet signature (e.g. sign0) and computing all the similarities across molecules.

Parameters:

num_triplets (int) – Total number of triplets to generate.
frac_hard (float) – Fraction of triplets to be of the hard case.
frac_neig (float) – Fraction of neighbor we will consider.
metric (std) – Metric to compute similarities, must be a distance metric that can be converted to similarity by (1-dist)
low_thr (float) – Low similarity threshold, any pair below this is negative.
high_thr (float) – High similarity threshold, any pair above this is positive.
plot (bool) – Save plots of the sampling.

get_split_indeces(rows, fractions): Get random indexes for different splits.

save_triplets(triplets, mean_center_x=True, shuffle=True, split_names=['train', 'test'], split_fractions=[0.8, 0.2], suffix='eval', cpu=1, x_dtype=<class 'numpy.float32'>, y_dtype=<class 'numpy.float32'>)

Save sampled triplets to file.

This function saves triplets performing the train test split, shuffling and normalization.

Parameters:

triplets (array) – Indexes of anchor, positive and negative for each triplet.
mean_center_x (bool) – Normalize data columns wise.
shuffle (bool) – shuffle order of triplets.
split_names (list str) – names of the splits.
split_fractions (list float) – fraction of each split.
suffix (str) – suffix of the generated scaler.