Classifier

Defines the main domain classification class

@author: J. Cid-Sueiro, J.A. Espinosa, A. Gallardo-Antolin, T.Ahlers

class src.domain_classifier.classifier.CorpusClassifier(df_dataset, model_type='mpnet', model_name='sentence-transformers/all-mpnet-base-v2', path2transformers='.', use_cuda=True)

Bases: object

A container of corpus classification methods

AL_sample(n_samples=5, sampler='extremes', p_ratio=0.8, top_prob=0.1)

Returns a given number of samples for active learning (AL)

Parameters:
  • n_samples (int, optional (default=5)) – Number of samples to return

  • sampler (str, optional (default=”random”)) – Sample selection algorithm. - If “random”, samples are taken at random from all docs with

    predictions

    • If “extremes”, samples are taken stochastically, but with documents with the highest or smallest probability scores are selected with higher probability.

    • If “full_rs”, samples are taken at random from the whole dataset for testing purposes. Half samples are taken at random from the train-test split, while the rest is taken from the other documents

  • p_ratio (float, optional (default=0.8)) – Ratio of high-score samples. The rest will be low-score samples. (Only for sampler=’extremes’)

  • top_prob (float, optional (default=0.1)) – (Approximate) probability of selecting the doc with the highest score in a single sampling. This parameter is used to control the randomness of the stochastic sampling: if top_prob=1, the highest score samples are taken deterministically. top_prob=0 is equivalent to random sampling.

Returns:

df_out – Selected samples

Return type:

pandas.dataFrame

Notes

Besides the output dataframe, this method updates columns ‘sampler’ and ‘sampling_prob’ from self.dataframe.

  • ‘sampler’ stores the sampling method that selected the doc.

  • ‘sampling_prob’ is the probability with which each doc was

selected. This is approximate, since the sampling of multiple documents is done without replacement, but it is reasonably accurate if the population sizes are large enough.

__init__(df_dataset, model_type='mpnet', model_name='sentence-transformers/all-mpnet-base-v2', path2transformers='.', use_cuda=True)

Initializes a classifier object

Parameters:
  • df_dataset (pandas.DataFrame) – Dataset with text and labels. It must contain at least two columns with names “text” and “labels”, with the input and the target labels for classification.

  • model_type (str, optional (default=”roberta”)) – Type of transformer model.

  • model_name (str, optional (default=”roberta-base”)) – Name of the simpletransformer model

  • path2transformers (pathlib.Path or str, optional (default=”.”)) – Path to the folder that will store all files produced by the simpletransformers library. Default value is “.”.

  • use_cuda (boolean, optional (default=True)) – If true, GPU will be used, if available.

Notes

Be aware that the simpletransformers library produces several folders, with some large files. You might like to use a value of path2transformers other than ‘.’.

__weakref__

list of weak references to the object (if defined)

annotate(idx, labels, col='annotations')

Annotate the given labels in the given positions

Parameters:
  • idx (list of int) – Rows to locate the labels.

  • labels (list of int) – Labels to annotate

  • col (str, optional (default = ‘annotations’)) – Column in the dataframe where the labels will be annotated. If it does not exist, it is created.

eval_model(samples='train_test', tag='', batch_size=8)

Compute predictions of the classification model over the input dataset and compute performance metrics.

Parameters:
  • samples (str, optional (default=”train_test”)) – Samples to evaluate. If “train_test” only training and test samples are evaluated. Otherwise, all samples in df_dataset attribute are evaluated

  • tag (str) – Prefix of the score and prediction names. The scores will be saved in the columns of self.df_dataset containing these scores.

  • batch_size (int, optiona (default=8)) – Batch size

get_annotations(annot_name='annotations', include_text=True)

Returns the portion of self.dataset that contains annotated data

Parameters:
  • annot_name (str, optional (default=’annotations’)) – Name of the column in the pandas dataframe containing the annotations

  • include_text (bool, optional (default=True)) – If true, the text of the annotated document is included in the output dataframe. This is usefull for a manual inspection of the annotations.

Returns:

df_annotation – The dataframe containing the annotations. All columns related to the annotation are returned.

Return type:

pandas.dataFrame

label2label_metrics(pred_name, true_label_name, subdataset, printout=True, use_sampling_probs=True)

Compute binary performance metrics (i.e. metrics based on the binary labels and predictions only)

Parameters:
  • pred_name (str in {‘PU’, ‘PN’}) – Ettiquete of the model to be evaluated.

  • true_label_name (str) – Name of the column tu be used as a reference for evaluation

  • subdataset (str) – An indicator of the subdataset to be evaluated. It can take values ‘train’, ‘test’ or ‘unused’

  • printout (boolean, optional (default=True)) – If true, all metrics are printed (unless the roc values)

  • use_sampling_probs (boolean, optional (default=True)) – If true, metrics are weighted by the (inverse) sampling probabilities, if available. If true, unweighted metrics are computed too, and saved in entry ‘unweighted’ of the output dictionary, as complementary info.

Returns:

bmetrics – A dictionary of binary metrics

Return type:

dict

load_model()

Loads an existing classification model

Return type:

The loaded model is stored in attribute self.model

load_model_config()

Load configuration for model.

If there is no previous configuration, copy it from simpletransformers ClassificationModel and save it.

num_annotations()

Return the number of manual annotations available

performance_metrics(tag, true_label_name, subdataset, pred_name=None, score_name=None, printout=True, use_sampling_probs=True)

Compute performance metrics

Parameters:
  • tag (str in {‘PU’, ‘PN’}) – Ettiquete of the model to be evaluated.

  • true_label_name (str) – Name of the column tu be used as a reference for evaluation

  • subdataset (str) – An indicator of the subdataset to be evaluated. It can take values ‘train’, ‘test’, ‘unused’, ‘notrain’ (which uses train and test) and ‘all’ (which uses all data)

  • printout (boolean, optional (default=True)) – If true, all metrics are printed (unless the roc values)

  • use_sampling_probs (boolean, optional (default=True)) – If true, metrics are weighted by the (inverse) sampling probabilities, if available. If true, unweighted metrics are computed too, and saved in entry ‘unweighted’ of the output dictionaries, as complementary info.

Returns:

  • bmetrics (dict) – A dictionary of binary metrics (i.e. metrics based on the binary labels and predictions only)

  • roc (dict) – A dictionary of score-based metrics (i.e. metric based on the binary labels and predictions, and the scores (soft decistions) of the classifier)

print_binary_metrics(bmetrics, tag='')

Pretty-prints the given metrics

Parameters:
  • bmetrics (dict) – Dictionary of metrics (produced by the binary_metrics() method)

  • title (str, optional (default=””)) – Title to print as a header

retrain_model(freeze_encoder=True, batch_size=8, epochs=3, annotation_gain=10)

Re-train the classifier model using annotations

Parameters:
  • epochs (int, optional (default=3)) – Number of training epochs

  • freeze_encoder (bool, optional (default=True)) – If True, the embedding layer is frozen, so that only the classification layers is updated. This is useful to use precomputed embedings for large datasets.

  • batch_size (int, optional (default=8)) – Batch size

  • annotation_gain (int or float, optional (default=10)) – Relative value of an annotated sample with respect to a non- annotated one.

train_model(epochs=3, validate=True, freeze_encoder=True, tag='', batch_size=8)

Train binary text classification model based on transformers

Parameters:
  • epochs (int, optional (default=3)) – Number of training epochs

  • validate (bool, optional (default=True)) – If True, the model epoch is selected based on the F1 score computed over the test data. Otherwise, the model after the last epoch is taken

  • freeze_encoder (bool, optional (default=True)) – If True, the embedding layer is frozen, so that only the classification layers is updated. This is useful to use precomputed embedings for large datasets.

  • tag (str, optional (default=””)) – A preffix that will be used for all result variables (scores and predictions) saved in the dataset dataframe

  • batch_size (int, optiona (default=8)) – Batch size

train_test_split(max_imbalance=None, nmax=None, train_size=0.6, random_state=None)

Split dataframe dataset into train an test datasets, undersampling the negative class

Parameters:
  • max_imbalance (int or float or None, optional (default=None)) – Maximum ratio negative vs positive samples. If the ratio in df_dataset is higher, the negative class is subsampled If None, the original proportions are preserved

  • nmax (int or None (defautl=None)) – Maximum size of the whole (train+test) dataset

  • train_size (float or int (default=0.6)) – Size of the training set. If float in [0.0, 1.0], proportion of the dataset to include in the train split. If int, absolute number of train samples.

  • random_state (int or None (default=None)) – Controls the shuffling applied to the data before splitting. Pass an int for reproducible output across multiple function calls.

Returns:

  • No variables are returned. The dataset dataframe in self.df_dataset is

  • updated with a new columm ‘train_test’ taking values – 0: if row is selected for training 1: if row is selected for test -1: otherwise

update_annotations(df_annotations, annot_name='annotations')

Updates self.df_dataset with the annotation data and metadata in the input dataframe.

Parameters:
  • df_annotations (pandas.dataFrame) – A dataframe of annotations.

  • annot_name (str, optional (default=’annotations’)) – Name of the column containing the class annotations

class src.domain_classifier.classifier.CorpusClassifierMLP(df_dataset, model_type='mpnet', model_name='sentence-transformers/all-mpnet-base-v2', path2transformers='.', use_cuda=True)

Bases: CorpusClassifier

__init__(df_dataset, model_type='mpnet', model_name='sentence-transformers/all-mpnet-base-v2', path2transformers='.', use_cuda=True)

Initializes a classifier object

Parameters:
  • df_dataset (pandas.DataFrame) – Dataset with text and labels. It must contain at least two columns with names “text” and “labels”, with the input and the target labels for classification.

  • model_type (str, optional (default=”roberta”)) – Type of transformer model.

  • model_name (str, optional (default=”roberta-base”)) – Name of the simpletransformer model

  • path2transformers (pathlib.Path or str, optional (default=”.”)) – Path to the folder that will store all files produced by the simpletransformers library. Default value is “.”.

  • use_cuda (boolean, optional (default=True)) – If true, GPU will be used, if available.

eval_model(samples='train_test', tag='', batch_size=8)

# inference Compute predictions of the classification model over the input dataset and compute performance metrics.

Parameters:
  • samples (str, optional (default=”train_test”)) – Samples to evaluate. If “train_test” only training and test samples are evaluated. Otherwise, all samples in df_dataset attribute are evaluated

  • tag (str) – Prefix of the score and prediction names. The scores will be saved in the columns of self.df_dataset containing these scores.

  • batch_size (int, optiona (default=8)) – Batch size

inferData(dPaths)

infers the dataset

Parameters:

dPaths

load_model()

Loads an existing classification model

Return type:

The loaded model is stored in attribute self.model

load_model_config()

Not relevant for MLP Classifier. However it gets called in the super class and has to be passed

retrain_model(freeze_encoder=True, batch_size=8, epochs=3, annotation_gain=10)

Re-train the classifier model using annotations

Parameters:
  • epochs (int, optional (default=3)) – Number of training epochs

  • freeze_encoder (bool, optional (default=True)) – If True, the embedding layer is frozen, so that only the classification layers is updated. This is useful to use precomputed embedings for large datasets.

  • batch_size (int, optional (default=8)) – Batch size

  • annotation_gain (int or float, optional (default=10)) – Relative value of an annotated sample with respect to a non- annotated one.

train_model(epochs=3, validate=True, freeze_encoder=True, tag='', batch_size=8)

Train binary text classification model based on transformers

Parameters:
  • epochs (int, optional (default=3)) – Number of training epochs

  • validate (bool, optional (default=True)) – If True, the model epoch is selected based on the F1 score computed over the test data. Otherwise, the model after the last epoch is taken

  • freeze_encoder (bool, optional (default=True)) – If True, the embedding layer is frozen, so that only the classification layers is updated. This is useful to use precomputed embedings for large datasets.

  • tag (str, optional (default=””)) – A preffix that will be used for all result variables (scores and predictions) saved in the dataset dataframe

  • batch_size (int, optiona (default=8)) – Batch size

train_test_split(max_imbalance=None, nmax=None, train_size=0.5, random_state=None)

Split dataframe dataset into train an test datasets,

Parameters:
  • max_imbalance (int or float or None, optional (default=None)) – Maximum ratio negative vs positive samples. If the ratio in df_dataset is higher, the negative class is subsampled If None, the original proportions are preserved

  • nmax (int or None (defautl=None)) – Maximum size of the whole (train+test) dataset

  • train_size (float or int (default=0.6)) – Size of the training set. If float in [0.0, 1.0], proportion of the dataset to include in the train split. If int, absolute number of train samples.

  • random_state (int or None (default=None)) – Controls the shuffling applied to the data before splitting. Pass an int for reproducible output across multiple function calls.

Return type:

No variables are returned.