Task Manager

Defines classes that define methods to run the main tasks in the project, using the core processing classes and methods.

@author: J. Cid-Sueiro, L. Calvo-Bartolome, A. Gallardo-Antolin, T.Ahlers

class src.task_manager.TaskManager(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)

Bases: baseTaskManager

This class extends the functionality of the baseTaskManager class for a specific example application

This class inherits from the baseTaskManager class, which provides the basic method to create, load and setup an application project.

The behavior of this class might depend on the state of the project, in dictionary self.state, with the following entries:

  • ‘isProject’ : If True, project created. Metadata variables loaded

  • ‘configReady’If True, config file succesfully loaded. Datamanager

    activated.

__init__(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)

Opens a task manager object.

Parameters:
  • path2project (pathlib.Path) – Path to the application project

  • path2source (str or pathlib.Path or None (default=None)) – Path to the folder containing the data sources

  • path2zeroshot (str or pathlib.Path or None (default=None)) – Path to the folder containing the zero-shot-model

  • config_fname (str, optional (default=’parameters.yaml’)) – Name of the configuration file

  • metadata_fname (str or None, optional (default=None)) – Name of the project metadata file. If None, no metadata file is used.

  • set_logs (bool, optional (default=True)) – If True logger objects are created according to the parameters specified in the configuration file

analyze_keywords(wt: float = 2.0, keywords: str = '')

Get a set of positive labels using keyword-based search

Parameters:
  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor

  • keywords (str, optional (default= “”)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords

annotate()

Save user-provided labels in the dataset

evaluate_PUlabels(true_label_name: str)

Evaluate the current set of PU labels

evaluate_PUmodel(samples: str = 'train_test')

Evaluate a domain classifiers

export_annotations(domain_name: str)

Imports / exports annotations from / to a file in the dataset folder.

This will be useful to share annotations from different projects.

get_feedback(sampler=None)

Gets some labels from a user for a selected subset of documents

get_labels_by_keywords(wt: float = 2.0, n_max: int = 2000, s_min: float = 1.0, tag: str = 'kwds', method: str = 'count', keywords: str = '')

Get a set of positive labels using keyword-based search

Parameters:
  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor

  • n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected

  • tag (str, optional (default=’kwds’)) – Name of the output label set.

  • method (‘embedding’ or ‘count’, optional) – Selection method: ‘count’ (based on counting occurences of keywords in docs) or ‘embedding’ (based on the computation of similarities between doc and keyword embeddings)

  • keywords (str, optional (default=””)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords

get_labels_by_topics(topic_weights, n_max: int = 2000, s_min: float = 1.0, tag: str = 'tpcs')

Get a set of positive labels from a weighted list of topics

Parameters:
  • topic_weights (str or dict) – If dict, a dictionwary topics: weighs If str, a string of comma-separated topicis and wieighs: t1, w1, t2, w2, …

  • n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected

  • tag (str, optional (default=1)) – Name of the output label set.

get_labels_by_zeroshot(n_max: int = 2000, s_min: float = 0.1, tag: str = 'zeroshot', keywords: str = '')

Get a set of positive labels using a zero-shot classification model

Parameters:
  • n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0.1)) – Minimum score. Only elements strictly above s_min are selected

  • tag (str, optional (default=1)) – Name of the output label set.

  • keywords (str, optional (default=””)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords

get_labels_from_docs()

Requests feedback about the class of given documents.

This method assumes user queryng through the command window. It should be overwrittend by inherited classes to adapt the specific UI of the application

Returns:

labels – Labels for the given documents, in the same order than the documents in the input dataframe

Return type:

list of boolean

import_AI_subcorpus()

Import a subcorpus of documents related to AI.

This method is very specific. Loads a subcorpus from EU_projects that is available from file. Not to be used for other corpora or other target domains.

import_annotations(domain_name: str)

Imports / exports annotations from / to a file in the dataset folder.

This will be useful to share annotations from different projects.

Parameters:

domain_name (str) – Name of the domain

inference(option=[])

Infers data

Parameters:

option – Unused

load()

Extends the load method from the parent class to load the project corpus and the dataset (if any)

load_corpus(corpus_name: str)

Loads a dataframe of documents from a given corpus.

Parameters:

corpus_name (str) – Name of the corpus. It should be the name of a folder in self.path2source

load_labels(class_name)

Load a set of labels and its corresponding dataset (if it exists)

Parameters:

class_name (str) – Name of the target category

performance_metrics_PN()

Compute all performance metrics based on the data available at the current dataset.

performance_metrics_PU()

Compute all performance metrics for the PU model, based on the data available at the current dataset

reevaluate_model(samples: str = 'train_test')

Evaluate a domain classifier

reset_labels(class_name)

Reset all labels and models associated to a given category

Parameters:

labelset (str) – Name of the category to be removed.

retrain_model(epochs: int = 3)

Improves classifier performance using the labels provided by users

sample_documents(sampler=None)

Gets some labels from a user for a selected subset of documents

setup()

Sets up the application projetc. To do so, it loads the configuration file and activates the logger objects.

train_PUmodel(max_imbalance: float = 3.0, nmax: int = 400, epochs: int = 3)

Train a domain classifiers

Parameters:
  • max_imbalance (float, optional (default=3.0)) – Maximum ratio negative vs positive samples. If the ratio in df_dataset is higher, the negative class is subsampled. If None, the original proportions are preserved

  • nmax (int, optional (defautl=400)) – Maximum size of the whole (train+test) dataset

  • epochs (int, optional (default=3)) – Number of training epoch

class src.task_manager.TaskManagerCMD(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)

Bases: TaskManager

Provides extra functionality to the task manager, requesting parameters from users from a command window.

__init__(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)

Opens a task manager object.

Parameters:
  • path2project (pathlib.Path) – Path to the application project

  • path2source (str or pathlib.Path or None (default=None)) – Path to the folder containing the data sources

  • path2zeroshot (str or pathlib.Path or None (default=None)) – Path to the folder containing the zero-shot-model

  • config_fname (str, optional (default=’parameters.yaml’)) – Name of the configuration file

  • metadata_fname (str or None, optional (default=None)) – Name of the project metadata file. If None, no metadata file is used.

  • set_logs (bool, optional (default=True)) – If True logger objects are created according to the parameters specified in the configuration file

analyze_keywords()

Get a set of positive labels using keyword-based search

export_annotations()

Imports / exports annotations from / to a file in the dataset folder.

This will be useful to share annotations from different projects.

get_labels_by_keywords()

Get a set of positive labels using keyword-based search

get_labels_by_topics()

Get a set of positive labels from a weighted list of topics

get_labels_by_zeroshot()

Get a set of positive labels using keyword-based search

train_PUmodel()

Train a domain classifier

class src.task_manager.TaskManagerGUI(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)

Bases: TaskManager

Provides extra functionality to the task manager, to be used by the Graphical User Interface (GUI)

get_feedback(idx, labels)

Gets some labels from a user for a selected subset of documents

Notes

In comparison to the corresponding parent method, STEPS 1 and 2 are carried out directly through the GUI

get_labels_by_keywords(keywords, wt, n_max, s_min, tag, method)

Get a set of positive labels using keyword-based search through the MainWindow

Parameters:
  • keywords (list of str) – List of keywords

  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor

  • n_max (int or None, optional (default=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit

  • s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected

  • tag (str, optional (default=1)) – Name of the output label set.

  • method (‘embedding’ or ‘count’, optional) – Selection method: ‘count’ (based on counting occurrences of keywords in docs) or ‘embedding’ (based on the computation of similarities between doc and keyword embeddings)

get_labels_by_zeroshot(keywords, n_max, s_min, tag)

Get a set of positive labels using a zero-shot classification model

Parameters:
  • keywords (list of str) – List of keywords

  • n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0.1)) – Minimum score. Only elements strictly above s_min are selected

  • tag (str, optional (default=1)) – Name of the output label set.

get_suggested_keywords()

Get the list of suggested keywords to showing it in the GUI.

Returns:

suggested_keywords – List of suggested keywords

Return type:

list of str

get_topic_words()

Get a set of positive labels from a weighted list of topics

train_PUmodel(max_imabalance, nmax)

Train a domain classifier

Parameters:
  • max_imabalance (int (default 3)) – Maximum ratio negative vs positive samples in the training set

  • nmax (int (default = 400)) – Maximum number of documents in the training set.