Task Manager¶
Defines classes that define methods to run the main tasks in the project, using the core processing classes and methods.
@author: J. Cid-Sueiro, L. Calvo-Bartolome, A. Gallardo-Antolin, T.Ahlers
- class src.task_manager.TaskManager(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)¶
Bases:
baseTaskManager
This class extends the functionality of the baseTaskManager class for a specific example application
This class inherits from the baseTaskManager class, which provides the basic method to create, load and setup an application project.
The behavior of this class might depend on the state of the project, in dictionary self.state, with the following entries:
‘isProject’ : If True, project created. Metadata variables loaded
- ‘configReady’If True, config file succesfully loaded. Datamanager
activated.
- __init__(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)¶
Opens a task manager object.
- Parameters:
path2project (pathlib.Path) – Path to the application project
path2source (str or pathlib.Path or None (default=None)) – Path to the folder containing the data sources
path2zeroshot (str or pathlib.Path or None (default=None)) – Path to the folder containing the zero-shot-model
config_fname (str, optional (default=’parameters.yaml’)) – Name of the configuration file
metadata_fname (str or None, optional (default=None)) – Name of the project metadata file. If None, no metadata file is used.
set_logs (bool, optional (default=True)) – If True logger objects are created according to the parameters specified in the configuration file
- analyze_keywords(wt: float = 2.0, keywords: str = '')¶
Get a set of positive labels using keyword-based search
- Parameters:
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor
keywords (str, optional (default= “”)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords
- annotate()¶
Save user-provided labels in the dataset
- evaluate_PUlabels(true_label_name: str)¶
Evaluate the current set of PU labels
- evaluate_PUmodel(samples: str = 'train_test')¶
Evaluate a domain classifiers
- export_annotations(domain_name: str)¶
Imports / exports annotations from / to a file in the dataset folder.
This will be useful to share annotations from different projects.
- get_feedback(sampler=None)¶
Gets some labels from a user for a selected subset of documents
- get_labels_by_keywords(wt: float = 2.0, n_max: int = 2000, s_min: float = 1.0, tag: str = 'kwds', method: str = 'count', keywords: str = '')¶
Get a set of positive labels using keyword-based search
- Parameters:
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor
n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected
tag (str, optional (default=’kwds’)) – Name of the output label set.
method (‘embedding’ or ‘count’, optional) – Selection method: ‘count’ (based on counting occurences of keywords in docs) or ‘embedding’ (based on the computation of similarities between doc and keyword embeddings)
keywords (str, optional (default=””)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords
- get_labels_by_topics(topic_weights, n_max: int = 2000, s_min: float = 1.0, tag: str = 'tpcs')¶
Get a set of positive labels from a weighted list of topics
- Parameters:
topic_weights (str or dict) – If dict, a dictionwary topics: weighs If str, a string of comma-separated topicis and wieighs: t1, w1, t2, w2, …
n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected
tag (str, optional (default=1)) – Name of the output label set.
- get_labels_by_zeroshot(n_max: int = 2000, s_min: float = 0.1, tag: str = 'zeroshot', keywords: str = '')¶
Get a set of positive labels using a zero-shot classification model
- Parameters:
n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0.1)) – Minimum score. Only elements strictly above s_min are selected
tag (str, optional (default=1)) – Name of the output label set.
keywords (str, optional (default=””)) – A comma-separated string of keywords. If the string is empty, the keywords are read from self.keywords
- get_labels_from_docs()¶
Requests feedback about the class of given documents.
This method assumes user queryng through the command window. It should be overwrittend by inherited classes to adapt the specific UI of the application
- Returns:
labels – Labels for the given documents, in the same order than the documents in the input dataframe
- Return type:
list of boolean
- import_AI_subcorpus()¶
Import a subcorpus of documents related to AI.
This method is very specific. Loads a subcorpus from EU_projects that is available from file. Not to be used for other corpora or other target domains.
- import_annotations(domain_name: str)¶
Imports / exports annotations from / to a file in the dataset folder.
This will be useful to share annotations from different projects.
- Parameters:
domain_name (str) – Name of the domain
- inference(option=[])¶
Infers data
- Parameters:
option – Unused
- load()¶
Extends the load method from the parent class to load the project corpus and the dataset (if any)
- load_corpus(corpus_name: str)¶
Loads a dataframe of documents from a given corpus.
- Parameters:
corpus_name (str) – Name of the corpus. It should be the name of a folder in self.path2source
- load_labels(class_name)¶
Load a set of labels and its corresponding dataset (if it exists)
- Parameters:
class_name (str) – Name of the target category
- performance_metrics_PN()¶
Compute all performance metrics based on the data available at the current dataset.
- performance_metrics_PU()¶
Compute all performance metrics for the PU model, based on the data available at the current dataset
- reevaluate_model(samples: str = 'train_test')¶
Evaluate a domain classifier
- reset_labels(class_name)¶
Reset all labels and models associated to a given category
- Parameters:
labelset (str) – Name of the category to be removed.
- retrain_model(epochs: int = 3)¶
Improves classifier performance using the labels provided by users
- sample_documents(sampler=None)¶
Gets some labels from a user for a selected subset of documents
- setup()¶
Sets up the application projetc. To do so, it loads the configuration file and activates the logger objects.
- train_PUmodel(max_imbalance: float = 3.0, nmax: int = 400, epochs: int = 3)¶
Train a domain classifiers
- Parameters:
max_imbalance (float, optional (default=3.0)) – Maximum ratio negative vs positive samples. If the ratio in df_dataset is higher, the negative class is subsampled. If None, the original proportions are preserved
nmax (int, optional (defautl=400)) – Maximum size of the whole (train+test) dataset
epochs (int, optional (default=3)) – Number of training epoch
- class src.task_manager.TaskManagerCMD(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)¶
Bases:
TaskManager
Provides extra functionality to the task manager, requesting parameters from users from a command window.
- __init__(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)¶
Opens a task manager object.
- Parameters:
path2project (pathlib.Path) – Path to the application project
path2source (str or pathlib.Path or None (default=None)) – Path to the folder containing the data sources
path2zeroshot (str or pathlib.Path or None (default=None)) – Path to the folder containing the zero-shot-model
config_fname (str, optional (default=’parameters.yaml’)) – Name of the configuration file
metadata_fname (str or None, optional (default=None)) – Name of the project metadata file. If None, no metadata file is used.
set_logs (bool, optional (default=True)) – If True logger objects are created according to the parameters specified in the configuration file
- analyze_keywords()¶
Get a set of positive labels using keyword-based search
- export_annotations()¶
Imports / exports annotations from / to a file in the dataset folder.
This will be useful to share annotations from different projects.
- get_labels_by_keywords()¶
Get a set of positive labels using keyword-based search
- get_labels_by_topics()¶
Get a set of positive labels from a weighted list of topics
- get_labels_by_zeroshot()¶
Get a set of positive labels using keyword-based search
- train_PUmodel()¶
Train a domain classifier
- class src.task_manager.TaskManagerGUI(path2project, path2source=None, path2zeroshot=None, config_fname='parameters.yaml', metadata_fname='metadata.yaml', set_logs=True)¶
Bases:
TaskManager
Provides extra functionality to the task manager, to be used by the Graphical User Interface (GUI)
- get_feedback(idx, labels)¶
Gets some labels from a user for a selected subset of documents
Notes
In comparison to the corresponding parent method, STEPS 1 and 2 are carried out directly through the GUI
- get_labels_by_keywords(keywords, wt, n_max, s_min, tag, method)¶
Get a set of positive labels using keyword-based search through the MainWindow
- Parameters:
keywords (list of str) – List of keywords
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor
n_max (int or None, optional (default=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit
s_min (float, optional (default=1)) – Minimum score. Only elements strictly above s_min are selected
tag (str, optional (default=1)) – Name of the output label set.
method (‘embedding’ or ‘count’, optional) – Selection method: ‘count’ (based on counting occurrences of keywords in docs) or ‘embedding’ (based on the computation of similarities between doc and keyword embeddings)
- get_labels_by_zeroshot(keywords, n_max, s_min, tag)¶
Get a set of positive labels using a zero-shot classification model
- Parameters:
keywords (list of str) – List of keywords
n_max (int or None, optional (defaul=2000)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0.1)) – Minimum score. Only elements strictly above s_min are selected
tag (str, optional (default=1)) – Name of the output label set.
- get_suggested_keywords()¶
Get the list of suggested keywords to showing it in the GUI.
- Returns:
suggested_keywords – List of suggested keywords
- Return type:
list of str
- get_topic_words()¶
Get a set of positive labels from a weighted list of topics
- train_PUmodel(max_imabalance, nmax)¶
Train a domain classifier
- Parameters:
max_imabalance (int (default 3)) – Maximum ratio negative vs positive samples in the training set
nmax (int (default = 400)) – Maximum number of documents in the training set.