Data Manager¶

Defines a data manager class to provide basic read-write functionality for the project.

@author: J. Cid-Sueiro, A. Gallardo-Antolin, T. Ahlers

class src.data_manager.DataManager(path2source, path2datasets, path2models, path2embeddings=None)¶

Bases: object

This class contains all read / write functionalities required by the domain_classification project.

It assumes that source and destination data will be stored in files.

__init__(path2source, path2datasets, path2models, path2embeddings=None)¶

Initializes the data manager object

Parameters:

path2source (str or pathlib.Path) – Path to the folder containing all external source data
path2datasets (str or pathlib.Path) – Path to the folder containing datasets
path2models (str or pathlib.Path) – Path to the folder containing classifier models
path2embeddings (str or pathlib.Path) – Path to the folder containing the document embeddings

__weakref__¶: list of weak references to the object (if defined)

export_annotations(df_annotations, tag)¶

Export a dataframe of annotations to csv file.

Parameters:

df_annotations (pandas.dataFrame) – A dataframse of annotations
tag (str, optional (default=”imported”)) – Name for the domain. To be included as a suffix of the file name.

get_annotation_list()¶: Returns the list of available corpus

get_corpus_list()¶: Returns the list of available corpus

get_dataset_list()¶: Returns the list of available datasets

get_keywords_list(filename='IA_keywords_SEAD_REV_JAG.txt')¶

Returns a list of IA-related keywords read from a file.

Parameters:: filename (str, optional (default==’IA_keywords_SEAD_REV_JAG.txt’)) – Name of the file with the keywords
Returns:: keywords – A list of keywords (empty if the file does not exist)
Return type:: list

get_model_list()¶: Returns the list of available models

import_AI_subcorpus(ids_corpus=None, tag='imported')¶

Loads a subcorpus of positive labels from file.

Parameters:

ids_corpus (list) – List of ids of the documents in the corpus. Only the labels with ids in ids_corpus are imported and saved into the output file.
tag (str, optional (default=”imported”)) – Name for the category defined by the positive labels.

Returns:

ids_pos – List of ids of documents from the positive class

Return type:

list

import_annotations(tag)¶

Loads a file with annotations.

Parameters:: tag (str) – Name used to identify the annmotation file.
Returns:: df_annotations – Dataframe of annotations.
Return type:: pandas.dataFrame

is_model(class_name)¶

Checks if a model exist for the given class_name in the folder of models

Parameters:: class_name (str) – Name of the class
Return type:: True if the model folder exists

load_corpus(corpus_name, sampling_factor=1)¶

Loads a dataframe of documents from a given corpus.

When available, the names of the relevant dataframe components are mapped to normalized names: id, title, description, keywords and target_xxx

Parameters:

corpus_name (str) – Name of the corpus. It should be the name of a folder in self.path2source
sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)

load_dataset(tag='')¶

Loads a labeled dataset of documents in the format required by the classifier modules

Parameters:: tag (str, optional (default=””)) – Name of the dataset

load_new_labels(tag='')¶

Loads a temporal dataframe of documents selected for annotation

Parameters:: tag (str, optional (default=””)) – Suffix for the file name
Returns:: df – A dataframe of labels
Return type:: pandas.DataFrame

load_selected_docs(tag='')¶

Loads a temporal dataframe of documents selected for annotation

Parameters:: tag (str, optional (default=””)) – Label suffix for the file name
Returns:: df – Selected documents
Return type:: pandas.DataFrame

load_topic_metadata()¶: Loads the metadata associated to the topic matrix from the selected corpus

load_topics()¶: Loads a topic matrix for a specific corpus

remove_temp_files(tag)¶: Removes temporary files associated to a given annotation round

reset_labels(tag='')¶

Delete all files related to a given class

Parameters:: tag (str, optional (default=””)) – Name of the class to be removed

save_dataset(df_dataset, tag='', save_csv=False)¶

Save dataset in input dataframe in a feather file.

Parameters:

df_dataset (pandas.DataFrame) – Dataset to save
tag (str, optional (default=””)) – Optional string to add to the output file name.
save_csv (boolean, optional (default=False)) – If True, the dataset is saved in csv format too

save_new_labels(idx, labels, tag='')¶

Save labels from the last annotation round in a temporary file

Parameters:

idx (list like) – Indices of the annotated documents.
labels (list) – List of labels
tag (str, optional (default=””)) – Label suffix for the file name

save_selected_docs(df_docs, tag='')¶

Save a temporal dataframe of documents selected for annotation

Parameters:

df_docs (pandas.DataFrame) – Dataframe of documents to be annotated
tag (str, optional (default=””)) – Suffix for the file name

src.data_manager.detect_english(x)¶

Returns True is x contains text in English.

Parameters:: x (str) – Input string
Return type:: True if x contains English text, False otherwise.