Data Manager¶
Defines a data manager class to provide basic read-write functionality for the project.
@author: J. Cid-Sueiro, A. Gallardo-Antolin, T. Ahlers
- class src.data_manager.DataManager(path2source, path2datasets, path2models, path2embeddings=None)¶
Bases:
object
This class contains all read / write functionalities required by the domain_classification project.
It assumes that source and destination data will be stored in files.
- __init__(path2source, path2datasets, path2models, path2embeddings=None)¶
Initializes the data manager object
- Parameters:
path2source (str or pathlib.Path) – Path to the folder containing all external source data
path2datasets (str or pathlib.Path) – Path to the folder containing datasets
path2models (str or pathlib.Path) – Path to the folder containing classifier models
path2embeddings (str or pathlib.Path) – Path to the folder containing the document embeddings
- __weakref__¶
list of weak references to the object (if defined)
- export_annotations(df_annotations, tag)¶
Export a dataframe of annotations to csv file.
- Parameters:
df_annotations (pandas.dataFrame) – A dataframse of annotations
tag (str, optional (default=”imported”)) – Name for the domain. To be included as a suffix of the file name.
- get_annotation_list()¶
Returns the list of available corpus
- get_corpus_list()¶
Returns the list of available corpus
- get_dataset_list()¶
Returns the list of available datasets
- get_keywords_list(filename='IA_keywords_SEAD_REV_JAG.txt')¶
Returns a list of IA-related keywords read from a file.
- Parameters:
filename (str, optional (default==’IA_keywords_SEAD_REV_JAG.txt’)) – Name of the file with the keywords
- Returns:
keywords – A list of keywords (empty if the file does not exist)
- Return type:
list
- get_model_list()¶
Returns the list of available models
- import_AI_subcorpus(ids_corpus=None, tag='imported')¶
Loads a subcorpus of positive labels from file.
- Parameters:
ids_corpus (list) – List of ids of the documents in the corpus. Only the labels with ids in ids_corpus are imported and saved into the output file.
tag (str, optional (default=”imported”)) – Name for the category defined by the positive labels.
- Returns:
ids_pos – List of ids of documents from the positive class
- Return type:
list
- import_annotations(tag)¶
Loads a file with annotations.
- Parameters:
tag (str) – Name used to identify the annmotation file.
- Returns:
df_annotations – Dataframe of annotations.
- Return type:
pandas.dataFrame
- is_model(class_name)¶
Checks if a model exist for the given class_name in the folder of models
- Parameters:
class_name (str) – Name of the class
- Return type:
True if the model folder exists
- load_corpus(corpus_name, sampling_factor=1)¶
Loads a dataframe of documents from a given corpus.
When available, the names of the relevant dataframe components are mapped to normalized names: id, title, description, keywords and target_xxx
- Parameters:
corpus_name (str) – Name of the corpus. It should be the name of a folder in self.path2source
sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)
- load_dataset(tag='')¶
Loads a labeled dataset of documents in the format required by the classifier modules
- Parameters:
tag (str, optional (default=””)) – Name of the dataset
- load_new_labels(tag='')¶
Loads a temporal dataframe of documents selected for annotation
- Parameters:
tag (str, optional (default=””)) – Suffix for the file name
- Returns:
df – A dataframe of labels
- Return type:
pandas.DataFrame
- load_selected_docs(tag='')¶
Loads a temporal dataframe of documents selected for annotation
- Parameters:
tag (str, optional (default=””)) – Label suffix for the file name
- Returns:
df – Selected documents
- Return type:
pandas.DataFrame
- load_topic_metadata()¶
Loads the metadata associated to the topic matrix from the selected corpus
- load_topics()¶
Loads a topic matrix for a specific corpus
- remove_temp_files(tag)¶
Removes temporary files associated to a given annotation round
- reset_labels(tag='')¶
Delete all files related to a given class
- Parameters:
tag (str, optional (default=””)) – Name of the class to be removed
- save_dataset(df_dataset, tag='', save_csv=False)¶
Save dataset in input dataframe in a feather file.
- Parameters:
df_dataset (pandas.DataFrame) – Dataset to save
tag (str, optional (default=””)) – Optional string to add to the output file name.
save_csv (boolean, optional (default=False)) – If True, the dataset is saved in csv format too
- save_new_labels(idx, labels, tag='')¶
Save labels from the last annotation round in a temporary file
- Parameters:
idx (list like) – Indices of the annotated documents.
labels (list) – List of labels
tag (str, optional (default=””)) – Label suffix for the file name
- save_selected_docs(df_docs, tag='')¶
Save a temporal dataframe of documents selected for annotation
- Parameters:
df_docs (pandas.DataFrame) – Dataframe of documents to be annotated
tag (str, optional (default=””)) – Suffix for the file name
- src.data_manager.detect_english(x)¶
Returns True is x contains text in English.
- Parameters:
x (str) – Input string
- Return type:
True if x contains English text, False otherwise.