Data Manager

Defines a data manager class to provide basic read-write functionality for the project.

@author: J. Cid-Sueiro, A. Gallardo-Antolin, T. Ahlers

class src.data_manager.DataManager(path2source, path2datasets, path2models, path2embeddings=None)

Bases: object

This class contains all read / write functionalities required by the domain_classification project.

It assumes that source and destination data will be stored in files.

__init__(path2source, path2datasets, path2models, path2embeddings=None)

Initializes the data manager object

Parameters:
  • path2source (str or pathlib.Path) – Path to the folder containing all external source data

  • path2datasets (str or pathlib.Path) – Path to the folder containing datasets

  • path2models (str or pathlib.Path) – Path to the folder containing classifier models

  • path2embeddings (str or pathlib.Path) – Path to the folder containing the document embeddings

__weakref__

list of weak references to the object (if defined)

export_annotations(df_annotations, tag)

Export a dataframe of annotations to csv file.

Parameters:
  • df_annotations (pandas.dataFrame) – A dataframse of annotations

  • tag (str, optional (default=”imported”)) – Name for the domain. To be included as a suffix of the file name.

get_annotation_list()

Returns the list of available corpus

get_corpus_list()

Returns the list of available corpus

get_dataset_list()

Returns the list of available datasets

get_keywords_list(filename='IA_keywords_SEAD_REV_JAG.txt')

Returns a list of IA-related keywords read from a file.

Parameters:

filename (str, optional (default==’IA_keywords_SEAD_REV_JAG.txt’)) – Name of the file with the keywords

Returns:

keywords – A list of keywords (empty if the file does not exist)

Return type:

list

get_model_list()

Returns the list of available models

import_AI_subcorpus(ids_corpus=None, tag='imported')

Loads a subcorpus of positive labels from file.

Parameters:
  • ids_corpus (list) – List of ids of the documents in the corpus. Only the labels with ids in ids_corpus are imported and saved into the output file.

  • tag (str, optional (default=”imported”)) – Name for the category defined by the positive labels.

Returns:

ids_pos – List of ids of documents from the positive class

Return type:

list

import_annotations(tag)

Loads a file with annotations.

Parameters:

tag (str) – Name used to identify the annmotation file.

Returns:

df_annotations – Dataframe of annotations.

Return type:

pandas.dataFrame

is_model(class_name)

Checks if a model exist for the given class_name in the folder of models

Parameters:

class_name (str) – Name of the class

Return type:

True if the model folder exists

load_corpus(corpus_name, sampling_factor=1)

Loads a dataframe of documents from a given corpus.

When available, the names of the relevant dataframe components are mapped to normalized names: id, title, description, keywords and target_xxx

Parameters:
  • corpus_name (str) – Name of the corpus. It should be the name of a folder in self.path2source

  • sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)

load_dataset(tag='')

Loads a labeled dataset of documents in the format required by the classifier modules

Parameters:

tag (str, optional (default=””)) – Name of the dataset

load_new_labels(tag='')

Loads a temporal dataframe of documents selected for annotation

Parameters:

tag (str, optional (default=””)) – Suffix for the file name

Returns:

df – A dataframe of labels

Return type:

pandas.DataFrame

load_selected_docs(tag='')

Loads a temporal dataframe of documents selected for annotation

Parameters:

tag (str, optional (default=””)) – Label suffix for the file name

Returns:

df – Selected documents

Return type:

pandas.DataFrame

load_topic_metadata()

Loads the metadata associated to the topic matrix from the selected corpus

load_topics()

Loads a topic matrix for a specific corpus

remove_temp_files(tag)

Removes temporary files associated to a given annotation round

reset_labels(tag='')

Delete all files related to a given class

Parameters:

tag (str, optional (default=””)) – Name of the class to be removed

save_dataset(df_dataset, tag='', save_csv=False)

Save dataset in input dataframe in a feather file.

Parameters:
  • df_dataset (pandas.DataFrame) – Dataset to save

  • tag (str, optional (default=””)) – Optional string to add to the output file name.

  • save_csv (boolean, optional (default=False)) – If True, the dataset is saved in csv format too

save_new_labels(idx, labels, tag='')

Save labels from the last annotation round in a temporary file

Parameters:
  • idx (list like) – Indices of the annotated documents.

  • labels (list) – List of labels

  • tag (str, optional (default=””)) – Label suffix for the file name

save_selected_docs(df_docs, tag='')

Save a temporal dataframe of documents selected for annotation

Parameters:
  • df_docs (pandas.DataFrame) – Dataframe of documents to be annotated

  • tag (str, optional (default=””)) – Suffix for the file name

src.data_manager.detect_english(x)

Returns True is x contains text in English.

Parameters:

x (str) – Input string

Return type:

True if x contains English text, False otherwise.