Preprocessor

Defines classes and methods providing the main functionality for document selection through keywords, a category name or a weighted list of topics.

@author: J. Cid-Sueiro, A. Gallardo-Antolin

class src.domain_classifier.preprocessor.CorpusDFProcessor(df_corpus, path2embeddings=None, path2zeroshot=None)

Bases: object

A container of corpus processing methods. It assumes that a corpus is given by a dataframe of documents.

Each dataframe must contain three columns: id: document identifiers title: document titles description: body of the document text

__init__(df_corpus, path2embeddings=None, path2zeroshot=None)

Initializes a preprocessor object

Parameters:
  • df_corpus (pandas.dataFrame) – Input corpus.

  • path2embeddings (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the document embeddings. If None, no embeddings will be used. Document scores will be based in word counts

  • path2zeroshot (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the pretrained zero-shot model If None, zero-shot classification will not be available.

__weakref__

list of weak references to the object (if defined)

compute_keyword_stats(keywords, wt=2)

Computes keyword statistics

Parameters:
  • corpus (dataframe) – Dataframe of corpus.

  • keywords (list of str) – List of keywords

Returns:

  • df_stats (dict) – Dictionary of document frequencies per keyword df_stats[k] is the number of docs containing keyword k

  • kf_stats (dict) – Dictionary of keyword frequencies df_stats[k] is the number of times keyword k appers in the corpus

  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor

evaluate_filter(scores, target_col, n_max, s_min, verbose=False)

Compute evaluation metrics for the generation of the subcorpus

To do so, it requires from self.df_corpus to have at least the following columns: id, title, description, target_bio, target_tic, target_ene

Parameters:
  • scores (list of float) – list of unsorted scores

  • target_col (str) – Name of the column in the corpus dataframe that will be used as a reference for evaluation.

  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit

  • s_min (float, optional (default=0)¡) – Minimum score. Only elements strictly above s_min are selected

  • verbose (bool, optional) – If true, the evaluation results are logged at level INFO.

Returns:

eval_scores – A dictionary of evaluation metrics. If there are no labels available for evaluation, an empty dictionary is returned.

Return type:

dict

filter_by_keywords(keywords, wt=2, n_max=1e+100, s_min=0, model_name='all-MiniLM-L6-v2', method='embedding')

Select documents from a given set of keywords

Parameters:
  • keywords (list of str) – List of keywords

  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor. Not used if self.path2embeddings is None

  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected

  • model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model

  • method (str in {‘embedding’, ‘count’}) –

    • If ‘count’, documents are scored according to word counts

    • If ‘embedding’, scores are based on neural embeddings

Returns:

  • ids (list) – List of ids of the selected documents

  • scores (list of float) – List of scores, one per documents in corpus

filter_by_topics(T, doc_ids, topic_weights, n_max=1e+100, s_min=0)

Select documents with a significant presence of a given set of keywords

Parameters:
  • T (numpy.ndarray or scipy.sparse) – Topic matrix.

  • doc_ids (array-like) – Ids of the documents in the topic matrix. doc_ids[i] = ‘123’ means that document with id ‘123’ has topic vector T[i]

  • topic_weights (dict) – Dictionary {t_i: w_i}, where t_i is a topic index and w_i is the weight of the topic

  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected

Returns:

ids – List of ids of the selected documents

Return type:

list

get_top_scores(scores, n_max=1e+100, s_min=0)

Select documents from the corpus whose score is strictly above a lower bound

Parameters:
  • scores (array-like of float) – List of scores. It must be the same size than the number of docs in the corpus

  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected

make_PU_dataset(ids, scores=None)

Returns the labeled dataframe in the format required by the CorpusClassifier class

Parameters:
  • ids (array-like) – ids of documents with positive labels

  • scores (array-like or None, optional) – A list or np.array of score values, one per row in self.df_corpus. It is used to fill the base_scores column in the output dataframe. They are expected to contain the scores used to select the positive labels for PU learning. Thus, the docs listed in ids should be those with the highest scores.

Returns:

df_dataset – A pandas dataframe with three columns: id, text and labels.

Return type:

pandas.DataFrame

remove_docs_from_topics(T, df_metadata, col_id='id')

Removes, from a given topic-document matrix and its corresponding metadata dataframe, all documents that do not belong to the corpus

Parameters:
  • T (numpy.ndarray or scipy.sparse) – Topic matrix (one column per topic)

  • df_metadata (pandas.DataFrame) – Dataframe of metadata. It must include a column with document ids

  • col_id (str, optional (default=’id’)) – Name of the column containing the document ids in df_metadata

Returns:

  • T_out (numpy.ndarray or scipy.sparse) – Reduced topic matrix (after document removal)

  • df_out (pands.DataFrame) – Metadata dataframe, after document removal

score_by_keyword_count(keywords, wt=2)

Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords

Parameters:
  • corpus (dataframe) – Dataframe of corpus.

  • keywords (list of str) – List of keywords

  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor

Returns:

score – List of scores, one per documents in corpus

Return type:

list of float

score_by_keywords(keywords, wt=2, model_name='all-MiniLM-L6-v2', method='embedding')

Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords

Parameters:
  • keywords (list of str) – List of keywords

  • wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor This input argument is used if self.path2embeddings is None only

  • model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model

  • method (str in {‘embedding’, ‘count’}) –

    • If ‘count’, documents are scored according to word counts

    • If ‘embedding’, scores are based on neural embeddings

Returns:

score – List of scores, one per documents in corpus

Return type:

list of float

score_by_topics(T, doc_ids, topic_weights)

Computes a score for every document in a given pandas dataframe according to the relevance of a weighted list of topics

Parameters:
  • T (numpy.ndarray or scipy.sparse) – Topic matrix (one column per topic)

  • doc_ids (array-like) – Ids of the documents in the topic matrix. doc_ids[i] = ‘123’ means that document with id ‘123’ has topic vector T[i]

  • topic_weights (dict) – Dictionary {t_i: w_i}, where t_i is a topic index and w_i is the weight of the topic

Returns:

score – List of scores, one per documents in corpus

Return type:

list of float

score_by_zeroshot(keyword)

Computes a score for every document in a given pandas dataframe according to the relevance of a given keyword according to a pretrained zero-shot classifier

Parameters:

keyword (str) – Keywords defining the target category

Returns:

score – List of scores, one per documents in corpus

Return type:

list of float

class src.domain_classifier.preprocessor.CorpusProcessor(path2embeddings=None, path2zeroshot=None)

Bases: object

A container of corpus preprocessing methods It provides basic processing methods to a corpus of text documents The input corpus must be given by a list of strings (or a pandas series of strings)

__init__(path2embeddings=None, path2zeroshot=None)

Initializes a preprocessor object

Parameters:
  • path2embeddings (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the document embeddings. If None, no embeddings will be used. Document scores will be based in word counts

  • path2zeroshot (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the pretrained zero-shot model If None, zero-shot classification will not be available.

__weakref__

list of weak references to the object (if defined)

compute_keyword_stats(corpus, keywords)

Computes keyword statistics

Parameters:
  • corpus (list (or pandas.Series) of str) – Input corpus.

  • keywords (list of str) – List of keywords

Returns:

  • df_stats (dict) – Dictionary of document frequencies per keyword df_stats[k] is the number of docs containing keyword k

  • kf_stats (dict) – Dictionary of keyword frequencies df_stats[k] is the number of times keyword k appers in the corpus

get_top_scores(scores, n_max=1e+100, s_min=0)

Select the elements from a given list of numbers that fulfill some conditions

Parameters:
  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit

  • s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected

performance_metrics(scores, target, s_min, n_max)

Compute evaluation metrics for the generation of the subcorpus using a keyword

To do so, it requires from self.df_corpus to have at least the following columns: id, title, description, target_bio, target_tic, target_ene

Parameters:
  • scores (np.array) – Score values

  • target (np.array) – Target values

  • s_min (float, optional (default=0)¡) – Minimum score. Only elements strictly above s_min are selected

  • n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit

Returns:

eval_scores – A dictionary of evaluation metrics.

Return type:

dict

score_docs_by_keyword_count(corpus, keywords)

Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords

Parameters:
  • corpus (list (or pandas.Series) of str) – Input corpus.

  • keywords (list of str) – List of keywords

Returns:

score – List of scores, one per document in corpus

Return type:

list of float

score_docs_by_keywords(corpus, keywords, model_name='all-MiniLM-L6-v2')

Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords

Parameters:
  • corpus (list (or pandas.Series) of str) – Input corpus.

  • keywords (list of str) – List of keywords

  • model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model

Returns:

score – List of scores, one per document in corpus

Return type:

list of float

score_docs_by_zeroshot(corpus, keyword)

Computes a score for every document in a given pandas dataframe according to a given keyword and a pre-trained zero-shot classifier

Parameters:
  • corpus (list (or pandas.Series) of str) – Input corpus.

  • keyword (str) – Keyword defining the target category

Returns:

score – List of scores, one per document in corpus

Return type:

list of float

Notes

Adapted from code contributed by BSC for the IntelComp project.