Preprocessor¶
Defines classes and methods providing the main functionality for document selection through keywords, a category name or a weighted list of topics.
@author: J. Cid-Sueiro, A. Gallardo-Antolin
- class src.domain_classifier.preprocessor.CorpusDFProcessor(df_corpus, path2embeddings=None, path2zeroshot=None)¶
Bases:
object
A container of corpus processing methods. It assumes that a corpus is given by a dataframe of documents.
Each dataframe must contain three columns: id: document identifiers title: document titles description: body of the document text
- __init__(df_corpus, path2embeddings=None, path2zeroshot=None)¶
Initializes a preprocessor object
- Parameters:
df_corpus (pandas.dataFrame) – Input corpus.
path2embeddings (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the document embeddings. If None, no embeddings will be used. Document scores will be based in word counts
path2zeroshot (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the pretrained zero-shot model If None, zero-shot classification will not be available.
- __weakref__¶
list of weak references to the object (if defined)
- compute_keyword_stats(keywords, wt=2)¶
Computes keyword statistics
- Parameters:
corpus (dataframe) – Dataframe of corpus.
keywords (list of str) – List of keywords
- Returns:
df_stats (dict) – Dictionary of document frequencies per keyword df_stats[k] is the number of docs containing keyword k
kf_stats (dict) – Dictionary of keyword frequencies df_stats[k] is the number of times keyword k appers in the corpus
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor
- evaluate_filter(scores, target_col, n_max, s_min, verbose=False)¶
Compute evaluation metrics for the generation of the subcorpus
To do so, it requires from self.df_corpus to have at least the following columns: id, title, description, target_bio, target_tic, target_ene
- Parameters:
scores (list of float) – list of unsorted scores
target_col (str) – Name of the column in the corpus dataframe that will be used as a reference for evaluation.
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit
s_min (float, optional (default=0)¡) – Minimum score. Only elements strictly above s_min are selected
verbose (bool, optional) – If true, the evaluation results are logged at level INFO.
- Returns:
eval_scores – A dictionary of evaluation metrics. If there are no labels available for evaluation, an empty dictionary is returned.
- Return type:
dict
- filter_by_keywords(keywords, wt=2, n_max=1e+100, s_min=0, model_name='all-MiniLM-L6-v2', method='embedding')¶
Select documents from a given set of keywords
- Parameters:
keywords (list of str) – List of keywords
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor. Not used if self.path2embeddings is None
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected
model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model
method (str in {‘embedding’, ‘count’}) –
If ‘count’, documents are scored according to word counts
If ‘embedding’, scores are based on neural embeddings
- Returns:
ids (list) – List of ids of the selected documents
scores (list of float) – List of scores, one per documents in corpus
- filter_by_topics(T, doc_ids, topic_weights, n_max=1e+100, s_min=0)¶
Select documents with a significant presence of a given set of keywords
- Parameters:
T (numpy.ndarray or scipy.sparse) – Topic matrix.
doc_ids (array-like) – Ids of the documents in the topic matrix. doc_ids[i] = ‘123’ means that document with id ‘123’ has topic vector T[i]
topic_weights (dict) – Dictionary {t_i: w_i}, where t_i is a topic index and w_i is the weight of the topic
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected
- Returns:
ids – List of ids of the selected documents
- Return type:
list
- get_top_scores(scores, n_max=1e+100, s_min=0)¶
Select documents from the corpus whose score is strictly above a lower bound
- Parameters:
scores (array-like of float) – List of scores. It must be the same size than the number of docs in the corpus
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected
- make_PU_dataset(ids, scores=None)¶
Returns the labeled dataframe in the format required by the CorpusClassifier class
- Parameters:
ids (array-like) – ids of documents with positive labels
scores (array-like or None, optional) – A list or np.array of score values, one per row in self.df_corpus. It is used to fill the base_scores column in the output dataframe. They are expected to contain the scores used to select the positive labels for PU learning. Thus, the docs listed in ids should be those with the highest scores.
- Returns:
df_dataset – A pandas dataframe with three columns: id, text and labels.
- Return type:
pandas.DataFrame
- remove_docs_from_topics(T, df_metadata, col_id='id')¶
Removes, from a given topic-document matrix and its corresponding metadata dataframe, all documents that do not belong to the corpus
- Parameters:
T (numpy.ndarray or scipy.sparse) – Topic matrix (one column per topic)
df_metadata (pandas.DataFrame) – Dataframe of metadata. It must include a column with document ids
col_id (str, optional (default=’id’)) – Name of the column containing the document ids in df_metadata
- Returns:
T_out (numpy.ndarray or scipy.sparse) – Reduced topic matrix (after document removal)
df_out (pands.DataFrame) – Metadata dataframe, after document removal
- score_by_keyword_count(keywords, wt=2)¶
Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords
- Parameters:
corpus (dataframe) – Dataframe of corpus.
keywords (list of str) – List of keywords
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor
- Returns:
score – List of scores, one per documents in corpus
- Return type:
list of float
- score_by_keywords(keywords, wt=2, model_name='all-MiniLM-L6-v2', method='embedding')¶
Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords
- Parameters:
keywords (list of str) – List of keywords
wt (float, optional (default=2)) – Weighting factor for the title components. Keyword matches with title words are weighted by this factor This input argument is used if self.path2embeddings is None only
model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model
method (str in {‘embedding’, ‘count’}) –
If ‘count’, documents are scored according to word counts
If ‘embedding’, scores are based on neural embeddings
- Returns:
score – List of scores, one per documents in corpus
- Return type:
list of float
- score_by_topics(T, doc_ids, topic_weights)¶
Computes a score for every document in a given pandas dataframe according to the relevance of a weighted list of topics
- Parameters:
T (numpy.ndarray or scipy.sparse) – Topic matrix (one column per topic)
doc_ids (array-like) – Ids of the documents in the topic matrix. doc_ids[i] = ‘123’ means that document with id ‘123’ has topic vector T[i]
topic_weights (dict) – Dictionary {t_i: w_i}, where t_i is a topic index and w_i is the weight of the topic
- Returns:
score – List of scores, one per documents in corpus
- Return type:
list of float
- score_by_zeroshot(keyword)¶
Computes a score for every document in a given pandas dataframe according to the relevance of a given keyword according to a pretrained zero-shot classifier
- Parameters:
keyword (str) – Keywords defining the target category
- Returns:
score – List of scores, one per documents in corpus
- Return type:
list of float
- class src.domain_classifier.preprocessor.CorpusProcessor(path2embeddings=None, path2zeroshot=None)¶
Bases:
object
A container of corpus preprocessing methods It provides basic processing methods to a corpus of text documents The input corpus must be given by a list of strings (or a pandas series of strings)
- __init__(path2embeddings=None, path2zeroshot=None)¶
Initializes a preprocessor object
- Parameters:
path2embeddings (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the document embeddings. If None, no embeddings will be used. Document scores will be based in word counts
path2zeroshot (str or pathlib.Path or None, optional (default=None)) – Path to the folder containing the pretrained zero-shot model If None, zero-shot classification will not be available.
- __weakref__¶
list of weak references to the object (if defined)
- compute_keyword_stats(corpus, keywords)¶
Computes keyword statistics
- Parameters:
corpus (list (or pandas.Series) of str) – Input corpus.
keywords (list of str) – List of keywords
- Returns:
df_stats (dict) – Dictionary of document frequencies per keyword df_stats[k] is the number of docs containing keyword k
kf_stats (dict) – Dictionary of keyword frequencies df_stats[k] is the number of times keyword k appers in the corpus
- get_top_scores(scores, n_max=1e+100, s_min=0)¶
Select the elements from a given list of numbers that fulfill some conditions
- Parameters:
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no loimit
s_min (float, optional (default=0)) – Minimum score. Only elements strictly above s_min are selected
- performance_metrics(scores, target, s_min, n_max)¶
Compute evaluation metrics for the generation of the subcorpus using a keyword
To do so, it requires from self.df_corpus to have at least the following columns: id, title, description, target_bio, target_tic, target_ene
- Parameters:
scores (np.array) – Score values
target (np.array) – Target values
s_min (float, optional (default=0)¡) – Minimum score. Only elements strictly above s_min are selected
n_max (int or None, optional (defaul=1e100)) – Maximum number of elements in the output list. The default is a huge number that, in practice, means there is no limit
- Returns:
eval_scores – A dictionary of evaluation metrics.
- Return type:
dict
- score_docs_by_keyword_count(corpus, keywords)¶
Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords
- Parameters:
corpus (list (or pandas.Series) of str) – Input corpus.
keywords (list of str) – List of keywords
- Returns:
score – List of scores, one per document in corpus
- Return type:
list of float
- score_docs_by_keywords(corpus, keywords, model_name='all-MiniLM-L6-v2')¶
Computes a score for every document in a given pandas dataframe according to the frequency of appearing some given keywords
- Parameters:
corpus (list (or pandas.Series) of str) – Input corpus.
keywords (list of str) – List of keywords
model_name (str, optinal (default = ‘all-MiniLM-L6-v2’)) – Name of the SBERT transformer model
- Returns:
score – List of scores, one per document in corpus
- Return type:
list of float
- score_docs_by_zeroshot(corpus, keyword)¶
Computes a score for every document in a given pandas dataframe according to a given keyword and a pre-trained zero-shot classifier
- Parameters:
corpus (list (or pandas.Series) of str) – Input corpus.
keyword (str) – Keyword defining the target category
- Returns:
score – List of scores, one per document in corpus
- Return type:
list of float
Notes
Adapted from code contributed by BSC for the IntelComp project.