Taskmanager

class rdigraphs.labtaskmanager.LabTaskManager(path2project, paths2data)

Bases: SgTaskManager

Task Manager for the RDIgraph analyzer.

The behavior of this class depends on the state of the project, in dictionary self.state, with the followin entries:

  • ‘isProject’ : If True, project created. Metadata variables loaded

  • ‘configReady’If True, config file succesfully loaded. Datamanager

    activated.

__init__(path2project, paths2data)

Initializes the LabTaskManager object

Parameters:
  • path2project (str) – Path to the graph processing project

  • paths2data (dict) – Paths to data sources

analyze_radius(corpus)

This method analyzes the generation of similarity (semantic) graphs. More specifically, it explores the variation of several graph parameters as a function of the radius, for a fixed number of nodes, for a given corpus and for several topic models (from the same corpus)

The radius is the bound on the JS distance used to sparsify the graph. The number of nodes is a subsample of the total number of nodes.

Parameters:

corpus (str) – Corpus. All topic models based on this corpus will be will be analyzed.

analyze_sampling(corpus)

Analyze sampling for a given corpus

Parameters:

corpus (str) – Corpus. All topic models based on this corpus will be will be analyzed.

compute_citation_centralities(path, n=200)

Computes all centralities for a given graph, and save in output file the top n nodes by centrality

Parameters:

path (str) – Path to snode

get_equivalent_classes(corpus)

Extracts basic information about the doc-topic matrices in the source folders.

For each matrix, plot the distribution of the number of nonzero topics

Parameters:

corpus (str) – Name of the corpus

get_keyword_descriptions(vocabfile, W, n_keywords=10)

Returns a description o each row of a weight matrix. Each column of the weight matrix is assumed to contain the weights associated to a single word specified in a given vocabulary

Parameters:
  • vocabfile (str) – Path to vocabulary

  • W (numpy array (n_words x n)) – Weight matrix

  • n_keywords (int, optional (default=10)) – Number of keywords used to describe each row of the weight matrix

get_source_info()

Extracts basic information about the doc-topic matrices in the source folders.

For each matrix, plot the distribution of the number of nonzero topics

read_vocabulary(vocab_filename)

Reads a vocabultary from file.

Parameters:

vocab_filename (str) – Path to vocabulary

Returns:

  • vocab_w2id (dict) – Vocabulary dictionary using words as keys {word_i : id_word_i}

  • vocab_id2w (dict) – Vocabulary dictionary using word ids as keys {i : word_i}

show_all_citation_centralities(path, n=400)

Shows all available local parameters for a given graph, and save them in output files containing the top n nodes for each parameter

Parameters:

path (str) – Path to snode

show_equivalent_classes()

Show equivalent classes

show_validation_results_e(path)

Shows the results of the topic model validation in self.validate_topic_models()

Parameters:

path (str) – Path to data

validate_all_models_cd(d)

Analyzes the influence of the topic model and the similarity graph parameters on the quality of the community structures

The similarity graph is validated using a citations graph.

Parameters:

d (str) – Similarity measure: options are: JS, He(llinger) or l1.

validate_topic_models(path2models)

Calls to self._validate_topic_models for a specific value of the radius and the sampling factor.

Parameters:

path2models (str) – Path specifying the class of models to validate.

visualize_bigraph()

Generate a bigraph visualization for a demo

class rdigraphs.sgtaskmanager.SgTaskManager(path2project, paths2data={}, path2source=None, metadata_fname='metadata.pkl', config_fname='parameters.yaml', keep_active=False)

Bases: object

Task Manager for the RDIgraph analyzer.

The behavior of this class depends on the state of the project, in dictionary self.state, with the following entries:

‘isProject’ : If True, project created. Metadata variables loaded ‘configReady’ : If True, config file loaded. Datamanager activated.

__init__(path2project, paths2data={}, path2source=None, metadata_fname='metadata.pkl', config_fname='parameters.yaml', keep_active=False)

Opens a corpus classification project.

Parameters:
  • path2project (str) – Path to the project

  • paths2data (dict, optional (default={})) – Dictionary of paths

  • metadata_fname (str, optional (default=’metadata.pkl’)) – Name of the metadata file

  • config_fname (str, optional (default=’parameters.yaml’)) – Name of the file containing the configuration variables

  • keep_active (bool, optional (default=False)) – If False, graphs are removed from memory (but not from files) after the tasks.

__weakref__

list of weak references to the object (if defined)

community_metric(path, community, parameter)

Compute a global metric for a graph partition resulting from a community detection algorithm

Parameters:
  • community (str) – Community detection algorithm

  • parameter (str) – Metric to compute

compare_communities(path1, comm1, path2, comm2, metric)

Compate two graph partitions

Parameters:
  • path1 (str) – Path to 1st snode

  • comm1 (str) – Name of the partition from 1st snode

  • path2 (str) – Path to 2nd snode

  • comm2 (str) – Name of the partition from 2nd snode

  • metric (str) – Metric used for the comparison

create(f_struct=None)

Creates a RDI graph analysis project. To do so, it defines the main folder structure, and creates (or cleans) the project folder, specified in self.path2project

Parameters:

f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)

If None, default names are given to the whole folder tree

detectCommunities(algorithm, path, comm_label=None)

Applies a community detection algorithm to a given snode

Parameters:
  • algorithm (str) – Community detection algoritms

  • path (str) – Path to snode

disambiguate_node()

Disambiguate a given node from a given graph based on the topological structure of the related graphs and bigraphs in the supergraph

Parameters:

path (str) – Path to snode

display_bigraph(path2sedge, s_att1, s_att2, t_att, t_att2=None, template_html='bigraph_template.html', template_js='make_bigraph_template.js')

Generates a bigraph visualization based on halo.

Parameters:
  • path2sedge (str) – Path to the bipartite graph. The bigraph (sedge) and both the source and target snodes must already exist in the supergraph structure. The name of the sedge and the source and target snodes is taken from the folder name.

  • s_att1 (str) – Name of the first attribute of the source snode. It should be a string attribute (not tested with others)

  • s_att2 (str) – Name of the second attribute of the source node. It should be a string attribute, though it could work with integers too (not fully tested).

  • t_att (str) – Name of the attribute of the target node. It should be a string attribute (not tested with others)

  • t_att2 (str or None, optional (default=None)) – Name of the second attribute of the target node It should be a string attribute (not tested with others)

  • template_html (str) – Name of the template html file.

  • template_js (str) – Name of the template js file

equivalence_graph(path)

This method manages the equivalence graph, which is a graph that connects all nodes with the same topic vector into its equivalence class.

Parameters:

path (str) – Path to snode

export_bigraph(path2graph, label_source, label_target)

Export bipartite graph from csv files to neo4J.

Parameters:

path2graph (str) – Path to graph

export_graph(path2graph, label_nodes)

Export graphs from csv files to neo4J.

Parameters:

path2graph (str) – Path to graph

generate_minigraph()

Generates a hand-made minigraph for testing purposes.

getGDBstruct()

Get structure of the graph database

Returns:

  • snodes (list) – Name of snodes

  • sedges (list) – Name of sedges

get_Neo4J_sedges()

Get sedges in Neo4J db

Returns:

sedges – Name of sedges

Return type:

list

get_Neo4J_snodes()

Get snodes in Neo4J db

Returns:

snodes – Name of snodes

Return type:

list

get_attributes(path)

Returns attributes of the graph in path. At this time, this version returns communities only. Eventually, it should be able to return all attributes that could be transformed in independent snodes

Parameters:

path (str) – Path to data

Returns:

c – List of attributes

Return type:

list

get_communities(path)

Returns community models of the graph in path.

Parameters:

path (str) – Path to data

Returns:

atts – List of community models computed for the graph

Return type:

list

get_graphs_with_features(*args)

Returns a list of the available snodes with saved attributes

get_local_features(path)

Returns the local features available for the graph in path.

Parameters:

path (str) – Path to data

Returns:

atts – List of local features computed for the graph

Return type:

list

get_names_of_SQL_dbs()

Returns the list of available databases

get_names_of_dataset_tables()

Returns the list of available tables with raw graph data

get_source_atts(path, *args)

Returns attributes of the source snode for the bipartite graph in path

Parameters:
  • path (str) – Path to data

  • args (tuple, optional) – Possible extra arguments that are ignored

Returns:

atts – List of available attributes at the selected snode

Return type:

list

get_sql_table_names(graph, db)

Get tables in the given database

Parameters:

db (str) – Name of the database

Returns:

table_names – Names of the tables in the database

Return type:

list of str

get_table_atts(graph, db, table, *args)

Get table attributes in the given database

Parameters:
  • graph (str) – Not used

  • db (str) – Name of the database

  • table (str) – Name of the table to read attributes

Returns:

table_names – Names of the tables in the database

Return type:

list of str

get_target_atts(path, *args)

Returns attributes of the target snode for the bipartite graph in path

Parameters:
  • path (str) – Path to data

  • args (tuple, optional) – Possible extra arguments that are ignored

Returns:

atts – List of available attributes at the selected snode

Return type:

list

graph_layout(path2snode, attribute)

Compute the layout of the given graph

Parameters:
  • path2snode (str) – Path to snode

  • attribute (str) – Snode attribute used to color the graph

import_SCOPUS_citations_graph(type_of_graph)

Loads a citations graph from table citations in SCOPUS SQL database

The graph is not restricted to the docs with Spanish authors. It includes all items in the citations table (about 5.8 M nodes).

No attributes are included, because the SCOPUS database contains attributes from a small subset of papers.

Parameters:

type_of_graph (str {‘undirected’, ‘cite_to’, ‘cited_by’}) – Type of graph

import_SCOPUS_citations_subgraph(type_of_graph)

Loads a citations subgraph from table ‘citations’ in SCOPUS SQL database

The subgraph contains the nodes with attributes in table ‘document’ form the same databse.

Parameters:

type_of_graph (str {‘undirected’, ‘cite_to’, ‘cited_by’}) – Type of graph

import_agents(path2tables, path2snode)

Import agents

Parameters:
  • path2tables (str) – Path to tables

  • path2snode (str) – Path to snode

import_and_infer_sim_graph(path, sim, n0=None, n_epn=None, label=None)

This method manages the generation of similarity (semantic) graphs.

Parameters:
  • path (str) – Path to the model

  • sim (str) – Similarity measure

  • n0 (int or float of None, optional (default=None)) – Number of nodes. If None, it is requested to the user. If 0 < n_epn < 1, it is the fraction of the total no. of nodes If 0, all nodes are taken.

  • n_epn (int or None, optional (default=None)) – Average number of edges per node. If None, it is requested to the user. If 0, 10 nodes are taken.

import_co_citations_graph()

Loads a co-citations-graph (only for ACL) and saves it in a new snode

import_node_atts(path, dbname, table, att, att_ref)

Load attributes from a given table from a SQL database and add them to a given snode.

Parameters:
  • path (str) – Path to the graph to add the new attribute

  • db (str) – Type of the database storing the data

  • table (str) – Name of the table in the given db that contains tue attribute

  • att (str) – Name of the attribute

  • att_ref (str) – Name of the attribute containing the node identifier

import_nodes_and_model(path)

This method manages the generation of similarity (semantic) graphs.

Parameters:

path (str) – Path to the model

import_snode_from_table(table_name)
inferBGfromA(path, attribute, t_label=None, e_label=None)

Infer bipartite graph from a categorical attribute

Parameters:
  • path (str) – Path to snode

  • attribute (str) – Name of the snode attribute used to generate the bipartite graph

  • t_label (str or None, optional (default=None)) – Name of the target s_node

  • e_label (str or None, optional (default=None)) – Name of the bipartite graph

inferSimBG()

Not available

inferTransit(path_xm, path_my)

Infer transitive graph from two bipartite graphs

Parameters:
  • path_xm (str) – Path to first bipartite graph (sedge)

  • path_my (str) – Path to second bipartite graph (sedge)

infer_eq_simgraph(path, sim)

This method manages the generation of an equivalence similarity (semantic) graph.

It is similar to a concatenatio of self.equivalence_graph() (to transform the original topic matrix into the reduced matrix without row repetitions) and infer_sim_graph() (to compute the similarity graph)

Parameters:
  • path (str) – Path to the model

  • sim (str) – Similarity measure

infer_ppr_graph(path)

Compute a transductive graph from a snode and a sedge

Parameters:

path (str) – Path to the sedge (snode is inferred from the sedge name)

infer_sim_bigraph(s_label, t_label, sim)

This method manages the generation of similarity (semantic) bipartite graphs.

It assumes that the feature vectors in source and target nodes are comparable.

Parameters:
  • s_label (str) – Name of the source graph (it must contain a feature matrix)

  • t_path (str) – Name of the source graph (it must contain a feature matrix that was comparable to that of the source graph)

  • sim (str) – Similarity measure

infer_sim_graph(path2snode, sim, n0=None, n_epn=None)

This method manages the generation of similarity (semantic) graphs.

Parameters:
  • path2snode (str) – Path to the snode

  • sim (str) – Similarity measure

  • n0 (int or float of None, optional (default=None)) – Number of nodes. If None, it is requested to the user. If 0 < n_epn < 1, it is the fraction of the total no. of nodes If 0, all nodes are taken.

  • n_epn (int or None, optional (default=None)) – Average number of edges per node. If None, it is requested to the user. If 0, 10 nodes are taken.

largest_community_subgraph(path, comm)

Subsample graph taking the nodes from the largest community.

Parameters:
  • path (str) – Path to graph

  • comm (str) – Name of the community

load(f_struct={})

Loads an existing project, by reading the metadata file in the project folder.

It can be used to modify file or folder names, or paths, by specifying the new names/paths in the f_struct dictionary.

Parameters:

f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)

If None, default names are given to the whole folder tree

local_graph_analysis(parameter, path)

Computes a local parameter for a snode

Parameters:
  • parameter (str) – Local parameter to compute

  • path (str) – Path to snode

remove_isolated_nodes(path)

Remove isolated nodes

Parameters:

path (str) – Path to snode

remove_snode_attributes(path, att)

Load attributes from a given table from a SQL database and add them to a given snode.

Parameters:
  • path (str) – Path to the graph where the attribute must be removed

  • att (str) – Name of the attribute

resetGDBdata(option, snodes, sedges)

Reset (drop and create emtpy) tables from the database.

Parameters:
  • option (str) – Selected node or edge to reset

  • snodes (list) – List of available nodes

  • sedges – List of available sedges

reset_Neo4J()

Reset the whole database

reset_Neo4J_sedge(sedge)

Reset (drop and create emtpy) tables from the database.

Parameters:

sedge (str) – Selected node or edge to reset

reset_Neo4J_snode(snode)

Reset (drop and create emtpy) tables from the database.

Parameters:

snode (str) – Selected node or edge to reset

reset_sedge(path)

Reset sedge in path

Parameters:

path (str) – Path to sedge

reset_snode(path)

Reset snode in path

Parameters:

path (str) – Path to snode

save_metadata()

Save metadata into a pickle file

set_logs()

Configure logging messages.

setup()

Set up the classification projetc. To do so:

  1. Loads the configuration file and initializes the data manager.

  2. Creates a DB table.

showGDBdata(option, snodes, sedges)

Print a general overview of the selected database

Parameters:
  • option (str) – Name of the db table

  • snodes (list) – List of snodes

  • sedges (list) – List of sedges

showSDBdata(option)

Print a general overview of the selected source (SQL) database

Parameters:

option (str) – Name of the db table

show_Neo4J()

Print a general overview of the whole database

show_Neo4J_sedge(sedge)

Print a general overview of the selected sedge

Parameters:

sedge (str) – Name of the sedge

show_Neo4J_snode(snode)

Print a general overview of the selected snode

Parameters:

snode (str) – Name of the snode

show_SuperGraph()

Show current supergraph structure

show_sedge(path2sedge)

A quick preview of a superedge.

Parameters:

path2sedge (str) – Path to the superedge

show_snode(path2snode)

A quick preview of a supernode.

Parameters:

path2snode (str) – Path to the supernode

show_top_nodes(path, feature)

Shows a reduced list of nodes from a given graph, ranked by the value of a single feature

Parameters:
  • path (str) – Path to the graph

  • feature (str) – Name of the local feature

subsample_graph(path, mode)

Subsample graph

Parameters:
  • path (str) – Path to graph

  • mode (str) – If ‘newgraph’, create a new snode with the subgraph

transduce(path, order)

Compute a transductive graph from a snode and a sedge

Parameters:
  • path (str) – Path to the sedge (snode is inferred from the sedge name)

  • order (int) – Order parameters of the transduced graph

update_folders(f_struct=None)

Updates the project folder structure using the file and folder names in f_struct.

Parameters:

f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)

If None, default names are given to the whole folder tree