Taskmanager
- class rdigraphs.labtaskmanager.LabTaskManager(path2project, paths2data)
Bases:
SgTaskManager
Task Manager for the RDIgraph analyzer.
The behavior of this class depends on the state of the project, in dictionary self.state, with the followin entries:
‘isProject’ : If True, project created. Metadata variables loaded
- ‘configReady’If True, config file succesfully loaded. Datamanager
activated.
- __init__(path2project, paths2data)
Initializes the LabTaskManager object
- Parameters:
path2project (str) – Path to the graph processing project
paths2data (dict) – Paths to data sources
- analyze_radius(corpus)
This method analyzes the generation of similarity (semantic) graphs. More specifically, it explores the variation of several graph parameters as a function of the radius, for a fixed number of nodes, for a given corpus and for several topic models (from the same corpus)
The radius is the bound on the JS distance used to sparsify the graph. The number of nodes is a subsample of the total number of nodes.
- Parameters:
corpus (str) – Corpus. All topic models based on this corpus will be will be analyzed.
- analyze_sampling(corpus)
Analyze sampling for a given corpus
- Parameters:
corpus (str) – Corpus. All topic models based on this corpus will be will be analyzed.
- compute_citation_centralities(path, n=200)
Computes all centralities for a given graph, and save in output file the top n nodes by centrality
- Parameters:
path (str) – Path to snode
- get_equivalent_classes(corpus)
Extracts basic information about the doc-topic matrices in the source folders.
For each matrix, plot the distribution of the number of nonzero topics
- Parameters:
corpus (str) – Name of the corpus
- get_keyword_descriptions(vocabfile, W, n_keywords=10)
Returns a description o each row of a weight matrix. Each column of the weight matrix is assumed to contain the weights associated to a single word specified in a given vocabulary
- Parameters:
vocabfile (str) – Path to vocabulary
W (numpy array (n_words x n)) – Weight matrix
n_keywords (int, optional (default=10)) – Number of keywords used to describe each row of the weight matrix
- get_source_info()
Extracts basic information about the doc-topic matrices in the source folders.
For each matrix, plot the distribution of the number of nonzero topics
- read_vocabulary(vocab_filename)
Reads a vocabultary from file.
- Parameters:
vocab_filename (str) – Path to vocabulary
- Returns:
vocab_w2id (dict) – Vocabulary dictionary using words as keys {word_i : id_word_i}
vocab_id2w (dict) – Vocabulary dictionary using word ids as keys {i : word_i}
- show_all_citation_centralities(path, n=400)
Shows all available local parameters for a given graph, and save them in output files containing the top n nodes for each parameter
- Parameters:
path (str) – Path to snode
- show_equivalent_classes()
Show equivalent classes
- show_validation_results_e(path)
Shows the results of the topic model validation in self.validate_topic_models()
- Parameters:
path (str) – Path to data
- validate_all_models_cd(d)
Analyzes the influence of the topic model and the similarity graph parameters on the quality of the community structures
The similarity graph is validated using a citations graph.
- Parameters:
d (str) – Similarity measure: options are: JS, He(llinger) or l1.
- validate_topic_models(path2models)
Calls to self._validate_topic_models for a specific value of the radius and the sampling factor.
- Parameters:
path2models (str) – Path specifying the class of models to validate.
- visualize_bigraph()
Generate a bigraph visualization for a demo
- class rdigraphs.sgtaskmanager.SgTaskManager(path2project, paths2data={}, path2source=None, metadata_fname='metadata.pkl', config_fname='parameters.yaml', keep_active=False)
Bases:
object
Task Manager for the RDIgraph analyzer.
The behavior of this class depends on the state of the project, in dictionary self.state, with the following entries:
‘isProject’ : If True, project created. Metadata variables loaded ‘configReady’ : If True, config file loaded. Datamanager activated.
- __init__(path2project, paths2data={}, path2source=None, metadata_fname='metadata.pkl', config_fname='parameters.yaml', keep_active=False)
Opens a corpus classification project.
- Parameters:
path2project (str) – Path to the project
paths2data (dict, optional (default={})) – Dictionary of paths
metadata_fname (str, optional (default=’metadata.pkl’)) – Name of the metadata file
config_fname (str, optional (default=’parameters.yaml’)) – Name of the file containing the configuration variables
keep_active (bool, optional (default=False)) – If False, graphs are removed from memory (but not from files) after the tasks.
- __weakref__
list of weak references to the object (if defined)
- community_metric(path, community, parameter)
Compute a global metric for a graph partition resulting from a community detection algorithm
- Parameters:
community (str) – Community detection algorithm
parameter (str) – Metric to compute
- compare_communities(path1, comm1, path2, comm2, metric)
Compate two graph partitions
- Parameters:
path1 (str) – Path to 1st snode
comm1 (str) – Name of the partition from 1st snode
path2 (str) – Path to 2nd snode
comm2 (str) – Name of the partition from 2nd snode
metric (str) – Metric used for the comparison
- create(f_struct=None)
Creates a RDI graph analysis project. To do so, it defines the main folder structure, and creates (or cleans) the project folder, specified in self.path2project
- Parameters:
f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)
If None, default names are given to the whole folder tree
- detectCommunities(algorithm, path, comm_label=None)
Applies a community detection algorithm to a given snode
- Parameters:
algorithm (str) – Community detection algoritms
path (str) – Path to snode
- disambiguate_node()
Disambiguate a given node from a given graph based on the topological structure of the related graphs and bigraphs in the supergraph
- Parameters:
path (str) – Path to snode
- display_bigraph(path2sedge, s_att1, s_att2, t_att, t_att2=None, template_html='bigraph_template.html', template_js='make_bigraph_template.js')
Generates a bigraph visualization based on halo.
- Parameters:
path2sedge (str) – Path to the bipartite graph. The bigraph (sedge) and both the source and target snodes must already exist in the supergraph structure. The name of the sedge and the source and target snodes is taken from the folder name.
s_att1 (str) – Name of the first attribute of the source snode. It should be a string attribute (not tested with others)
s_att2 (str) – Name of the second attribute of the source node. It should be a string attribute, though it could work with integers too (not fully tested).
t_att (str) – Name of the attribute of the target node. It should be a string attribute (not tested with others)
t_att2 (str or None, optional (default=None)) – Name of the second attribute of the target node It should be a string attribute (not tested with others)
template_html (str) – Name of the template html file.
template_js (str) – Name of the template js file
- equivalence_graph(path)
This method manages the equivalence graph, which is a graph that connects all nodes with the same topic vector into its equivalence class.
- Parameters:
path (str) – Path to snode
- export_bigraph(path2graph, label_source, label_target)
Export bipartite graph from csv files to neo4J.
- Parameters:
path2graph (str) – Path to graph
- export_graph(path2graph, label_nodes)
Export graphs from csv files to neo4J.
- Parameters:
path2graph (str) – Path to graph
- generate_minigraph()
Generates a hand-made minigraph for testing purposes.
- getGDBstruct()
Get structure of the graph database
- Returns:
snodes (list) – Name of snodes
sedges (list) – Name of sedges
- get_Neo4J_sedges()
Get sedges in Neo4J db
- Returns:
sedges – Name of sedges
- Return type:
list
- get_Neo4J_snodes()
Get snodes in Neo4J db
- Returns:
snodes – Name of snodes
- Return type:
list
- get_attributes(path)
Returns attributes of the graph in path. At this time, this version returns communities only. Eventually, it should be able to return all attributes that could be transformed in independent snodes
- Parameters:
path (str) – Path to data
- Returns:
c – List of attributes
- Return type:
list
- get_communities(path)
Returns community models of the graph in path.
- Parameters:
path (str) – Path to data
- Returns:
atts – List of community models computed for the graph
- Return type:
list
- get_graphs_with_features(*args)
Returns a list of the available snodes with saved attributes
- get_local_features(path)
Returns the local features available for the graph in path.
- Parameters:
path (str) – Path to data
- Returns:
atts – List of local features computed for the graph
- Return type:
list
- get_names_of_SQL_dbs()
Returns the list of available databases
- get_names_of_dataset_tables()
Returns the list of available tables with raw graph data
- get_source_atts(path, *args)
Returns attributes of the source snode for the bipartite graph in path
- Parameters:
path (str) – Path to data
args (tuple, optional) – Possible extra arguments that are ignored
- Returns:
atts – List of available attributes at the selected snode
- Return type:
list
- get_sql_table_names(graph, db)
Get tables in the given database
- Parameters:
db (str) – Name of the database
- Returns:
table_names – Names of the tables in the database
- Return type:
list of str
- get_table_atts(graph, db, table, *args)
Get table attributes in the given database
- Parameters:
graph (str) – Not used
db (str) – Name of the database
table (str) – Name of the table to read attributes
- Returns:
table_names – Names of the tables in the database
- Return type:
list of str
- get_target_atts(path, *args)
Returns attributes of the target snode for the bipartite graph in path
- Parameters:
path (str) – Path to data
args (tuple, optional) – Possible extra arguments that are ignored
- Returns:
atts – List of available attributes at the selected snode
- Return type:
list
- graph_layout(path2snode, attribute)
Compute the layout of the given graph
- Parameters:
path2snode (str) – Path to snode
attribute (str) – Snode attribute used to color the graph
- import_SCOPUS_citations_graph(type_of_graph)
Loads a citations graph from table citations in SCOPUS SQL database
The graph is not restricted to the docs with Spanish authors. It includes all items in the citations table (about 5.8 M nodes).
No attributes are included, because the SCOPUS database contains attributes from a small subset of papers.
- Parameters:
type_of_graph (str {‘undirected’, ‘cite_to’, ‘cited_by’}) – Type of graph
- import_SCOPUS_citations_subgraph(type_of_graph)
Loads a citations subgraph from table ‘citations’ in SCOPUS SQL database
The subgraph contains the nodes with attributes in table ‘document’ form the same databse.
- Parameters:
type_of_graph (str {‘undirected’, ‘cite_to’, ‘cited_by’}) – Type of graph
- import_agents(path2tables, path2snode)
Import agents
- Parameters:
path2tables (str) – Path to tables
path2snode (str) – Path to snode
- import_and_infer_sim_graph(path, sim, n0=None, n_epn=None, label=None)
This method manages the generation of similarity (semantic) graphs.
- Parameters:
path (str) – Path to the model
sim (str) – Similarity measure
n0 (int or float of None, optional (default=None)) – Number of nodes. If None, it is requested to the user. If 0 < n_epn < 1, it is the fraction of the total no. of nodes If 0, all nodes are taken.
n_epn (int or None, optional (default=None)) – Average number of edges per node. If None, it is requested to the user. If 0, 10 nodes are taken.
- import_co_citations_graph()
Loads a co-citations-graph (only for ACL) and saves it in a new snode
- import_node_atts(path, dbname, table, att, att_ref)
Load attributes from a given table from a SQL database and add them to a given snode.
- Parameters:
path (str) – Path to the graph to add the new attribute
db (str) – Type of the database storing the data
table (str) – Name of the table in the given db that contains tue attribute
att (str) – Name of the attribute
att_ref (str) – Name of the attribute containing the node identifier
- import_nodes_and_model(path)
This method manages the generation of similarity (semantic) graphs.
- Parameters:
path (str) – Path to the model
- import_snode_from_table(table_name)
- inferBGfromA(path, attribute, t_label=None, e_label=None)
Infer bipartite graph from a categorical attribute
- Parameters:
path (str) – Path to snode
attribute (str) – Name of the snode attribute used to generate the bipartite graph
t_label (str or None, optional (default=None)) – Name of the target s_node
e_label (str or None, optional (default=None)) – Name of the bipartite graph
- inferSimBG()
Not available
- inferTransit(path_xm, path_my)
Infer transitive graph from two bipartite graphs
- Parameters:
path_xm (str) – Path to first bipartite graph (sedge)
path_my (str) – Path to second bipartite graph (sedge)
- infer_eq_simgraph(path, sim)
This method manages the generation of an equivalence similarity (semantic) graph.
It is similar to a concatenatio of self.equivalence_graph() (to transform the original topic matrix into the reduced matrix without row repetitions) and infer_sim_graph() (to compute the similarity graph)
- Parameters:
path (str) – Path to the model
sim (str) – Similarity measure
- infer_ppr_graph(path)
Compute a transductive graph from a snode and a sedge
- Parameters:
path (str) – Path to the sedge (snode is inferred from the sedge name)
- infer_sim_bigraph(s_label, t_label, sim)
This method manages the generation of similarity (semantic) bipartite graphs.
It assumes that the feature vectors in source and target nodes are comparable.
- Parameters:
s_label (str) – Name of the source graph (it must contain a feature matrix)
t_path (str) – Name of the source graph (it must contain a feature matrix that was comparable to that of the source graph)
sim (str) – Similarity measure
- infer_sim_graph(path2snode, sim, n0=None, n_epn=None)
This method manages the generation of similarity (semantic) graphs.
- Parameters:
path2snode (str) – Path to the snode
sim (str) – Similarity measure
n0 (int or float of None, optional (default=None)) – Number of nodes. If None, it is requested to the user. If 0 < n_epn < 1, it is the fraction of the total no. of nodes If 0, all nodes are taken.
n_epn (int or None, optional (default=None)) – Average number of edges per node. If None, it is requested to the user. If 0, 10 nodes are taken.
- largest_community_subgraph(path, comm)
Subsample graph taking the nodes from the largest community.
- Parameters:
path (str) – Path to graph
comm (str) – Name of the community
- load(f_struct={})
Loads an existing project, by reading the metadata file in the project folder.
It can be used to modify file or folder names, or paths, by specifying the new names/paths in the f_struct dictionary.
- Parameters:
f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)
If None, default names are given to the whole folder tree
- local_graph_analysis(parameter, path)
Computes a local parameter for a snode
- Parameters:
parameter (str) – Local parameter to compute
path (str) – Path to snode
- remove_isolated_nodes(path)
Remove isolated nodes
- Parameters:
path (str) – Path to snode
- remove_snode_attributes(path, att)
Load attributes from a given table from a SQL database and add them to a given snode.
- Parameters:
path (str) – Path to the graph where the attribute must be removed
att (str) – Name of the attribute
- resetGDBdata(option, snodes, sedges)
Reset (drop and create emtpy) tables from the database.
- Parameters:
option (str) – Selected node or edge to reset
snodes (list) – List of available nodes
sedges – List of available sedges
- reset_Neo4J()
Reset the whole database
- reset_Neo4J_sedge(sedge)
Reset (drop and create emtpy) tables from the database.
- Parameters:
sedge (str) – Selected node or edge to reset
- reset_Neo4J_snode(snode)
Reset (drop and create emtpy) tables from the database.
- Parameters:
snode (str) – Selected node or edge to reset
- reset_sedge(path)
Reset sedge in path
- Parameters:
path (str) – Path to sedge
- reset_snode(path)
Reset snode in path
- Parameters:
path (str) – Path to snode
- save_metadata()
Save metadata into a pickle file
- set_logs()
Configure logging messages.
- setup()
Set up the classification projetc. To do so:
Loads the configuration file and initializes the data manager.
Creates a DB table.
- showGDBdata(option, snodes, sedges)
Print a general overview of the selected database
- Parameters:
option (str) – Name of the db table
snodes (list) – List of snodes
sedges (list) – List of sedges
- showSDBdata(option)
Print a general overview of the selected source (SQL) database
- Parameters:
option (str) – Name of the db table
- show_Neo4J()
Print a general overview of the whole database
- show_Neo4J_sedge(sedge)
Print a general overview of the selected sedge
- Parameters:
sedge (str) – Name of the sedge
- show_Neo4J_snode(snode)
Print a general overview of the selected snode
- Parameters:
snode (str) – Name of the snode
- show_SuperGraph()
Show current supergraph structure
- show_sedge(path2sedge)
A quick preview of a superedge.
- Parameters:
path2sedge (str) – Path to the superedge
- show_snode(path2snode)
A quick preview of a supernode.
- Parameters:
path2snode (str) – Path to the supernode
- show_top_nodes(path, feature)
Shows a reduced list of nodes from a given graph, ranked by the value of a single feature
- Parameters:
path (str) – Path to the graph
feature (str) – Name of the local feature
- subsample_graph(path, mode)
Subsample graph
- Parameters:
path (str) – Path to graph
mode (str) – If ‘newgraph’, create a new snode with the subgraph
- transduce(path, order)
Compute a transductive graph from a snode and a sedge
- Parameters:
path (str) – Path to the sedge (snode is inferred from the sedge name)
order (int) – Order parameters of the transduced graph
- update_folders(f_struct=None)
Updates the project folder structure using the file and folder names in f_struct.
- Parameters:
f_struct (dict or None, optional (default=None)) – Contains all information related to the structure of project files and folders: paths (relative to ppath), file names, suffixes, prefixes or extensions that could be used to define other files or folders. (default names are used when not given)
If None, default names are given to the whole folder tree