Datamanager
The classes in this file provide functionality to interact with the specific databases provided for the PTL projects.
- class rdigraphs.datamanager.datamanager.DMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')
Bases:
BaseDMneo4j
This class is an extension of BaseDMneo4j to include some additional functionality
- class rdigraphs.datamanager.datamanager.DMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')
Bases:
BaseDMsql
This class is an extension of BaseDMsql to include some additional functionality
- class rdigraphs.datamanager.datamanager.DataManager(path2project, db_params, path2source=None)
Bases:
object
This is the datamanager for a supergraph project. It provides functionality to manage both the neo4j graph DB and the SQL databased containing the source data. To do so, it uses generic managers for Neo4j and SQL.
- __init__(path2project, db_params, path2source=None)
Initializes datamanager object, which facilities read and write operations.
File operation methods available.
Also, several SQL and Neo4J DataManager objects are created to facilitate interaction with databases.
Each SQL manager is stored in dictionary self.SQL. Tipically:
self.SQL[‘db1’] : SQL database named db1 self.SQL[‘db2’] : SQL database named db2 … self.Neo4j : Neo4j graph database
- Parameters:
path2project (str) – Path to the project folder
db_params (dict) – Parameters to stablish db connections.
path2source (str or None, optional (default=None)) – Path to the folder containing several data sources. This parameter is optional to allow backward compatibility. Future versions of this datamanager will modify all methods to use this variable, that will be ncessarily string-like.
- __weakref__
list of weak references to the object (if defined)
- import_graph_data_from_tables(table_name, sampling_factor=1)
Loads a dataframe of documents from one or several files in tabular format.
- Parameters:
table_name (str) – Name of the tabular dataset. It should be the name of a folder in self.path2source
sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)
- load_SCOPUS_citations_all(col_ref, atts, mode='cite_to')
Extracts all data from table ‘citations’ in SCOPUS SQL database
The graph contains all nodes in the citation graph, no matter if they have attributes in table ‘document’.
- Parameters:
col_ref (str) – Column in sql table that will be used as index
atts (list of str) – Columns in sql table that will be taken as attributes
mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers
- Returns:
nodes (list) – Nodes with attributes in table ‘document’
df_atts (pandas dataframe) – Attributes of the nodes
source_nodes (list) – Source of all edges between nodes in the list of nodes
target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes
- load_SCOPUS_citations_with_atts(col_ref, atts, mode='cite_to')
Extracts data from table ‘citations’ in SCOPUS SQL database
The subgraph contains the nodes with attributes in table ‘document’ form the same databse.
- Parameters:
col_ref (str) – Column in sql table that will be used as index
atts (list of str) – Columns in sql table that will be taken as attributes
mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers
- Returns:
nodes (list) – Nodes with attributes in table ‘document’
df_atts (pandas dataframe) – Attributes of the nodes
source_nodes (list) – Source of all edges between nodes in the list of nodes
target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes
- readCoordsFromFile(fpath=None, fields=['thetas'], sparse=False, path2nodenames=None, ref_col='corpusid')
Reads a data matrix from a given path. This method assumes a particular data structure of the PTL projects
- Parameters:
fpath (str or None, optional (default=None)) – Path to the file that contains the topic model. (the data file is assumed to be modelo.npz or modelo_sparse.npz inside that folder)
fields (str or list, optional (default=[‘thetas’])) – Name of the field or fields containing the doc-topic matrix
sparse (bool, optional (default=False)) – If True, the doc-topic matrix is sparse, otherwise dense
path2nodenames (str or None, optional (default=None)) – path to file containing metadata (in particular node names). If None, the file is assumed to be in fpath, with name docs_metadata.csv
ref_col (str, optional (default=’corpusid’)) – Name of the column in the metadata file (given by path2nodenames) that contains the doc id.
- Returns:
data_out (dict) – Output data dictionary
df_nodes (dataframe) – Dataframe of nodes
SQL
This class provides functionality for managing a generig sqlite or mysql database:
reading specific fields (with the possibility to filter by field values)
storing calculated values in the dataset
Created on May 11 2018
@author: Jerónimo Arenas García
- class rdigraphs.datamanager.base_dm_sql.BaseDMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')
Bases:
object
Data manager base class.
- __del__()
When destroying the object, it is necessary to commit changes in the database and close the connection
- __init__(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')
Initializes a DataManager object
- Parameters:
db_name (str) – Name of the DB
db_connector (str {‘mysql’, ‘sqlite’}) – Connector
path2db (str or None, optional (default=None)) – Path to the project folder (sqlite only)
db_server (str or None, optional default=None)) – Server (mysql only)
db_user (str or None, optional (default=None)) – User (mysql only)
db_password (str or None, optional (default=None)) – Password (mysql only)
db_port (str or None, optional (default=None)) – Port(mysql via TCP only) Necessary if not 3306
unix_socket (str or None, optional (default=None)) – Socket for local connectivity. If available, connection is slightly faster than through TCP.
charset (str, optional (default=’utf8mb4’)) – Coding to use by default in the connection
- __weakref__
list of weak references to the object (if defined)
- addTableColumn(tablename, columnname, columntype)
Add a new column to the specified table.
- Parameters:
tablename (str) – Table to which the column will be added
columnname (str) – Name of new column
columntype – Type of new column.
Notes
For mysql, if type is TXT or VARCHAR, the character set if forzed to be utf8.
- deleteDBtables(tables=None)
Delete existing database, and regenerate empty tables
- Parameters:
tables (str or list or None, optional (default=None)) – If string, name of the table to reset. If list, list of tables to reset If None, all tables are deleted, and all tables (inlcuding those that might not exist previously)
- dropTableColumn(tablename, columnname)
Remove column from the specified table
- Parameters:
tablename (str) – Table containing the column to be removed
columnname (str) – Name of column to be removed
- exportTable(tablename, fileformat, path, filename, cols=None)
Export columns from a table to a file.
- Parameters:
tablename (str) – Name of the table
fileformat (str {‘xlsx’, ‘pkl’}) – Type of output file
path (str) – Route to the output folder
filename (str) – Name of the output file
cols (list or str) – Columns to save. It can be a list or a string of comma-separated columns. If None, all columns saved.
- export_table_to_csv(table_name: str, output_file: Union[str, Path], block_size: Optional[int] = None, max_rows: Optional[int] = None, gzipped: bool = True, callbacks: Optional[List[Callable[[DataFrame], DataFrame]]] = None, select_options: Optional[str] = None, filter_options: Optional[str] = None, order_options: Optional[str] = None)
Exports a table to csv.
- Parameters:
table_name (str) – Name of the SQL table.
output_file (str) – Name of the output csv file.
block_size (int, optional) – Table is read blockwise (to avoid running out of memory) in batches of this size.
max_rows (int, optional) – Maximum number of rows to be read. If None, the whole table is read.
gzipped (bool, optional (default=T)) – Whether to write a gzipped csv file (as opposed to plain text).
callbacks (list or None, optional (default=None)) – List of callable receiving a DataFrame and returning another one with the same structure A list of functions to be called on every block read before actually writing to disk.
select_options (str or None, optional (default=None)) – “select” options to be passed readDBtable
filter_options (str or None, optional (default=None)) – “filter” options to be passed readDBtable
order_options (str or None, optional (default=None)) – “order” options to be passed readDBtable
- getColumnNames(tablename)
Returns a list with the names of all columns in the indicated table
- Parameters:
tablename (str) – Table to be read from
- Returns:
columnames – Names of all columns in the selected table
- Return type:
list
- getTableInfo(tablename)
Get information about the given table (size and columns)
- Parameters:
tablename (str) – Table to be read from
- Returns:
cols (list) – Names of all columns in the table
n_rows (int) – Number of rows in table
- getTableNames()
Provides acces to table names
- Returns:
tbnames – Names of all tables in the database
- Return type:
list
- insertInTable(tablename, columns, arguments)
Insert new records into table
- Parameters:
tablename (str) – Name of table in which the data will be inserted
columns (list) – Name of columns for which data are provided
arguments (list of list of tuples) – A list of lists of tuples, each element associated to one new entry for the table
- readDBtable(tablename, limit=None, selectOptions=None, filterOptions=None, orderOptions=None)
Read data from a table in the database can choose to read only some specific fields
- Parameters:
tablename (str) – Table to be read from
limit (int or None, optional (default=None)) – The maximum number of records to retrieve
selectOptions (str or None, optional (default=None)) – string with fields that will be retrieved (e.g. ‘REFERENCIA, Resumen’)
filterOptions (str or None, optional (default=None)) – Filtering options for the SQL query (e.g., ‘WHERE UNESCO_cd=23’)
orderOptions (str or None, optional (default=None)) – Field that will be used for sorting the results of the query (e.g, ‘Cconv’)
- setField(tablename, keyfld, valueflds, values)
Update records of a DB table
- Parameters:
tablename (str) – Table that will be modified
keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)
valueflds (list) – Names of the columns that will be updated (e.g., ‘Lemas’)
values (list of tuples) – A list of tuples in the format (keyfldvalue, valuefldvalue) (e.g., [(‘Ref1’, ‘gen celula’), (‘Ref2’, ‘big_data, algorithm’)])
- upsert(tablename, keyfld, df, robust=True)
Update records of a DB table with the values in the df This function implements the following additional functionality: * If there are columns in df that are not in the SQL table, columns will be added * New records will be created in the table if there are rows in the dataframe without an entry already in the table. For this, keyfld indicates which is the column that will be used as an index
- Parameters:
tablename (str) – Table that will be modified
keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)
df (dataframe) – Dataframe that we wish to save in table tablename
robust (bool, optional (default=True)) – If False, verifications are skipped (for a faster execution)
Neo4j
This class provides general functionality for managing a Neo4J database
reading specific fields (with the possibility to filter by field values)
storing calculated values in the dataset
Created on Sep 06 2018
@author: Manu A. Vázquez
- class rdigraphs.datamanager.base_dm_neo4j.BaseDMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')
Bases:
object
Base class for interacting with a Neo4j database.
- __del__()
Tidy up stuff after before deleting the object.
- __init__(db_server: str, db_password: str, db_user: str = 'neo4j') None
Initializer
- Parameters:
db_server (str) – The URL for the server
db_password (str) – User Password
db_user (str, optional) – User login
- __weakref__
list of weak references to the object (if defined)
- drop_node_property(label: str, property: str) None
Deletes a property from nodes.
- Parameters:
label (str) – The label of the node
property (str) – The name of the property
- drop_relationship(relationship_type: str) None
Deletes a relationship.
- Parameters:
relationship_type (str) – The type of the relationship
- export_graph(label_nodes, path2nodes, col_ref_nodes, label_edges, path2edges)
Export graph to Neo4J
- Parameters:
label_nodes (str or tuple of str) – If str, all nodes of the same type If tuple, one type for sources, the other for destinations
path2nodes (str) – Path to the file of nodes: a csv file with one column for the node names, and possibly other columns for attributes (that may be numeric or str)
col_ref_nodes (str) – Name of the column in the file of nodes containing the node names
label_edges (str) – Type of relationship for all edges.
path2edges (str) – Path to the edges file: a csv file with 3 columns: Source, Target, Weight
- get_db_structure() dict
Returns meta-data.
- Returns:
out – Metadata
- Return type:
dictionary
- make_edges(df: DataFrame, source: Tuple[str, Tuple[str, str]], destination: Tuple[str, Tuple[str, str]], relationship: Tuple[str, Dict[str, str]])
Makes edges between nodes as specified in a dataframe; if the requested nodes don’t exist, they are created
- Parameters:
df (pandas Dataframe) – The data
source (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the source node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.
destination (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the destination node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.
relationship (tuple of a str and a dict) – The first element is the type of the relationship to be created between nodes; the second is a dictionary that maps columns in this DataFrame to properties of the relationship.
- properties_of_label(label: str) List[str]
Returns all the properties (across all the nodes) of a given label.
- Parameters:
label (str) – The label (type)
- Returns:
out – A list with the properties
- Return type:
list
- properties_of_relationship(relationship_type: str) List[str]
Returns all the properties of a given relationship.
- Parameters:
relationship_type (str) – The type of the relationship
- Returns:
out – A list with the properties
- Return type:
list
- read_edges(relationship_type: str, limit: Optional[int] = None) Optional[DataFrame]
Reads edges from the database.
- Parameters:
relationship_type (str) – Type of the relationship
limit (int) – Maximum number of edges
- Returns:
out – Every row contains information about a single edge
- Return type:
Pandas dataframe
- read_nodes(label: str, limit: Optional[int] = None, select_options=None, filter_options=None, order_options=None) DataFrame
Reads nodes from the database.
- Parameters:
label (str) – Label of the nodes
limit (int) – Maximum number of nodes
select_options (unused)
filter_options (unused)
order_options (unused)
- Returns:
out – Every row contains information about a node
- Return type:
Pandas dataframe
- reset_database() None
Reset the database, deleting everything.
- write_dataframe(df: DataFrame, source: Tuple[str, List[str]], destination: Tuple[str, List[str]], edge: Tuple[str, List[str]]) None
Writes a dataframe in the database
- Parameters:
df (pandas Dataframe) – The data
source (tuple of a str and a list of str) – The first element in the tuple is the label for the source node, and the second one is a list with the columns of the dataframe that will become the properties of the node
destination (tuple of a str and a list of str) – The first element in the tuple is the label for the destination node, and the second one is a list withthe columns of the dataframe that will become the properties of the node
edge (tuple of a str and a list of str) – The first element in the tuple is the label for the edge node, and the second one is a list with the columns of the dataframe that will become the properties of the node
Cypher
- class rdigraphs.datamanager.cypher.DataLoadingStatement(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
Statement
Class to load data from csv files into Neo4j.
- __init__(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://') None
Initializer.
- Parameters:
csv_files_path (pathlib’s Path) – Path to the csv files
csv_line (str, optional) – Name of a line from the file when referred to in another statement
- flatten_dic(d, sep=':', property_name_prefix='')
Takes a dictionary specifying SQL to neo4j conversions and returns a string amenable to be used in cypher.
- Parameters:
d (dict) – A dictionary mapping properties to values.
sep (str) – Separator between the property and the value
property_name_prefix (str) – A string to prepend to every property name.
- Returns:
out – A piece of cypher statement specifying a mapping between properties and values.
- Return type:
str
- load_clause(parameters, using_periodic_commit=True) str
Returns a clause to load data from a csv file.
- Parameters:
parameters (dictionary) – Settings
using_periodic_commit (bool) – Whether to use the “periodic commit” operation mode
- Returns:
statement – Neo4j statement
- Return type:
str
- match_clause(labels, matching_columns, matching_properties, nodes=['origin', 'destination']) str
Returns a clause to match data read from the file with that in the database.
- Parameters:
labels (list) – Labels of the nodes to be matched
matching_columns (list) – Columns in the csv file
matching_properties (list) – Properties of the nodes to be matched
nodes (list) – Names of the nodes to be matched when referred to in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- class rdigraphs.datamanager.cypher.LoadAttributesTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
DataLoadingStatement
Class to load data from a csv file representing an “attributes” table.
- assemble(parameters: dict) Union[str, List[str]]
Returns a statement to load data from a csv file representing an “attributes” table.
- Parameters:
parameters (dictionary) – Settings
- Returns:
out – Neo4j statement
- Return type:
str
- class rdigraphs.datamanager.cypher.LoadJoinTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
DataLoadingStatement
Class to load data from a csv file representing a “join” table.
- assemble(parameters: dict) Union[str, List[str]]
Returns a statement to load data from a csv file representing a “join” table.
- Parameters:
parameters (dictionary) – Settings
- Returns:
out – Neo4j statement
- Return type:
str
- class rdigraphs.datamanager.cypher.LoadTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
DataLoadingStatement
Class to load a table from a csv file.
- assemble(parameters: dict) Union[str, List[str]]
Returns a statement to load data from a csv file.
- Parameters:
parameters (dictionary) – Settings
- Returns:
out – Neo4j statement
- Return type:
str
- class rdigraphs.datamanager.cypher.MakeRelationship(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
DataLoadingStatement
Class to load data from a csv file representing a “relationship” table.
- assemble(parameters: dict) Union[str, List[str]]
Returns a statement to load data from a csv file representing a “relationship” table.
- Parameters:
parameters (dictionary) – Settings
- Returns:
out – Neo4j statement
- Return type:
str
- class rdigraphs.datamanager.cypher.Statement
Bases:
object
Base class for a Neo4j statement.
- __weakref__
list of weak references to the object (if defined)
- class rdigraphs.datamanager.cypher.UpdateTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')
Bases:
DataLoadingStatement
Class to update an existing table using data from a csv file.
- assemble(parameters: dict) Union[str, List[str]]
Returns a statement to update an existing table using data from a csv file.
- Parameters:
parameters (dictionary) – Settings
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.assert_uniqueness_clause(property: str, node: str = 'node') str
Returns the part of a statement that ensures a property of a node is unique.
- Parameters:
property (str) – Name of the mean-to-be-unique property
node (str, optional) – Name of the node (coming from other statement)
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.create_constraint_clause(label: str, node: str = 'node') str
Returns the part of a statement that creates a constraint.
- Parameters:
label (str) – The label (node type) on which to create the index
node (str, optional) – Name of the node when referred in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.create_indexes(label: str, properties: List[str]) Union[str, List[str]]
Returns a statement that creates indexes on a given label.
- Parameters:
label (str) – The label (node type) on which to create the index
properties (list) – A list of properties to be indexed on the given label
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.create_or_merge_node(label: str, properties: dict = {}, name: str = 'A', merge: bool = False) str
Returns a statement that creates or merges a new node.
- Parameters:
label (str) – The label of the node
properties (dictionary) – The properties of the node
name (str, optional) – Name of the node when referred in another statement
merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.create_or_merge_relationship(origin: str, destination: str, rel_type: str, properties: dict = {}, name: str = 'rel', merge: bool = False) str
Returns a statement that creates or merges a new relationship.
- Parameters:
origin (str) – Name (identifier/variable) of the origin node
destination (str) – Name (identifier/variable) of the destination node
rel_type (str) – Relationship type
properties (dictionary) – Properties to be added to the relationship
name (str, optional) – Name of the relationship when referred to in another statement
merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.create_unique_constraint(label: str, property: str) str
Returns a statement to creates a unique constraint.
- Parameters:
label (str) – The label of the node
property (str) – The property of the node
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.dic_to_properties(d: dict)
Takes a dictionary and returns a string amenable to be used in cypher.
- Parameters:
d (dict) – A dictionary mapping properties to values.
- Returns:
out – A piece of cypher statement specifying properties and values.
- Return type:
str
- rdigraphs.datamanager.cypher.drop_null_valued_keys(d: dict)
Convenience function to get rid of null (None) values in a dictionary.
- Parameters:
d (dict) – Any dictionary
- Returns:
out – The input dictionary without null values
- Return type:
dict
- rdigraphs.datamanager.cypher.get_metadata() str
Returns a statement to get the “schema” of the database.
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.labels() str
Returns a statement to get all the labels (node types) in the database.
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.make_relationship(origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {})
Returns a statement to make a relationship while merging properties in the nodes involved.
- Parameters:
origin_label (str) – Label of the origin node
destination_label (str) – Label of the destination node
relationship_type (str) – Type of the relationship
origin_properties (dictionary) – Properties of the origin node
relationship_properties (dictionary) – Properties of the relationship
destination_properties (dictionary) – Properties of the destination node
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.match_label(label: str, limit: Optional[int] = None, dummy: str = 'node') str
Returns a statement that matches and returns a label.
- Parameters:
label (str) – The label of the node
limit (int, optional) – The maximum number of matches
dummy (str, optional) – Name of the node when referred in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.match_relationship(relationship: str, limit: Optional[int] = None, surrogate_relationship: str = 'rel', surrogate_path: str = 'path') str
Returns a statement that matches relationships.
- Parameters:
relationship (str) – The (type of the) relationship
limit (int, optional) – The maximum number of matches
surrogate_relationship (str, optional) – Name of the relationship when referred in another statement
surrogate_path (str, optional) – Name of the path when referred in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.merge_relationship_with_properties_clause(origin: str = 'origin', destination: str = 'destination', origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {}, relationship: str = 'rel', arrowhead: str = '>')
Returns a clause that merges a relationship specifying properties for each entity.
- Parameters:
origin (str) – Name (identifier/variable) of the origin node
destination (str) – Name (identifier/variable) of the destination node
origin_label (str) – Label of the origin node
destination_label (str) – Label of the destination node
relationship_type (str) – Type of the relationship
origin_properties (dictionary) – Properties of the origin node
relationship_properties (dictionary) – Properties of the relationship
destination_properties (dictionary) – Properties of the destination node
relationship (str, optional) – Name of the relationship when referred to in another statement
arrowhead (string) – Suffix controlling whether the relationship is (bi)directional or not
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.relationship_types() str
Returns a statement to get all the relationship types in the database.
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.remove_property(label: str, property: str, surrogate_node: str = 'node') str
Returns a statement that removes a property from a node.
- Parameters:
label (str) – The label of the node
property (str) – The name of the property
surrogate_node (str, optional) – Name of the node when referred in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.remove_relationship(relationship_type: str, var: str = 'r') str
Returns a statement that removes a relationship.
- Parameters:
relationship_type (str) – The type of the relationship
var (str, optional) – Name of the relationship when referred in another statement
- Returns:
out – Neo4j statement
- Return type:
str
- rdigraphs.datamanager.cypher.reset_database() List[str]
Returns a statement that resets the database, i.e., deletes everything.
- Returns:
out – Neo4j statement
- Return type:
str
Migration Callbacks
- class rdigraphs.datamanager.callbacks.AuthorsDisambiguator(disambiguation_map: List[str], new_id: str)
Bases:
Disambiguator
Class containing callbacks methods needed for disambiguating authors.
- patstats_person(df: DataFrame)
Callback function for “person” table in “patstats” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- patstats_person_application(df: DataFrame)
Callback function for “application” table in “patstats” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- projects_researcher_project(df: DataFrame)
Callback function for “researcher_project” table in “projects” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- projects_researchers(df: DataFrame)
Callback function for “researchers” table in “projects” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- scopus_authorship(df: DataFrame)
Callback function for “authorship” table in “scopus” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- class rdigraphs.datamanager.callbacks.Disambiguator(disambiguation_map: List[str], new_id: str)
Bases:
object
Class to process disambiguation data.
- __init__(disambiguation_map: List[str], new_id: str) None
Initializer.
- Parameters:
disambiguation_map (list) – Path to the disambiguation map
new_id (str) – The name of the new field/column/neo4j property to be created for storing the final disambiguated id
- __weakref__
list of weak references to the object (if defined)
- static handle_duplicates(df)
Get rid of duplicated (disambiguated) ids.
- Parameters:
df (Pandas dataframe) – Data with duplicates
- Returns:
unmapped – Input data with the only row with the value “left_only” in column source, or the 1st row of the input dataframe if there are several.
- Return type:
Pandas dataframe
- merge(mapping: DataFrame, df: DataFrame, field: str, prefix: str)
Replace within the passed DataFrame the values of field that are present in the disambiguation map, while at the same time ensuring that the new disambiguated values are unique.
- Parameters:
mapping (Pandas dataframe) – Disambiguation map
df (Pandas dataframe) – Data to be disambiguated
field (str) – Column to be used in df
prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)
- Returns:
merge – Input data with the disambiguation map applied.
- Return type:
Pandas dataframe
- replace(mapping: DataFrame, df: DataFrame, field: str, prefix: str)
Replace within the passed DataFrame the values of field that are present in the disambiguation map.
- Parameters:
mapping (Pandas dataframe) – Disambiguation map
df (Pandas dataframe) – Data to be disambiguated
field (str) – Column to be used in df
prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)
- Returns:
merge – Input data with the disambiguation map applied.
- Return type:
Pandas dataframe
- class rdigraphs.datamanager.callbacks.OrganizationsDisambiguator(disambiguation_map: List[str], new_id: str)
Bases:
AuthorsDisambiguator
Class containing callbacks methods needed for disambiguating organizations.
- projects_organizations(df: DataFrame)
Callback function for “organizations” table in “projects” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- projects_projects(df: DataFrame)
Callback function for “projects” table in “projects” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- scopus_authorship(df: DataFrame)
Callback function for “authorship” table in “scopus” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- rdigraphs.datamanager.callbacks.initialize(parameters: dict) None
Initializes this module.
- Parameters:
parameters (dictionary) – Settings
- rdigraphs.datamanager.callbacks.patents_literature(df: DataFrame)
Callback function for relating patents literature with itself in “patents” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- rdigraphs.datamanager.callbacks.patents_non_literature(df: DataFrame)
Callback function for relating patents literature with non-patents literature in “patents” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- rdigraphs.datamanager.callbacks.patents_person(df: DataFrame)
Callback function for “person” table in “patents” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
- rdigraphs.datamanager.callbacks.projects_researchers(df: DataFrame, name_col='NOMBRE')
Callback function for “researchers” table in “projects” database.
- Parameters:
df (Pandas dataframe) – Input data
name_col (str, optional) – Name of the colum with the researchers’ names
- Returns:
df – Output data
- Return type:
Pandas dataframe
- rdigraphs.datamanager.callbacks.publications_authorship(df: DataFrame)
Callback function for “authorship” table in “publications” database.
- Parameters:
df (Pandas dataframe) – Input data
- Returns:
df – Output data
- Return type:
Pandas dataframe
Util
- rdigraphs.datamanager.util.simple_http_server(host: str = '0.0.0.0', port: int = 4001, path: str = '.')
Sets up a http server for files.
- Parameters:
host (str) – The IP where the host will “listen”
port (int) – The port where the host will “listen”
path (str) – The path to be served
- Returns:
start (function) – It starts the server
stop (function) – It stops the server