Datamanager

The classes in this file provide functionality to interact with the specific databases provided for the PTL projects.

class rdigraphs.datamanager.datamanager.DMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')

Bases: BaseDMneo4j

This class is an extension of BaseDMneo4j to include some additional functionality

class rdigraphs.datamanager.datamanager.DMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Bases: BaseDMsql

This class is an extension of BaseDMsql to include some additional functionality

class rdigraphs.datamanager.datamanager.DataManager(path2project, db_params, path2source=None)

Bases: object

This is the datamanager for a supergraph project. It provides functionality to manage both the neo4j graph DB and the SQL databased containing the source data. To do so, it uses generic managers for Neo4j and SQL.

__init__(path2project, db_params, path2source=None)

Initializes datamanager object, which facilities read and write operations.

File operation methods available.

Also, several SQL and Neo4J DataManager objects are created to facilitate interaction with databases.

Each SQL manager is stored in dictionary self.SQL. Tipically:

self.SQL[‘db1’] : SQL database named db1 self.SQL[‘db2’] : SQL database named db2 … self.Neo4j : Neo4j graph database

Parameters:
  • path2project (str) – Path to the project folder

  • db_params (dict) – Parameters to stablish db connections.

  • path2source (str or None, optional (default=None)) – Path to the folder containing several data sources. This parameter is optional to allow backward compatibility. Future versions of this datamanager will modify all methods to use this variable, that will be ncessarily string-like.

__weakref__

list of weak references to the object (if defined)

import_graph_data_from_tables(table_name, sampling_factor=1)

Loads a dataframe of documents from one or several files in tabular format.

Parameters:
  • table_name (str) – Name of the tabular dataset. It should be the name of a folder in self.path2source

  • sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)

load_SCOPUS_citations_all(col_ref, atts, mode='cite_to')

Extracts all data from table ‘citations’ in SCOPUS SQL database

The graph contains all nodes in the citation graph, no matter if they have attributes in table ‘document’.

Parameters:
  • col_ref (str) – Column in sql table that will be used as index

  • atts (list of str) – Columns in sql table that will be taken as attributes

  • mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers

Returns:

  • nodes (list) – Nodes with attributes in table ‘document’

  • df_atts (pandas dataframe) – Attributes of the nodes

  • source_nodes (list) – Source of all edges between nodes in the list of nodes

  • target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes

load_SCOPUS_citations_with_atts(col_ref, atts, mode='cite_to')

Extracts data from table ‘citations’ in SCOPUS SQL database

The subgraph contains the nodes with attributes in table ‘document’ form the same databse.

Parameters:
  • col_ref (str) – Column in sql table that will be used as index

  • atts (list of str) – Columns in sql table that will be taken as attributes

  • mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers

Returns:

  • nodes (list) – Nodes with attributes in table ‘document’

  • df_atts (pandas dataframe) – Attributes of the nodes

  • source_nodes (list) – Source of all edges between nodes in the list of nodes

  • target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes

readCoordsFromFile(fpath=None, fields=['thetas'], sparse=False, path2nodenames=None, ref_col='corpusid')

Reads a data matrix from a given path. This method assumes a particular data structure of the PTL projects

Parameters:
  • fpath (str or None, optional (default=None)) – Path to the file that contains the topic model. (the data file is assumed to be modelo.npz or modelo_sparse.npz inside that folder)

  • fields (str or list, optional (default=[‘thetas’])) – Name of the field or fields containing the doc-topic matrix

  • sparse (bool, optional (default=False)) – If True, the doc-topic matrix is sparse, otherwise dense

  • path2nodenames (str or None, optional (default=None)) – path to file containing metadata (in particular node names). If None, the file is assumed to be in fpath, with name docs_metadata.csv

  • ref_col (str, optional (default=’corpusid’)) – Name of the column in the metadata file (given by path2nodenames) that contains the doc id.

Returns:

  • data_out (dict) – Output data dictionary

  • df_nodes (dataframe) – Dataframe of nodes

SQL

This class provides functionality for managing a generig sqlite or mysql database:

  • reading specific fields (with the possibility to filter by field values)

  • storing calculated values in the dataset

Created on May 11 2018

@author: Jerónimo Arenas García

class rdigraphs.datamanager.base_dm_sql.BaseDMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Bases: object

Data manager base class.

__del__()

When destroying the object, it is necessary to commit changes in the database and close the connection

__init__(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Initializes a DataManager object

Parameters:
  • db_name (str) – Name of the DB

  • db_connector (str {‘mysql’, ‘sqlite’}) – Connector

  • path2db (str or None, optional (default=None)) – Path to the project folder (sqlite only)

  • db_server (str or None, optional default=None)) – Server (mysql only)

  • db_user (str or None, optional (default=None)) – User (mysql only)

  • db_password (str or None, optional (default=None)) – Password (mysql only)

  • db_port (str or None, optional (default=None)) – Port(mysql via TCP only) Necessary if not 3306

  • unix_socket (str or None, optional (default=None)) – Socket for local connectivity. If available, connection is slightly faster than through TCP.

  • charset (str, optional (default=’utf8mb4’)) – Coding to use by default in the connection

__weakref__

list of weak references to the object (if defined)

addTableColumn(tablename, columnname, columntype)

Add a new column to the specified table.

Parameters:
  • tablename (str) – Table to which the column will be added

  • columnname (str) – Name of new column

  • columntype – Type of new column.

Notes

For mysql, if type is TXT or VARCHAR, the character set if forzed to be utf8.

deleteDBtables(tables=None)

Delete existing database, and regenerate empty tables

Parameters:

tables (str or list or None, optional (default=None)) – If string, name of the table to reset. If list, list of tables to reset If None, all tables are deleted, and all tables (inlcuding those that might not exist previously)

dropTableColumn(tablename, columnname)

Remove column from the specified table

Parameters:
  • tablename (str) – Table containing the column to be removed

  • columnname (str) – Name of column to be removed

exportTable(tablename, fileformat, path, filename, cols=None)

Export columns from a table to a file.

Parameters:
  • tablename (str) – Name of the table

  • fileformat (str {‘xlsx’, ‘pkl’}) – Type of output file

  • path (str) – Route to the output folder

  • filename (str) – Name of the output file

  • cols (list or str) – Columns to save. It can be a list or a string of comma-separated columns. If None, all columns saved.

export_table_to_csv(table_name: str, output_file: Union[str, Path], block_size: Optional[int] = None, max_rows: Optional[int] = None, gzipped: bool = True, callbacks: Optional[List[Callable[[DataFrame], DataFrame]]] = None, select_options: Optional[str] = None, filter_options: Optional[str] = None, order_options: Optional[str] = None)

Exports a table to csv.

Parameters:
  • table_name (str) – Name of the SQL table.

  • output_file (str) – Name of the output csv file.

  • block_size (int, optional) – Table is read blockwise (to avoid running out of memory) in batches of this size.

  • max_rows (int, optional) – Maximum number of rows to be read. If None, the whole table is read.

  • gzipped (bool, optional (default=T)) – Whether to write a gzipped csv file (as opposed to plain text).

  • callbacks (list or None, optional (default=None)) – List of callable receiving a DataFrame and returning another one with the same structure A list of functions to be called on every block read before actually writing to disk.

  • select_options (str or None, optional (default=None)) – “select” options to be passed readDBtable

  • filter_options (str or None, optional (default=None)) – “filter” options to be passed readDBtable

  • order_options (str or None, optional (default=None)) – “order” options to be passed readDBtable

getColumnNames(tablename)

Returns a list with the names of all columns in the indicated table

Parameters:

tablename (str) – Table to be read from

Returns:

columnames – Names of all columns in the selected table

Return type:

list

getTableInfo(tablename)

Get information about the given table (size and columns)

Parameters:

tablename (str) – Table to be read from

Returns:

  • cols (list) – Names of all columns in the table

  • n_rows (int) – Number of rows in table

getTableNames()

Provides acces to table names

Returns:

tbnames – Names of all tables in the database

Return type:

list

insertInTable(tablename, columns, arguments)

Insert new records into table

Parameters:
  • tablename (str) – Name of table in which the data will be inserted

  • columns (list) – Name of columns for which data are provided

  • arguments (list of list of tuples) – A list of lists of tuples, each element associated to one new entry for the table

readDBtable(tablename, limit=None, selectOptions=None, filterOptions=None, orderOptions=None)

Read data from a table in the database can choose to read only some specific fields

Parameters:
  • tablename (str) – Table to be read from

  • limit (int or None, optional (default=None)) – The maximum number of records to retrieve

  • selectOptions (str or None, optional (default=None)) – string with fields that will be retrieved (e.g. ‘REFERENCIA, Resumen’)

  • filterOptions (str or None, optional (default=None)) – Filtering options for the SQL query (e.g., ‘WHERE UNESCO_cd=23’)

  • orderOptions (str or None, optional (default=None)) – Field that will be used for sorting the results of the query (e.g, ‘Cconv’)

setField(tablename, keyfld, valueflds, values)

Update records of a DB table

Parameters:
  • tablename (str) – Table that will be modified

  • keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)

  • valueflds (list) – Names of the columns that will be updated (e.g., ‘Lemas’)

  • values (list of tuples) – A list of tuples in the format (keyfldvalue, valuefldvalue) (e.g., [(‘Ref1’, ‘gen celula’), (‘Ref2’, ‘big_data, algorithm’)])

upsert(tablename, keyfld, df, robust=True)

Update records of a DB table with the values in the df This function implements the following additional functionality: * If there are columns in df that are not in the SQL table, columns will be added * New records will be created in the table if there are rows in the dataframe without an entry already in the table. For this, keyfld indicates which is the column that will be used as an index

Parameters:
  • tablename (str) – Table that will be modified

  • keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)

  • df (dataframe) – Dataframe that we wish to save in table tablename

  • robust (bool, optional (default=True)) – If False, verifications are skipped (for a faster execution)

Neo4j

This class provides general functionality for managing a Neo4J database

  • reading specific fields (with the possibility to filter by field values)

  • storing calculated values in the dataset

Created on Sep 06 2018

@author: Manu A. Vázquez

class rdigraphs.datamanager.base_dm_neo4j.BaseDMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')

Bases: object

Base class for interacting with a Neo4j database.

__del__()

Tidy up stuff after before deleting the object.

__init__(db_server: str, db_password: str, db_user: str = 'neo4j') None

Initializer

Parameters:
  • db_server (str) – The URL for the server

  • db_password (str) – User Password

  • db_user (str, optional) – User login

__weakref__

list of weak references to the object (if defined)

drop_node_property(label: str, property: str) None

Deletes a property from nodes.

Parameters:
  • label (str) – The label of the node

  • property (str) – The name of the property

drop_relationship(relationship_type: str) None

Deletes a relationship.

Parameters:

relationship_type (str) – The type of the relationship

export_graph(label_nodes, path2nodes, col_ref_nodes, label_edges, path2edges)

Export graph to Neo4J

Parameters:
  • label_nodes (str or tuple of str) – If str, all nodes of the same type If tuple, one type for sources, the other for destinations

  • path2nodes (str) – Path to the file of nodes: a csv file with one column for the node names, and possibly other columns for attributes (that may be numeric or str)

  • col_ref_nodes (str) – Name of the column in the file of nodes containing the node names

  • label_edges (str) – Type of relationship for all edges.

  • path2edges (str) – Path to the edges file: a csv file with 3 columns: Source, Target, Weight

get_db_structure() dict

Returns meta-data.

Returns:

out – Metadata

Return type:

dictionary

make_edges(df: DataFrame, source: Tuple[str, Tuple[str, str]], destination: Tuple[str, Tuple[str, str]], relationship: Tuple[str, Dict[str, str]])

Makes edges between nodes as specified in a dataframe; if the requested nodes don’t exist, they are created

Parameters:
  • df (pandas Dataframe) – The data

  • source (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the source node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.

  • destination (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the destination node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.

  • relationship (tuple of a str and a dict) – The first element is the type of the relationship to be created between nodes; the second is a dictionary that maps columns in this DataFrame to properties of the relationship.

properties_of_label(label: str) List[str]

Returns all the properties (across all the nodes) of a given label.

Parameters:

label (str) – The label (type)

Returns:

out – A list with the properties

Return type:

list

properties_of_relationship(relationship_type: str) List[str]

Returns all the properties of a given relationship.

Parameters:

relationship_type (str) – The type of the relationship

Returns:

out – A list with the properties

Return type:

list

read_edges(relationship_type: str, limit: Optional[int] = None) Optional[DataFrame]

Reads edges from the database.

Parameters:
  • relationship_type (str) – Type of the relationship

  • limit (int) – Maximum number of edges

Returns:

out – Every row contains information about a single edge

Return type:

Pandas dataframe

read_nodes(label: str, limit: Optional[int] = None, select_options=None, filter_options=None, order_options=None) DataFrame

Reads nodes from the database.

Parameters:
  • label (str) – Label of the nodes

  • limit (int) – Maximum number of nodes

  • select_options (unused)

  • filter_options (unused)

  • order_options (unused)

Returns:

out – Every row contains information about a node

Return type:

Pandas dataframe

reset_database() None

Reset the database, deleting everything.

write_dataframe(df: DataFrame, source: Tuple[str, List[str]], destination: Tuple[str, List[str]], edge: Tuple[str, List[str]]) None

Writes a dataframe in the database

Parameters:
  • df (pandas Dataframe) – The data

  • source (tuple of a str and a list of str) – The first element in the tuple is the label for the source node, and the second one is a list with the columns of the dataframe that will become the properties of the node

  • destination (tuple of a str and a list of str) – The first element in the tuple is the label for the destination node, and the second one is a list withthe columns of the dataframe that will become the properties of the node

  • edge (tuple of a str and a list of str) – The first element in the tuple is the label for the edge node, and the second one is a list with the columns of the dataframe that will become the properties of the node

Cypher

class rdigraphs.datamanager.cypher.DataLoadingStatement(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: Statement

Class to load data from csv files into Neo4j.

__init__(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://') None

Initializer.

Parameters:
  • csv_files_path (pathlib’s Path) – Path to the csv files

  • csv_line (str, optional) – Name of a line from the file when referred to in another statement

flatten_dic(d, sep=':', property_name_prefix='')

Takes a dictionary specifying SQL to neo4j conversions and returns a string amenable to be used in cypher.

Parameters:
  • d (dict) – A dictionary mapping properties to values.

  • sep (str) – Separator between the property and the value

  • property_name_prefix (str) – A string to prepend to every property name.

Returns:

out – A piece of cypher statement specifying a mapping between properties and values.

Return type:

str

load_clause(parameters, using_periodic_commit=True) str

Returns a clause to load data from a csv file.

Parameters:
  • parameters (dictionary) – Settings

  • using_periodic_commit (bool) – Whether to use the “periodic commit” operation mode

Returns:

statement – Neo4j statement

Return type:

str

match_clause(labels, matching_columns, matching_properties, nodes=['origin', 'destination']) str

Returns a clause to match data read from the file with that in the database.

Parameters:
  • labels (list) – Labels of the nodes to be matched

  • matching_columns (list) – Columns in the csv file

  • matching_properties (list) – Properties of the nodes to be matched

  • nodes (list) – Names of the nodes to be matched when referred to in another statement

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.LoadAttributesTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing an “attributes” table.

assemble(parameters: dict) Union[str, List[str]]

Returns a statement to load data from a csv file representing an “attributes” table.

Parameters:

parameters (dictionary) – Settings

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.LoadJoinTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing a “join” table.

assemble(parameters: dict) Union[str, List[str]]

Returns a statement to load data from a csv file representing a “join” table.

Parameters:

parameters (dictionary) – Settings

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.LoadTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load a table from a csv file.

assemble(parameters: dict) Union[str, List[str]]

Returns a statement to load data from a csv file.

Parameters:

parameters (dictionary) – Settings

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.MakeRelationship(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing a “relationship” table.

assemble(parameters: dict) Union[str, List[str]]

Returns a statement to load data from a csv file representing a “relationship” table.

Parameters:

parameters (dictionary) – Settings

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.Statement

Bases: object

Base class for a Neo4j statement.

__weakref__

list of weak references to the object (if defined)

class rdigraphs.datamanager.cypher.UpdateTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to update an existing table using data from a csv file.

assemble(parameters: dict) Union[str, List[str]]

Returns a statement to update an existing table using data from a csv file.

Parameters:

parameters (dictionary) – Settings

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.assert_uniqueness_clause(property: str, node: str = 'node') str

Returns the part of a statement that ensures a property of a node is unique.

Parameters:
  • property (str) – Name of the mean-to-be-unique property

  • node (str, optional) – Name of the node (coming from other statement)

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_constraint_clause(label: str, node: str = 'node') str

Returns the part of a statement that creates a constraint.

Parameters:
  • label (str) – The label (node type) on which to create the index

  • node (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_indexes(label: str, properties: List[str]) Union[str, List[str]]

Returns a statement that creates indexes on a given label.

Parameters:
  • label (str) – The label (node type) on which to create the index

  • properties (list) – A list of properties to be indexed on the given label

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_or_merge_node(label: str, properties: dict = {}, name: str = 'A', merge: bool = False) str

Returns a statement that creates or merges a new node.

Parameters:
  • label (str) – The label of the node

  • properties (dictionary) – The properties of the node

  • name (str, optional) – Name of the node when referred in another statement

  • merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_or_merge_relationship(origin: str, destination: str, rel_type: str, properties: dict = {}, name: str = 'rel', merge: bool = False) str

Returns a statement that creates or merges a new relationship.

Parameters:
  • origin (str) – Name (identifier/variable) of the origin node

  • destination (str) – Name (identifier/variable) of the destination node

  • rel_type (str) – Relationship type

  • properties (dictionary) – Properties to be added to the relationship

  • name (str, optional) – Name of the relationship when referred to in another statement

  • merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_unique_constraint(label: str, property: str) str

Returns a statement to creates a unique constraint.

Parameters:
  • label (str) – The label of the node

  • property (str) – The property of the node

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.dic_to_properties(d: dict)

Takes a dictionary and returns a string amenable to be used in cypher.

Parameters:

d (dict) – A dictionary mapping properties to values.

Returns:

out – A piece of cypher statement specifying properties and values.

Return type:

str

rdigraphs.datamanager.cypher.drop_null_valued_keys(d: dict)

Convenience function to get rid of null (None) values in a dictionary.

Parameters:

d (dict) – Any dictionary

Returns:

out – The input dictionary without null values

Return type:

dict

rdigraphs.datamanager.cypher.get_metadata() str

Returns a statement to get the “schema” of the database.

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.labels() str

Returns a statement to get all the labels (node types) in the database.

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.make_relationship(origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {})

Returns a statement to make a relationship while merging properties in the nodes involved.

Parameters:
  • origin_label (str) – Label of the origin node

  • destination_label (str) – Label of the destination node

  • relationship_type (str) – Type of the relationship

  • origin_properties (dictionary) – Properties of the origin node

  • relationship_properties (dictionary) – Properties of the relationship

  • destination_properties (dictionary) – Properties of the destination node

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.match_label(label: str, limit: Optional[int] = None, dummy: str = 'node') str

Returns a statement that matches and returns a label.

Parameters:
  • label (str) – The label of the node

  • limit (int, optional) – The maximum number of matches

  • dummy (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.match_relationship(relationship: str, limit: Optional[int] = None, surrogate_relationship: str = 'rel', surrogate_path: str = 'path') str

Returns a statement that matches relationships.

Parameters:
  • relationship (str) – The (type of the) relationship

  • limit (int, optional) – The maximum number of matches

  • surrogate_relationship (str, optional) – Name of the relationship when referred in another statement

  • surrogate_path (str, optional) – Name of the path when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.merge_relationship_with_properties_clause(origin: str = 'origin', destination: str = 'destination', origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {}, relationship: str = 'rel', arrowhead: str = '>')

Returns a clause that merges a relationship specifying properties for each entity.

Parameters:
  • origin (str) – Name (identifier/variable) of the origin node

  • destination (str) – Name (identifier/variable) of the destination node

  • origin_label (str) – Label of the origin node

  • destination_label (str) – Label of the destination node

  • relationship_type (str) – Type of the relationship

  • origin_properties (dictionary) – Properties of the origin node

  • relationship_properties (dictionary) – Properties of the relationship

  • destination_properties (dictionary) – Properties of the destination node

  • relationship (str, optional) – Name of the relationship when referred to in another statement

  • arrowhead (string) – Suffix controlling whether the relationship is (bi)directional or not

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.relationship_types() str

Returns a statement to get all the relationship types in the database.

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.remove_property(label: str, property: str, surrogate_node: str = 'node') str

Returns a statement that removes a property from a node.

Parameters:
  • label (str) – The label of the node

  • property (str) – The name of the property

  • surrogate_node (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.remove_relationship(relationship_type: str, var: str = 'r') str

Returns a statement that removes a relationship.

Parameters:
  • relationship_type (str) – The type of the relationship

  • var (str, optional) – Name of the relationship when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.reset_database() List[str]

Returns a statement that resets the database, i.e., deletes everything.

Returns:

out – Neo4j statement

Return type:

str

Migration Callbacks

class rdigraphs.datamanager.callbacks.AuthorsDisambiguator(disambiguation_map: List[str], new_id: str)

Bases: Disambiguator

Class containing callbacks methods needed for disambiguating authors.

patstats_person(df: DataFrame)

Callback function for “person” table in “patstats” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

patstats_person_application(df: DataFrame)

Callback function for “application” table in “patstats” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

projects_researcher_project(df: DataFrame)

Callback function for “researcher_project” table in “projects” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

projects_researchers(df: DataFrame)

Callback function for “researchers” table in “projects” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

scopus_authorship(df: DataFrame)

Callback function for “authorship” table in “scopus” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

class rdigraphs.datamanager.callbacks.Disambiguator(disambiguation_map: List[str], new_id: str)

Bases: object

Class to process disambiguation data.

__init__(disambiguation_map: List[str], new_id: str) None

Initializer.

Parameters:
  • disambiguation_map (list) – Path to the disambiguation map

  • new_id (str) – The name of the new field/column/neo4j property to be created for storing the final disambiguated id

__weakref__

list of weak references to the object (if defined)

static handle_duplicates(df)

Get rid of duplicated (disambiguated) ids.

Parameters:

df (Pandas dataframe) – Data with duplicates

Returns:

unmapped – Input data with the only row with the value “left_only” in column source, or the 1st row of the input dataframe if there are several.

Return type:

Pandas dataframe

merge(mapping: DataFrame, df: DataFrame, field: str, prefix: str)

Replace within the passed DataFrame the values of field that are present in the disambiguation map, while at the same time ensuring that the new disambiguated values are unique.

Parameters:
  • mapping (Pandas dataframe) – Disambiguation map

  • df (Pandas dataframe) – Data to be disambiguated

  • field (str) – Column to be used in df

  • prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)

Returns:

merge – Input data with the disambiguation map applied.

Return type:

Pandas dataframe

replace(mapping: DataFrame, df: DataFrame, field: str, prefix: str)

Replace within the passed DataFrame the values of field that are present in the disambiguation map.

Parameters:
  • mapping (Pandas dataframe) – Disambiguation map

  • df (Pandas dataframe) – Data to be disambiguated

  • field (str) – Column to be used in df

  • prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)

Returns:

merge – Input data with the disambiguation map applied.

Return type:

Pandas dataframe

class rdigraphs.datamanager.callbacks.OrganizationsDisambiguator(disambiguation_map: List[str], new_id: str)

Bases: AuthorsDisambiguator

Class containing callbacks methods needed for disambiguating organizations.

projects_organizations(df: DataFrame)

Callback function for “organizations” table in “projects” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

projects_projects(df: DataFrame)

Callback function for “projects” table in “projects” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

scopus_authorship(df: DataFrame)

Callback function for “authorship” table in “scopus” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.initialize(parameters: dict) None

Initializes this module.

Parameters:

parameters (dictionary) – Settings

rdigraphs.datamanager.callbacks.patents_literature(df: DataFrame)

Callback function for relating patents literature with itself in “patents” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.patents_non_literature(df: DataFrame)

Callback function for relating patents literature with non-patents literature in “patents” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.patents_person(df: DataFrame)

Callback function for “person” table in “patents” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.projects_researchers(df: DataFrame, name_col='NOMBRE')

Callback function for “researchers” table in “projects” database.

Parameters:
  • df (Pandas dataframe) – Input data

  • name_col (str, optional) – Name of the colum with the researchers’ names

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.publications_authorship(df: DataFrame)

Callback function for “authorship” table in “publications” database.

Parameters:

df (Pandas dataframe) – Input data

Returns:

df – Output data

Return type:

Pandas dataframe

Util

rdigraphs.datamanager.util.simple_http_server(host: str = '0.0.0.0', port: int = 4001, path: str = '.')

Sets up a http server for files.

Parameters:
  • host (str) – The IP where the host will “listen”

  • port (int) – The port where the host will “listen”

  • path (str) – The path to be served

Returns:

  • start (function) – It starts the server

  • stop (function) – It stops the server