Datamanager

The classes in this file provide functionality to interact with the specific databases provided for the PTL projects.

class rdigraphs.datamanager.datamanager.DMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')

Bases: BaseDMneo4j

This class is an extension of BaseDMneo4j to include some additional functionality

class rdigraphs.datamanager.datamanager.DMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Bases: BaseDMsql

This class is an extension of BaseDMsql to include some additional functionality

class rdigraphs.datamanager.datamanager.DataManager(path2project, db_params, path2source=None)

Bases: object

This is the datamanager for a supergraph project. It provides functionality to manage both the neo4j graph DB and the SQL databased containing the source data. To do so, it uses generic managers for Neo4j and SQL.

__init__(path2project, db_params, path2source=None)

Initializes datamanager object, which facilities read and write operations.

File operation methods available.

Also, several SQL and Neo4J DataManager objects are created to facilitate interaction with databases.

Each SQL manager is stored in dictionary self.SQL. Tipically:

self.SQL[‘db1’] : SQL database named db1 self.SQL[‘db2’] : SQL database named db2 … self.Neo4j : Neo4j graph database

Parameters:

path2project (str) – Path to the project folder
db_params (dict) – Parameters to stablish db connections.
path2source (str or None, optional (default=None)) – Path to the folder containing several data sources. This parameter is optional to allow backward compatibility. Future versions of this datamanager will modify all methods to use this variable, that will be ncessarily string-like.

__weakref__: list of weak references to the object (if defined)

import_graph_data_from_tables(table_name, sampling_factor=1)

Loads a dataframe of documents from one or several files in tabular format.

Parameters:

table_name (str) – Name of the tabular dataset. It should be the name of a folder in self.path2source
sampling_factor (float, optional (default=1)) – Fraction of documents to be taken from the original corpus. (Used for SemanticScholar and patstat only)

load_SCOPUS_citations_all(col_ref, atts, mode='cite_to')

Extracts all data from table ‘citations’ in SCOPUS SQL database

The graph contains all nodes in the citation graph, no matter if they have attributes in table ‘document’.

Parameters:

col_ref (str) – Column in sql table that will be used as index
atts (list of str) – Columns in sql table that will be taken as attributes
mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers

Returns:

nodes (list) – Nodes with attributes in table ‘document’
df_atts (pandas dataframe) – Attributes of the nodes
source_nodes (list) – Source of all edges between nodes in the list of nodes
target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes

load_SCOPUS_citations_with_atts(col_ref, atts, mode='cite_to')

Extracts data from table ‘citations’ in SCOPUS SQL database

The subgraph contains the nodes with attributes in table ‘document’ form the same databse.

Parameters:

col_ref (str) – Column in sql table that will be used as index
atts (list of str) – Columns in sql table that will be taken as attributes
mode (str {‘cite_to’, ‘cited_by’}) – If ‘cite_to’, source nodes are the citing papers If ‘cited_by’, source nodes are the cited papers

Returns:

nodes (list) – Nodes with attributes in table ‘document’
df_atts (pandas dataframe) – Attributes of the nodes
source_nodes (list) – Source of all edges between nodes in the list of nodes
target_nodes – Target of all edges between nodes in the list of nodes. Edges are assumed to be in the same order in source_nodes and target_nodes

readCoordsFromFile(fpath=None, fields=['thetas'], sparse=False, path2nodenames=None, ref_col='corpusid')

Reads a data matrix from a given path. This method assumes a particular data structure of the PTL projects

Parameters:

fpath (str or None, optional (default=None)) – Path to the file that contains the topic model. (the data file is assumed to be modelo.npz or modelo_sparse.npz inside that folder)
fields (str or list, optional (default=[‘thetas’])) – Name of the field or fields containing the doc-topic matrix
sparse (bool, optional (default=False)) – If True, the doc-topic matrix is sparse, otherwise dense
path2nodenames (str or None, optional (default=None)) – path to file containing metadata (in particular node names). If None, the file is assumed to be in fpath, with name docs_metadata.csv
ref_col (str, optional (default=’corpusid’)) – Name of the column in the metadata file (given by path2nodenames) that contains the doc id.

Returns:

data_out (dict) – Output data dictionary
df_nodes (dataframe) – Dataframe of nodes

SQL

This class provides functionality for managing a generig sqlite or mysql database:

reading specific fields (with the possibility to filter by field values)
storing calculated values in the dataset

Created on May 11 2018

@author: Jerónimo Arenas García

class rdigraphs.datamanager.base_dm_sql.BaseDMsql(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Bases: object

Data manager base class.

__del__(): When destroying the object, it is necessary to commit changes in the database and close the connection

__init__(db_name, db_connector, path2db=None, db_server=None, db_user=None, db_password=None, db_port=None, unix_socket=None, charset='utf8mb4')

Initializes a DataManager object

Parameters:

db_name (str) – Name of the DB
db_connector (str {‘mysql’, ‘sqlite’}) – Connector
path2db (str or None, optional (default=None)) – Path to the project folder (sqlite only)
db_server (str or None, optional default=None)) – Server (mysql only)
db_user (str or None, optional (default=None)) – User (mysql only)
db_password (str or None, optional (default=None)) – Password (mysql only)
db_port (str or None, optional (default=None)) – Port(mysql via TCP only) Necessary if not 3306
unix_socket (str or None, optional (default=None)) – Socket for local connectivity. If available, connection is slightly faster than through TCP.
charset (str, optional (default=’utf8mb4’)) – Coding to use by default in the connection

__weakref__: list of weak references to the object (if defined)

addTableColumn(tablename, columnname, columntype)

Add a new column to the specified table.

Parameters:

tablename (str) – Table to which the column will be added
columnname (str) – Name of new column
columntype – Type of new column.

Notes

For mysql, if type is TXT or VARCHAR, the character set if forzed to be utf8.

deleteDBtables(tables=None)

Delete existing database, and regenerate empty tables

Parameters:: tables (str or list or None, optional (default=None)) – If string, name of the table to reset. If list, list of tables to reset If None, all tables are deleted, and all tables (inlcuding those that might not exist previously)

dropTableColumn(tablename, columnname)

Remove column from the specified table

Parameters:

tablename (str) – Table containing the column to be removed
columnname (str) – Name of column to be removed

exportTable(tablename, fileformat, path, filename, cols=None)

Export columns from a table to a file.

Parameters:

tablename (str) – Name of the table
fileformat (str {‘xlsx’, ‘pkl’}) – Type of output file
path (str) – Route to the output folder
filename (str) – Name of the output file
cols (list or str) – Columns to save. It can be a list or a string of comma-separated columns. If None, all columns saved.

export_table_to_csv(table_name: str, output_file: Union[str, Path], block_size: Optional[int] = None, max_rows: Optional[int] = None, gzipped: bool = True, callbacks: Optional[List[Callable[[DataFrame], DataFrame]]] = None, select_options: Optional[str] = None, filter_options: Optional[str] = None, order_options: Optional[str] = None)

Exports a table to csv.

Parameters:

table_name (str) – Name of the SQL table.
output_file (str) – Name of the output csv file.
block_size (int, optional) – Table is read blockwise (to avoid running out of memory) in batches of this size.
max_rows (int, optional) – Maximum number of rows to be read. If None, the whole table is read.
gzipped (bool, optional (default=T)) – Whether to write a gzipped csv file (as opposed to plain text).
callbacks (list or None, optional (default=None)) – List of callable receiving a DataFrame and returning another one with the same structure A list of functions to be called on every block read before actually writing to disk.
select_options (str or None, optional (default=None)) – “select” options to be passed readDBtable
filter_options (str or None, optional (default=None)) – “filter” options to be passed readDBtable
order_options (str or None, optional (default=None)) – “order” options to be passed readDBtable

getColumnNames(tablename)

Returns a list with the names of all columns in the indicated table

Parameters:: tablename (str) – Table to be read from
Returns:: columnames – Names of all columns in the selected table
Return type:: list

getTableInfo(tablename)

Get information about the given table (size and columns)

Parameters:

tablename (str) – Table to be read from

Returns:

cols (list) – Names of all columns in the table
n_rows (int) – Number of rows in table

getTableNames()

Provides acces to table names

Returns:: tbnames – Names of all tables in the database
Return type:: list

insertInTable(tablename, columns, arguments)

Insert new records into table

Parameters:

tablename (str) – Name of table in which the data will be inserted
columns (list) – Name of columns for which data are provided
arguments (list of list of tuples) – A list of lists of tuples, each element associated to one new entry for the table

readDBtable(tablename, limit=None, selectOptions=None, filterOptions=None, orderOptions=None)

Read data from a table in the database can choose to read only some specific fields

Parameters:

tablename (str) – Table to be read from
limit (int or None, optional (default=None)) – The maximum number of records to retrieve
selectOptions (str or None, optional (default=None)) – string with fields that will be retrieved (e.g. ‘REFERENCIA, Resumen’)
filterOptions (str or None, optional (default=None)) – Filtering options for the SQL query (e.g., ‘WHERE UNESCO_cd=23’)
orderOptions (str or None, optional (default=None)) – Field that will be used for sorting the results of the query (e.g, ‘Cconv’)

setField(tablename, keyfld, valueflds, values)

Update records of a DB table

Parameters:

tablename (str) – Table that will be modified
keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)
valueflds (list) – Names of the columns that will be updated (e.g., ‘Lemas’)
values (list of tuples) – A list of tuples in the format (keyfldvalue, valuefldvalue) (e.g., [(‘Ref1’, ‘gen celula’), (‘Ref2’, ‘big_data, algorithm’)])

upsert(tablename, keyfld, df, robust=True)

Update records of a DB table with the values in the df This function implements the following additional functionality: * If there are columns in df that are not in the SQL table, columns will be added * New records will be created in the table if there are rows in the dataframe without an entry already in the table. For this, keyfld indicates which is the column that will be used as an index

Parameters:

tablename (str) – Table that will be modified
keyfld (str) – Name of the column that will be used as key (e.g. ‘REFERENCIA’)
df (dataframe) – Dataframe that we wish to save in table tablename
robust (bool, optional (default=True)) – If False, verifications are skipped (for a faster execution)

Neo4j

This class provides general functionality for managing a Neo4J database

reading specific fields (with the possibility to filter by field values)
storing calculated values in the dataset

Created on Sep 06 2018

@author: Manu A. Vázquez

class rdigraphs.datamanager.base_dm_neo4j.BaseDMneo4j(db_server: str, db_password: str, db_user: str = 'neo4j')

Bases: object

Base class for interacting with a Neo4j database.

__del__(): Tidy up stuff after before deleting the object.

__init__(db_server: str, db_password: str, db_user: str = 'neo4j') → None

Initializer

Parameters:

db_server (str) – The URL for the server
db_password (str) – User Password
db_user (str, optional) – User login

__weakref__: list of weak references to the object (if defined)

drop_node_property(label: str, property: str) → None

Deletes a property from nodes.

Parameters:

label (str) – The label of the node
property (str) – The name of the property

drop_relationship(relationship_type: str) → None

Deletes a relationship.

Parameters:: relationship_type (str) – The type of the relationship

export_graph(label_nodes, path2nodes, col_ref_nodes, label_edges, path2edges)

Export graph to Neo4J

Parameters:

label_nodes (str or tuple of str) – If str, all nodes of the same type If tuple, one type for sources, the other for destinations
path2nodes (str) – Path to the file of nodes: a csv file with one column for the node names, and possibly other columns for attributes (that may be numeric or str)
col_ref_nodes (str) – Name of the column in the file of nodes containing the node names
label_edges (str) – Type of relationship for all edges.
path2edges (str) – Path to the edges file: a csv file with 3 columns: Source, Target, Weight

get_db_structure() → dict

Returns meta-data.

Returns:: out – Metadata
Return type:: dictionary

make_edges(df: DataFrame, source: Tuple[str, Tuple[str, str]], destination: Tuple[str, Tuple[str, str]], relationship: Tuple[str, Dict[str, str]])

Makes edges between nodes as specified in a dataframe; if the requested nodes don’t exist, they are created

Parameters:

df (pandas Dataframe) – The data
source (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the source node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.
destination (tuple of a str and a tuple of two str) – The first element is the name of the column (within this DataFrame) specifying the destination node; the second is a tuple whose first element is the label in the graph and whose second is the property (in the nodes with the aforementioned label) that must match the values in the column.
relationship (tuple of a str and a dict) – The first element is the type of the relationship to be created between nodes; the second is a dictionary that maps columns in this DataFrame to properties of the relationship.

properties_of_label(label: str) → List[str]

Returns all the properties (across all the nodes) of a given label.

Parameters:: label (str) – The label (type)
Returns:: out – A list with the properties
Return type:: list

properties_of_relationship(relationship_type: str) → List[str]

Returns all the properties of a given relationship.

Parameters:: relationship_type (str) – The type of the relationship
Returns:: out – A list with the properties
Return type:: list

read_edges(relationship_type: str, limit: Optional[int] = None) → Optional[DataFrame]

Reads edges from the database.

Parameters:

relationship_type (str) – Type of the relationship
limit (int) – Maximum number of edges

Returns:

out – Every row contains information about a single edge

Return type:

Pandas dataframe

read_nodes(label: str, limit: Optional[int] = None, select_options=None, filter_options=None, order_options=None) → DataFrame

Reads nodes from the database.

Parameters:

label (str) – Label of the nodes
limit (int) – Maximum number of nodes
select_options (unused)
filter_options (unused)
order_options (unused)

Returns:

out – Every row contains information about a node

Return type:

Pandas dataframe

reset_database() → None: Reset the database, deleting everything.

write_dataframe(df: DataFrame, source: Tuple[str, List[str]], destination: Tuple[str, List[str]], edge: Tuple[str, List[str]]) → None

Writes a dataframe in the database

Parameters:

df (pandas Dataframe) – The data
source (tuple of a str and a list of str) – The first element in the tuple is the label for the source node, and the second one is a list with the columns of the dataframe that will become the properties of the node
destination (tuple of a str and a list of str) – The first element in the tuple is the label for the destination node, and the second one is a list withthe columns of the dataframe that will become the properties of the node
edge (tuple of a str and a list of str) – The first element in the tuple is the label for the edge node, and the second one is a list with the columns of the dataframe that will become the properties of the node

Cypher

class rdigraphs.datamanager.cypher.DataLoadingStatement(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: Statement

Class to load data from csv files into Neo4j.

__init__(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://') → None

Initializer.

Parameters:

csv_files_path (pathlib’s Path) – Path to the csv files
csv_line (str, optional) – Name of a line from the file when referred to in another statement

flatten_dic(d, sep=':', property_name_prefix='')

Takes a dictionary specifying SQL to neo4j conversions and returns a string amenable to be used in cypher.

Parameters:

d (dict) – A dictionary mapping properties to values.
sep (str) – Separator between the property and the value
property_name_prefix (str) – A string to prepend to every property name.

Returns:

out – A piece of cypher statement specifying a mapping between properties and values.

Return type:

str

load_clause(parameters, using_periodic_commit=True) → str

Returns a clause to load data from a csv file.

Parameters:

parameters (dictionary) – Settings
using_periodic_commit (bool) – Whether to use the “periodic commit” operation mode

Returns:

statement – Neo4j statement

Return type:

str

match_clause(labels, matching_columns, matching_properties, nodes=['origin', 'destination']) → str

Returns a clause to match data read from the file with that in the database.

Parameters:

labels (list) – Labels of the nodes to be matched
matching_columns (list) – Columns in the csv file
matching_properties (list) – Properties of the nodes to be matched
nodes (list) – Names of the nodes to be matched when referred to in another statement

Returns:

out – Neo4j statement

Return type:

str

class rdigraphs.datamanager.cypher.LoadAttributesTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing an “attributes” table.

assemble(parameters: dict) → Union[str, List[str]]

Returns a statement to load data from a csv file representing an “attributes” table.

Parameters:: parameters (dictionary) – Settings
Returns:: out – Neo4j statement
Return type:: str

class rdigraphs.datamanager.cypher.LoadJoinTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing a “join” table.

assemble(parameters: dict) → Union[str, List[str]]

Returns a statement to load data from a csv file representing a “join” table.

Parameters:: parameters (dictionary) – Settings
Returns:: out – Neo4j statement
Return type:: str

class rdigraphs.datamanager.cypher.LoadTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load a table from a csv file.

assemble(parameters: dict) → Union[str, List[str]]

Returns a statement to load data from a csv file.

Parameters:: parameters (dictionary) – Settings
Returns:: out – Neo4j statement
Return type:: str

class rdigraphs.datamanager.cypher.MakeRelationship(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to load data from a csv file representing a “relationship” table.

assemble(parameters: dict) → Union[str, List[str]]

Returns a statement to load data from a csv file representing a “relationship” table.

Parameters:: parameters (dictionary) – Settings
Returns:: out – Neo4j statement
Return type:: str

class rdigraphs.datamanager.cypher.Statement

Bases: object

Base class for a Neo4j statement.

__weakref__: list of weak references to the object (if defined)

class rdigraphs.datamanager.cypher.UpdateTable(csv_files_path: Path, csv_line: str = 'line', data_access: str = 'file://')

Bases: DataLoadingStatement

Class to update an existing table using data from a csv file.

assemble(parameters: dict) → Union[str, List[str]]

Returns a statement to update an existing table using data from a csv file.

Parameters:: parameters (dictionary) – Settings
Returns:: out – Neo4j statement
Return type:: str

rdigraphs.datamanager.cypher.assert_uniqueness_clause(property: str, node: str = 'node') → str

Returns the part of a statement that ensures a property of a node is unique.

Parameters:

property (str) – Name of the mean-to-be-unique property
node (str, optional) – Name of the node (coming from other statement)

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_constraint_clause(label: str, node: str = 'node') → str

Returns the part of a statement that creates a constraint.

Parameters:

label (str) – The label (node type) on which to create the index
node (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_indexes(label: str, properties: List[str]) → Union[str, List[str]]

Returns a statement that creates indexes on a given label.

Parameters:

label (str) – The label (node type) on which to create the index
properties (list) – A list of properties to be indexed on the given label

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_or_merge_node(label: str, properties: dict = {}, name: str = 'A', merge: bool = False) → str

Returns a statement that creates or merges a new node.

Parameters:

label (str) – The label of the node
properties (dictionary) – The properties of the node
name (str, optional) – Name of the node when referred in another statement
merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_or_merge_relationship(origin: str, destination: str, rel_type: str, properties: dict = {}, name: str = 'rel', merge: bool = False) → str

Returns a statement that creates or merges a new relationship.

Parameters:

origin (str) – Name (identifier/variable) of the origin node
destination (str) – Name (identifier/variable) of the destination node
rel_type (str) – Relationship type
properties (dictionary) – Properties to be added to the relationship
name (str, optional) – Name of the relationship when referred to in another statement
merge (boolean) – Whether a “merge” statement rather than a “create” one is to be returned

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.create_unique_constraint(label: str, property: str) → str

Returns a statement to creates a unique constraint.

Parameters:

label (str) – The label of the node
property (str) – The property of the node

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.dic_to_properties(d: dict)

Takes a dictionary and returns a string amenable to be used in cypher.

Parameters:: d (dict) – A dictionary mapping properties to values.
Returns:: out – A piece of cypher statement specifying properties and values.
Return type:: str

rdigraphs.datamanager.cypher.drop_null_valued_keys(d: dict)

Convenience function to get rid of null (None) values in a dictionary.

Parameters:: d (dict) – Any dictionary
Returns:: out – The input dictionary without null values
Return type:: dict

rdigraphs.datamanager.cypher.get_metadata() → str

Returns a statement to get the “schema” of the database.

Returns:: out – Neo4j statement
Return type:: str

rdigraphs.datamanager.cypher.labels() → str

Returns a statement to get all the labels (node types) in the database.

Returns:: out – Neo4j statement
Return type:: str

rdigraphs.datamanager.cypher.make_relationship(origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {})

Returns a statement to make a relationship while merging properties in the nodes involved.

Parameters:

origin_label (str) – Label of the origin node
destination_label (str) – Label of the destination node
relationship_type (str) – Type of the relationship
origin_properties (dictionary) – Properties of the origin node
relationship_properties (dictionary) – Properties of the relationship
destination_properties (dictionary) – Properties of the destination node

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.match_label(label: str, limit: Optional[int] = None, dummy: str = 'node') → str

Returns a statement that matches and returns a label.

Parameters:

label (str) – The label of the node
limit (int, optional) – The maximum number of matches
dummy (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.match_relationship(relationship: str, limit: Optional[int] = None, surrogate_relationship: str = 'rel', surrogate_path: str = 'path') → str

Returns a statement that matches relationships.

Parameters:

relationship (str) – The (type of the) relationship
limit (int, optional) – The maximum number of matches
surrogate_relationship (str, optional) – Name of the relationship when referred in another statement
surrogate_path (str, optional) – Name of the path when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.merge_relationship_with_properties_clause(origin: str = 'origin', destination: str = 'destination', origin_label: str = '', destination_label: str = '', relationship_type: str = '', origin_properties: dict = {}, relationship_properties: dict = {}, destination_properties: dict = {}, relationship: str = 'rel', arrowhead: str = '>')

Returns a clause that merges a relationship specifying properties for each entity.

Parameters:

origin (str) – Name (identifier/variable) of the origin node
destination (str) – Name (identifier/variable) of the destination node
origin_label (str) – Label of the origin node
destination_label (str) – Label of the destination node
relationship_type (str) – Type of the relationship
origin_properties (dictionary) – Properties of the origin node
relationship_properties (dictionary) – Properties of the relationship
destination_properties (dictionary) – Properties of the destination node
relationship (str, optional) – Name of the relationship when referred to in another statement
arrowhead (string) – Suffix controlling whether the relationship is (bi)directional or not

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.relationship_types() → str

Returns a statement to get all the relationship types in the database.

Returns:: out – Neo4j statement
Return type:: str

rdigraphs.datamanager.cypher.remove_property(label: str, property: str, surrogate_node: str = 'node') → str

Returns a statement that removes a property from a node.

Parameters:

label (str) – The label of the node
property (str) – The name of the property
surrogate_node (str, optional) – Name of the node when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.remove_relationship(relationship_type: str, var: str = 'r') → str

Returns a statement that removes a relationship.

Parameters:

relationship_type (str) – The type of the relationship
var (str, optional) – Name of the relationship when referred in another statement

Returns:

out – Neo4j statement

Return type:

str

rdigraphs.datamanager.cypher.reset_database() → List[str]

Returns a statement that resets the database, i.e., deletes everything.

Returns:: out – Neo4j statement
Return type:: str

Migration Callbacks

class rdigraphs.datamanager.callbacks.AuthorsDisambiguator(disambiguation_map: List[str], new_id: str)

Bases: Disambiguator

Class containing callbacks methods needed for disambiguating authors.

patstats_person(df: DataFrame)

Callback function for “person” table in “patstats” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

patstats_person_application(df: DataFrame)

Callback function for “application” table in “patstats” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

projects_researcher_project(df: DataFrame)

Callback function for “researcher_project” table in “projects” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

projects_researchers(df: DataFrame)

Callback function for “researchers” table in “projects” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

scopus_authorship(df: DataFrame)

Callback function for “authorship” table in “scopus” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

class rdigraphs.datamanager.callbacks.Disambiguator(disambiguation_map: List[str], new_id: str)

Bases: object

Class to process disambiguation data.

__init__(disambiguation_map: List[str], new_id: str) → None

Initializer.

Parameters:

disambiguation_map (list) – Path to the disambiguation map
new_id (str) – The name of the new field/column/neo4j property to be created for storing the final disambiguated id

__weakref__: list of weak references to the object (if defined)

static handle_duplicates(df)

Get rid of duplicated (disambiguated) ids.

Parameters:: df (Pandas dataframe) – Data with duplicates
Returns:: unmapped – Input data with the only row with the value “left_only” in column source, or the 1st row of the input dataframe if there are several.
Return type:: Pandas dataframe

merge(mapping: DataFrame, df: DataFrame, field: str, prefix: str)

Replace within the passed DataFrame the values of field that are present in the disambiguation map, while at the same time ensuring that the new disambiguated values are unique.

Parameters:

mapping (Pandas dataframe) – Disambiguation map
df (Pandas dataframe) – Data to be disambiguated
field (str) – Column to be used in df
prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)

Returns:

merge – Input data with the disambiguation map applied.

Return type:

Pandas dataframe

replace(mapping: DataFrame, df: DataFrame, field: str, prefix: str)

Replace within the passed DataFrame the values of field that are present in the disambiguation map.

Parameters:

mapping (Pandas dataframe) – Disambiguation map
df (Pandas dataframe) – Data to be disambiguated
field (str) – Column to be used in df
prefix (str) – Prefix to be added to a “new” identifier built from the already existing (not present in the disambiguation map)

Returns:

merge – Input data with the disambiguation map applied.

Return type:

Pandas dataframe

class rdigraphs.datamanager.callbacks.OrganizationsDisambiguator(disambiguation_map: List[str], new_id: str)

Bases: AuthorsDisambiguator

Class containing callbacks methods needed for disambiguating organizations.

projects_organizations(df: DataFrame)

Callback function for “organizations” table in “projects” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

projects_projects(df: DataFrame)

Callback function for “projects” table in “projects” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

scopus_authorship(df: DataFrame)

Callback function for “authorship” table in “scopus” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

rdigraphs.datamanager.callbacks.initialize(parameters: dict) → None

Initializes this module.

Parameters:: parameters (dictionary) – Settings

rdigraphs.datamanager.callbacks.patents_literature(df: DataFrame)

Callback function for relating patents literature with itself in “patents” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

rdigraphs.datamanager.callbacks.patents_non_literature(df: DataFrame)

Callback function for relating patents literature with non-patents literature in “patents” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

rdigraphs.datamanager.callbacks.patents_person(df: DataFrame)

Callback function for “person” table in “patents” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

rdigraphs.datamanager.callbacks.projects_researchers(df: DataFrame, name_col='NOMBRE')

Callback function for “researchers” table in “projects” database.

Parameters:

df (Pandas dataframe) – Input data
name_col (str, optional) – Name of the colum with the researchers’ names

Returns:

df – Output data

Return type:

Pandas dataframe

rdigraphs.datamanager.callbacks.publications_authorship(df: DataFrame)

Callback function for “authorship” table in “publications” database.

Parameters:: df (Pandas dataframe) – Input data
Returns:: df – Output data
Return type:: Pandas dataframe

Util

rdigraphs.datamanager.util.simple_http_server(host: str = '0.0.0.0', port: int = 4001, path: str = '.')

Sets up a http server for files.

Parameters:

host (str) – The IP where the host will “listen”
port (int) – The port where the host will “listen”
path (str) – The path to be served

Returns:

start (function) – It starts the server
stop (function) – It stops the server