Similarity graphs

rdigraphs.sim_graph.sim_graph.JSdist(p, q)

Compute the Jensen-Shannon distance between probability distributions p and q. It assumes that both p and q are normalized and sum up to 1

Parameters

p (numpy array) – Probability distribution
q (numpy array) – Probability distribution (with the same size as p)

Returns

d – JS distance

Return type

float

rdigraphs.sim_graph.sim_graph.JSdist_sp(p, q)

Compute the Jensen-Shannon distance between probability distributions p and q. It assumes that both p and q are normalized and sum up to 1 p and q can be sparse vectors (this is the main difference wrt JSdist())

Parameters

p (numpy array or sparse vector) – Probability distribution
q (numpy array or sparse vector) – Probability distribution (with the same size as p)

Returns

d – JS distance

Return type

float

class rdigraphs.sim_graph.sim_graph.SimGraph(X, blocksize=25000, useGPU=False, tmp_folder=None, save_every=1e+300)

Bases: ThOps

Generic class to generate similarity graphs from data

JS2_affinity(X, Y=None, R2=10, mapping='linear', g=1)

Compute all truncated affinities, based on the squared Jensen-Shannon divergence, between all pairs of connected nodes in the graph based on the node attribute vectors

It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrices in X and Y can be sparse

Parameters

X (numpy array) – Input matrix of probabilistic attribute vectors
Y (numpy array or None, optional (default=None)) – Input matrix of probabilistic attribute vectors. If None, it is assumed Y=X
R2 (float, optional (default=2)) – Radius (maximum squared JS distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible squared distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping

Returns

edge_id (list of tuples) – List of edges
weights (list) – List of edge weights

Returns

edge_id (list of tuples) – List of edges
weights (list) – List of edge weights

__init__(X, blocksize=25000, useGPU=False, tmp_folder=None, save_every=1e+300)

Stores the main attributes of a class instance

Parameters

X (scipy.sparse.csr or numpy.array) – Matrix of node attribute vectors
blocksize (int, optional (default=25_000)) – Size (number of rows) of each block in blocwise processing.
useGPU (bool, optional (default=False)) – If True, matrix operations are accelerated using GPU
tmp_folder (str or None, optional (defautl = None)) – Name of the folder to save temporary files
save_every (int, optional (default=0)) – Maximum size of the growing lists. The output lists are constructed incrementally. To avooy memory overload, growing lists are saved every time they reach this size limit. The full liests are thus incrementally saved in files. The default value is extremely large, which de facto implies no temporary saving.

cluster_equivalent_nodes(reduceX=False)

Computes a graph where each node is formed by all nodes at zero distance

Parameters: reduceX (boolean) – If True, it computes self.Xeq, a data matrix without rows at zero distance

computeXeq()

Computes the reduced feature matrix X, with a single row per each equivalent class.

self.X[i] contains the feature vector of all nodes from equivalence class i.

compute_id_graph(R=1e-100, verbose=True)

Computes identity graph.

The identity graph connects nodes a and b with weight 1 iff a and b have the same feature vector in self.T. Otherwise, a and be are disconnected

Parameters

R (float) – Radius. Edges link all data pairs at distance lower than R. It should be a very small value in order to link nodes with almost equal attribute. Nonzero values may be used to allow slight deviations from equality.
verbose (boolean) – If False, block-by-block messaging is omitted in the call to he_neighbors graph.

Returns

self – Updated attribute self.edge_ids (list of edges, as pairs (i, j) of indices)

Return type

object

connectivity_graph(R=None, metric='JS', verbose=True)

Computes a sparse connectivity graph for the self graph structure. The self graph must contain matrix self.X

Parameters

R (float) – Radius. Edges link all data pairs at distance lower than R This is to forze a sparse graph.
metric (string) – Similarity measure used to compute affinity matrix Available options are:

‘JS’, 1 minus Jensen-Shannon (JS) divergence (too slow);

‘l1’, 1 minus l1 distance

‘l2’, Euclidean distance

‘He’, 1-squared Hellinger’s distance (sklearn-based implementation)

‘He2’, 1 minus squared Hellinger distance (self implementation)

‘cosine’, cosine distance
verbose (boolean, optional (default=True)) – (Only for he_neighbors_graph()). If False, block-by-block messaging is omitted

Returns

self – Changes in attributes self.edge_ids (List of edges, as pairs (i, j) of indices) and self.weights (list of affinity values for each pair in edge_ids)

Return type

object

div2sim(div, mapping='linear', g=1, B=None)

Transforms a list of divergence scores (tipically, distances or squared distances) into a list of similarity values

The model considers three types of mappings:

Linear: s = (1 - div / R)
Polynomial: s = 1 - (div / R)**g
Exponential: s = exp(-g div)

Note that the linear mapping is equivalent to polynomial with g=1. Both linear and polynomial mappings are defined for founded distances in [0, div]. Otherwise, the similarity values would be negative.

Parameters

div (float or list of float) – A a divergence or a list of divergence values
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (int or float, optional (default=1)) – Power factor. Use g != 1 to apply nonlinear mapping
B (float or None, optional (default=None)) – (only for linear or polynomial mappings). Radius bound. For values in d2 higher than B, the output similarity will be negative.

Returns

s – A list of similarity values with the same size than d

Return type

list

he_affinity(X, R2=10, mapping='linear', g=1)

Compute all Hellinger’s affinities between all nodes in the graph based on the node attribute vectors in matrix

It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse

Parameters

X (numpy array) – Input matrix of probabilistic attribute vectors
R2 (float, optional (default=2)) – Squared radius (maximum squared He distance). Edges at higher distance are removed). The default is a large value (larger than any possible Hellinger distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping

Returns

edge_id (list of tuples) – List of edges
weights (list) – List of edge weights

l1_affinity(X, R=10, mapping='linear', g=1)

Compute all l1’s affinities between all nodes in the graph based on the node attribute vectors

It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse

Parameters

X (numpy array) – Input matrix of probabilistic attribute vectors
R (float, optional (default=2)) – Radius (maximum l1 distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible l1 distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping

Returns

edge_id (list of tuples) – List of edges
weights (list) – List of edge weights

l2_affinity(X, R2=10, mapping='exponential', g=1)

Compute all l2’s affinities between all nodes in the graph based on the node attribute vectors

It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse

Parameters

X (numpy array) – Input matrix of probabilistic attribute vectors
R2 (float, optional (default=2)) – Radius (maximum squared l2 distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible squared distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping

Returns

edge_id (list of tuples) – List of edges
weights (list) – List of edge weights

show_JS_bounds(s_min, sim, g=1, out_path=None, verbose=True)

Computes JS bounds for a given similarity measure.

Parameters

s_min (float) – Radius. Edges link all data pairs at distance lower than R This is to forze a sparse graph.
sim (string) – Similarity measure used to compute affinity matrix Available options are:

‘l1->JS’, same as JS, but the graph is computed after preselecting edges using l1 distances and a theoretical bound

‘He->JS’, same as JS, but the graph is computed after preselecting edges using Hellinger’s distances and a theoretical bound

‘He2->JS’, same as He-Js, but using the self implementation of He.
g (float) – Exponent for the affinity mapping
verbose (boolean) – (Only for he_neighbors_graph()). If False, block-by-block messaging is omitted

Returns

self – Modifies self.edge_ids (ist of edges, as pairs (i, j) of indices) and self.weights (list of affinity values for each pair in edge_ids)

Return type

object

sim2div(s, mapping='linear', g=1, B=None)

Transforms a list of similarity values into a list of divergences (tipically, distances or squared distances)

This is the inverse of method div2sim()

Parameters

s (list) – A list of similarity values
g (int or float, optional (default=1)) – Power factor. Use g != 1 to apply nonlinear mapping
B (float or None, optional (default=None)) – (only for linear or polynomial mappings). Radius bound. For values in div higher than B, the output similarity will be negative.

Returns

div – A list of divergences

Return type

list

simTest(D, R, sim=None, g=1, fpath=None, label='l1')

Plot the values in weights vs the values in D selected by the indices in edge_id.

This is used to visualize the efect of selecting samples using one measure (l1, l2) as a first step to reduce the sample set used to elect edges based on the Jensen-Shannon divergence.

Parameters

D (2D array) – Data matrix
R (float) – Radius bound
sim (string) – Name of the similarity function
g (float) – Exponent value
fpath (string) – Output path
label (string) – Label for the figure plot

sim_graph(s_min=None, n_edges=None, **kwargs)

Computes a sparse graph for a given radius or for a given number of edges

Parameters

s_min (float or None, optional (default=None)) – Similarity threshold. Edges link all data pairs with similarity higher than R. This forzes a sparse graph.
n_edges (int) – Number of edges