Similarity graphs
- rdigraphs.sim_graph.sim_graph.JSdist(p, q)
Compute the Jensen-Shannon distance between probability distributions p and q. It assumes that both p and q are normalized and sum up to 1
- Parameters
p (numpy array) – Probability distribution
q (numpy array) – Probability distribution (with the same size as p)
- Returns
d – JS distance
- Return type
float
- rdigraphs.sim_graph.sim_graph.JSdist_sp(p, q)
Compute the Jensen-Shannon distance between probability distributions p and q. It assumes that both p and q are normalized and sum up to 1 p and q can be sparse vectors (this is the main difference wrt JSdist())
- Parameters
p (numpy array or sparse vector) – Probability distribution
q (numpy array or sparse vector) – Probability distribution (with the same size as p)
- Returns
d – JS distance
- Return type
float
- class rdigraphs.sim_graph.sim_graph.SimGraph(X, blocksize=25000, useGPU=False, tmp_folder=None, save_every=1e+300)
Bases:
ThOps
Generic class to generate similarity graphs from data
- JS2_affinity(X, Y=None, R2=10, mapping='linear', g=1)
Compute all truncated affinities, based on the squared Jensen-Shannon divergence, between all pairs of connected nodes in the graph based on the node attribute vectors
It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrices in X and Y can be sparse
- Parameters
X (numpy array) – Input matrix of probabilistic attribute vectors
Y (numpy array or None, optional (default=None)) – Input matrix of probabilistic attribute vectors. If None, it is assumed Y=X
R2 (float, optional (default=2)) – Radius (maximum squared JS distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible squared distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping
- Returns
edge_id (list of tuples) – List of edges
weights (list) – List of edge weights
- Returns
edge_id (list of tuples) – List of edges
weights (list) – List of edge weights
- __init__(X, blocksize=25000, useGPU=False, tmp_folder=None, save_every=1e+300)
Stores the main attributes of a class instance
- Parameters
X (scipy.sparse.csr or numpy.array) – Matrix of node attribute vectors
blocksize (int, optional (default=25_000)) – Size (number of rows) of each block in blocwise processing.
useGPU (bool, optional (default=False)) – If True, matrix operations are accelerated using GPU
tmp_folder (str or None, optional (defautl = None)) – Name of the folder to save temporary files
save_every (int, optional (default=0)) – Maximum size of the growing lists. The output lists are constructed incrementally. To avooy memory overload, growing lists are saved every time they reach this size limit. The full liests are thus incrementally saved in files. The default value is extremely large, which de facto implies no temporary saving.
- cluster_equivalent_nodes(reduceX=False)
Computes a graph where each node is formed by all nodes at zero distance
- Parameters
reduceX (boolean) – If True, it computes self.Xeq, a data matrix without rows at zero distance
- computeXeq()
Computes the reduced feature matrix X, with a single row per each equivalent class.
self.X[i] contains the feature vector of all nodes from equivalence class i.
- compute_id_graph(R=1e-100, verbose=True)
Computes identity graph.
The identity graph connects nodes a and b with weight 1 iff a and b have the same feature vector in self.T. Otherwise, a and be are disconnected
- Parameters
R (float) – Radius. Edges link all data pairs at distance lower than R. It should be a very small value in order to link nodes with almost equal attribute. Nonzero values may be used to allow slight deviations from equality.
verbose (boolean) – If False, block-by-block messaging is omitted in the call to he_neighbors graph.
- Returns
self – Updated attribute self.edge_ids (list of edges, as pairs (i, j) of indices)
- Return type
object
- connectivity_graph(R=None, metric='JS', verbose=True)
Computes a sparse connectivity graph for the self graph structure. The self graph must contain matrix self.X
- Parameters
R (float) – Radius. Edges link all data pairs at distance lower than R This is to forze a sparse graph.
metric (string) – Similarity measure used to compute affinity matrix Available options are:
‘JS’, 1 minus Jensen-Shannon (JS) divergence (too slow);
‘l1’, 1 minus l1 distance
‘l2’, Euclidean distance
‘He’, 1-squared Hellinger’s distance (sklearn-based implementation)
‘He2’, 1 minus squared Hellinger distance (self implementation)
‘cosine’, cosine distance
verbose (boolean, optional (default=True)) – (Only for he_neighbors_graph()). If False, block-by-block messaging is omitted
- Returns
self – Changes in attributes self.edge_ids (List of edges, as pairs (i, j) of indices) and self.weights (list of affinity values for each pair in edge_ids)
- Return type
object
- div2sim(div, mapping='linear', g=1, B=None)
Transforms a list of divergence scores (tipically, distances or squared distances) into a list of similarity values
The model considers three types of mappings:
Linear: s = (1 - div / R)
Polynomial: s = 1 - (div / R)**g
Exponential: s = exp(-g div)
Note that the linear mapping is equivalent to polynomial with g=1. Both linear and polynomial mappings are defined for founded distances in [0, div]. Otherwise, the similarity values would be negative.
- Parameters
div (float or list of float) – A a divergence or a list of divergence values
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (int or float, optional (default=1)) – Power factor. Use g != 1 to apply nonlinear mapping
B (float or None, optional (default=None)) – (only for linear or polynomial mappings). Radius bound. For values in d2 higher than B, the output similarity will be negative.
- Returns
s – A list of similarity values with the same size than d
- Return type
list
- he_affinity(X, R2=10, mapping='linear', g=1)
Compute all Hellinger’s affinities between all nodes in the graph based on the node attribute vectors in matrix
It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse
- Parameters
X (numpy array) – Input matrix of probabilistic attribute vectors
R2 (float, optional (default=2)) – Squared radius (maximum squared He distance). Edges at higher distance are removed). The default is a large value (larger than any possible Hellinger distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping
- Returns
edge_id (list of tuples) – List of edges
weights (list) – List of edge weights
- l1_affinity(X, R=10, mapping='linear', g=1)
Compute all l1’s affinities between all nodes in the graph based on the node attribute vectors
It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse
- Parameters
X (numpy array) – Input matrix of probabilistic attribute vectors
R (float, optional (default=2)) – Radius (maximum l1 distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible l1 distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping
- Returns
edge_id (list of tuples) – List of edges
weights (list) – List of edge weights
- l2_affinity(X, R2=10, mapping='exponential', g=1)
Compute all l2’s affinities between all nodes in the graph based on the node attribute vectors
It assumes that all attribute vectors are normalized to sum up to 1 Attribute matrix X can be sparse
- Parameters
X (numpy array) – Input matrix of probabilistic attribute vectors
R2 (float, optional (default=2)) – Radius (maximum squared l2 distance). Edges at higher distance are removed). The default is a large value (larger than the maximum possible squared distance between two probability distributions), which implies no edge filtering
mapping (str in {‘linear’, ‘polynomial’, ‘exponential’}) – Type of mapping from distances to similarities.
g (float) – Exponent for the final affinity mapping
- Returns
edge_id (list of tuples) – List of edges
weights (list) – List of edge weights
- show_JS_bounds(s_min, sim, g=1, out_path=None, verbose=True)
Computes JS bounds for a given similarity measure.
- Parameters
s_min (float) – Radius. Edges link all data pairs at distance lower than R This is to forze a sparse graph.
sim (string) – Similarity measure used to compute affinity matrix Available options are:
‘l1->JS’, same as JS, but the graph is computed after preselecting edges using l1 distances and a theoretical bound
‘He->JS’, same as JS, but the graph is computed after preselecting edges using Hellinger’s distances and a theoretical bound
‘He2->JS’, same as He-Js, but using the self implementation of He.
g (float) – Exponent for the affinity mapping
verbose (boolean) – (Only for he_neighbors_graph()). If False, block-by-block messaging is omitted
- Returns
self – Modifies self.edge_ids (ist of edges, as pairs (i, j) of indices) and self.weights (list of affinity values for each pair in edge_ids)
- Return type
object
- sim2div(s, mapping='linear', g=1, B=None)
Transforms a list of similarity values into a list of divergences (tipically, distances or squared distances)
This is the inverse of method
div2sim()
- Parameters
s (list) – A list of similarity values
g (int or float, optional (default=1)) – Power factor. Use g != 1 to apply nonlinear mapping
B (float or None, optional (default=None)) – (only for linear or polynomial mappings). Radius bound. For values in div higher than B, the output similarity will be negative.
- Returns
div – A list of divergences
- Return type
list
- simTest(D, R, sim=None, g=1, fpath=None, label='l1')
Plot the values in weights vs the values in D selected by the indices in edge_id.
This is used to visualize the efect of selecting samples using one measure (l1, l2) as a first step to reduce the sample set used to elect edges based on the Jensen-Shannon divergence.
- Parameters
D (2D array) – Data matrix
R (float) – Radius bound
sim (string) – Name of the similarity function
g (float) – Exponent value
fpath (string) – Output path
label (string) – Label for the figure plot
- sim_graph(s_min=None, n_edges=None, **kwargs)
Computes a sparse graph for a given radius or for a given number of edges
- Parameters
s_min (float or None, optional (default=None)) – Similarity threshold. Edges link all data pairs with similarity higher than R. This forzes a sparse graph.
n_edges (int) – Number of edges