Clustering#
The module implements the quick shift clustering algorithm, which is used in probabilistic analysis of molecular motifs (PAMM). See Gasparotto and Ceriotti for more details.
Quick Shift#
- class skmatter.clustering.QuickShift(dist_cutoff_sq: float | None = None, gabriel_shell: int | None = None, scale: float = 1.0, metric: Callable | None = None, metric_params: dict | None = None)[source]#
Conducts quick shift clustering.
This class is used to implement the quick shift clustering algorithm, which is used in probabilistic analysis of molecular motifs (PAMM). There are two ways of searching the next point: (1) search for the point within the given distance cutoff and (2) search for the point within the given number of neighbor shell of the Gabriel graph. If both of them are set, the distance cutoff is used.
- Parameters:
dist_cutoff_sq (float, default=None) – The squared distance cutoff for searching for the next point. Two points are considered as neighbors if they are within this distance. If
None
, the scheme of Gabriel graph is used.gabriel_shell (int, default=None) – The number of neighbor shell of Gabriel graph for searching for the next point. For example, if the number is 1, two points will be considered as neighbors if they have at least one common neighbor, like for the case “A-B-C”, we will consider “A-C” as neighbors. If the number is 2, for the case “A-B-C-D”, we will consider “A-D” as neighbors. If
None
, the scheme of distance cutoff is used.scale (float, default=1.0) – Distance cutoff scaling factor used during the QS clustering. It will be squared since the squared distance is used in this class.
metric (Callable, default=None) – The metric to use. Your metric should be able to take at least three arguments in secquence: X, Y, and squared=True. Here, X and Y are two array-like of shape (n_samples, n_components). The return of the metric is an array-like of shape (n_samples, n_samples). If you want to use periodic boundary conditions, be sure to provide the cell length in the
metric_params
and provide a metric that can take the cell argument. IfNone
, theskmatter.metrics.periodic_pairwise_euclidean_distances()
is used.metric_params (dict, default=None) – Additional parameters to be passed to the use of metric. i.e. the dimension of a rectangular cell of side length \(a_i\) for
skmatter.metrics.periodic_pairwise_euclidean_distances()
{'cell_length': [a_1, a_2, ..., a_n]}
- labels_#
An array of labels for each input data.
- Type:
- cluster_centers_idx_#
An array of indices of cluster centers.
- Type:
- cluster_centers_#
An array of cluster centers.
- Type:
Examples
>>> import numpy as np >>> from skmatter.clustering import QuickShift
Create some points and their weights for quick shift clustering
>>> feature1 = np.array([-1.72, -4.44, 0.54, 3.19, -1.13, 0.55]) >>> feature2 = np.array([-1.32, -2.13, -2.43, -0.49, 2.33, 0.18]) >>> points = np.vstack((feature1, feature2)).T >>> weights = np.array([-3.94, -12.68, -7.07, -9.03, -8.26, -2.61])
Set cutoffs for seraching
>>> cuts = np.array([6.99, 8.80, 7.68, 9.51, 8.07, 6.22])
Do the clustering
>>> model = QuickShift(cuts).fit(points, samples_weight=weights) >>> print(model.labels_) [0 0 0 5 5 5] >>> print(model.cluster_centers_idx_) [0 5]
We can also apply a periodic boundary condition
>>> model = QuickShift(cuts, metric_params={"cell_length": [3, 3]}) >>> model = model.fit(points, samples_weight=weights) >>> print(model.labels_) [5 5 5 5 5 5] >>> print(model.cluster_centers_idx_) [5]
Since the searching cuts are all larger than the maximum distance in the PBC box, it can be expected that all points are assigned to the same cluster, of the center that has the largest weight.