Neighbors#

The module implements the sparse kernel density estimator.

A large dataset can be generated during the molecular dynamics sampling. The distribution of the sampled data reflects the (free) energetic stability of molecular patterns. The KDE model can be used to characterize the probability distribution, and thus to identify the stable patterns in the system. However, the computational cost of KDE is O(N^2) where N is the number of sampled points, which is very expensive. Here we offer a sparse implementation of the KDE model with a O(MN) computational cost, where M is the number of grid points generated from the sampled data.

The following class is available:

Sparse Kernel Density Estimation#

class skmatter.neighbors.SparseKDE(descriptors: ndarray, weights: ndarray | None = None, metric: Callable | None = None, metric_params: dict | None = None, fspread: float = -1.0, fpoints: float = 0.15, kernel: str = 'gaussian', verbose: bool = False)[source]#

Bases: BaseEstimator

A sparse implementation of the Kernel Density Estimation. This class is used to build a sparse kernel density estimator. It takes a set of descriptors and a set of weights as input, and fit the KDE model on the sampled point (e.g. the grid point selected by FPS).

Note

Currently only the Gaussian kernel is supported.

Parameters:
  • descriptors (numpy.ndarray) – Descriptors of the system where you want to build a sparse KDE. It should be an array of shape (n_descriptors, n_features).

  • weights (numpy.ndarray, default=None) – Weights of the descriptors. If None, all weights are set to 1/n_descriptors.

  • metric (Callable, default=None) – The metric to use. Your metric should be able to take at least three arguments in secquence: X, Y, and squared=True. Here, X and Y are two array-like of shape (n_samples, n_components). The return of the metric is an array-like of shape (n_samples, n_samples). If you want to use periodic boundary conditions, be sure to provide the cell size in the metric_params and provide a metric that can take the cell argument. If None, the skmatter.metrics.periodic_pairwise_euclidean_distances() is used.

  • metric_params (dict, default=None) – Additional parameters to be passed to the use of metric. i.e. the cell dimension for skmatter.metrics.periodic_pairwise_euclidean_distances() {'cell_length': [side_length_1, ..., side_length_n]}

  • fspread (float, default=-1.0) – The fractional “space” occupied by the voronoi cell of each grid. Use this when each cell is of a similar size.

  • fpoints (float, default=0.15) – The fractional number of points in the voronoi cell of each grid points. Use this when each cell has a similar number of points.

  • kernel (str, default=gaussian) – The kernel used here. Now only the Gaussian kernel is available.

  • verbose (bool, default=False) – Whether to print progress.

n_samples#

The number of descriptors.

Type:

int

kdecut_squared#

The cut-off value for the KDE. If the mahalanobis distance between two grid points is larger than kdecut2, they are considered to be far away.

Type:

float

cell#

The cell dimension for the metric.

Type:

numpy.ndarray

bandwidth_#

The bandwidth of the KDE.

Type:

numpy.ndarray

Examples

>>> import numpy as np
>>> from skmatter.neighbors import SparseKDE
>>> from skmatter.feature_selection import FPS
>>> np.random.seed(0)
>>> n_samples = 10_000

To create two Gaussians with different means and covariance and sample from them

>>> cov1 = [[1, 0.5], [0.5, 1]]
>>> cov2 = [[1, 0.5], [0.5, 0.5]]
>>> sample1 = np.random.multivariate_normal([0, 0], cov1, n_samples)
>>> sample2 = np.random.multivariate_normal([4, 4], cov2, n_samples)
>>> samples = np.concatenate([sample1, sample2])

To select grid points using FPS

>>> selector = FPS(n_to_select=int(np.sqrt(2 * n_samples)))
>>> result = selector.fit_transform(samples.T).T

Conduct sparse KDE based on the grid points

>>> estimator = SparseKDE(samples, None, fpoints=0.5)
>>> _ = estimator.fit(result)

The total log-likelihood under the model

>>> print(round(estimator.score(result), 3))
-759.831
fit(X, y=None, sample_weight=None)[source]#

Fit the Kernel Density model on the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

  • sample_weight (array-like of shape (n_samples,), default=None) – List of sample weights attached to the data X. This parameter is ignored. Instead of reading sample_weight from the input, it is calculated internally.

Returns:

self (object) – Returns the instance itself.

score_samples(X)[source]#

Compute the log-likelihood of each sample under the model.

Parameters:

X (array-like of shape (n_samples, n_features)) – An array of points to query. Last dimension should match dimension of training data (n_features).

Returns:

density (ndarray of shape (n_samples,)) – Log-likelihood of each sample in X. These are normalized to be probability densities, so values will be low for high-dimensional data.

score(X, y=None)[source]#

Compute the total log-likelihood under the model.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – List of n_features-dimensional data points. Each row corresponds to a single data point.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns:

logprob (float) – Total log-likelihood of the data in X. This is normalized to be a probability density, so the value will be low for high-dimensional data.