.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/neighbors/sparse-kde.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_neighbors_sparse-kde.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_neighbors_sparse-kde.py:


Sparse KDE examples
===================

Example for the usage of the :class:`skmatter.neighbors.SparseKDE` class. This class is
specifically designed for conducting pobabilistic analysis of molecular motifs
(`PAMM <https://doi.org/10.1063/1.4900655>`_),
which is quite useful for analyzing motifs like H-bonds, coordination polyhedra, and
protein secondary structure.

Here we show how to use the sparse KDE model to fit the probability distribution based
on sampled data and how to use PAMM to analyze the H-bond.

We start from a simple system, which is consist of three 2D Gaussians. Our task is to
estimate the parameters of these Gaussians from our sampled data.

Here we first sample from these three Gaussians.

.. GENERATED FROM PYTHON SOURCE LINES 24-34

.. code-block:: Python

    import time

    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.stats import gaussian_kde

    from skmatter.feature_selection import FPS
    from skmatter.neighbors import SparseKDE


.. GENERATED FROM PYTHON SOURCE LINES 35-48

.. code-block:: Python

    means = np.array([[0, 0], [4, 4], [6, -2]])
    covariances = np.array(
        [[[1, 0.5], [0.5, 1]], [[1, 0.5], [0.5, 0.5]], [[1, -0.5], [-0.5, 1]]]
    )
    N_SAMPLES = 100_000
    samples = np.concatenate(
        [
            np.random.multivariate_normal(means[0], covariances[0], N_SAMPLES),
            np.random.multivariate_normal(means[1], covariances[1], N_SAMPLES),
            np.random.multivariate_normal(means[2], covariances[2], N_SAMPLES),
        ]
    )


.. GENERATED FROM PYTHON SOURCE LINES 49-52

We can visualize our sample result:


.. GENERATED FROM PYTHON SOURCE LINES 54-61

.. code-block:: Python

    fig, ax = plt.subplots()
    ax.scatter(samples[:, 0], samples[:, 1], alpha=0.05, s=1)
    ax.scatter(means[:, 0], means[:, 1], marker="+", color="red", s=100)
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    plt.show()


.. image-sg:: /examples/neighbors/images/sphx_glr_sparse-kde_001.png
   :alt: sparse kde
   :srcset: /examples/neighbors/images/sphx_glr_sparse-kde_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 62-66

Sparse KDE requires a discretization of the sample space. Here, we use
the FPS method to generate grid points in the sample space:


.. GENERATED FROM PYTHON SOURCE LINES 68-80

.. code-block:: Python

    start1 = time.time()
    selector = FPS(n_to_select=int(np.sqrt(3 * N_SAMPLES)))
    grids = selector.fit_transform(samples.T).T
    end1 = time.time()
    fig, ax = plt.subplots()
    ax.scatter(samples[:, 0], samples[:, 1], alpha=0.05, s=1)
    ax.scatter(means[:, 0], means[:, 1], marker="+", color="red", s=100)
    ax.scatter(grids[:, 0], grids[:, 1], color="orange", s=1)
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    plt.show()


.. image-sg:: /examples/neighbors/images/sphx_glr_sparse-kde_002.png
   :alt: sparse kde
   :srcset: /examples/neighbors/images/sphx_glr_sparse-kde_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 81-84

Now we can do sparse KDE (usually takes tens of seconds):


.. GENERATED FROM PYTHON SOURCE LINES 86-91

.. code-block:: Python

    start2 = time.time()
    estimator = SparseKDE(samples, None, fpoints=0.5)
    estimator.fit(grids)
    end2 = time.time()


.. GENERATED FROM PYTHON SOURCE LINES 92-96

We can have a comparison with the original sampling result by plotting them.

For the convenience, we create a class for the Gaussian mixture model to help us plot
the result.

.. GENERATED FROM PYTHON SOURCE LINES 99-138

.. code-block:: Python

    class GaussianMixtureModel:
        def __init__(
            self,
            weights: np.ndarray,
            means: np.ndarray,
            covariances: np.ndarray,
            period: np.ndarray = None,
        ):
            self.weights = weights
            self.means = means
            self.covariances = covariances
            self.period = period
            self.dimension = self.means.shape[1]
            self.cov_inv = np.linalg.inv(self.covariances)
            self.cov_det = np.linalg.det(self.covariances)
            self.norm = 1 / np.sqrt((2 * np.pi) ** self.dimension * self.cov_det)

        def __call__(self, x: np.ndarray, i: int = None):
            if len(x.shape) == 1:
                x = x[np.newaxis, :]
            if self.period is not None:
                xij = np.zeros(self.means.shape)
                xij = rij(self.period, xij, x)
            else:
                xij = x - self.means
            p = (
                self.weights
                * self.norm
                * np.exp(
                    -0.5 * (xij[:, np.newaxis, :] @ self.cov_inv @ xij[:, :, np.newaxis])
                ).reshape(-1)
            )
            sum_p = np.sum(p)
            if i is None:
                return sum_p

            return np.sum(p[i]) / sum_p


.. GENERATED FROM PYTHON SOURCE LINES 139-148

.. code-block:: Python

    def rij(period: np.ndarray, xi: np.ndarray, xj: np.ndarray) -> np.ndarray:
        """Get the position vectors between two points. PBC are taken into account."""
        xij = xi - xj
        if period is not None:
            xij -= np.round(xij / period) * period

        return xij


.. GENERATED FROM PYTHON SOURCE LINES 149-150

The original model that we want to fit:

.. GENERATED FROM PYTHON SOURCE LINES 150-175

.. code-block:: Python

    original_model = GaussianMixtureModel(np.full(3, 1 / 3), means, covariances)
    # The fitted model:
    fitted_model = GaussianMixtureModel(
        estimator._sample_weights, estimator._grids, estimator.bandwidth_
    )

    # To plot the probability density contour, we need to create a grid of points:
    x, y = np.meshgrid(np.linspace(-6, 12, 100), np.linspace(-8, 8))
    points = np.concatenate(np.stack([x, y], axis=-1))
    probs = np.array([original_model(point) for point in points])
    fitted_probs = np.array([fitted_model(point) for point in points])

    fig, ax = plt.subplots()
    ct1 = ax.contour(x, y, probs.reshape(x.shape), colors="blue")
    ct2 = ax.contour(x, y, fitted_probs.reshape(x.shape), colors="orange")
    h1, _ = ct1.legend_elements()
    h2, _ = ct2.legend_elements()
    ax.legend(
        [h1[0], h2[0]],
        ["original", "fitted"],
    )
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    plt.show()


.. image-sg:: /examples/neighbors/images/sphx_glr_sparse-kde_003.png
   :alt: sparse kde
   :srcset: /examples/neighbors/images/sphx_glr_sparse-kde_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 176-179

The performance of the probability density estimation can be characterized by the
Mean Integrated Squared Error (MISE), which is defined as:
:math:`\text{MISE}=\text{E}[\int (\hat{P}(\textbf{x})-P(\textbf{x}))^2 d\textbf{x}]`

.. GENERATED FROM PYTHON SOURCE LINES 181-185

.. code-block:: Python

    RMSE = np.sum((probs - fitted_probs) ** 2 * (x[0][1] - x[0][0]) * (y[1][0] - y[0][0]))
    print(f"Time sparse-kde: {end2 - start2} s")
    print(f"RMSE = {RMSE:.2e}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Time sparse-kde: 44.906388998031616 s
    RMSE = 7.22e-04


.. GENERATED FROM PYTHON SOURCE LINES 186-188

We can compare the result with the KDE class from scipy. (Usually takes
several minutes to run)

.. GENERATED FROM PYTHON SOURCE LINES 190-201

.. code-block:: Python

    data = np.vstack([x.ravel(), y.ravel()])
    start = time.time()
    kde = gaussian_kde(samples.T)
    sklearn_probs = kde(data).T
    end = time.time()
    print(f"Time scipy: {end - start} s")
    RMSE_kde = np.sum(
        (probs - sklearn_probs) ** 2 * (x[0][1] - x[0][0]) * (y[1][0] - y[0][0])
    )
    print(f"RMSE_kde = {RMSE_kde:.2e}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Time scipy: 16.33100938796997 s
    RMSE_kde = 8.85e-04


.. GENERATED FROM PYTHON SOURCE LINES 202-206

We can see that the fitted model can perfectly capture the original one. Even though
we have not specified the number of the Gaussians, it can still perform well. This
allows us to fit distributions of the data automatically at a comparable quality
within a much shorter time than scipy.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (1 minutes 2.548 seconds)


.. _sphx_glr_download_examples_neighbors_sparse-kde.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: sparse-kde.ipynb <sparse-kde.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: sparse-kde.py <sparse-kde.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: sparse-kde.zip <sparse-kde.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_