Feature and Sample Selection#

Data sub-selection modules primarily corresponding to methods derived from CUR matrix decomposition and Farthest Point Sampling. In their classical form, CUR and FPS determine a data subset that maximizes the variance (CUR) or distribution (FPS) of the features or samples. These methods can be modified to combine supervised target information denoted by the methods PCov-CUR and PCov-FPS. For further reading, refer to [Imbalzano2018] and [Cersonsky2021]. These selectors can be used for both feature and sample selection, with similar instantiations. All sub-selection methods scores each feature or sample (without an estimator) and chooses that with the maximum score. A simple example of usage:

>>> # feature selection
>>> import numpy as np
>>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(
...     # the number of selections to make
...     # if None, set to half the samples or features
...     # if float, fraction of the total dataset to select
...     # if int, absolute number of selections to make
...     n_to_select=2,
...     # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
...     progress_bar=True,
...     # float, cutoff score to stop selecting
...     score_threshold=1e-12,
...     # boolean, whether to select randomly after non-redundant selections
...     # are exhausted
...     full=False,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> y = np.array([0.0, 0.0, 1.0])  # classes of each sample
>>> selector.fit(X)
CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>> selector = PCovCUR(n_to_select=2)
>>> selector.fit(X, y)
PCovCUR(n_to_select=2)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>>
>>> # Now sample selection
>>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(n_to_select=2)
>>> selector.fit(X)
CUR(n_to_select=2)
>>> Xr = X[selector.selected_idx_]
>>> print(Xr.shape)
(2, 3)

These selectors are available:

  • CUR: a decomposition: an iterative feature selection method based upon the singular value decoposition.

  • PCov-CUR decomposition extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression.

  • Farthest Point-Sampling (FPS): a common selection technique intended to exploit the diversity of the input space. The selection of the first point is made at random or by a separate metric

  • PCov-FPS extends upon FPS much like PCov-CUR does to CUR.

  • Voronoi FPS: conduct FPS selection, taking advantage of Voronoi tessellations to accelerate selection.

  • Directional Convex Hull (DCH): selects samples by constructing a directional convex hull and determining which samples lie on the bounding surface.

CUR#

CUR decomposition begins by approximating a matrix \({\mathbf{X}}\) using a subset of columns and rows

\[\mathbf{\hat{X}} \approx \mathbf{X}_\mathbf{c} \left(\mathbf{X}_\mathbf{c}^- \mathbf{X} \mathbf{X}_\mathbf{r}^-\right) \mathbf{X}_\mathbf{r}.\]

These subsets of rows and columns, denoted \(\mathbf{X}_\mathbf{r}\) and \(\mathbf{X}_\mathbf{c}\), respectively, can be determined by iterative maximization of a leverage score \(\pi\), representative of the relative importance of each column or row. From hereon, we will call selection methods which are derived off of the CUR decomposition “CUR” as a shorthand for “CUR-derived selection”. In each iteration of CUR, we select the column or row that maximizes \(\pi\) and orthogonalize the remaining columns or rows. These steps are iterated until a sufficient number of features has been selected. This iterative approach, albeit comparatively time consuming, is the most deterministic and efficient route in reducing the number of features needed to approximate \(\mathbf{X}\) when compared to selecting all features in a single iteration based upon the relative \(\pi\) importance.

The feature and sample selection versions of CUR differ only in the computation of \(\pi\). In sample selection \(\pi\) is computed using the left singular vectors, versus in feature selection, \(\pi\) is computed using the right singular vectors.

class skmatter.feature_selection.CUR(recompute_every=1, k=1, tolerance=1e-12, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer that performs Greedy Feature Selection by choosing features which maximize the magnitude of the right singular vectors, consistent with classic CUR matrix decomposition.

Parameters:
  • recompute_every (int) – number of steps after which to recompute the pi score defaults to 1, if 0 no re-computation is done

  • k (int) – number of eigenvectors to compute the importance score with, defaults to 1

  • tolerance (float) – threshold below which scores will be considered 0, defaults to 1e-12

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the features are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) – option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining features. Stored in self.full.

  • random_state (int or numpy.random`RandomState instance, default=0)

X_current_#

The original matrix orthogonalized by previous selections

Type:

numpy.ndarray (n_samples, n_features)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected features, for use in fitting

Type:

numpy.ndarray

pi_#

the importance score see _compute_pi()

Type:

numpy.ndarray (n_features),

selected_idx_#

indices of selected features

Type:

numpy.ndarray

Examples

>>> from skmatter.feature_selection import CUR
>>> import numpy as np
>>> selector = CUR(n_to_select=2, random_state=0)
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> selector.fit(X)
CUR(n_to_select=2)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>> np.round(selector.pi_)  # importance score
array([0., 0., 0.])
>>> selector.selected_idx_
array([1, 0])
_compute_pi(X, y=None)#

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{C}\right)_{ij}^2.\]

where \(\mathbf{C} = \mathbf{X}^T\mathbf{X}\).

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{K}\right)_{ij}^2.\]

where \(\mathbf{K} = \mathbf{X}\mathbf{X}^T\).

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

pi (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the importance score of the given samples or features.

Note

This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

score (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') CUR#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

class skmatter.sample_selection.CUR(recompute_every=1, k=1, tolerance=1e-12, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer that performs Greedy Sample Selection by choosing samples which maximize the magnitude of the left singular vectors, consistent with classic CUR matrix decomposition.

Parameters:
  • recompute_every (int) – number of steps after which to recompute the pi score defaults to 1, if 0 no re-computation is done

  • k (int) – number of eigenvectors to compute the importance score with, defaults to 1

  • tolerance (float) – threshold below which scores will be considered 0, defaults to 1E-12

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the samples are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining samples. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

X_current_#

The original matrix orthogonalized by previous selections

Type:

numpy.ndarray (n_samples, n_features)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected samples, for use in fitting

Type:

numpy.ndarray,

y_selected_#

In sample selection, the matrix containing the selected targets, for use in fitting

Type:

numpy.ndarray,

pi_#

the importance score see _compute_pi()

Type:

numpy.ndarray (n_features),

selected_idx_#

indices of selected features

Type:

ndarray

Examples

>>> from skmatter.sample_selection import CUR
>>> import numpy as np
>>> selector = CUR(n_to_select=2, random_state=0)
>>> X = np.array(
...     [
...         [0.12, 0.21, -0.11],  # 4 samples, 3 features
...         [-0.09, 0.32, 0.51],
...         [-0.03, 0.53, 0.14],
...         [-0.83, -0.13, 0.82],
...     ]
... )
>>> selector.fit(X)
CUR(n_to_select=2)
>>> np.round(selector.pi_, 2)  # importance score
array([0.01, 0.99, 0.  , 0.  ])
>>> selector.selected_idx_  # selected idx
array([3, 2])
>>> # selector.transform(X) cannot be used as sklearn API
>>> # restricts the change of sample size using transformers
>>> # So one has to do
>>> X[selector.selected_idx_]
array([[-0.83, -0.13,  0.82],
       [-0.03,  0.53,  0.14]])
_compute_pi(X, y=None)#

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{C}\right)_{ij}^2.\]

where \(\mathbf{C} = \mathbf{X}^T\mathbf{X}\).

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{K}\right)_{ij}^2.\]

where \(\mathbf{K} = \mathbf{X}\mathbf{X}^T\).

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

pi (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the importance score of the given samples or features.

Note

This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

score (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') CUR#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

PCov-CUR#

PCov-CUR extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression, as demonstrated in [Cersonsky2021]. These methods employ the modified kernel and covariance matrices introduced in PCovR and available via the Utility Classes.

Again, the feature and sample selection versions of PCov-CUR differ only in the computation of \(\pi\). S

class skmatter.feature_selection.PCovCUR(mixing=0.5, recompute_every=1, k=1, tolerance=1e-12, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer that performs Greedy Feature Selection by choosing features which maximize the importance score \(\pi\), which is the sum over the squares of the first \(k\) components of the PCovR-modified right singular vectors.

Parameters:
  • recompute_every (int) – number of steps after which to recompute the pi score defaults to 1, if 0 no re-computation is done

  • k (int) – number of eigenvectors to compute the importance score with, defaults to 1

  • tolerance (float) – threshold below which scores will be considered 0, defaults to 1e-12

  • mixing (float, default=0.5) – The PCovR mixing parameter, as described in PCovR as \({\alpha}\). Stored in self.mixing.

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the features are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining features. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

X_current_#

The original matrix orthogonalized by previous selections

Type:

numpy.ndarray (n_samples, n_features)

y_current_#

The targets orthogonalized by a regression on the previous selections.

Type:

numpy.ndarray (n_samples, n_properties)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected features, for use in fitting

Type:

numpy.ndarray,

pi_#

the importance score see _compute_pi()

Type:

numpy.ndarray (n_features),

selected_idx_#

indices of selected features

Type:

numpy.ndarray

Examples

>>> from skmatter.feature_selection import PCovCUR
>>> import numpy as np
>>> selector = PCovCUR(n_to_select=2, mixing=0.5, random_state=0)
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> y = np.array([0.0, 0.0, 1.0])  # classes of each sample
>>> selector.fit(X, y)
PCovCUR(n_to_select=2)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>> np.round(selector.pi_)  # importance score
array([0., 0., 0.])
>>> selector.selected_idx_
array([1, 0])
_compute_pi(X, y=None)#

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors.

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{C}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{C}} = \alpha \mathbf{X}^T\mathbf{X} + (1 - \alpha)(\mathbf{X}^T\mathbf{X})^{-1/2}\mathbf{X}^T \mathbf{\hat{Y}\hat{Y}}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1/2}}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the covariance matrix \({\mathbf{C} = \mathbf{X}^T\mathbf{X}}\) used in CUR.

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{K}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{K}} = \alpha \mathbf{XX}^T + (1 - \alpha)\mathbf{\hat{Y}\hat{Y}}^T}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the Gram matrix \({\mathbf{K} = \mathbf{X}\mathbf{X}^T}\).

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

pi (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the importance score of the given samples or features.

Note

This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

score (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') PCovCUR#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

class skmatter.sample_selection.PCovCUR(mixing=0.5, recompute_every=1, k=1, tolerance=1e-12, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer that performs Greedy Sample Selection by choosing samples which maximize the importance score \(\pi\), which is the sum over the squares of the first \(k\) components of the PCovR-modified left singular vectors.

Parameters:
  • mixing (float, default=0.5) – The PCovR mixing parameter, as described in PCovR as \({\alpha}\). Stored in self.mixing.

  • recompute_every (int) – number of steps after which to recompute the pi score defaults to 1, if 0 no re-computation is done

  • k (int) – number of eigenvectors to compute the importance score with, defaults to 1

  • tolerance (float) – threshold below which scores will be considered 0, defaults to 1E-12

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the samples are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining samples. Stored in self.full.

  • random_state (int or :class`numpy.random.RandomState` instance, default=0)

X_current_#

The original matrix orthogonalized by previous selections

Type:

numpy.ndarray (n_samples, n_features)

y_current_#

The targets orthogonalized by a regression on the previous selections.

Type:

numpy.ndarray (n_samples, n_properties)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected samples, for use in fitting

Type:

numpy.ndarray

y_selected_#

In sample selection, the matrix containing the selected targets, for use in fitting

Type:

numpy.ndarray,

pi_#

the importance score see _compute_pi()

Type:

numpy.ndarray (n_features),

selected_idx_#

indices of selected features

Type:

numpy.ndarray

Examples

>>> from skmatter.sample_selection import PCovCUR
>>> import numpy as np
>>> selector = PCovCUR(n_to_select=2, random_state=0)
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> y = np.array([0.0, 0.0, 1.0])  # classes of each sample
>>> selector.fit(X, y)
PCovCUR(n_to_select=2)
>>> np.round(selector.pi_, 2)  # importance scole
array([1., 0., 0.])
>>> selector.selected_idx_  # importance scole
array([2, 1])
>>> # selector.transform(X) cannot be used as sklearn API
>>> # restricts the change of sample size using transformers
>>> # So one has to do
>>> X[selector.selected_idx_]
array([[-0.03, -0.53,  0.08],
       [-0.09,  0.32, -0.1 ]])
_compute_pi(X, y=None)#

For feature selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors.

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{C}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{C}} = \alpha \mathbf{X}^T\mathbf{X} + (1 - \alpha)(\mathbf{X}^T\mathbf{X})^{-1/2}\mathbf{X}^T \mathbf{\hat{Y}\hat{Y}}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1/2}}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the covariance matrix \({\mathbf{C} = \mathbf{X}^T\mathbf{X}}\) used in CUR.

For sample selection, the importance score \(\pi\) is the sum over the squares of the first \(k\) components of the right singular vectors

\[\pi_j = \sum_i^k \left(\mathbf{U}_\mathbf{\tilde{K}}\right)_{ij}^2.\]

where \({\mathbf{\tilde{K}} = \alpha \mathbf{XX}^T + (1 - \alpha)\mathbf{\hat{Y}\hat{Y}}^T}\) for some mixing parameter \({\alpha}\). When \({\alpha = 1}\), this defaults to the Gram matrix \({\mathbf{K} = \mathbf{X}\mathbf{X}^T}\).

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

pi (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the importance score of the given samples or features.

Note

This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. This is done by self._compute_pi().

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

score (numpy.ndarray of (n_to_select_from_)) – \(\pi\) importance for the given samples or features

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') PCovCUR#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

Farthest Point-Sampling (FPS)#

Farthest Point Sampling is a common selection technique intended to exploit the diversity of the input space.

In FPS, the selection of the first point is made at random or by a separate metric. Each subsequent selection is made to maximize the Hausdorf distance, i.e. the minimum distance between a point and all previous selections. It is common to use the Euclidean distance, however other distance metrics may be employed.

Similar to CUR, the feature and selection versions of FPS differ only in the way distance is computed (feature selection does so column-wise, sample selection does so row-wise), and are built off of the same base class,

These selectors can be instantiated using skmatter.feature_selection.FPS and skmatter.sample_selection.FPS.

class skmatter.feature_selection.FPS(initialize=0, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer performing Greedy Feature Selection using Farthest Point Sampling.

Parameters:
  • initialize (int, list of int, numpy.ndarray of int, or 'random', default=0) – Index of the first selection(s). If ‘random’, picks a random value when fit starts. Stored in self.initialize.

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the features are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining features. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected features, for use in fitting

Type:

ndarray,

selected_idx_#

indices of selected samples

Type:

ndarray

Examples

>>> from skmatter.feature_selection import FPS
>>> import numpy as np
>>> selector = FPS(
...     n_to_select=2,
...     # int or 'random', default=0
...     # Index of the first selection.
...     # If "random", picks a random value when fit starts.
...     initialize=0,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> selector.fit(X)
FPS(n_to_select=2)
>>> Xr = selector.transform(X)
>>> selector.selected_idx_
array([0, 1])
fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_distance()#

Traditional FPS employs a column-wise Euclidean distance for feature selection, which can be expressed using the covariance matrix \(\mathbf{C} = \mathbf{X} ^ T \mathbf{X}\)

\[\operatorname{d}_c(i, j) = C_{ii} - 2 C_{ij} + C_{jj}.\]

For sample selection, this is a row-wise Euclidean distance, which can be expressed in terms of the Gram matrix \(\mathbf{K} = \mathbf{X} \mathbf{X} ^ T\)

\[\operatorname{d}_r(i, j) = K_{ii} - 2 K_{ij} + K_{jj}.\]
Returns:

hausdorff (numpy.ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_select_distance()#
Returns:

hausdorff_at_select (numpy.ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the Hausdorff distances of all samples to previous selections

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The hausdorff distance is updated in self._update_hausdorff()

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

hausdorff (Hausdorff distances)

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') FPS#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

class skmatter.sample_selection.FPS(initialize=0, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer performing Greedy Sample Selection using Farthest Point Sampling.

Parameters:
  • initialize (int, list of int, numpy.ndarray of int, or 'random', default=0) – Index of the first selection(s). If ‘random’, picks a random value when fit starts. Stored in self.initialize.

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the samples are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining samples. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected samples, for use in fitting

Type:

numpy.ndarray,

y_selected_#

In sample selection, the matrix containing the selected targets, for use in fitting.

Type:

numpy.ndarray,

selected_idx_#

indices of selected samples

Type:

numpy.ndarray

Examples

>>> from skmatter.sample_selection import FPS
>>> import numpy as np
>>> selector = FPS(
...     n_to_select=2,
...     # int or 'random', default=0
...     # Index of the first selection.
...     # If "random", picks a random value when fit starts.
...     initialize=0,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> selector.fit(X)
FPS(n_to_select=2)
>>> selector.selected_idx_
array([0, 2])
fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_distance()#

Traditional FPS employs a column-wise Euclidean distance for feature selection, which can be expressed using the covariance matrix \(\mathbf{C} = \mathbf{X} ^ T \mathbf{X}\)

\[\operatorname{d}_c(i, j) = C_{ii} - 2 C_{ij} + C_{jj}.\]

For sample selection, this is a row-wise Euclidean distance, which can be expressed in terms of the Gram matrix \(\mathbf{K} = \mathbf{X} \mathbf{X} ^ T\)

\[\operatorname{d}_r(i, j) = K_{ii} - 2 K_{ij} + K_{jj}.\]
Returns:

hausdorff (numpy.ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_select_distance()#
Returns:

hausdorff_at_select (numpy.ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the Hausdorff distances of all samples to previous selections

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The hausdorff distance is updated in self._update_hausdorff()

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

hausdorff (Hausdorff distances)

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') FPS#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

PCov-FPS#

PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the Euclidean distance solely in the space of \(\mathbf{X}\), we use a combined distance in terms of \(\mathbf{X}\) and \(\mathbf{y}\).

class skmatter.feature_selection.PCovFPS(mixing=0.5, initialize=0, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer that performs Greedy Feature Selection using PCovR-weighted Farthest Point Sampling.

Parameters:
  • mixing (float, default=0.5) – The PCovR mixing parameter, as described in PCovR as \({\alpha}\)

  • initialize (int or 'random', default=0) – Index of the first selection. If ‘random’, picks a random value when fit starts.

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the features are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining features. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected features, for use in fitting

Type:

numpy.ndarray,

Examples

>>> from skmatter.feature_selection import PCovFPS
>>> import numpy as np
>>> selector = PCovFPS(
...     n_to_select=2,
...     # int or 'random', default=0
...     # Index of the first selection.
...     # If ‘random’, picks a random value when fit starts.
...     initialize=0,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> y = np.array([0.0, 0.0, 1.0])  # classes of each sample
>>> selector.fit(X, y)
PCovFPS(n_to_select=2)
>>> Xr = selector.transform(X)
>>> selector.selected_idx_
array([0, 1])
fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_distance()#
Returns:

hausdorff (numpy.ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_select_distance()#
Returns:

hausdorff_at_select (numpy.ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the Hausdorff distances of all samples to previous selections.

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The hausdorff distance is updated in self._update_hausdorff()

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

hausdorff (Hausdorff distances)

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') PCovFPS#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

class skmatter.sample_selection.PCovFPS(mixing=0.5, initialize=0, n_to_select=None, score_threshold=None, score_threshold_type='absolute', progress_bar=False, full=False, random_state=0)[source]#

Transformer performing Greedy Sample Selection using PCovR-weighted Farthest Point Sampling.

Parameters:
  • mixing (float, default=0.5) – The PCovR mixing parameter, as described in PCovR as \({\alpha}\)

  • initialize (int or 'random', default=0) – Index of the first selection. If ‘random’, picks a random value when fit starts.

  • n_to_select (int or float, default=None) – The number of selections to make. If None, half of the samples are selected. If integer, the parameter is the absolute number of selections to make. If float between 0 and 1, it is the fraction of the total dataset to select. Stored in self.n_to_select.

  • score_threshold (float, default=None) – Threshold for the score. If None selection will continue until the n_to_select is chosen. Otherwise will stop when the score falls below the threshold. Stored in self.score_threshold.

  • score_threshold_type (str, default="absolute") – How to interpret the score_threshold. When “absolute”, the score used by the selector is compared to the threshold directly. When “relative”, at each iteration, the score used by the selector is compared proportionally to the score of the first selection, i.e. the selector quits when current_score / first_score < threshold. Stored in self.score_threshold_type.

  • progress_bar (bool, default=False) –

    option to use tqdm progress bar to monitor selections. Stored in self.report_progress.

  • full (bool, default=False) – In the case that all non-redundant selections are exhausted, choose randomly from the remaining samples. Stored in self.full.

  • random_state (int or numpy.random.RandomState instance, default=0)

n_selected_#

Counter tracking the number of selections that have been made

Type:

int

X_selected_#

Matrix containing the selected samples, for use in fitting

Type:

numpy.ndarray,

y_selected_#

In sample selection, the matrix containing the selected targets, for use in fitting

Type:

numpy.ndarray,

selected_idx_#

indices of selected samples

Type:

numpy.ndarray

Examples

>>> from skmatter.sample_selection import PCovFPS
>>> import numpy as np
>>> selector = PCovFPS(
...     n_to_select=2,
...     # int or 'random', default=0
...     # Index of the first selection.
...     # If "random", picks a random value when fit starts.
...     initialize=0,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> y = np.array([0.0, 0.0, 1.0])  # classes of each sample
>>> selector.fit(X, y)
PCovFPS(n_to_select=2)
>>> selector.selected_idx_
array([0, 2])
fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_distance()#
Returns:

hausdorff (numpy.ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_select_distance()#
Returns:

hausdorff_at_select (numpy.ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X, y=None)#

Returns the Hausdorff distances of all samples to previous selections.

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The hausdorff distance is updated in self._update_hausdorff()

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

hausdorff (Hausdorff distances)

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') PCovFPS#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

Voronoi FPS#

class skmatter.sample_selection.VoronoiFPS(n_trial_calculation=4, full_fraction=None, initialize=0, **kwargs)[source]#

In FPS, points are selected based upon their Hausdorff distance to previous selections, i.e. the minimum distance between a given point and any previously selected points. This implicitly constructs a Voronoi tessellation which is updated with each new selection, as each unselected point “belongs” to the Voronoi polyhedron of the nearest previous selection.

This implicit tessellation enabled a more efficient evaluation of the FPS – at each iteration, we need only consider for selection those points at the boundaries of the Voronoi polyhedra, and when updating the tessellation we need only consider moving those points whose Hausdorff distance is greater than half of the distance between the corresponding Voronoi center and the newly selected point, per the triangle equality.

../_images/VoronoiFPS-Schematic.pdf

To demonstrate the algorithm behind Voronoi FPS, let \(*_{m+1}\) be a new chosen point, \(v(j)\) was chosen earlier, \(j\) is a point in the polyhedron with center \(v(j)\). From the inequalities of the triangle one can easily see that if \(d(v(j),j)<d(*_{m+1}, j)/2\), then point \(j\) is guaranteed not to fall into the formed polyhedron and the distance to it can be uncalculated.

This algorithm is particularly appealing when using a non-Euclidean or computationally-intensive distance metric, for which the decrease in computational time due to the reduction in distance calculations outweighs the increase from book-keeping. For simple metrics (such as Euclidean distance), VoronoiFPS will likely not accelerate, and may decelerate, computations when compared to FPS.

Parameters:
  • n_trial_calculation (int, default=4) – Number of calculations used for the switching point between Voronoi FPS and traditional FPS (for detail look at full_fraction).

  • full_fraction (float, default=None) – Proportion of calculated distances from the total number of features at which the switch from Voronoi FPS to FPS occurs. At a certain number of distances to be calculated, the use of Voronoi FPS becomes unreasonably expensive due to the associated costs connected with reading data from the memory. The switching point depends on many conditions, and it is determined “in situ” for optimal use of the algorithm. Determination is done with a few test calculations and memory operations.

Examples

>>> from skmatter.sample_selection import VoronoiFPS
>>> selector = VoronoiFPS(
...     n_to_select=2,
...     progress_bar=True,
...     score_threshold=1e-12,
...     full=False,
...     # n_trial_calculation used for calculation of full_fraction,
...     # so you need to determine only one parameter
...     n_trial_calculation=4,
...     full_fraction=0.45,
...     # int or 'random', default=0
...     # Index of the first selection.
...     # If 'random', picks a random value when fit starts.
...     initialize=0,
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...     ]
... )
>>> selector.fit(X)
VoronoiFPS(full_fraction=0.45)
fit(X, y=None, warm_start=False)#

Learn the features to select.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Training vectors.

  • y (numpy.ndarray of shape (n_samples,), default=None) – Target values.

  • warm_start (bool) – Whether the fit should continue after having already run, after increasing n_to_select. Assumes it is called with the same X and y

Returns:

self (object)

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_distance()[source]#

Traditional FPS employs a column-wise Euclidean distance for feature selection, which can be expressed using the covariance matrix \(\mathbf{C} = \mathbf{X} ^ T \mathbf{X}\).

\[\operatorname{d}_c(i, j) = C_{ii} - 2 C_{ij} + C_{jj}.\]

For sample selection, this is a row-wise Euclidean distance, which can be expressed in terms of the Gram matrix \(\mathbf{K} = \mathbf{X} \mathbf{X} ^ T\)

\[\operatorname{d}_r(i, j) = K_{ii} - 2 K_{ij} + K_{jj}.\]
Returns:

hausdorff (numpy.ndarray of shape (n_to_select_from_)) – the minimum distance from each point to the set of selected points. once a point is selected, the distance is not updated; the final list will reflect the distances when selected.

get_feature_names_out(input_features=None)#

Mask feature names according to selected features.

Parameters:

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: [“x0”, “x1”, …, “x(n_features_in_ - 1)”].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

get_select_distance()[source]#
Returns:

hausdorff_at_select (numpy.ndarray of shape (n_to_select)) – at the time of selection, the minimum distance from each selected point to the set of previously selected points.

get_support(indices=False, ordered=False)#

Get a mask, or integer index, of the subset

Parameters:
  • indices (bool, default=False) – If True, the return value will be an array of integers, rather than a bool mask.

  • ordered (bool, default=False) – With indices, if True, the return value will be an array of integers, rather than a bool mask, in the order in which they were selected.

Returns:

support (An index that selects the retained subset from a original vectors.) – If indices is False, this is a bool array of shape [# input], in which an element is True iff its corresponding feature or sample is selected for retention. If indices is True, this is an integer array of shape [# n_to_select] whose values are indices into the input vectors.

inverse_transform(X)#

Reverse the transformation operation.

Parameters:

X (array of shape [n_samples, n_selected_features]) – The input samples.

Returns:

X_original (array of shape [n_samples, n_original_features]) – X with columns of zeros inserted where features would have been removed by transform().

score(X=None, y=None)[source]#

Returns the Hausdorff distances of all samples to previous selections

NOTE: This function does not compute the importance score each time it is called, in order to avoid unnecessary computations. The hausdorff distance is updated in self._update_post_selection()

Parameters:
  • X (ignored)

  • y (ignored)

Returns:

hausdorff (Hausdorff distances)

set_fit_request(*, warm_start: bool | None | str = '$UNCHANGED$') VoronoiFPS#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

warm_start (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for warm_start parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

transform(X, y=None)#

Reduce X to the selected features.

Parameters:
  • X (numpy.ndarray of shape [n_samples, n_features]) – The input samples.

  • y (ignored)

Returns:

X_r (numpy.ndarray) – The selected subset of the input.

When Not to Use Voronoi FPS#

In many cases, this algorithm may not increase upon the efficiency. For example, for simple metrics (such as Euclidean distance), Voronoi FPS will likely not accelerate, and may decelerate, computations when compared to FPS. The sweet spot for Voronoi FPS is when the number of selectable samples is already enough to divide the space with Voronoi polyhedrons, but not yet comparable to the total number of samples, when the cost of bookkeeping significantly degrades the speed of work compared to FPS.

Directional Convex Hull (DCH)#

class skmatter.sample_selection.DirectionalConvexHull(low_dim_idx=None, tolerance=1e-12)[source]#

Performs Sample Selection by constructing a Directional Convex Hull and determining the distance to the hull as outlined in the reference [dch].

Parameters:
  • low_dim_idx (list of ints, default None) – Indices of columns of X containing features to be used for the directional convex hull construction (also known as the low- dimensional (LD) hull). By default [0] is used.

  • tolerance (float, default=1.0E-12) – Tolerance for the negative distances to the directional convex hull to consider a point below the convex hull. Depending if a point is below or above the convex hull the distance is differently computed. A very low value can result in a completely wrong distances. Distances cannot be distinguished from zero up to tolerance. It is recommended to leave the default setting.

high_dim_idx_#

Indices of columns in data containing high-dimensional features (i.e. those not used for the convex hull construction)

Type:

list of ints

selected_idx_#

Indices of datapoints that form the vertices of the convex hull

Type:

numpy.ndarray

interpolator_high_dim_#

Interpolator for the features in the high- dimensional space

Type:

scipy.interpolate._interpnd.LinearNDInterpolator

Examples

>>> from skmatter.sample_selection import DirectionalConvexHull
>>> selector = DirectionalConvexHull(
...     # Indices of columns of X to use for fitting
...     # the convex hull
...     low_dim_idx=[0, 1],
... )
>>> X = np.array(
...     [
...         [0.12, 0.21, 0.02],  # 3 samples, 3 features
...         [-0.09, 0.32, -0.10],
...         [-0.03, -0.53, 0.08],
...         [-0.41, 0.25, 0.34],
...     ]
... )
>>> y = np.array([0.1, 1.0, 0.2, 0.4])  # classes of each sample
>>> dch = selector.fit(X, y)
>>> # Get the distance to the convex hull for samples used to fit the
>>> # convex hull. This can also be called using other samples (X_new)
>>> # and corresponding properties (y_new) that were not used to fit
>>> # the hull. In this case they are alle one the conex hull so we
>>> # zeros
>>> np.allclose(dch.score_samples(X, y), [0.0, 0.0, 0.0, 0.0])
True

References

[dch]

A. Anelli, E. A. Engel, C. J. Pickard and M. Ceriotti, Physical Review Materials, 2018.

property directional_vertices_#
fit(X, y)[source]#

Learn the samples that form the convex hull.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Feature matrix of samples to use for constructing the convex hull.

  • y (numpy.ndarray of shape (n_samples,)) – Target values (property on which the convex hull should be constructed, e.g. Gibbs free energy)

Returns:

self (object) – Fitted scorer.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

score_feature_matrix(X)[source]#

Calculate the distance (or more specifically, the residuals) of the samples to the convex hull in the high-dimensional space. Samples with a distance value of zero in all the higher dimensions lie on the convex hull.

Parameters:

X (numpy.ndarray of shape (n_samples, n_features)) – Feature matrix of samples to use for determining distance to the convex hull. Please note that samples provided should have the same dimensions (features) as used during fitting of the convex hull. The same column indices will be used for the low- and high-dimensional features.

Returns:

dch_distance (numpy.ndarray of shape (n_samples, len(high_dim_idx_))) – The distance (residuals) of samples to the convex hull in the higher-dimensional space.

score_samples(X, y)[source]#

Calculate the distance of the samples to the convex hull in the target direction y. Samples with a distance > 0 lie above the convex surface. Samples with a distance of zero lie on the convex surface. Samples with a distance value < 0 lie below the convex surface.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – Feature matrix of samples to use for determining distance to the convex hull. Please note that samples provided should have the same dimensions (features) as used during fitting of the convex hull. The same column indices will be used for the low- and high-dimensional features.

  • y (numpy.ndarray of shape (n_samples,)) – Target values (property on which the convex hull should be constructed, e.g. Gibbs free energy)

Returns:

dch_distance (numpy.ndarray of shape (n_samples, len(high_dim_idx_))) – The distance (residuals) of samples to the convex hull in the higher-dimensional space.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.