Preprocessing#

Scaling, centering and normalization methods.

KernelNormalizer#

class skmatter.preprocessing.KernelNormalizer(with_center=True, with_trace=True)[source]#

Kernel centering method, similar to KernelCenterer, but with additional scaling and ability to pass a set of sample weights.

Let \(K(x, z)\) be a kernel defined by \(\phi(x)^T \phi(z)\), where \(\phi\) is a function mapping x to a Hilbert space. KernelNormalizer centers (i.e., normalize to have zero mean) the data without explicitly computing \(\phi(x)\). It is equivalent to centering and scaling \(\phi(x)\) with sklearn.preprocessing.StandardScaler(with_std=False).

Parameters:
  • with_center (bool, default=True) – If True, center the kernel matrix before scaling. If False, do not center the kernel

  • with_trace (bool, default=True) – If True, scale the kernel so that the trace is equal to the number of samples. If False, do not scale the kernel

K_fit_rows_#

Average of each column of kernel matrix.

Type:

numpy.ndarray of shape (n_samples,)

K_fit_all_#

Average of kernel matrix.

Type:

float

sample_weight_#

Sample weights (if provided during the fit)

Type:

float

scale_#

Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]

Type:

float

Examples

>>> from skmatter.preprocessing import KernelNormalizer
>>> from sklearn.metrics.pairwise import pairwise_kernels
>>> X = [[1.0, -2.0, 2.0], [-2.0, 1.0, 3.0], [4.0, 1.0, -2.0]]
>>> K = pairwise_kernels(X, metric="linear")
>>> K
array([[  9.,   2.,  -2.],
       [  2.,  14., -13.],
       [ -2., -13.,  21.]])
>>> transformer = KernelNormalizer().fit(K)
>>> transformer
KernelNormalizer()
>>> transformer.transform(K)
array([[ 0.39473684,  0.        , -0.39473684],
       [ 0.        ,  1.10526316, -1.10526316],
       [-0.39473684, -1.10526316,  1.5       ]])
>>> transformer.scale_ * transformer.transform(K)
array([[  5.,   0.,  -5.],
       [  0.,  14., -14.],
       [ -5., -14.,  19.]])
>>>
fit(K, y=None, sample_weight=None)[source]#

Fit KernelFlexibleCenterer

Parameters:
  • K (numpy.ndarray of shape (n_samples, n_samples)) – Kernel matrix.

  • y (None) – Ignored.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted transformer.

fit_transform(K, y=None, sample_weight=None, copy=True, **fit_params)[source]#

Fit to data, then transform it.

Parameters:
  • K (numpy.ndarray of shape (n_samples, n_samples)) – Kernel matrix.

  • y (None) – Ignored.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

  • **fit_params – necessary for compatibility with the functions of the TransformerMixin class

Returns:

K_new (numpy.ndarray of shape (n_samples1, n_samples2)) – Transformed array

get_feature_names_out(input_features=None)#

Get output feature names for transformation.

The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].

Parameters:

input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit.

Returns:

feature_names_out (ndarray of str objects) – Transformed feature names.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KernelNormalizer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') KernelNormalizer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.

Returns:

self (object) – The updated object.

transform(K, copy=True)[source]#

Center kernel matrix.

Parameters:
  • K (numpy.ndarray of shape (n_samples1, n_samples2)) – Kernel matrix.

  • copy (bool, default=True) – Set to False to perform inplace computation.

Returns:

K_new (numpy.ndarray of shape (n_samples1, n_samples2)) – Transformed array

SparseKernelCenterer#

class skmatter.preprocessing.SparseKernelCenterer(with_center=True, with_trace=True, rcond=1e-12)[source]#

Kernel centering method for sparse kernels, similar to KernelFlexibleCenterer.

The main disadvantage of kernel methods, which is widely used in machine learning it is that they quickly grow in time and space complexity with the number of sample. It is clear that with a large dataset, not only do you need to store a huge amount of information, but you also need to use it constantly in calculations. In order to avoid this, so-called sparse kernel methods are used formulated from the low-dimensional (The Nystrom) approximation:

\[\mathbf{K} \approx \hat{\mathbf{K}}_{N N} = \mathbf{K}_{N M} \mathbf{K}_{M M}^{-1} \mathbf{K}_{N M}^{T}\]

where the subscripts for $mathbf{K}$ denote the size of the sets of samples compared in each kernel, with $N$ being the size of the full data set and $M$ referring a small, active set containing $M$ samples. With this method it is only need to save and use the matrix $mathbf{K}_{NM}$, i.e. it is possible to get a $N/M$ times improvement in the asymptotic by memory.

Parameters:
  • with_center (bool, default=True) – If True, center the kernel matrix before scaling. If False, do not center the kernel

  • with_trace (bool, default=True) – If True, scale the kernel so that the trace is equal to the number of samples. If False, do not scale the kernel

  • rcond (float, default 1E-12) – conditioning parameter to use when computing the Nystrom-approximated kernel for scaling

K_fit_rows_#

Average of each column of kernel matrix.

Type:

numpy.ndarray of shape (n_samples,)

K_fit_all_#

Average of kernel matrix.

Type:

float

sample_weight_#

Sample weights (if provided during the fit)

Type:

float

scale_#

Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]

Type:

float

n_active_#

size of active set

Type:

int

fit(Knm, Kmm, y=None, sample_weight=None)[source]#

Fit KernelFlexibleCenterer

Parameters:
  • Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • Kmm (numpy.ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself

  • y (None) – Ignored.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted transformer.

fit_transform(Knm, Kmm, y=None, sample_weight=None, **fit_params)[source]#

Fit to data, then transform it.

Parameters:
  • Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • Kmm (numpy.ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself

  • y (None) – Ignored.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

  • **fit_params – necessary for compatibility with the functions of the TransformerMixin class

Returns:

K_new (numpy.ndarray of shape (n_samples, n_active)) – Transformed array

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

set_fit_request(*, Kmm: bool | None | str = '$UNCHANGED$', Knm: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') SparseKernelCenterer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • Kmm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for Kmm parameter in fit.

  • Knm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for Knm parameter in fit.

  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

set_transform_request(*, Knm: bool | None | str = '$UNCHANGED$') SparseKernelCenterer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

Knm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for Knm parameter in transform.

Returns:

self (object) – The updated object.

transform(Knm, y=None)[source]#

Centering our Kernel. Previously you should fit data.

Parameters:
  • Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set

  • y (None) – Ignored.

Returns:

K_new (numpy.ndarray of shape (n_samples, n_active)) – Transformed array

StandardFlexibleScaler#

class skmatter.preprocessing.StandardFlexibleScaler(with_mean=True, with_std=True, column_wise=False, rtol=0, atol=1e-12, copy=False)[source]#

Standardize features by removing the mean and scaling to unit variance. Reduce the mean of the column to zero and, in the case of column_wise=True the variance of each column equal to one / number of columns. The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the samples if with_mean, otherwise zero, and s is the standard deviation of the samples if with_std or one.

Centering and scaling can occur independently for each feature by calculating the appropriate statistics for the input or for the whole matrix (column_wise=False). The mean and standard deviation are then stored for use on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: an improperly scaled / centered dataset may result in anomalous behavior.

At the same time, depending on the conditions of the task, it may be necessary to preserve the ratio in the scale between the features (for example, in the case where the feature matrix is something like a covariance matrix), so the standardization should be carried out for the whole matrix, as opposed to the individual columns, as is done in sklearn.preprocessing.StandardScaler.

Parameters:
  • with_mean (bool, default=True) – If True, center the data before scaling. If False, keep the mean intact

  • with_std (bool, default=True) – If True, scale the data to unit variance. If False, keep the variance intact

  • column_wise (bool, default=False) – If True, normalize each column separately. If False, normalize the whole matrix with respect to its total variance.

  • rtol (float, default=0) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.

  • atol (float, default=1.0E-12) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.

  • copy (bool, default=None) – Copy the input X or not.

n_samples_in_#

Number of samples in the reference ndarray

Type:

int

n_features_in_#

Number of features in the reference ndarray

Type:

int

mean_#

The mean value for each feature in the training set. Equal to numpy.ndarray of zeros shape (n_features,) when with_mean=False.

Type:

numpy.ndarray of shape (n_features,)

scale_#

The scaling factor, numpy.ndarray of shape (n_features,) when column_wise=True or float when column_wise = False.

Type:

numpy.ndarray of shape (n_features,), float or None

copy#

Copy the input X or not.

Type:

bool, default=None

Examples

>>> import numpy as np
>>> from skmatter.preprocessing import StandardFlexibleScaler
>>> X = np.array([[1.0, -2.0, 2.0], [-2.0, 1.0, 3.0], [4.0, 1.0, -2.0]])
>>> transformer = StandardFlexibleScaler().fit(X)
>>> transformer
StandardFlexibleScaler()
>>> transformer.transform(X)
array([[ 0.        , -0.56195149,  0.28097574],
       [-0.84292723,  0.28097574,  0.56195149],
       [ 0.84292723,  0.28097574, -0.84292723]])
>>> transformer.scale_ * transformer.transform(X)
array([[ 0., -2.,  1.],
       [-3.,  1.,  2.],
       [ 3.,  1., -3.]])
>>> transformer.scale_ * transformer.transform(X) + transformer.mean_
array([[ 1., -2.,  2.],
       [-2.,  1.,  3.],
       [ 4.,  1., -2.]])
fit(X, y=None, sample_weight=None)[source]#

Compute mean and scaling to be applied for subsequent normalization.

Parameters:
  • X (numpy.ndarray of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (numpy.ndarray of shape (n_samples,)) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.

Returns:

self (object) – Fitted scaler.

fit_transform(X, y=None, **fit_params)#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

inverse_transform(X_tr)[source]#

Scale back the data to the original representation.

Parameters:

X_tr (numpy.ndarray of shape (n_samples, n_features)) – Transformed matrix

Returns:

X (original matrix)

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

set_inverse_transform_request(*, X_tr: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler#

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

X_tr (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_tr parameter in inverse_transform.

Returns:

self (object) – The updated object.

set_output(*, transform=None)#

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self (estimator instance) – Estimator instance.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.

Returns:

self (object) – The updated object.

transform(X, y=None, copy=None)[source]#

Normalize a vector based on previously computed mean and scaling.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • y (None) – Ignored.

  • copy (bool, default=None) – Copy the input X or not.

Returns:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Transformed array.