Preprocessing#
Scaling, centering and normalization methods.
KernelNormalizer#
- class skmatter.preprocessing.KernelNormalizer(with_center=True, with_trace=True)[source]#
Kernel centering method, similar to KernelCenterer, but with additional scaling and ability to pass a set of sample weights.
Let \(K(x, z)\) be a kernel defined by \(\phi(x)^T \phi(z)\), where \(\phi\) is a function mapping x to a Hilbert space. KernelNormalizer centers (i.e., normalize to have zero mean) the data without explicitly computing \(\phi(x)\). It is equivalent to centering and scaling \(\phi(x)\) with sklearn.preprocessing.StandardScaler(with_std=False).
- Parameters:
- K_fit_rows_#
Average of each column of kernel matrix.
- Type:
numpy.ndarray of shape (n_samples,)
- scale_#
Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]
- Type:
Examples
>>> from skmatter.preprocessing import KernelNormalizer >>> from sklearn.metrics.pairwise import pairwise_kernels >>> X = [[1.0, -2.0, 2.0], [-2.0, 1.0, 3.0], [4.0, 1.0, -2.0]] >>> K = pairwise_kernels(X, metric="linear") >>> K array([[ 9., 2., -2.], [ 2., 14., -13.], [ -2., -13., 21.]]) >>> transformer = KernelNormalizer().fit(K) >>> transformer KernelNormalizer() >>> transformer.transform(K) array([[ 0.39473684, 0. , -0.39473684], [ 0. , 1.10526316, -1.10526316], [-0.39473684, -1.10526316, 1.5 ]]) >>> transformer.scale_ * transformer.transform(K) array([[ 5., 0., -5.], [ 0., 14., -14.], [ -5., -14., 19.]]) >>>
- fit(K, y=None, sample_weight=None)[source]#
Fit KernelFlexibleCenterer
- Parameters:
K (numpy.ndarray of shape (n_samples, n_samples)) – Kernel matrix.
y (None) – Ignored.
sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.
- Returns:
self (object) – Fitted transformer.
- fit_transform(K, y=None, sample_weight=None, copy=True, **fit_params)[source]#
Fit to data, then transform it.
- Parameters:
K (numpy.ndarray of shape (n_samples, n_samples)) – Kernel matrix.
y (None) – Ignored.
sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.
**fit_params – necessary for compatibility with the functions of the TransformerMixin class
- Returns:
K_new (numpy.ndarray of shape (n_samples1, n_samples2)) – Transformed array
- get_feature_names_out(input_features=None)#
Get output feature names for transformation.
The feature names out will prefixed by the lowercased class name. For example, if the transformer outputs 3 features, then the feature names out are: [“class_name0”, “class_name1”, “class_name2”].
- Parameters:
input_features (array-like of str or None, default=None) – Only used to validate feature names with the names seen in fit.
- Returns:
feature_names_out (ndarray of str objects) – Transformed feature names.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (MetadataRequest) – A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params (dict) – Parameter names mapped to their values.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') KernelNormalizer #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter infit
.- Returns:
self (object) – The updated object.
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
self (estimator instance) – Estimator instance.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') KernelNormalizer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
copy
parameter intransform
.- Returns:
self (object) – The updated object.
- transform(K, copy=True)[source]#
Center kernel matrix.
- Parameters:
K (numpy.ndarray of shape (n_samples1, n_samples2)) – Kernel matrix.
copy (bool, default=True) – Set to False to perform inplace computation.
- Returns:
K_new (numpy.ndarray of shape (n_samples1, n_samples2)) – Transformed array
SparseKernelCenterer#
- class skmatter.preprocessing.SparseKernelCenterer(with_center=True, with_trace=True, rcond=1e-12)[source]#
Kernel centering method for sparse kernels, similar to
KernelFlexibleCenterer
.The main disadvantage of kernel methods, which is widely used in machine learning it is that they quickly grow in time and space complexity with the number of sample. It is clear that with a large dataset, not only do you need to store a huge amount of information, but you also need to use it constantly in calculations. In order to avoid this, so-called sparse kernel methods are used formulated from the low-dimensional (The Nystrom) approximation:
\[\mathbf{K} \approx \hat{\mathbf{K}}_{N N} = \mathbf{K}_{N M} \mathbf{K}_{M M}^{-1} \mathbf{K}_{N M}^{T}\]where the subscripts for $mathbf{K}$ denote the size of the sets of samples compared in each kernel, with $N$ being the size of the full data set and $M$ referring a small, active set containing $M$ samples. With this method it is only need to save and use the matrix $mathbf{K}_{NM}$, i.e. it is possible to get a $N/M$ times improvement in the asymptotic by memory.
- Parameters:
with_center (bool, default=True) – If True, center the kernel matrix before scaling. If False, do not center the kernel
with_trace (bool, default=True) – If True, scale the kernel so that the trace is equal to the number of samples. If False, do not scale the kernel
rcond (float, default 1E-12) – conditioning parameter to use when computing the Nystrom-approximated kernel for scaling
- K_fit_rows_#
Average of each column of kernel matrix.
- Type:
numpy.ndarray of shape (n_samples,)
- scale_#
Scaling parameter used when ‘with_trace’=True Calculated as np.trace(K) / K.shape[0]
- Type:
- fit(Knm, Kmm, y=None, sample_weight=None)[source]#
Fit
KernelFlexibleCenterer
- Parameters:
Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set
Kmm (numpy.ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself
y (None) – Ignored.
sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.
- Returns:
self (object) – Fitted transformer.
- fit_transform(Knm, Kmm, y=None, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it.
- Parameters:
Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set
Kmm (numpy.ndarray of shape (n_active, n_active)) – Kernel matrix between the active set and itself
y (None) – Ignored.
sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.
**fit_params – necessary for compatibility with the functions of the TransformerMixin class
- Returns:
K_new (numpy.ndarray of shape (n_samples, n_active)) – Transformed array
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (MetadataRequest) – A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params (dict) – Parameter names mapped to their values.
- set_fit_request(*, Kmm: bool | None | str = '$UNCHANGED$', Knm: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') SparseKernelCenterer #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
Kmm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
Kmm
parameter infit
.Knm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
Knm
parameter infit
.sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter infit
.
- Returns:
self (object) – The updated object.
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
self (estimator instance) – Estimator instance.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.
- set_transform_request(*, Knm: bool | None | str = '$UNCHANGED$') SparseKernelCenterer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
Knm (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
Knm
parameter intransform
.- Returns:
self (object) – The updated object.
- transform(Knm, y=None)[source]#
Centering our Kernel. Previously you should fit data.
- Parameters:
Knm (numpy.ndarray of shape (n_samples, n_active)) – Kernel matrix between the reference data set and the active set
y (None) – Ignored.
- Returns:
K_new (numpy.ndarray of shape (n_samples, n_active)) – Transformed array
StandardFlexibleScaler#
- class skmatter.preprocessing.StandardFlexibleScaler(with_mean=True, with_std=True, column_wise=False, rtol=0, atol=1e-12, copy=False)[source]#
Standardize features by removing the mean and scaling to unit variance. Reduce the mean of the column to zero and, in the case of column_wise=True the variance of each column equal to one / number of columns. The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the samples if with_mean, otherwise zero, and s is the standard deviation of the samples if with_std or one.
Centering and scaling can occur independently for each feature by calculating the appropriate statistics for the input or for the whole matrix (column_wise=False). The mean and standard deviation are then stored for use on later data using
transform()
.Standardization of a dataset is a common requirement for many machine learning estimators: an improperly scaled / centered dataset may result in anomalous behavior.
At the same time, depending on the conditions of the task, it may be necessary to preserve the ratio in the scale between the features (for example, in the case where the feature matrix is something like a covariance matrix), so the standardization should be carried out for the whole matrix, as opposed to the individual columns, as is done in sklearn.preprocessing.StandardScaler.
- Parameters:
with_mean (bool, default=True) – If True, center the data before scaling. If False, keep the mean intact
with_std (bool, default=True) – If True, scale the data to unit variance. If False, keep the variance intact
column_wise (bool, default=False) – If True, normalize each column separately. If False, normalize the whole matrix with respect to its total variance.
rtol (float, default=0) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.
atol (float, default=1.0E-12) – The relative tolerance for the optimization: variance is considered zero when it is less than abs(mean) * rtol + atol.
copy (bool, default=None) – Copy the input X or not.
- mean_#
The mean value for each feature in the training set. Equal to
numpy.ndarray
of zeros shape (n_features,) whenwith_mean=False
.- Type:
numpy.ndarray of shape (n_features,)
- scale_#
The scaling factor,
numpy.ndarray
of shape (n_features,) when column_wise=True or float when column_wise = False.- Type:
numpy.ndarray of shape (n_features,), float or None
Examples
>>> import numpy as np >>> from skmatter.preprocessing import StandardFlexibleScaler >>> X = np.array([[1.0, -2.0, 2.0], [-2.0, 1.0, 3.0], [4.0, 1.0, -2.0]]) >>> transformer = StandardFlexibleScaler().fit(X) >>> transformer StandardFlexibleScaler() >>> transformer.transform(X) array([[ 0. , -0.56195149, 0.28097574], [-0.84292723, 0.28097574, 0.56195149], [ 0.84292723, 0.28097574, -0.84292723]]) >>> transformer.scale_ * transformer.transform(X) array([[ 0., -2., 1.], [-3., 1., 2.], [ 3., 1., -3.]]) >>> transformer.scale_ * transformer.transform(X) + transformer.mean_ array([[ 1., -2., 2.], [-2., 1., 3.], [ 4., 1., -2.]])
- fit(X, y=None, sample_weight=None)[source]#
Compute mean and scaling to be applied for subsequent normalization.
- Parameters:
X (numpy.ndarray of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (numpy.ndarray of shape (n_samples,)) – Weights for each sample. Sample weighting can be used to center (and scale) data using a weighted mean. Weights are internally normalized before preprocessing.
- Returns:
self (object) – Fitted scaler.
- fit_transform(X, y=None, **fit_params)#
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (MetadataRequest) – A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params (dict) – Parameter names mapped to their values.
- inverse_transform(X_tr)[source]#
Scale back the data to the original representation.
- Parameters:
X_tr (numpy.ndarray of shape (n_samples, n_features)) – Transformed matrix
- Returns:
X (original matrix)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter infit
.- Returns:
self (object) – The updated object.
- set_inverse_transform_request(*, X_tr: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler #
Request metadata passed to the
inverse_transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toinverse_transform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toinverse_transform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
X_tr (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_tr
parameter ininverse_transform
.- Returns:
self (object) – The updated object.
- set_output(*, transform=None)#
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
self (estimator instance) – Estimator instance.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self (estimator instance) – Estimator instance.
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') StandardFlexibleScaler #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
copy
parameter intransform
.- Returns:
self (object) – The updated object.
- transform(X, y=None, copy=None)[source]#
Normalize a vector based on previously computed mean and scaling.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.
y (None) – Ignored.
copy (bool, default=None) – Copy the input X or not.
- Returns:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Transformed array.