Utility Classes#
Modified Gram Matrix \(\mathbf{\tilde{K}}\)#
- skmatter.utils.pcovr_kernel(mixing, X, Y, **kernel_params)[source]#
Creates the PCovR modified kernel distances
\[\mathbf{\tilde{K}} = \alpha \mathbf{K} + (1 - \alpha) \mathbf{Y}\mathbf{Y}^T\]the default kernel is the linear kernel, such that:
\[\mathbf{\tilde{K}} = \alpha \mathbf{X} \mathbf{X}^T + (1 - \alpha) \mathbf{Y}\mathbf{Y}^T\]- Parameters:
mixing (float) – mixing parameter, as described in PCovR as \({\alpha}\)
X (numpy.ndarray of shape (n x m)) – Data matrix \(\mathbf{X}\)
Y (numpy.ndarray of shape (n x p)) – Array to include in biased selection when mixing < 1
kernel_params (dict, optional) – dictionary of arguments to pass to pairwise_kernels if none are specified, assumes that the kernel is linear
Modified Covariance Matrix \(\mathbf{\tilde{C}}\)#
- skmatter.utils.pcovr_covariance(mixing, X, Y, rcond=1e-12, return_isqrt=False, rank=None, random_state=0, iterated_power='auto')[source]#
Creates the PCovR modified covariance.
\[\mathbf{\tilde{C}} = \alpha \mathbf{X}^T \mathbf{X} + (1 - \alpha) \left(\left(\mathbf{X}^T \mathbf{X}\right)^{-\frac{1}{2}} \mathbf{X}^T \mathbf{\hat{Y}}\mathbf{\hat{Y}}^T \mathbf{X} \left(\mathbf{X}^T \mathbf{X}\right)^{-\frac{1}{2}}\right)\]where \(\mathbf{\hat{Y}}\) are the properties obtained by linear regression.
- Parameters:
mixing (float) – mixing parameter, as described in PCovR as \({\alpha}\),
X (numpy.ndarray of shape (n x m)) – Data matrix \(\mathbf{X}\)
Y (numpy.ndarray of shape (n x p)) – Array to include in biased selection when mixing < 1
rcond (float, default=1E-12) – threshold below which eigenvalues will be considered 0,
return_isqrt (bool, default=False) – Whether to return the calculated inverse square root of the covariance. Used when inverse square root is needed and the pcovr_covariance has already been calculated
rank (int, default=min(X.shape)) – number of eigenpairs to estimate the inverse square root with
random_state (int, default=0) – random seed to use for randomized svd
Orthogonalizers for CUR#
When computing non-iterative CUR, it is necessary to orthogonalize the input matrices after each selection. For this, we have supplied a feature and a sample orthogonalizer for feature and sample selection.
- skmatter.utils.X_orthogonalizer(x1, c=None, x2=None, tol=1e-12, copy=False)[source]#
Orthogonalizes a feature matrix by the given columns.
Can be used to orthogonalize by samples by calling X = X_orthogonalizer(X.T, row_index).T. After orthogonalization, each column of X will contain only what is orthogonal to X[:, c] or x2.
- Parameters:
x1 (numpy.ndarray of shape (n x m)) – feature matrix to orthogonalize
c (int, less than m, default=None) – index of the column to orthogonalize by
x2 (numpy.ndarray of shape (n x a), default=x1[:, c]) – a separate set of columns to orthogonalize with respect to Note: the orthogonalizer will work column-by-column in column-index order
- skmatter.utils.Y_feature_orthogonalizer(y, X, tol=1e-12, copy=True)[source]#
Orthogonalizes a property matrix given the selected features in \(\mathbf{X}\).
\[\mathbf{Y} \leftarrow \mathbf{Y} - \mathbf{X} \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{Y}\]- Parameters:
y (numpy.ndarray of shape (n_samples x n_properties)) – property matrix
X (numpy.ndarray of shape (n_samples x n_features)) – feature matrix
tol (float) – cutoff for small eigenvalues to send to np.linalg.pinv
copy (bool) – whether to return a copy of y or edit in-place, default=True
- skmatter.utils.Y_sample_orthogonalizer(y, X, y_ref, X_ref, tol=1e-12, copy=True)[source]#
Orthogonalizes a matrix of targets \({\mathbf{Y}}\) given a reference feature matrix \({\mathbf{X}_r}\) and reference target matrix \({\mathbf{Y}_r}\):
\[\mathbf{Y} \leftarrow \mathbf{Y} - \mathbf{X} \left(\mathbf{X}_{\mathbf{r}}^T \mathbf{X}_{\mathbf{r}}\right)^{-1}\mathbf{X}_{\mathbf{r}}^T \mathbf{Y}_{\mathbf{r}}\]- Parameters:
y (numpy.ndarray of shape (n_samples x n_properties)) – property matrix
X (numpy.ndarray of shape (n_samples x n_features)) – feature matrix
y_ref (numpy.ndarray of shape (n_ref x n_properties)) – reference property matrix
X_ref (numpy.ndarray of shape (n_ref x n_features)) – reference feature matrix
tol (float) – cutoff for small eigenvalues to send to np.linalg.pinv
copy (bool) – whether to return a copy of y or edit in-place, default=True
Random Partitioning with Overlaps#
- skmatter.model_selection.train_test_split(*arrays, **options)[source]#
Extended version of the sklearn train test split supporting overlapping train and test sets.
See sklearn.model_selection.train_test_split (external link) .
- Parameters:
*arrays (sequence of indexables with same length / shape[0]) – Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If
None
, the value is set to the complement of the train size. Iftrain_size
is also None, it will be set to 0.25.train_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If
None
, the value is automatically set to the complement of the test size.random_state (int or :class`numpy.random.RandomState` instance, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See random state glossary from sklearn (external link)
shuffle (bool, default=True) – Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be
None
.stratify (array-like, default=None) – If not
None
, data is split in a stratified fashion, using this as the class labels.train_test_overlap (bool, default=False) – If
True
, and train and test set are both notNone
, the train and test set may overlap.
- Returns:
splitting (list, length=2 * len(arrays)) – List containing train-test split of inputs.
Effective Dimension of Covariance Matrix#
- skmatter.utils.effdim(cov)[source]#
Calculate the effective dimension of a covariance matrix based on Shannon entropy.
- Parameters:
cov (numpy.ndarray) – The covariance matrix.
- Returns:
float – The effective dimension of the covariance matrix.
Examples
>>> import numpy as np >>> from skmatter.utils import effdim >>> cov = np.array([[25, 15, -5], [15, 18, 0], [-5, 0, 11]], dtype=np.float64) >>> print(round(effdim(cov), 3)) 2.214
References
Oracle Approximating Shrinkage#
- skmatter.utils.oas(cov: ndarray, n: float, D: int) ndarray [source]#
Oracle approximating shrinkage (OAS) estimator
- Parameters:
cov (numpy.ndarray) – A covariance matrix
n (float) – The local population
D (int) – Dimension
Examples
>>> import numpy as np >>> from skmatter.utils import oas >>> cov = np.array([[0.5, 1.0], [0.7, 0.4]]) >>> oas(cov, 10, 2) array([[0.48903924, 0.78078484], [0.54654939, 0.41096076]])
- Returns:
np.ndarray – Covariance matrix