Datasets#
CSD-1000R#
This dataset, intended for model testing, contains the SOAP power spectrum features and local NMR chemical shieldings for 100 environments selected from CSD-1000r, originally published in [Ceriotti2019].
Function Call#
- skmatter.datasets.load_csd_1000r()#
Data Set Characteristics#
- Number of Instances:
Each representation 100
- Number of Features:
Each representation 100
The representations were computed with [C1] using the hyperparameters:
- rascal hyperparameters:
key |
value |
---|---|
interaction_cutoff: |
3.5 |
max_radial: |
6 |
max_angular: |
6 |
gaussian_sigma_constant: |
0.4 |
gaussian_sigma_type: |
“Constant” |
cutoff_smooth_width: |
0.5 |
normalize: |
True |
Of the 2’520 resulting features, 100 were selected via FPS using [C2].
Chemical Properties#
The CSD-1000R dataset consists of 100 atomic environments selected from crystal structures in the Cambridge Structural Database (CSD) [C3]. These environments represent a diverse set of chemical compositions and bonding types, including:
Metals, metalloids, and non-metals
Covalent, ionic, and metallic bonding environments
Various coordination numbers and geometries
The dataset captures local chemical environments relevant for modeling properties such as nuclear magnetic resonance (NMR) chemical shieldings, aiding in the understanding of structure-property relationships in materials chemistry.
For more detailed chemical information, users can refer to the original Cambridge Structural Database [C3] or the publication by Ceriotti et al. (2019) [C4].
References#
lab-cosmo/librascal commit ade202a6
lab-cosmo/scikit-matter commit 4ed1d92
Reference Code#
from skmatter.feature_selection import CUR
from skmatter.preprocessing import StandardFlexibleScaler
from skmatter.sample_selection import FPS
# read all of the frames and book-keep the centers and species
filename = "/path/to/CSD-1000R.xyz"
frames = np.asarray(
read(filename, ":"),
dtype=object,
)
n_centers = np.array([len(frame) for frame in frames])
center_idx = np.array([i for i, f in enumerate(frames) for p in f])
n_env_accum = np.zeros(len(frames) + 1, dtype=int)
n_env_accum[1:] = np.cumsum(n_centers)
numbers = np.concatenate([frame.numbers for frame in frames])
# compute radial soap vectors as first pass
hypers = dict(
soap_type="PowerSpectrum",
interaction_cutoff=2.5,
max_radial=6,
max_angular=0,
gaussian_sigma_type="Constant",
gaussian_sigma_constant=0.4,
cutoff_smooth_width=0.5,
normalize=False,
global_species=[1, 6, 7, 8],
expansion_by_species_method="user defined",
)
soap = SOAP(**hypers)
X_raw = StandardFlexibleScaler(column_wise=False).fit_transform(
soap.transform(frames).get_features(soap)
)
# rank the environments in terms of diversity
n_samples = 500
i_selected = FPS(n_to_select=n_samples, initialize=0).fit(X_raw).selected_idx_
# book-keep which frames these samples belong in
f_selected = center_idx[i_selected]
reduced_f_selected = list(sorted(set(f_selected)))
frames_selected = frames[f_selected].copy()
ci_selected = i_selected - n_env_accum[f_selected]
properties_select = [
frames[fi].arrays["CS_local"][ci] for fi, ci in zip(f_selected, ci_selected)
]
Degenerate CH4 manifold#
The dataset contains two representations (SOAP power spectrum and bispectrum) of the two manifolds spanned by the carbon atoms of two times 81 methane structures. The SOAP power spectrum representation the two manifolds intersect creating a degenerate manifold/line for which the representation remains the same. In contrast for higher body order representations as the (SOAP) bispectrum the carbon atoms can be uniquely represented and do not create a degenerate manifold. Following the naming convention of [Pozdnyakov2020] for each representation the first 81 samples correspond to the X minus manifold and the second 81 samples contain the X plus manifold
Function Call#
- skmatter.datasets.load_degenerate_CH4_manifold()#
Data Set Characteristics#
- Number of Instances:
Each representation 162
- Number of Features:
Each representation 12
The representations were computed with [D1] using the hyperparameters:
- rascal hyperparameters:
key |
value |
---|---|
radial_basis: |
“GTO” |
interaction_cutoff: |
4 |
max_radial: |
2 |
max_angular: |
2 |
gaussian_sigma_constant”: |
0.5 |
gaussian_sigma_type: |
“Constant” |
cutoff_smooth_width: |
0.5 |
normalize: |
False |
The SOAP bispectrum features were in addition reduced to 12 features with principal component analysis (PCA) [D2].
References#
H2O-BLYP-Piglet#
This dataset contains 27233 hydrogen bond descriptors and corresponding weights from a trajectory of a classical simulation performed with a BLYP exchange-correlation functional and a DZVP basis set. The simulation box contined 64 water molecules. This dataset was originally published in [Gasparotto2014].
Function Call#
- skmatter.datasets.load_hbond_dataset()#
Data Set Characteristics#
- Number of Instances:
27233
- Number of Features:
3
Reference#
[1] lab-cosmo/pamm
Reference Code#
[3] lab-cosmo/pamm
NICE dataset#
This is a toy dataset containing NICE[1, 4](N-body Iterative Contraction of Equivariants) features for first 500 configurations of the dataset[2, 3] with randomly displaced methane configurations.
Function Call#
- skmatter.datasets.load_nice_dataset()#
Data Set Characteristics#
- Number of Instances:
500
- Number of Features:
160
The representations were computed using the NICE package[4] using the following definition of the NICE calculator:
StandardSequence(
[
StandardBlock(
ThresholdExpansioner(num_expand=150),
CovariantsPurifierBoth(max_take=10),
IndividualLambdaPCAsBoth(n_components=50),
ThresholdExpansioner(num_expand=300, mode="invariants"),
InvariantsPurifier(max_take=50),
InvariantsPCA(n_components=30),
),
StandardBlock(
ThresholdExpansioner(num_expand=150),
CovariantsPurifierBoth(max_take=10),
IndividualLambdaPCAsBoth(n_components=50),
ThresholdExpansioner(num_expand=300, mode="invariants"),
InvariantsPurifier(max_take=50),
InvariantsPCA(n_components=20),
),
StandardBlock(
None,
None,
None,
ThresholdExpansioner(num_expand=300, mode="invariants"),
InvariantsPurifier(max_take=50),
InvariantsPCA(n_components=20),
),
],
initial_scaler=InitialScaler(mode="signal integral", individually=True),
)
References#
- [1] Jigyasa Nigam, Sergey Pozdnyakov, and Michele Ceriotti. “Recursive evaluation and
iterative contraction of N-body equivariant features.” The Journal of Chemical Physics 153.12 (2020): 121101.
- [2] Incompleteness of Atomic Structure Representations
Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti
Reference Code#
[4] lab-cosmo/nice
WHO dataset#
who_dataset.csv
is a compilation of multiple publically-available datasets
through data.worldbank.org. Specifically, the following versioned datasets are used:
NY.GDP.PCAP.CD (v2_4770383) [1]
SE.XPD.TOTL.GD.ZS (v2_4773094) [2]
SH.DYN.AIDS.ZS (v2_4770518) [3]
SH.IMM.IDPT (v2_4770682) [4]
SH.IMM.MEAS (v2_4774112) [5]
SH.TBS.INCD (v2_4770775) [6]
SH.XPD.CHEX.GD.ZS (v2_4771258) [7]
SN.ITK.DEFC.ZS (v2_4771336) [8]
SP.DYN.LE00.IN (v2_4770556) [9]
SP.POP.TOTL (v2_4770385) [10]
where the corresponding file names are API_{dataset}_DS2_excel_en_{version}.xls
.
This dataset, intended only for demonstration, contains 2020 country-year pairings and the corresponding values above. Function Call ————-
- skmatter.datasets.load_who_dataset()#
Data Set Characteristics#
- Number of Instances:
2020
- Number of Features:
10
References#
Reference Code#
The following script is compiled, where the datasets have been placed in a
folder named who_data
:
import os
import pandas as pd
import numpy as np
files = os.listdir("who_data/")
indicators = [f[4 : f[4:].index("_") + 4] for f in files]
indicator_codes = {}
data_dict = {}
entries = []
for file in files:
data = pd.read_excel(
"who_data/" + file,
header=3,
sheet_name="Data",
index_col=0,
)
indicator = data["Indicator Code"].values[0]
indicator_codes[indicator] = data["Indicator Name"].values[0]
for index in data.index:
for year in range(1900, 2022):
if str(year) in data.loc[index] and not np.isnan(
data.loc[index].loc[str(year)]
):
if (index, year) not in data_dict:
data_dict[(index, year)] = np.nan * np.ones(len(indicators))
data_dict[(index, year)][indicators.index(indicator)] = data.loc[
index
].loc[str(year)]
with open("who_data.csv", "w") as outf:
outf.write("Country,Year," + ",".join(indicators) + "\n")
for key, data in data_dict.items():
if np.count_nonzero(~np.isnan(np.array(data, dtype=float))) == len(
indicators
):
outf.write(
"{},{},{}\n".format(
key[0].replace(",", " "),
key[1],
",".join([str(d) for d in data]),
)
)