Datasets#

CSD-1000R#

This dataset, intended for model testing, contains the SOAP power spectrum features and local NMR chemical shieldings for 100 environments selected from CSD-1000r, originally published in [Ceriotti2019].

Function Call#

skmatter.datasets.load_csd_1000r()#

Data Set Characteristics#

Number of Instances:

Each representation 100

Number of Features:

Each representation 100

The representations were computed with [C1] using the hyperparameters:

rascal hyperparameters:

key

value

interaction_cutoff:

3.5

max_radial:

6

max_angular:

6

gaussian_sigma_constant:

0.4

gaussian_sigma_type:

“Constant”

cutoff_smooth_width:

0.5

normalize:

True

Of the 2’520 resulting features, 100 were selected via FPS using [C2].

Chemical Properties#

The CSD-1000R dataset consists of 100 atomic environments selected from crystal structures in the Cambridge Structural Database (CSD) [C3]. These environments represent a diverse set of chemical compositions and bonding types, including:

  • Metals, metalloids, and non-metals

  • Covalent, ionic, and metallic bonding environments

  • Various coordination numbers and geometries

The dataset captures local chemical environments relevant for modeling properties such as nuclear magnetic resonance (NMR) chemical shieldings, aiding in the understanding of structure-property relationships in materials chemistry.

For more detailed chemical information, users can refer to the original Cambridge Structural Database [C3] or the publication by Ceriotti et al. (2019) [C4].

References#

Reference Code#

from skmatter.feature_selection import CUR
from skmatter.preprocessing import StandardFlexibleScaler
from skmatter.sample_selection import FPS

# read all of the frames and book-keep the centers and species
filename = "/path/to/CSD-1000R.xyz"
frames = np.asarray(
    read(filename, ":"),
    dtype=object,
)

n_centers = np.array([len(frame) for frame in frames])
center_idx = np.array([i for i, f in enumerate(frames) for p in f])
n_env_accum = np.zeros(len(frames) + 1, dtype=int)
n_env_accum[1:] = np.cumsum(n_centers)

numbers = np.concatenate([frame.numbers for frame in frames])

# compute radial soap vectors as first pass
hypers = dict(
    soap_type="PowerSpectrum",
    interaction_cutoff=2.5,
    max_radial=6,
    max_angular=0,
    gaussian_sigma_type="Constant",
    gaussian_sigma_constant=0.4,
    cutoff_smooth_width=0.5,
    normalize=False,
    global_species=[1, 6, 7, 8],
    expansion_by_species_method="user defined",
)
soap = SOAP(**hypers)

X_raw = StandardFlexibleScaler(column_wise=False).fit_transform(
    soap.transform(frames).get_features(soap)
)

# rank the environments in terms of diversity
n_samples = 500
i_selected = FPS(n_to_select=n_samples, initialize=0).fit(X_raw).selected_idx_

# book-keep which frames these samples belong in
f_selected = center_idx[i_selected]
reduced_f_selected = list(sorted(set(f_selected)))
frames_selected = frames[f_selected].copy()
ci_selected = i_selected - n_env_accum[f_selected]

properties_select = [
    frames[fi].arrays["CS_local"][ci] for fi, ci in zip(f_selected, ci_selected)
]

Degenerate CH4 manifold#

The dataset contains two representations (SOAP power spectrum and bispectrum) of the two manifolds spanned by the carbon atoms of two times 81 methane structures. The SOAP power spectrum representation the two manifolds intersect creating a degenerate manifold/line for which the representation remains the same. In contrast for higher body order representations as the (SOAP) bispectrum the carbon atoms can be uniquely represented and do not create a degenerate manifold. Following the naming convention of [Pozdnyakov2020] for each representation the first 81 samples correspond to the X minus manifold and the second 81 samples contain the X plus manifold

Function Call#

skmatter.datasets.load_degenerate_CH4_manifold()#

Data Set Characteristics#

Number of Instances:

Each representation 162

Number of Features:

Each representation 12

The representations were computed with [D1] using the hyperparameters:

rascal hyperparameters:

key

value

radial_basis:

“GTO”

interaction_cutoff:

4

max_radial:

2

max_angular:

2

gaussian_sigma_constant”:

0.5

gaussian_sigma_type:

“Constant”

cutoff_smooth_width:

0.5

normalize:

False

The SOAP bispectrum features were in addition reduced to 12 features with principal component analysis (PCA) [D2].

References#

H2O-BLYP-Piglet#

This dataset contains 27233 hydrogen bond descriptors and corresponding weights from a trajectory of a classical simulation performed with a BLYP exchange-correlation functional and a DZVP basis set. The simulation box contined 64 water molecules. This dataset was originally published in [Gasparotto2014].

Function Call#

skmatter.datasets.load_hbond_dataset()#

Data Set Characteristics#

Number of Instances:

27233

Number of Features:

3

Reference#

[1] lab-cosmo/pamm

Reference Code#

[2] GardevoirX/pypamm

[3] lab-cosmo/pamm

NICE dataset#

This is a toy dataset containing NICE[1, 4](N-body Iterative Contraction of Equivariants) features for first 500 configurations of the dataset[2, 3] with randomly displaced methane configurations.

Function Call#

skmatter.datasets.load_nice_dataset()#

Data Set Characteristics#

Number of Instances:

500

Number of Features:

160

The representations were computed using the NICE package[4] using the following definition of the NICE calculator:

StandardSequence(
    [
        StandardBlock(
            ThresholdExpansioner(num_expand=150),
            CovariantsPurifierBoth(max_take=10),
            IndividualLambdaPCAsBoth(n_components=50),
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=30),
        ),
        StandardBlock(
            ThresholdExpansioner(num_expand=150),
            CovariantsPurifierBoth(max_take=10),
            IndividualLambdaPCAsBoth(n_components=50),
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=20),
        ),
        StandardBlock(
            None,
            None,
            None,
            ThresholdExpansioner(num_expand=300, mode="invariants"),
            InvariantsPurifier(max_take=50),
            InvariantsPCA(n_components=20),
        ),
    ],
    initial_scaler=InitialScaler(mode="signal integral", individually=True),
)

References#

[1] Jigyasa Nigam, Sergey Pozdnyakov, and Michele Ceriotti. “Recursive evaluation and

iterative contraction of N-body equivariant features.” The Journal of Chemical Physics 153.12 (2020): 121101.

[2] Incompleteness of Atomic Structure Representations

Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti

[3] https://archive.materialscloud.org/record/2020.110

Reference Code#

[4] lab-cosmo/nice

WHO dataset#

who_dataset.csv is a compilation of multiple publically-available datasets through data.worldbank.org. Specifically, the following versioned datasets are used:

  • NY.GDP.PCAP.CD (v2_4770383) [1]

  • SE.XPD.TOTL.GD.ZS (v2_4773094) [2]

  • SH.DYN.AIDS.ZS (v2_4770518) [3]

  • SH.IMM.IDPT (v2_4770682) [4]

  • SH.IMM.MEAS (v2_4774112) [5]

  • SH.TBS.INCD (v2_4770775) [6]

  • SH.XPD.CHEX.GD.ZS (v2_4771258) [7]

  • SN.ITK.DEFC.ZS (v2_4771336) [8]

  • SP.DYN.LE00.IN (v2_4770556) [9]

  • SP.POP.TOTL (v2_4770385) [10]

where the corresponding file names are API_{dataset}_DS2_excel_en_{version}.xls.

This dataset, intended only for demonstration, contains 2020 country-year pairings and the corresponding values above. Function Call ————-

skmatter.datasets.load_who_dataset()#

Data Set Characteristics#

Number of Instances:

2020

Number of Features:

10

References#

Reference Code#

The following script is compiled, where the datasets have been placed in a folder named who_data:

import os
import pandas as pd
import numpy as np

files = os.listdir("who_data/")
indicators = [f[4 : f[4:].index("_") + 4] for f in files]
indicator_codes = {}
data_dict = {}
entries = []

for file in files:
    data = pd.read_excel(
        "who_data/" + file,
        header=3,
        sheet_name="Data",
        index_col=0,
    )

    indicator = data["Indicator Code"].values[0]
    indicator_codes[indicator] = data["Indicator Name"].values[0]

    for index in data.index:
        for year in range(1900, 2022):
            if str(year) in data.loc[index] and not np.isnan(
                data.loc[index].loc[str(year)]
            ):
                if (index, year) not in data_dict:
                    data_dict[(index, year)] = np.nan * np.ones(len(indicators))
                data_dict[(index, year)][indicators.index(indicator)] = data.loc[
                    index
                ].loc[str(year)]

with open("who_data.csv", "w") as outf:
    outf.write("Country,Year," + ",".join(indicators) + "\n")
    for key, data in data_dict.items():
        if np.count_nonzero(~np.isnan(np.array(data, dtype=float))) == len(
            indicators
        ):
            outf.write(
                "{},{},{}\n".format(
                    key[0].replace(",", " "),
                    key[1],
                    ",".join([str(d) for d in data]),
                )
            )