Loaders

Handlers to pull in datasets and perform preprocessing.

exception sacroml.preprocessing.loaders.DataNotAvailable[source]

Exception raised if the user asks for a dataset that they do not have.

__init__(*args, **kwargs)

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

args

exception sacroml.preprocessing.loaders.UnknownDataset[source]

Exception raised if the user passes a name that we don’t recognise.

__init__(*args, **kwargs)

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

args

sacroml.preprocessing.loaders.get_data_sklearn(dataset_name: str, data_folder: str = '/home/runner/work/SACRO-ML/SACRO-ML/sacroml/data') → tuple[DataFrame, DataFrame][source]

Get data in a format sensible for sklearn.

User passes a name and that dataset is returned as a tuple of pandas DataFrames (data, labels).

Parameters:

dataset_namestr: The name of the dataset to load
data_folderstr: The name of the local folder in which data is stored.

Returns:

Xpd.DataFrame: The input dataframe – rows are examples, columns variables
ypd.DataFrame: the target dataframe – has a single column containing the target values

Notes

The following datasets are available: mimic2-iaccd (requires data download) in-hospital-mortality (requires data download) medical-mnist-ab-v-br-100 (requires data download) medical-mnist-ab-v-br-500 (requires data download) medical-mnist-all-100 (requires data download) indian liver (requires data download) synth-ae (requires data download) synth-ae-small (requires data download) nursery (downloads automatically) iris (available out of the box via sklearn)

Datasets can be normalised by adding the following prefixes: standard: standardises all columns to have zero mean and unit variance. minmax: standardises all columns to have values between 0 and 1. round: rounds continues features to have 3dp

These can be nested.

Examples

# pull the mimic2-iaccd data
X, y = get_data_sklearn("mimic2-iaccd")

# pull the iris data and round continuous features
X, y = get_data_sklearn("minmax iris")