Loaders
Handlers to pull in datasets and perform preprocessing.
- exception sacroml.preprocessing.loaders.DataNotAvailable[source]
Exception raised if the user asks for a dataset that they do not have.
- __init__(*args, **kwargs)
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- args
- exception sacroml.preprocessing.loaders.UnknownDataset[source]
Exception raised if the user passes a name that we don’t recognise.
- __init__(*args, **kwargs)
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- args
- sacroml.preprocessing.loaders.get_data_sklearn(dataset_name: str, data_folder: str = '/home/runner/work/SACRO-ML/SACRO-ML/sacroml/data') tuple[DataFrame, DataFrame] [source]
Get data in a format sensible for sklearn.
User passes a name and that dataset is returned as a tuple of pandas DataFrames (data, labels).
- Parameters:
- dataset_namestr
The name of the dataset to load
- data_folderstr
The name of the local folder in which data is stored.
- Returns:
- Xpd.DataFrame
The input dataframe – rows are examples, columns variables
- ypd.DataFrame
the target dataframe – has a single column containing the target values
Notes
The following datasets are available: mimic2-iaccd (requires data download) in-hospital-mortality (requires data download) medical-mnist-ab-v-br-100 (requires data download) medical-mnist-ab-v-br-500 (requires data download) medical-mnist-all-100 (requires data download) indian liver (requires data download) synth-ae (requires data download) synth-ae-small (requires data download) nursery (downloads automatically) iris (available out of the box via sklearn)
Datasets can be normalised by adding the following prefixes: standard: standardises all columns to have zero mean and unit variance. minmax: standardises all columns to have values between 0 and 1. round: rounds continues features to have 3dp
These can be nested.
Examples
# pull the mimic2-iaccd data X, y = get_data_sklearn("mimic2-iaccd") # pull the iris data and round continuous features X, y = get_data_sklearn("minmax iris")