Preprocessing

Loaders.py A set of useful handlers to pull in datasets common to the project and perform the appropriate pre-processing.

exception aisdc.preprocessing.loaders.DataNotAvailable[source]: Exception raised if the user asks for a dataset that they do not have the data for. I.e. some datasets require a .csv file to have been downloaded.

exception aisdc.preprocessing.loaders.UnknownDataset[source]: Exception raised if the user passes a name that we don’t recognise.

aisdc.preprocessing.loaders.get_data_sklearn(dataset_name: str, data_folder: str = '/home/runner/work/AI-SDC/AI-SDC/aisdc/data') → Tuple[DataFrame, DataFrame][source]

Main entry method to return data in format sensible for sklearn. User passes a name and that dataset is returned as a tuple of pandas DataFrames (data, labels).

Parameters:

dataset_namestr: The name of the dataset to load
data_folderstr: The name of the local folder in which data is stored.

Returns:

Xpd.DataFrame: The input dataframe – rows are examples, columns variables
ypd.DataFrame: the target dataframe – has a single column containing the target values

Notes

The following datasets are available: mimic2-iaccd (requires data download) in-hospital-mortality (requires data download) medical-mnist-ab-v-br-100 (requires data download) medical-mnist-ab-v-br-500 (requires data download) medical-mnist-all-100 (requires data download) indian liver (requires data download) synth-ae (requires data download) synth-ae-small (requires data download) nursery (downloads automatically) iris (available out of the box via sklearn)

Datasets can be normalised by adding the following prefixes: standard: standardises all columns to have zero mean and unit variance. minmax: standardises all columns to have values between 0 and 1. round: rounds continues features to have 3dp

These can be nested.

Examples

>>> X, y = get_data_sklearn("mimic2-iaccd") # pull the mimic2-iaccd data
>>> X, y = get_data_sklearn("minmax iris") # pull the iris data and round continuous features