ACRO Class

ACRO: Automatic Checking of Research Outputs.

class acro.acro.ACRO(config: str = 'default', suppress: bool = False)[source]

ACRO: Automatic Checking of Research Outputs.

Examples

>>> acro = ACRO()
>>> results = acro.ols(
...     y, x
... )
>>> results.summary()
>>> acro.finalise(
...     "MYFOLDER",
...     "json",
... )
Attributes
configdict

Safe parameters and their values.

resultsRecords

The current outputs including the results of checks.

suppressbool

Whether to automatically apply suppression

Methods

add_comments(output, comment)

Add a comment to an output.

add_exception(output, reason)

Add an exception request to an output.

crosstab(index, columns[, values, rownames, ...])

Compute a simple cross tabulation of two (or more) factors.

custom_output(filename[, comment])

Add an unsupported output to the results dictionary.

finalise([path, ext])

Create a results file for checking.

hist(data, column[, by_val, grid, ...])

Create a histogram from a single column.

logit(endog, exog[, missing, check_rank])

Fits Logit model.

logitr(formula, data[, subset, drop_cols])

Fits Logit model from a formula and dataframe.

ols(endog[, exog, missing, hasconst])

Fits Ordinary Least Squares Regression.

olsr(formula, data[, subset, drop_cols])

Fits Ordinary Least Squares Regression from a formula and dataframe.

pivot_table(data[, values, index, columns, ...])

Create a spreadsheet-style pivot table as a DataFrame.

print_outputs()

Print the current results dictionary.

probit(endog, exog[, missing, check_rank])

Fits Probit model.

probitr(formula, data[, subset, drop_cols])

Fits Probit model from a formula and dataframe.

remove_output(key)

Remove an output from the results.

rename_output(old, new)

Rename an output.

surv_func(time, status, output[, entry, ...])

Estimate the survival function.

survival_plot(survival_table, survival_func, ...)

Create the survival plot according to the status of suppressing.

survival_table(survival_table, safe_table, ...)

Create the survival table according to the status of suppressing.

add_comments(output: str, comment: str) None[source]

Add a comment to an output.

Parameters
outputstr

The name of the output.

commentstr

The comment.

add_exception(output: str, reason: str) None[source]

Add an exception request to an output.

Parameters
outputstr

The name of the output.

reasonstr

The comment.

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins: bool = False, margins_name: str = 'All', dropna: bool = True, normalize=False, show_suppressed=False) DataFrame

Compute a simple cross tabulation of two (or more) factors.

By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.

To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.

Parameters
indexarray-like, Series, or list of arrays/Series

Values to group by in the rows.

columnsarray-like, Series, or list of arrays/Series

Values to group by in the columns.

valuesarray-like, optional

Array of values to aggregate according to the factors. Requires aggfunc be specified.

rownamessequence, default None

If passed, must match number of row arrays passed.

colnamessequence, default None

If passed, must match number of column arrays passed.

aggfuncstr, optional

If specified, requires values be specified as well.

marginsbool, default False

Add row/column margins (subtotals).

margins_namestr, default ‘All’

Name of the row/column that will contain the totals when margins is True.

dropnabool, default True

Do not include columns whose entries are all NaN.

normalizebool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False

Normalize by dividing all values by the sum of values. - If passed ‘all’ or True, will normalize over all values. - If passed ‘index’ will normalize over each row. - If passed ‘columns’ will normalize over each column. - If margins is True, will also normalize margin values.

show_suppressedbool. default False

how the totals are being calculated when the suppression is true

Returns
DataFrame

Cross tabulation of the data.

custom_output(filename: str, comment: str = '') None[source]

Add an unsupported output to the results dictionary.

Parameters
filenamestr

The name of the file that will be added to the list of the outputs.

commentstr

An optional comment.

finalise(path: str = 'outputs', ext='json') acro.record.Records | None[source]

Create a results file for checking.

Parameters
pathstr

Name of a folder to save outputs.

extstr

Extension of the results file. Valid extensions: {json, xlsx}.

Returns
Records

Object storing the outputs.

hist(data, column, by_val=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, axis=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, filename='histogram.png', **kwargs)

Create a histogram from a single column.

The dataset and the column’s name should be passed to the function as parameters. If more than one column is used the histogram will not be calculated.

To save the histogram plot to a file, the user can specify a filename otherwise ‘histogram.png’ will be used as the filename. A number will be appended automatically to the filename to avoid overwriting the files.

Parameters
dataDataFrame

The pandas object holding the data.

columnstr

The column that will be used to plot the histogram.

by_valobject, optional

If passed, then used to form histograms for separate groups.

gridbool, default True

Whether to show axis grid lines.

xlabelsizeint, default None

If specified changes the x-axis label size.

xrotfloat, default None

Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.

ylabelsizeint, default None

If specified changes the y-axis label size.

yrotfloat, default None

Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.

axisMatplotlib axes object, default None

The axes to plot the histogram on.

sharexbool, default True if ax is None else False

In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.

shareybool, default False

In case subplots=True, share y axis and set some y axis labels to invisible.

figsizetuple, optional

The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.

layouttuple, optional

Tuple of (rows, columns) for the layout of the histograms.

binsint or sequence, default 10

Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin.

backendstr, default None

Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

legendbool, default False

Whether to show the legend.

filename:

The name of the file where the plot will be saved.

Returns
matplotlib.Axes

The histogram.

str

The name of the file where the histogram is saved.

logit(endog, exog, missing: Optional[str] = None, check_rank: bool = True) BinaryResultsWrapper

Fits Logit model.

Parameters
endogarray_like

A 1-d endogenous response variable. The dependent variable.

exogarray_like

A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.

missingstr | None

Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.

check_rankbool

Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.

Returns
BinaryResultsWrapper

Results.

logitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper

Fits Logit model from a formula and dataframe.

Parameters
formulastr or generic Formula object

The formula specifying the model.

dataarray_like

The data for the model. See Notes.

subsetarray_like

An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.

drop_colsarray_like

Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.

*args

Additional positional argument that are passed to the model.

**kwargs

These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns
RegressionResultsWrapper

Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

ols(endog, exog=None, missing='none', hasconst=None, **kwargs) RegressionResultsWrapper

Fits Ordinary Least Squares Regression.

Parameters
endogarray_like

A 1-d endogenous response variable. The dependent variable.

exogarray_like

A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.

missingstr

Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.

hasconstNone or bool

Indicates whether the RHS includes a user-supplied constant. If True, a constant is not checked for and k_constant is set to 1 and all result statistics are calculated as if a constant is present. If False, a constant is not checked for and k_constant is set to 0.

**kwargs

Extra arguments that are used to set model properties when using the formula interface.

Returns
RegressionResultsWrapper

Results.

olsr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper

Fits Ordinary Least Squares Regression from a formula and dataframe.

Parameters
formulastr or generic Formula object

The formula specifying the model.

dataarray_like

The data for the model. See Notes.

subsetarray_like

An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.

drop_colsarray_like

Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.

*args

Additional positional argument that are passed to the model.

**kwargs

These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns
RegressionResultsWrapper

Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

pivot_table(data: DataFrame, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins: bool = False, dropna: bool = True, margins_name: str = 'All', observed: bool = False, sort: bool = True) DataFrame

Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.

Parameters
dataDataFrame

The DataFrame to operate on.

valuescolumn, optional

Column to aggregate, optional.

indexcolumn, Grouper, array, or list of the previous

If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

columnscolumn, Grouper, array, or list of the previous

If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

aggfuncstr | list[str], default ‘mean’

If list of strings passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves).

fill_valuescalar, default None

Value to replace missing values with (in the resulting pivot table, after aggregation).

marginsbool, default False

Add all row / columns (e.g. for subtotal / grand totals).

dropnabool, default True

Do not include columns whose entries are all NaN.

margins_namestr, default ‘All’

Name of the row / column that will contain the totals when margins is True.

observedbool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

sortbool, default True

Specifies if the result should be sorted.

Returns
DataFrame

Cross tabulation of the data.

print_outputs() str[source]

Print the current results dictionary.

Returns
str

String representation of all outputs.

probit(endog, exog, missing: Optional[str] = None, check_rank: bool = True) BinaryResultsWrapper

Fits Probit model.

Parameters
endogarray_like

A 1-d endogenous response variable. The dependent variable.

exogarray_like

A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.

missingstr | None

Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.

check_rankbool

Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.

Returns
BinaryResultsWrapper

Results.

probitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper

Fits Probit model from a formula and dataframe.

Parameters
formulastr or generic Formula object

The formula specifying the model.

dataarray_like

The data for the model. See Notes.

subsetarray_like

An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.

drop_colsarray_like

Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.

*args

Additional positional argument that are passed to the model.

**kwargs

These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns
RegressionResultsWrapper

Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

remove_output(key: str) None[source]

Remove an output from the results.

Parameters
keystr

Key specifying which output to remove, e.g., ‘output_0’.

rename_output(old: str, new: str) None[source]

Rename an output.

Parameters
oldstr

The old name of the output.

newstr

The new name of the output.

surv_func(time, status, output, entry=None, title=None, freq_weights=None, exog=None, bw_factor=1.0, filename='kaplan-meier.png') DataFrame

Estimate the survival function.

Parameters
timearray_like

An array of times (censoring times or event times)

statusarray_like

Status at the event time, status==1 is the ‘event’ (e.g. death, failure), meaning that the event occurs at the given value in time; status==0 indicatesthat censoring has occurred, meaning that the event occurs after the given value in time.

outputstr

A string determine the type of output. Available options are ‘table’, ‘plot’.

entryarray_like, optional An array of entry times for handling

left truncation (the subject is not in the risk set on or before the entry time)

titlestr

Optional title used for plots and summary output.

freq_weightsarray_like

Optional frequency weights

exogarray_like

Optional, if present used to account for violation of independent censoring.

bw_factorfloat

Band-width multiplier for kernel-based estimation. Only used if exog is provided.

filenamestr

The name of the file where the plot will be saved. Only used if the output is a plot.

Returns
DataFrame

The survival table.

survival_plot(survival_table, survival_func, filename, status, sdc, command, summary)

Create the survival plot according to the status of suppressing.

survival_table(survival_table, safe_table, status, sdc, command, summary, outcome)

Create the survival table according to the status of suppressing.

acro.acro.add_to_acro(src_path: str, dest_path: str = 'sdc_results') None[source]

Add outputs to an acro object and creates a results file for checking.

Parameters
src_pathstr

Name of the folder with outputs produced without using acro.

dest_pathstr

Name of the folder to save outputs.