ACRO Class

ACRO: Automatic Checking of Research Outputs.

class acro.acro.ACRO(config: str = 'default', suppress: bool = False)[source]

ACRO: Automatic Checking of Research Outputs.

Attributes:

configdict: Safe parameters and their values.
resultsRecords: The current outputs including the results of checks.
suppressbool: Whether to automatically apply suppression

Methods

`add_comments`(output, comment)	Add a comment to an output.
`add_exception`(output, reason)	Add an exception request to an output.
`crosstab`(index, columns[, values, rownames, ...])	Compute a simple cross tabulation of two (or more) factors.
`custom_output`(filename[, comment])	Add an unsupported output to the results dictionary.
`finalise`([path, ext])	Create a results file for checking.
`hist`(data, column[, by_val, grid, ...])	Create a histogram from a single column.
`logit`(endog, exog[, missing, check_rank])	Fits Logit model.
`logitr`(formula, data[, subset, drop_cols])	Fits Logit model from a formula and dataframe.
`ols`(endog[, exog, missing, hasconst])	Fits Ordinary Least Squares Regression.
`olsr`(formula, data[, subset, drop_cols])	Fits Ordinary Least Squares Regression from a formula and dataframe.
`pivot_table`(data[, values, index, columns, ...])	Create a spreadsheet-style pivot table as a DataFrame.
`print_outputs`()	Print the current results dictionary.
`probit`(endog, exog[, missing, check_rank])	Fits Probit model.
`probitr`(formula, data[, subset, drop_cols])	Fits Probit model from a formula and dataframe.
`remove_output`(key)	Remove an output from the results.
`rename_output`(old, new)	Rename an output.
`surv_func`(time, status, output[, entry, ...])	Estimate the survival function.
`survival_plot`(survival_table, survival_func, ...)	Create the survival plot according to the status of suppressing.
`survival_table`(survival_table, safe_table, ...)	Create the survival table according to the status of suppressing.

Examples

>>> acro = ACRO()
>>> results = acro.ols(
...     y, x
... )
>>> results.summary()
>>> acro.finalise(
...     "MYFOLDER",
...     "json",
... )

add_comments(output: str, comment: str) → None[source]

Add a comment to an output.

Parameters:

outputstr: The name of the output.
commentstr: The comment.

add_exception(output: str, reason: str) → None[source]

Add an exception request to an output.

Parameters:

outputstr: The name of the output.
reasonstr: The comment.

crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins: bool = False, margins_name: str = 'All', dropna: bool = True, normalize=False, show_suppressed=False) → DataFrame

Compute a simple cross tabulation of two (or more) factors.

By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.

To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.

Parameters:

indexarray-like, Series, or list of arrays/Series: Values to group by in the rows.
columnsarray-like, Series, or list of arrays/Series: Values to group by in the columns.
valuesarray-like, optional: Array of values to aggregate according to the factors. Requires aggfunc be specified.
rownamessequence, default None: If passed, must match number of row arrays passed.
colnamessequence, default None: If passed, must match number of column arrays passed.
aggfuncstr, optional: If specified, requires values be specified as well.
marginsbool, default False: Add row/column margins (subtotals).
margins_namestr, default ‘All’: Name of the row/column that will contain the totals when margins is True.
dropnabool, default True: Do not include columns whose entries are all NaN.
normalizebool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False: Normalize by dividing all values by the sum of values. - If passed ‘all’ or True, will normalize over all values. - If passed ‘index’ will normalize over each row. - If passed ‘columns’ will normalize over each column. - If margins is True, will also normalize margin values.
show_suppressedbool. default False: how the totals are being calculated when the suppression is true

Returns:

DataFrame: Cross tabulation of the data.

custom_output(filename: str, comment: str = '') → None[source]

Add an unsupported output to the results dictionary.

Parameters:

filenamestr: The name of the file that will be added to the list of the outputs.
commentstr: An optional comment.

finalise(path: str = 'outputs', ext='json') → Records | None[source]

Create a results file for checking.

Parameters:

pathstr: Name of a folder to save outputs.
extstr: Extension of the results file. Valid extensions: {json, xlsx}.

Returns:

Records: Object storing the outputs.

hist(data, column, by_val=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, axis=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, filename='histogram.png', **kwargs)

Create a histogram from a single column.

The dataset and the column’s name should be passed to the function as parameters. If more than one column is used the histogram will not be calculated.

To save the histogram plot to a file, the user can specify a filename otherwise ‘histogram.png’ will be used as the filename. A number will be appended automatically to the filename to avoid overwriting the files.

Parameters:

dataDataFrame: The pandas object holding the data.
columnstr: The column that will be used to plot the histogram.
by_valobject, optional: If passed, then used to form histograms for separate groups.
gridbool, default True: Whether to show axis grid lines.
xlabelsizeint, default None: If specified changes the x-axis label size.
xrotfloat, default None: Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.
ylabelsizeint, default None: If specified changes the y-axis label size.
yrotfloat, default None: Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.
axisMatplotlib axes object, default None: The axes to plot the histogram on.
sharexbool, default True if ax is None else False: In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.
shareybool, default False: In case subplots=True, share y axis and set some y axis labels to invisible.
figsizetuple, optional: The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.
layouttuple, optional: Tuple of (rows, columns) for the layout of the histograms.
binsint or sequence, default 10: Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin.
backendstr, default None: Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.
legendbool, default False: Whether to show the legend.
filename:: The name of the file where the plot will be saved.

Returns:

matplotlib.Axes: The histogram.
str: The name of the file where the histogram is saved.

logit(endog, exog, missing: str | None = None, check_rank: bool = True) → BinaryResultsWrapper

Fits Logit model.

Parameters:

endogarray_like: A 1-d endogenous response variable. The dependent variable.
exogarray_like: A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
missingstr | None: Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
check_rankbool: Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.

Returns:

BinaryResultsWrapper: Results.

logitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) → RegressionResultsWrapper

Fits Logit model from a formula and dataframe.

Parameters:

formulastr or generic Formula object: The formula specifying the model.
dataarray_like: The data for the model. See Notes.
subsetarray_like: An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
drop_colsarray_like: Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
*args: Additional positional argument that are passed to the model.
**kwargs: These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns:

RegressionResultsWrapper: Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

ols(endog, exog=None, missing='none', hasconst=None, **kwargs) → RegressionResultsWrapper

Fits Ordinary Least Squares Regression.

Parameters:

endogarray_like: A 1-d endogenous response variable. The dependent variable.
exogarray_like: A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
missingstr: Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
hasconstNone or bool: Indicates whether the RHS includes a user-supplied constant. If True, a constant is not checked for and k_constant is set to 1 and all result statistics are calculated as if a constant is present. If False, a constant is not checked for and k_constant is set to 0.
**kwargs: Extra arguments that are used to set model properties when using the formula interface.

Returns:

RegressionResultsWrapper: Results.

olsr(formula, data, subset=None, drop_cols=None, *args, **kwargs) → RegressionResultsWrapper

Fits Ordinary Least Squares Regression from a formula and dataframe.

Parameters:

formulastr or generic Formula object: The formula specifying the model.
dataarray_like: The data for the model. See Notes.
subsetarray_like: An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
drop_colsarray_like: Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
*args: Additional positional argument that are passed to the model.
**kwargs: These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns:

RegressionResultsWrapper: Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

pivot_table(data: DataFrame, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins: bool = False, dropna: bool = True, margins_name: str = 'All', observed: bool = False, sort: bool = True) → DataFrame

Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.

Parameters:

dataDataFrame: The DataFrame to operate on.
valuescolumn, optional: Column to aggregate, optional.
indexcolumn, Grouper, array, or list of the previous: If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
columnscolumn, Grouper, array, or list of the previous: If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
aggfuncstr | list[str], default ‘mean’: If list of strings passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves).
fill_valuescalar, default None: Value to replace missing values with (in the resulting pivot table, after aggregation).
marginsbool, default False: Add all row / columns (e.g. for subtotal / grand totals).
dropnabool, default True: Do not include columns whose entries are all NaN.
margins_namestr, default ‘All’: Name of the row / column that will contain the totals when margins is True.
observedbool, default False: This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
sortbool, default True: Specifies if the result should be sorted.

Returns:

DataFrame: Cross tabulation of the data.

print_outputs() → str[source]

Print the current results dictionary.

Returns:

str: String representation of all outputs.

probit(endog, exog, missing: str | None = None, check_rank: bool = True) → BinaryResultsWrapper

Fits Probit model.

Parameters:

endogarray_like: A 1-d endogenous response variable. The dependent variable.
exogarray_like: A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
missingstr | None: Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
check_rankbool: Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.

Returns:

BinaryResultsWrapper: Results.

probitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) → RegressionResultsWrapper

Fits Probit model from a formula and dataframe.

Parameters:

formulastr or generic Formula object: The formula specifying the model.
dataarray_like: The data for the model. See Notes.
subsetarray_like: An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
drop_colsarray_like: Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
*args: Additional positional argument that are passed to the model.
**kwargs: These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns:

RegressionResultsWrapper: Results.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.

remove_output(key: str) → None[source]

Remove an output from the results.

Parameters:

keystr: Key specifying which output to remove, e.g., ‘output_0’.

rename_output(old: str, new: str) → None[source]

Rename an output.

Parameters:

oldstr: The old name of the output.
newstr: The new name of the output.

surv_func(time, status, output, entry=None, title=None, freq_weights=None, exog=None, bw_factor=1.0, filename='kaplan-meier.png') → DataFrame

Estimate the survival function.

Parameters:

timearray_like: An array of times (censoring times or event times)
statusarray_like: Status at the event time, status==1 is the ‘event’ (e.g. death, failure), meaning that the event occurs at the given value in time; status==0 indicatesthat censoring has occurred, meaning that the event occurs after the given value in time.
outputstr: A string determine the type of output. Available options are ‘table’, ‘plot’.
entryarray_like, optional An array of entry times for handling: left truncation (the subject is not in the risk set on or before the entry time)
titlestr: Optional title used for plots and summary output.
freq_weightsarray_like: Optional frequency weights
exogarray_like: Optional, if present used to account for violation of independent censoring.
bw_factorfloat: Band-width multiplier for kernel-based estimation. Only used if exog is provided.
filenamestr: The name of the file where the plot will be saved. Only used if the output is a plot.

Returns:

DataFrame: The survival table.

survival_plot(survival_table, survival_func, filename, status, sdc, command, summary): Create the survival plot according to the status of suppressing.

survival_table(survival_table, safe_table, status, sdc, command, summary, outcome): Create the survival table according to the status of suppressing.

acro.acro.add_to_acro(src_path: str, dest_path: str = 'sdc_results') → None[source]

Add outputs to an acro object and creates a results file for checking.

Parameters:

src_pathstr: Name of the folder with outputs produced without using acro.
dest_pathstr: Name of the folder to save outputs.