ACRO Class
ACRO: Automatic Checking of Research Outputs.
- class acro.acro.ACRO(config: str = 'default', suppress: bool = False)[source]
ACRO: Automatic Checking of Research Outputs.
Examples
>>> acro = ACRO() >>> results = acro.ols( ... y, x ... ) >>> results.summary() >>> acro.finalise( ... "MYFOLDER", ... "json", ... )
- Attributes
- configdict
Safe parameters and their values.
- resultsRecords
The current outputs including the results of checks.
- suppressbool
Whether to automatically apply suppression
Methods
add_comments
(output, comment)Add a comment to an output.
add_exception
(output, reason)Add an exception request to an output.
crosstab
(index, columns[, values, rownames, ...])Compute a simple cross tabulation of two (or more) factors.
custom_output
(filename[, comment])Add an unsupported output to the results dictionary.
finalise
([path, ext])Create a results file for checking.
hist
(data, column[, by_val, grid, ...])Create a histogram from a single column.
logit
(endog, exog[, missing, check_rank])Fits Logit model.
logitr
(formula, data[, subset, drop_cols])Fits Logit model from a formula and dataframe.
ols
(endog[, exog, missing, hasconst])Fits Ordinary Least Squares Regression.
olsr
(formula, data[, subset, drop_cols])Fits Ordinary Least Squares Regression from a formula and dataframe.
pivot_table
(data[, values, index, columns, ...])Create a spreadsheet-style pivot table as a DataFrame.
Print the current results dictionary.
probit
(endog, exog[, missing, check_rank])Fits Probit model.
probitr
(formula, data[, subset, drop_cols])Fits Probit model from a formula and dataframe.
remove_output
(key)Remove an output from the results.
rename_output
(old, new)Rename an output.
surv_func
(time, status, output[, entry, ...])Estimate the survival function.
survival_plot
(survival_table, survival_func, ...)Create the survival plot according to the status of suppressing.
survival_table
(survival_table, safe_table, ...)Create the survival table according to the status of suppressing.
- add_comments(output: str, comment: str) None [source]
Add a comment to an output.
- Parameters
- outputstr
The name of the output.
- commentstr
The comment.
- add_exception(output: str, reason: str) None [source]
Add an exception request to an output.
- Parameters
- outputstr
The name of the output.
- reasonstr
The comment.
- crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins: bool = False, margins_name: str = 'All', dropna: bool = True, normalize=False, show_suppressed=False) DataFrame
Compute a simple cross tabulation of two (or more) factors.
By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.
To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.
- Parameters
- indexarray-like, Series, or list of arrays/Series
Values to group by in the rows.
- columnsarray-like, Series, or list of arrays/Series
Values to group by in the columns.
- valuesarray-like, optional
Array of values to aggregate according to the factors. Requires aggfunc be specified.
- rownamessequence, default None
If passed, must match number of row arrays passed.
- colnamessequence, default None
If passed, must match number of column arrays passed.
- aggfuncstr, optional
If specified, requires values be specified as well.
- marginsbool, default False
Add row/column margins (subtotals).
- margins_namestr, default ‘All’
Name of the row/column that will contain the totals when margins is True.
- dropnabool, default True
Do not include columns whose entries are all NaN.
- normalizebool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False
Normalize by dividing all values by the sum of values. - If passed ‘all’ or True, will normalize over all values. - If passed ‘index’ will normalize over each row. - If passed ‘columns’ will normalize over each column. - If margins is True, will also normalize margin values.
- show_suppressedbool. default False
how the totals are being calculated when the suppression is true
- Returns
- DataFrame
Cross tabulation of the data.
- custom_output(filename: str, comment: str = '') None [source]
Add an unsupported output to the results dictionary.
- Parameters
- filenamestr
The name of the file that will be added to the list of the outputs.
- commentstr
An optional comment.
- finalise(path: str = 'outputs', ext='json') acro.record.Records | None [source]
Create a results file for checking.
- Parameters
- pathstr
Name of a folder to save outputs.
- extstr
Extension of the results file. Valid extensions: {json, xlsx}.
- Returns
- Records
Object storing the outputs.
- hist(data, column, by_val=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, axis=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, filename='histogram.png', **kwargs)
Create a histogram from a single column.
The dataset and the column’s name should be passed to the function as parameters. If more than one column is used the histogram will not be calculated.
To save the histogram plot to a file, the user can specify a filename otherwise ‘histogram.png’ will be used as the filename. A number will be appended automatically to the filename to avoid overwriting the files.
- Parameters
- dataDataFrame
The pandas object holding the data.
- columnstr
The column that will be used to plot the histogram.
- by_valobject, optional
If passed, then used to form histograms for separate groups.
- gridbool, default True
Whether to show axis grid lines.
- xlabelsizeint, default None
If specified changes the x-axis label size.
- xrotfloat, default None
Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.
- ylabelsizeint, default None
If specified changes the y-axis label size.
- yrotfloat, default None
Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.
- axisMatplotlib axes object, default None
The axes to plot the histogram on.
- sharexbool, default True if ax is None else False
In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.
- shareybool, default False
In case subplots=True, share y axis and set some y axis labels to invisible.
- figsizetuple, optional
The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.
- layouttuple, optional
Tuple of (rows, columns) for the layout of the histograms.
- binsint or sequence, default 10
Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin.
- backendstr, default None
Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.
- legendbool, default False
Whether to show the legend.
- filename:
The name of the file where the plot will be saved.
- Returns
- matplotlib.Axes
The histogram.
- str
The name of the file where the histogram is saved.
- logit(endog, exog, missing: Optional[str] = None, check_rank: bool = True) BinaryResultsWrapper
Fits Logit model.
- Parameters
- endogarray_like
A 1-d endogenous response variable. The dependent variable.
- exogarray_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
- missingstr | None
Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
- check_rankbool
Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.
- Returns
- BinaryResultsWrapper
Results.
- logitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper
Fits Logit model from a formula and dataframe.
- Parameters
- formulastr or generic Formula object
The formula specifying the model.
- dataarray_like
The data for the model. See Notes.
- subsetarray_like
An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
- drop_colsarray_like
Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
- *args
Additional positional argument that are passed to the model.
- **kwargs
These are passed to the model with one exception. The
eval_env
keyword is passed to patsy. It can be either apatsy:patsy.EvalEnvironment
object or an integer indicating the depth of the namespace to use. For example, the defaulteval_env=0
uses the calling namespace. If you wish to use a “clean” environment seteval_env=-1
.
- Returns
- RegressionResultsWrapper
Results.
Notes
data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.
- ols(endog, exog=None, missing='none', hasconst=None, **kwargs) RegressionResultsWrapper
Fits Ordinary Least Squares Regression.
- Parameters
- endogarray_like
A 1-d endogenous response variable. The dependent variable.
- exogarray_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
- missingstr
Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
- hasconstNone or bool
Indicates whether the RHS includes a user-supplied constant. If True, a constant is not checked for and k_constant is set to 1 and all result statistics are calculated as if a constant is present. If False, a constant is not checked for and k_constant is set to 0.
- **kwargs
Extra arguments that are used to set model properties when using the formula interface.
- Returns
- RegressionResultsWrapper
Results.
- olsr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper
Fits Ordinary Least Squares Regression from a formula and dataframe.
- Parameters
- formulastr or generic Formula object
The formula specifying the model.
- dataarray_like
The data for the model. See Notes.
- subsetarray_like
An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
- drop_colsarray_like
Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
- *args
Additional positional argument that are passed to the model.
- **kwargs
These are passed to the model with one exception. The
eval_env
keyword is passed to patsy. It can be either apatsy:patsy.EvalEnvironment
object or an integer indicating the depth of the namespace to use. For example, the defaulteval_env=0
uses the calling namespace. If you wish to use a “clean” environment seteval_env=-1
.
- Returns
- RegressionResultsWrapper
Results.
Notes
data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.
- pivot_table(data: DataFrame, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins: bool = False, dropna: bool = True, margins_name: str = 'All', observed: bool = False, sort: bool = True) DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
To provide consistent behaviour with different aggregation functions, ‘empty’ rows or columns -i.e. that are all NaN or 0 (count,sum) are removed.
- Parameters
- dataDataFrame
The DataFrame to operate on.
- valuescolumn, optional
Column to aggregate, optional.
- indexcolumn, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
- columnscolumn, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
- aggfuncstr | list[str], default ‘mean’
If list of strings passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves).
- fill_valuescalar, default None
Value to replace missing values with (in the resulting pivot table, after aggregation).
- marginsbool, default False
Add all row / columns (e.g. for subtotal / grand totals).
- dropnabool, default True
Do not include columns whose entries are all NaN.
- margins_namestr, default ‘All’
Name of the row / column that will contain the totals when margins is True.
- observedbool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- sortbool, default True
Specifies if the result should be sorted.
- Returns
- DataFrame
Cross tabulation of the data.
- print_outputs() str [source]
Print the current results dictionary.
- Returns
- str
String representation of all outputs.
- probit(endog, exog, missing: Optional[str] = None, check_rank: bool = True) BinaryResultsWrapper
Fits Probit model.
- Parameters
- endogarray_like
A 1-d endogenous response variable. The dependent variable.
- exogarray_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user.
- missingstr | None
Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none’.
- check_rankbool
Check exog rank to determine model degrees of freedom. Default is True. Setting to False reduces model initialization time when exog.shape[1] is large.
- Returns
- BinaryResultsWrapper
Results.
- probitr(formula, data, subset=None, drop_cols=None, *args, **kwargs) RegressionResultsWrapper
Fits Probit model from a formula and dataframe.
- Parameters
- formulastr or generic Formula object
The formula specifying the model.
- dataarray_like
The data for the model. See Notes.
- subsetarray_like
An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame.
- drop_colsarray_like
Columns to drop from the design matrix. Cannot be used to drop terms involving categoricals.
- *args
Additional positional argument that are passed to the model.
- **kwargs
These are passed to the model with one exception. The
eval_env
keyword is passed to patsy. It can be either apatsy:patsy.EvalEnvironment
object or an integer indicating the depth of the namespace to use. For example, the defaulteval_env=0
uses the calling namespace. If you wish to use a “clean” environment seteval_env=-1
.
- Returns
- RegressionResultsWrapper
Results.
Notes
data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame. Arguments are passed in the same order as statsmodels.
- remove_output(key: str) None [source]
Remove an output from the results.
- Parameters
- keystr
Key specifying which output to remove, e.g., ‘output_0’.
- rename_output(old: str, new: str) None [source]
Rename an output.
- Parameters
- oldstr
The old name of the output.
- newstr
The new name of the output.
- surv_func(time, status, output, entry=None, title=None, freq_weights=None, exog=None, bw_factor=1.0, filename='kaplan-meier.png') DataFrame
Estimate the survival function.
- Parameters
- timearray_like
An array of times (censoring times or event times)
- statusarray_like
Status at the event time, status==1 is the ‘event’ (e.g. death, failure), meaning that the event occurs at the given value in time; status==0 indicatesthat censoring has occurred, meaning that the event occurs after the given value in time.
- outputstr
A string determine the type of output. Available options are ‘table’, ‘plot’.
- entryarray_like, optional An array of entry times for handling
left truncation (the subject is not in the risk set on or before the entry time)
- titlestr
Optional title used for plots and summary output.
- freq_weightsarray_like
Optional frequency weights
- exogarray_like
Optional, if present used to account for violation of independent censoring.
- bw_factorfloat
Band-width multiplier for kernel-based estimation. Only used if exog is provided.
- filenamestr
The name of the file where the plot will be saved. Only used if the output is a plot.
- Returns
- DataFrame
The survival table.
- survival_plot(survival_table, survival_func, filename, status, sdc, command, summary)
Create the survival plot according to the status of suppressing.
- survival_table(survival_table, safe_table, status, sdc, command, summary, outcome)
Create the survival table according to the status of suppressing.