Title: | Tidy Data Validation Reports |
---|---|
Description: | Tools for creating data validation pipelines and tidy reports. This package offers a framework for exploring and validating data frame like objects using 'dplyr' grammar of data manipulation. |
Authors: | Evgeni Chasnovski [aut, cre] |
Maintainer: | Evgeni Chasnovski <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0.9000 |
Built: | 2024-11-20 03:50:06 UTC |
Source: | https://github.com/echasnovski/ruler |
ruler
offers a set of tools for creating tidy data validation reports using
dplyr grammar of data manipulation. It
is designed to be flexible and extendable in terms of creating rules and
using their output.
The common workflow is:
Define dplyr-style packs of rules for basic data units (data, group, column, row, cell) to obey.
Expose some data to those rules. The result is the same data with possibly created exposure attribute. Exposure contains information about applied packs and tidy data validation report.
Use data and exposure to perform some actions: assert about rule breakers, impute data, remove outliers and so on.
To learn more about ruler
browse vignettes with browseVignettes(package = "ruler")
. The preferred order is:
Design process and exposure format.
Rule packs.
Validation
Maintainer: Evgeni Chasnovski [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/echasnovski/ruler/issues
A wrapper for consistent application of some actions based on the data after exposure.
act_after_exposure(.tbl, .trigger, .actor)
act_after_exposure(.tbl, .trigger, .actor)
.tbl |
Result of exposure, i.e. data frame with exposure attribute. |
.trigger |
Function which takes |
.actor |
Function which takes |
Basically act_after_exposure()
is doing the following:
Check that .tbl
has a proper exposure attribute.
Compute whether to perform intended action by computing .trigger(.tbl)
.
If trigger results in TRUE
then .actor(.tbl)
is returned. In other
case .tbl
is returned.
It is a good idea that .actor
should be doing one of two things:
Making side effects. For example throwing an error (if condition in
.trigger
is met), printing some information and so on. In this case it
should return .tbl
to be used properly inside a pipe.
Changing .tbl
based on exposure information. In this case it should
return the imputed version of .tbl
.
any_breaker for trigger which returns TRUE
in case any rule
breaker is found in exposure.
assert_any_breaker for usage of act_after_exposure()
in building data
validation pipelines.
exposure_printer <- function(.tbl) { print(get_exposure(.tbl)) .tbl } mtcars_exposed <- mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% act_after_exposure(any_breaker, exposure_printer)
exposure_printer <- function(.tbl) { print(get_exposure(.tbl)) .tbl } mtcars_exposed <- mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% act_after_exposure(any_breaker, exposure_printer)
Function designed to be used as trigger in act_after_exposure()
. Returns
TRUE
if exposure attribute of .tbl
has any information about data units
not obeying the rules, i.e. rule breakers.
any_breaker(.tbl)
any_breaker(.tbl)
.tbl |
Result of exposure, i.e. data frame with exposure attribute. |
assert_any_breaker for implicit usage of any_breaker()
.
mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% any_breaker()
mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% any_breaker()
Function to assert if exposure resulted in detecting some rule breakers.
assert_any_breaker(.tbl, .type = "error", .silent = FALSE, ...)
assert_any_breaker(.tbl, .type = "error", .silent = FALSE, ...)
.tbl |
Result of exposure, i.e. data frame with exposure attribute. |
.type |
The type of assertion. Can be only one of "error", "warning" or "message". |
.silent |
If |
... |
Arguments for printing rule breaker information. |
In case breaker presence this function does the following:
In case .silent
is FALSE
print rows from exposure
report corresponding to rule breakers.
Make assertion of the chosen .type
about breaker presence in exposure.
Return .tbl
(for using inside a pipe).
If there are no breakers only .tbl
is returned.
any_breaker for checking of breaker presence in exposure result.
act_after_exposure for making general actions based in exposure result.
## Not run: mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% assert_any_breaker() ## End(Not run)
## Not run: mtcars %>% expose(data_packs(. %>% dplyr::summarise(nrow_low = nrow(.) > 50))) %>% assert_any_breaker() ## End(Not run)
Function to bind several exposures into one.
bind_exposures(..., .validate_output = TRUE)
bind_exposures(..., .validate_output = TRUE)
... |
Exposures to bind. |
.validate_output |
Whether to validate with |
Note that the output might not have names in list-column fun
in packs info, which depends on version of
dplyr package.
my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise(nrow_low = nrow(.) < 10), data_sum = . %>% dplyr::summarise(sum = sum(.) < 1000) ) ref_exposure <- mtcars %>% expose(my_data_packs) %>% get_exposure() exposure_1 <- mtcars %>% expose(my_data_packs[1]) %>% get_exposure() exposure_2 <- mtcars %>% expose(my_data_packs[2]) %>% get_exposure() exposure_binded <- bind_exposures(exposure_1, exposure_2) exposure_pipe <- mtcars %>% expose(my_data_packs[1]) %>% expose(my_data_packs[2]) %>% get_exposure() identical(exposure_binded, ref_exposure) identical(exposure_pipe, ref_exposure)
my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise(nrow_low = nrow(.) < 10), data_sum = . %>% dplyr::summarise(sum = sum(.) < 1000) ) ref_exposure <- mtcars %>% expose(my_data_packs) %>% get_exposure() exposure_1 <- mtcars %>% expose(my_data_packs[1]) %>% get_exposure() exposure_2 <- mtcars %>% expose(my_data_packs[2]) %>% get_exposure() exposure_binded <- bind_exposures(exposure_1, exposure_2) exposure_pipe <- mtcars %>% expose(my_data_packs[1]) %>% expose(my_data_packs[2]) %>% get_exposure() identical(exposure_binded, ref_exposure) identical(exposure_pipe, ref_exposure)
Cell rule pack is a rule pack which defines a set of rules for cells, i.e. functions which convert cells of interest to logical values. It should return a data frame with the following properties:
Number of rows equals to number of rows for checked cells.
Column names should be treated as concatenation of 'column name of check cell' + 'separator' + 'rule name'
Values indicate whether the cell follows the rule.
This format is inspired by scoped variants of transmute().
The most common way to define cell pack is by creating a functional sequence containing one of:
transmute_all(.funs = rules(...))
.
transmute_if(.predicate, .funs = rules(...))
.
transmute_at(.vars, .funs = rules(...))
.
Note that (as of dplyr
version 0.7.4) when only one column is
transmuted, names of the output don't have a necessary structure. The 'column
name of check cell' is missing which results (after exposure)
into empty string in var
column of validation report. The
current way of dealing with this is to name the input column (see examples).
Using rules()
to create list of functions for scoped dplyr
"mutating"
verbs (such as summarise_all() and
transmute_all()) is recommended because:
It is a convenient way to ensure consistent naming of rules without manual name.
It adds a common prefix to all rule names. This helps in defining separator as prefix surrounded by any number of non-alphanumeric values.
Note that during exposure packs are applied to keyed object with id key. So they can rearrange rows as long as it is done with functions supported by keyholder. Rows will be tracked and recognized as in the original data frame of interest.
Data pack, group pack, column pack, row pack.
cell_outlier_rules <- . %>% dplyr::transmute_at( c("disp", "qsec"), rules(z_score = abs(. - mean(.)) / sd(.) > 1) ) cell_packs(outlier = cell_outlier_rules) # Dealing with one column edge case improper_pack <- . %>% dplyr::transmute_at( dplyr::vars(vs), rules(improper_is_neg = . < 0) ) proper_pack <- . %>% dplyr::transmute_at( dplyr::vars(vs = vs), rules(proper_is_neg = . < 0) ) mtcars[1:2, ] %>% expose(cell_packs(improper_pack, proper_pack)) %>% get_report()
cell_outlier_rules <- . %>% dplyr::transmute_at( c("disp", "qsec"), rules(z_score = abs(. - mean(.)) / sd(.) > 1) ) cell_packs(outlier = cell_outlier_rules) # Dealing with one column edge case improper_pack <- . %>% dplyr::transmute_at( dplyr::vars(vs), rules(improper_is_neg = . < 0) ) proper_pack <- . %>% dplyr::transmute_at( dplyr::vars(vs = vs), rules(proper_is_neg = . < 0) ) mtcars[1:2, ] %>% expose(cell_packs(improper_pack, proper_pack)) %>% get_report()
Column rule pack is a rule pack which defines a set of rules for columns as a whole, i.e. functions which convert columns of interest to logical values. It should return a data frame with the following properties:
Number of rows equals to one.
Column names should be treated as concatenation of 'check column name' + 'separator' + 'rule name'.
Values indicate whether the column as a whole follows the rule.
This format is inspired by dplyr
's
scoped variants of summarise() applied to non-grouped
data.
The most common way to define column pack is by creating a functional sequence with no grouping and ending with one of:
summarise_all(.funs = rules(...))
.
summarise_if(.predicate, .funs = rules(...))
.
summarise_at(.vars, .funs = rules(...))
.
Note that (as of dplyr
version 0.7.4) when only one column is
summarised, names of the output don't have a necessary structure. The 'check
column name' is missing which results (after exposure) into empty
string in var
column of validation report. The current way
of dealing with this is to name the input column (see examples).
Using rules()
to create list of functions for scoped dplyr
"mutating"
verbs (such as summarise_all() and
transmute_all()) is recommended because:
It is a convenient way to ensure consistent naming of rules without manual name.
It adds a common prefix to all rule names. This helps in defining separator as prefix surrounded by any number of non-alphanumeric values.
Data pack, group pack, row pack, cell pack.
# Validating present columns numeric_column_rules <- . %>% dplyr::summarise_if( is.numeric, rules(mean(.) > 5, sd(.) < 10) ) character_column_rules <- . %>% dplyr::summarise_if( is.character, rules(. %in% letters[1:4]) ) col_packs( num_col = numeric_column_rules, chr_col = character_column_rules ) # Dealing with one column edge case improper_pack <- . %>% dplyr::summarise_at( dplyr::vars(vs), rules(improper_is_chr = is.character) ) proper_pack <- . %>% dplyr::summarise_at( dplyr::vars(vs = vs), rules(proper_is_chr = is.character) ) mtcars %>% expose(col_packs(improper_pack, proper_pack)) %>% get_report()
# Validating present columns numeric_column_rules <- . %>% dplyr::summarise_if( is.numeric, rules(mean(.) > 5, sd(.) < 10) ) character_column_rules <- . %>% dplyr::summarise_if( is.character, rules(. %in% letters[1:4]) ) col_packs( num_col = numeric_column_rules, chr_col = character_column_rules ) # Dealing with one column edge case improper_pack <- . %>% dplyr::summarise_at( dplyr::vars(vs), rules(improper_is_chr = is.character) ) proper_pack <- . %>% dplyr::summarise_at( dplyr::vars(vs = vs), rules(proper_is_chr = is.character) ) mtcars %>% expose(col_packs(improper_pack, proper_pack)) %>% get_report()
Data rule pack is a rule pack which defines a set of rules for data as a whole, i.e. functions which convert data to logical values. It should return a data frame with the following properties:
Number of rows equals to one.
Column names should be treated as rule names.
Values indicate whether the data as a whole follows the rule.
This format is inspired by dplyr
's summarise() applied
to non-grouped data.
The most common way to define data pack is by creating a
functional sequence with no grouping and ending with
summarise(...)
.
Group pack, Column pack, row pack, cell pack.
data_dims_rules <- . %>% dplyr::summarise( nrow_low = nrow(.) > 10, nrow_up = nrow(.) < 20, ncol_low = ncol(.) > 5, ncol_up = ncol(.) < 10 ) data_na_rules <- . %>% dplyr::summarise(all_not_na = Negate(anyNA)(.)) data_packs( data_nrow = data_dims_rules, data_na = data_na_rules )
data_dims_rules <- . %>% dplyr::summarise( nrow_low = nrow(.) > 10, nrow_up = nrow(.) < 20, ncol_low = ncol(.) > 5, ncol_up = ncol(.) < 10 ) data_na_rules <- . %>% dplyr::summarise(all_not_na = Negate(anyNA)(.)) data_packs( data_nrow = data_dims_rules, data_na = data_na_rules )
Function for applying rule packs to data.
expose(.tbl, ..., .rule_sep = inside_punct("\\._\\."), .remove_obeyers = TRUE, .guess = TRUE)
expose(.tbl, ..., .rule_sep = inside_punct("\\._\\."), .remove_obeyers = TRUE, .guess = TRUE)
.tbl |
Data frame of interest. |
... |
Rule packs. They can be in pure form or inside a list (at any depth). |
.rule_sep |
Regular expression used as separator between column and rule names in col packs and cell packs. |
.remove_obeyers |
Whether to remove elements which obey rules from report. |
.guess |
Whether to guess type of unsupported rule pack type (see Details). |
expose()
applies all supplied rule packs to data, creates an
exposure object based on results and stores it to attribute 'exposure'.
It is guaranteed that .tbl
is not modified in any other way in order to
use expose()
inside a pipe
.
It is a good idea to name all rule packs: explicitly in ...
(if they are
supplied not inside list) or during creation with respective rule pack
function. In case of missing name it is imputed based on possibly existing
exposure attribute in .tbl
and supplied rule packs. Imputation is similar
to one in rules()
but applied to every pack type separately.
Default value for .rule_sep
is the regular expression characters ._. surrounded by non alphanumeric characters
. It is picked to be used
smoothly with dplyr
's scoped verbs and rules()
instead
of pure list. In most cases it shouldn't be changed but if needed it
should align with .prefix
in rules()
.
A .tbl
with possibly added 'exposure' attribute containing the
resulting exposure. If .tbl
already contains 'exposure' attribute then
the result is binded with it.
To work properly in some edge cases one should specify pack types with
appropriate function. However with .guess
equals to TRUE
expose
will guess the pack type based on its output after applying to
.tbl
. It uses the following features:
Presence of non-logical columns: if present then the guess is group pack. Grouping columns are guessed as all non-logical. This works incorrectly if some grouping column is logical: it will be guessed as result of applying the rule. Note that on most occasions this edge case will produce error about grouping columns define non-unique levels.
Combination of whether number of rows equals 1 (n_rows_one
) and
presence of .rule_sep
in all column names (all_contain_sep
). Guesses
are:
Data pack if n_rows_one == TRUE
and all_contain_sep == FALSE
.
Column pack if n_rows_one == TRUE
and
all_contain_sep == TRUE
.
Row pack if n_rows_one == FALSE
and all_contain_sep == FALSE
. This works incorrectly if output has one row which is checked.
In this case it will be guessed as data pack.
Cell pack if n_rows_one == FALSE
and all_contain_sep == TRUE
. This works incorrectly if output has one row in which cells
are checked. In this case it will be guessed as column pack.
my_rule_pack <- . %>% dplyr::summarise(nrow_neg = nrow(.) < 0) my_data_packs <- data_packs(my_data_pack_1 = my_rule_pack) # These pipes give identical results mtcars %>% expose(my_data_packs) %>% get_report() mtcars %>% expose(my_data_pack_1 = my_rule_pack) %>% get_report() # This throws an error because no pack type is specified for my_rule_pack ## Not run: mtcars %>% expose(my_data_pack_1 = my_rule_pack, .guess = FALSE) ## End(Not run) # Edge cases against using 'guess = TRUE' for robust code group_rule_pack <- . %>% dplyr::mutate(vs_one = vs == 1) %>% dplyr::group_by(vs_one, am) %>% dplyr::summarise(n_low = dplyr::n() > 10) group_rule_pack_dummy <- . %>% dplyr::mutate(vs_one = vs == 1) %>% dplyr::group_by(mpg, vs_one, wt) %>% dplyr::summarise(n_low = dplyr::n() > 10) row_rule_pack <- . %>% dplyr::transmute(neg_row_sum = rowSums(.) < 0) cell_rule_pack <- . %>% dplyr::transmute_all(rules(neg_value = . < 0)) # Only column 'am' is guessed as grouping which defines non-unique levels. ## Not run: mtcars %>% expose(group_rule_pack, .remove_obeyers = FALSE, .guess = TRUE) %>% get_report() ## End(Not run) # Values in `var` should contain combination of three grouping columns but # column 'vs_one' is guessed as rule. No error is thrown because the guessed # grouping column define unique levels. mtcars %>% expose(group_rule_pack_dummy, .remove_obeyers = FALSE, .guess = TRUE) %>% get_report() # Results should have in column 'id' value 1 and not 0. mtcars %>% dplyr::slice(1) %>% expose(row_rule_pack) %>% get_report() mtcars %>% dplyr::slice(1) %>% expose(cell_rule_pack) %>% get_report()
my_rule_pack <- . %>% dplyr::summarise(nrow_neg = nrow(.) < 0) my_data_packs <- data_packs(my_data_pack_1 = my_rule_pack) # These pipes give identical results mtcars %>% expose(my_data_packs) %>% get_report() mtcars %>% expose(my_data_pack_1 = my_rule_pack) %>% get_report() # This throws an error because no pack type is specified for my_rule_pack ## Not run: mtcars %>% expose(my_data_pack_1 = my_rule_pack, .guess = FALSE) ## End(Not run) # Edge cases against using 'guess = TRUE' for robust code group_rule_pack <- . %>% dplyr::mutate(vs_one = vs == 1) %>% dplyr::group_by(vs_one, am) %>% dplyr::summarise(n_low = dplyr::n() > 10) group_rule_pack_dummy <- . %>% dplyr::mutate(vs_one = vs == 1) %>% dplyr::group_by(mpg, vs_one, wt) %>% dplyr::summarise(n_low = dplyr::n() > 10) row_rule_pack <- . %>% dplyr::transmute(neg_row_sum = rowSums(.) < 0) cell_rule_pack <- . %>% dplyr::transmute_all(rules(neg_value = . < 0)) # Only column 'am' is guessed as grouping which defines non-unique levels. ## Not run: mtcars %>% expose(group_rule_pack, .remove_obeyers = FALSE, .guess = TRUE) %>% get_report() ## End(Not run) # Values in `var` should contain combination of three grouping columns but # column 'vs_one' is guessed as rule. No error is thrown because the guessed # grouping column define unique levels. mtcars %>% expose(group_rule_pack_dummy, .remove_obeyers = FALSE, .guess = TRUE) %>% get_report() # Results should have in column 'id' value 1 and not 0. mtcars %>% dplyr::slice(1) %>% expose(row_rule_pack) %>% get_report() mtcars %>% dplyr::slice(1) %>% expose(cell_rule_pack) %>% get_report()
Exposure is a result of exposing data to rules. It is
implemented with S3 class exposure
which is a list of the following
structure: packs_info
- a packs_info object; report
-
tidy data validation report.
is_exposure(.x) get_exposure(.object) remove_exposure(.object)
is_exposure(.x) get_exposure(.object) remove_exposure(.object)
.x |
Object to test. |
.object |
Object to get or remove |
get_exposure()
returns object
if it is exposure and its attribute
'exposure' otherwise.
remove_exposure()
returns object
with removed attributed 'exposure'.
my_col_packs <- col_packs( col_sum_props = . %>% dplyr::summarise_all( rules( col_sum_low = sum(.) > 100, col_sum_high = sum(.) < 1000 ) ) ) mtcars_exposed <- mtcars %>% expose(my_col_packs) mtcars_exposure <- mtcars_exposed %>% get_exposure() is_exposure(mtcars_exposure) identical(remove_exposure(mtcars_exposed), mtcars) identical(get_exposure(mtcars_exposure), mtcars_exposure)
my_col_packs <- col_packs( col_sum_props = . %>% dplyr::summarise_all( rules( col_sum_low = sum(.) > 100, col_sum_high = sum(.) < 1000 ) ) ) mtcars_exposed <- mtcars %>% expose(my_col_packs) mtcars_exposure <- mtcars_exposed %>% get_exposure() is_exposure(mtcars_exposure) identical(remove_exposure(mtcars_exposed), mtcars) identical(get_exposure(mtcars_exposure), mtcars_exposure)
Group rule pack is a rule pack which defines a set of rules for groups of rows as a whole, i.e. functions which convert groups of interest to logical values. It should return a data frame with the following properties:
There should be present some columns which combined values uniquely describe group. They should be defined during creation with group_packs().
Number of rows equals to number of checked groups.
Names of non-grouping columns should be treated as rule names.
Values indicate whether the group as a whole follows the rule.
This format is inspired by dplyr
's summarise() applied
to grouped data.
The most common way to define data pack is by creating a
functional sequence with grouping and ending with
summarise(...)
.
Group pack output is interpreted in the following way:
All grouping columns are united with delimiter .group_sep
(which is an argument of group_packs()
).
Levels of the resulting column are treated as names of some new variables which should be exposed as a whole. Names of non-grouping columns are treated as rule names. They are transformed in column pack format and interpreted accordingly.
Exposure result of group pack is different from others in a way that column
var
in exposure report doesn't represent the actual column
in data.
Data pack, Column pack, row pack, cell pack.
vs_am_rules <- . %>% dplyr::group_by(vs, am) %>% dplyr::summarise( nrow_low = n(.) > 10, nrow_up = n(.) < 20, rowmeans_low = rowMeans(.) > 19 ) group_packs(vs_am = vs_am_rules, .group_vars = c("vs", "am"))
vs_am_rules <- . %>% dplyr::group_by(vs, am) %>% dplyr::summarise( nrow_low = n(.) > 10, nrow_up = n(.) < 20, rowmeans_low = rowMeans(.) > 19 ) group_packs(vs_am = vs_am_rules, .group_vars = c("vs", "am"))
Function to construct regular expression of form: 'non alpha-numeric characters' + 'some characters' + 'non alpha-numeric characters'.
inside_punct(.x = "\\._\\.")
inside_punct(.x = "\\._\\.")
.x |
Middle characters to be put between non alpha-numeric characters. |
inside_punct() inside_punct("abc")
inside_punct() inside_punct("abc")
An S3 class packs_info
to represent information about packs in exposure.
It is a tibble with the following structure:
name <chr>
: Name of the pack.
type <chr>
: Pack type.
fun <list>
: List (preferably unnamed) of rule pack functions.
remove_obeyers <lgl>
: value of .remove_obeyers
argument of
expose()
with which pack was applied.
is_packs_info(.x, .skip_class = FALSE) get_packs_info(.object)
is_packs_info(.x, .skip_class = FALSE) get_packs_info(.object)
.x |
Object to test. |
.skip_class |
Whether to skip checking inheritance from |
.object |
Object to get |
To avoid possible confusion it is preferred (but not required) that
list-column fun
doesn't have names. Names of packs are stored in name
column. During exposure fun
is always created without names.
get_packs_info()
returns packs_info
attribute of object
if it
is exposure and of its 'exposure' attribute otherwise.
my_row_packs <- row_packs( row_mean_props = . %>% dplyr::transmute(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 20, row_mean_high = row_mean < 60 ), row_outlier = . %>% dplyr::transmute(row_sum = rowSums(.)) %>% dplyr::transmute( not_row_outlier = abs(row_sum - mean(row_sum)) / sd(row_sum) < 1.5 ) ) my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise( nrow = nrow(.) == 32, ncol = ncol(.) == 5 ) ) mtcars_exposed <- mtcars %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(my_row_packs) mtcars_exposed %>% get_packs_info() mtcars_exposed %>% get_packs_info() %>% is_packs_info()
my_row_packs <- row_packs( row_mean_props = . %>% dplyr::transmute(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 20, row_mean_high = row_mean < 60 ), row_outlier = . %>% dplyr::transmute(row_sum = rowSums(.)) %>% dplyr::transmute( not_row_outlier = abs(row_sum - mean(row_sum)) / sd(row_sum) < 1.5 ) ) my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise( nrow = nrow(.) == 32, ncol = ncol(.) == 5 ) ) mtcars_exposed <- mtcars %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(my_row_packs) mtcars_exposed %>% get_packs_info() mtcars_exposed %>% get_packs_info() %>% is_packs_info()
Row rule pack is a rule pack which defines a set of rules for rows as a whole, i.e. functions which convert rows of interest to logical values. It should return a data frame with the following properties:
Number of rows equals to number of checked rows.
Column names should be treated as rule names.
Values indicate whether the row as a whole follows the rule.
This format is inspired by dplyr
's transmute().
The most common way to define row pack is by creating a
functional sequence containing transmute(...)
.
Note that during exposure packs are applied to keyed object with id key. So they can rearrange rows as long as it is done with functions supported by keyholder. Rows will be tracked and recognized as in the original data frame of interest.
Data pack, group pack, column pack, cell pack.
some_row_mean_rules <- . %>% dplyr::slice(1:3) %>% dplyr::mutate(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 10, row_mean_up = row_mean < 20 ) all_row_sum_rules <- . %>% dplyr::mutate(row_sum = rowSums(.)) %>% dplyr::transmute(row_sum_low = row_sum > 30) row_packs( some_row_mean_rules, all_row_sum_rules )
some_row_mean_rules <- . %>% dplyr::slice(1:3) %>% dplyr::mutate(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 10, row_mean_up = row_mean < 20 ) all_row_sum_rules <- . %>% dplyr::mutate(row_sum = rowSums(.)) %>% dplyr::transmute(row_sum_low = row_sum > 30) row_packs( some_row_mean_rules, all_row_sum_rules )
Functions for creating different kinds of rule packs. Rule is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition. Rule pack is a function which combines several rules into one functional block. It takes a data frame of interest and returns a data frame with certain structure and column naming scheme. Types of packs differ in interpretation of their output.
data_packs(...) group_packs(..., .group_vars, .group_sep = ".") col_packs(...) row_packs(...) cell_packs(...)
data_packs(...) group_packs(..., .group_vars, .group_sep = ".") col_packs(...) row_packs(...) cell_packs(...)
... |
Functions which define packs. They can be in pure form or inside a list (at any depth). |
.group_vars |
Character vector of names of grouping variables. |
.group_sep |
String to be used as separator when uniting grouping
levels for |
These functions convert ...
to list, apply rlang
's
squash() and add appropriate classes (group_packs()
also
adds necessary attributes). Also they are only constructors and do not check
for validity of certain pack. Note that it is allowed for elements of
...
to not have names: they will be computed during exposure. However it is
a good idea to manually name packs.
data_packs()
returns a list of what should be data rule packs, group_packs()
- group rule packs,
col_packs()
- column rule packs, row_packs()
- row rule packs, cell_packs()
- cell rule packs.
A tibble representing the data validation result of certain data units in tidy way:
pack <chr>
: Name of rule pack from column 'name' of corresponding
packs_info object.
rule <chr>
: Name of the rule defined in rule pack.
var <chr>
: Name of the variable which validation result is reported.
Value '.all' is reserved and interpreted as 'all columns as a whole'.
Note that var
doesn't always represent the actual column in data frame
(see group packs).
id <int>
: Index of the row in tested data frame which validation
result is reported. Value 0 is reserved and interpreted as 'all rows as a
whole'.
value <lgl>
: Whether the described data unit obeys the rule.
is_report(.x, .skip_class = FALSE) get_report(.object)
is_report(.x, .skip_class = FALSE) get_report(.object)
.x |
Object to test. |
.skip_class |
Whether to skip checking inheritance from |
.object |
Object to get |
There are four basic combinations of var
and id
values which
define five basic data units:
var == '.all'
and id == 0
: Data as a whole.
var != '.all'
and id == 0
: Group (var
shouldn't be an actual column
name) or column (var
should be an actual column name) as a whole.
var == '.all'
and id != 0
: Row as a whole.
var != '.all'
and id != 0
: Described cell.
get_report()
returns report
element of object
if it is
exposure and of its 'exposure' attribute otherwise.
my_row_packs <- row_packs( row_mean_props = . %>% dplyr::transmute(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 20, row_mean_high = row_mean < 60 ), row_outlier = . %>% dplyr::transmute(row_sum = rowSums(.)) %>% dplyr::transmute( not_row_outlier = abs(row_sum - mean(row_sum)) / sd(row_sum) < 1.5 ) ) my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise( nrow = nrow(.) == 32, ncol = ncol(.) == 5 ) ) mtcars_exposed <- mtcars %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(my_row_packs) mtcars_exposed %>% get_report() mtcars_exposed %>% get_report() %>% is_report()
my_row_packs <- row_packs( row_mean_props = . %>% dplyr::transmute(row_mean = rowMeans(.)) %>% dplyr::transmute( row_mean_low = row_mean > 20, row_mean_high = row_mean < 60 ), row_outlier = . %>% dplyr::transmute(row_sum = rowSums(.)) %>% dplyr::transmute( not_row_outlier = abs(row_sum - mean(row_sum)) / sd(row_sum) < 1.5 ) ) my_data_packs <- data_packs( data_dims = . %>% dplyr::summarise( nrow = nrow(.) == 32, ncol = ncol(.) == 5 ) ) mtcars_exposed <- mtcars %>% expose(my_data_packs, .remove_obeyers = FALSE) %>% expose(my_row_packs) mtcars_exposed %>% get_report() mtcars_exposed %>% get_report() %>% is_report()
rules()
is a function designed to create input for .funs
argument of
scoped dplyr
"mutating" verbs (such as
summarise_all() and
transmute_all()). It converts bare expressions
with .
as input into formulas and repairs names of the output.
rules(..., .prefix = "._.")
rules(..., .prefix = "._.")
... |
Bare expression(s) with |
.prefix |
Prefix to be added to function names. |
rules()
repairs names by the following algorithm:
Absent names are replaced with the 'rule__\ind\' where \ind\ is the
index of function position in the ...
.
.prefix
is added at the beginning of all names. The default is ._.
. It
is picked for its symbolism (it is the Morse code of letter 'R') and rare
occurrence in names. In those rare cases it can be manually changed but
this will not be tracked further. Note that it is a good idea for
.prefix
to be syntactic, as dplyr
will force tibble
names to be syntactic. To check if string is "good", use it as input to
make.names()
: if output equals that string than it is a "good" choice.
# `rules()` accepts bare expression calls with `.` as input, which is not # possible with advised `list()` approach of `dplyr` dplyr::summarise_all(mtcars[, 1:2], rules(sd, "sd", sd(.), ~ sd(.))) dplyr::summarise_all(mtcars[, 1:2], rules(sd, .prefix = "a_a_")) # Use `...` in `summarise_all()` to supply extra arguments dplyr::summarise_all(data.frame(x = c(1:2, NA)), rules(sd), na.rm = TRUE)
# `rules()` accepts bare expression calls with `.` as input, which is not # possible with advised `list()` approach of `dplyr` dplyr::summarise_all(mtcars[, 1:2], rules(sd, "sd", sd(.), ~ sd(.))) dplyr::summarise_all(mtcars[, 1:2], rules(sd, .prefix = "a_a_")) # Use `...` in `summarise_all()` to supply extra arguments dplyr::summarise_all(data.frame(x = c(1:2, NA)), rules(sd), na.rm = TRUE)
Function that is used during interpretation of group pack output. It converts grouped summary into column pack format.
spread_groups(.tbl, ..., .group_sep = ".", .col_sep = "._.")
spread_groups(.tbl, ..., .group_sep = ".", .col_sep = "._.")
.tbl |
Data frame with result of grouped summary. |
... |
A selection of grouping columns (as in |
.group_sep |
A string to be used as separator of grouping levels. |
.col_sep |
A string to be used as separator in column pack. |
Multiple grouping variables are converted to one with
tidyr::unite()
and separator .group_sep
. New values are then treated as
variable names which should be validated and which represent the group data
as a whole.
A data frame in column pack format.
mtcars_grouped_summary <- mtcars %>% dplyr::group_by(vs, am) %>% dplyr::summarise(n_low = dplyr::n() > 6, n_high = dplyr::n() < 10) spread_groups(mtcars_grouped_summary, vs, am) spread_groups(mtcars_grouped_summary, vs, am, .group_sep = "__") spread_groups(mtcars_grouped_summary, vs, am, .col_sep = "__")
mtcars_grouped_summary <- mtcars %>% dplyr::group_by(vs, am) %>% dplyr::summarise(n_low = dplyr::n() > 6, n_high = dplyr::n() < 10) spread_groups(mtcars_grouped_summary, vs, am) spread_groups(mtcars_grouped_summary, vs, am, .group_sep = "__") spread_groups(mtcars_grouped_summary, vs, am, .col_sep = "__")