This vignette will describe the
actual validation step (called ‘exposure’) of ruler
workflow and show some examples of what one can do with validation
results. Packs from vignette about rule packs will be used for this.
Exposing data to rules means applying rule packs to
data, collecting results in common format and attaching them to the data
as an exposure
attribute. In this way actual exposure can
be done in multiple steps and also be a part of a general data
preparation pipeline.
After attaching exposure to data frame one can extract information from it using the following functions:
get_exposure()
for exposure.get_packs_info()
for packs info (part of
exposure).get_report()
for tidy data validation report (part of
exposure).For exposing data to rules use expose()
:
exposure
might change. If input has
already exposure
attached to it then the new one is binded
with it.Simple example:
mtcars %>%
expose(my_group_packs) %>%
get_exposure()
#> Exposure
#>
#> Packs info:
#> # A tibble: 1 × 4
#> name type fun remove_obeyers
#> <chr> <chr> <list> <lgl>
#> 1 group_pack__1 group_pack <grop_pck> TRUE
#>
#> Tidy data validation report:
#> # A tibble: 2 × 5
#> pack rule var id value
#> <chr> <chr> <chr> <int> <lgl>
#> 1 group_pack__1 any_cyl_6 0.0 0 FALSE
#> 2 group_pack__1 any_cyl_6 1.1 0 FALSE
By default exposing removes obeyers. One can leave obeyers by setting
.remove_obeyers
to FALSE
.
mtcars %>%
expose(my_group_packs, .remove_obeyers = FALSE) %>%
get_exposure()
#> Exposure
#>
#> Packs info:
#> # A tibble: 1 × 4
#> name type fun remove_obeyers
#> <chr> <chr> <list> <lgl>
#> 1 group_pack__1 group_pack <grop_pck> FALSE
#>
#> Tidy data validation report:
#> # A tibble: 4 × 5
#> pack rule var id value
#> <chr> <chr> <chr> <int> <lgl>
#> 1 group_pack__1 any_cyl_6 0.0 0 FALSE
#> 2 group_pack__1 any_cyl_6 0.1 0 TRUE
#> 3 group_pack__1 any_cyl_6 1.0 0 TRUE
#> 4 group_pack__1 any_cyl_6 1.1 0 FALSE
Notice imputed group pack name group_pack__1
. To change
it one can set name during creation with group_packs()
or
write the following:
One can expose to several packs at ones or do it step by step:
By default expose()
guesses which type of pack function
represents (if it is not set manually). This is useful for interactive
experiments. Guess is based on features of pack’s output structures (see
?expose
for more details).
mtcars %>%
expose(some_data_pack = . %>% summarise(nrow = nrow(.) == 10)) %>%
get_exposure()
#> Exposure
#>
#> Packs info:
#> # A tibble: 1 × 4
#> name type fun remove_obeyers
#> <chr> <chr> <list> <lgl>
#> 1 some_data_pack data_pack <data_pck> TRUE
#>
#> Tidy data validation report:
#> # A tibble: 1 × 5
#> pack rule var id value
#> <chr> <chr> <chr> <int> <lgl>
#> 1 some_data_pack nrow .all 0 FALSE
However there are some edge cases (especially for group packs). To
write strict and robust code one should use .guess = FALSE
option.
If for some reason not default rule separator was used in
rules()
one should take this into consideration by using
argument .rule_sep
. It takes regular expression describing
the separator. Note that by default it is a string
‘._.’ surrounded by any number of ‘non alpha-numeric characters’ (with
use of inside_punct()
). This is done to take account of the
dplyr
’s default separator _
.
regular_col_packs <- col_packs(
. %>% summarise_all(rules(mean(.) > 1))
)
irregular_col_packs <- col_packs(
. %>% summarise_all(rules(mean(.) > 1, .prefix = "a_a_"))
)
regular_report <- mtcars %>%
expose(regular_col_packs) %>%
get_report()
irregular_report <- mtcars %>%
expose(irregular_col_packs, .rule_sep = inside_punct("a_a_")) %>%
get_report()
identical(regular_report, irregular_report)
#> [1] TRUE
# Note suffix '_' after column names
mtcars %>%
expose(irregular_col_packs, .rule_sep = "a_a_") %>%
get_report()
#> Tidy data validation report:
#> # A tibble: 2 × 5
#> pack rule var id value
#> <chr> <chr> <chr> <int> <lgl>
#> 1 col_pack__1 rule__1 vs_ 0 FALSE
#> 2 col_pack__1 rule__1 am_ 0 FALSE
With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on.
General actions are recommended to be done with
act_after_exposure()
. It takes two arguments:
.trigger
- a function which takes the data with
attached exposure and returns TRUE
if some action should be
made..actor
- a function which takes the same argument as
.trigger
and performs some action.If trigger didn’t notify then the input data is returned untouched.
Otherwise the output of .actor()
is returned.
Note that act_after_exposure()
is often
used for creating side effects (printing, throwing error etc.) and in
that case should invisibly return its input (to be able to use it with
pipe %>%
).
trigger_one_pack <- function(.tbl) {
packs_number <- .tbl %>%
get_packs_info() %>%
nrow()
packs_number > 1
}
actor_one_pack <- function(.tbl) {
cat("More than one pack was applied.\n")
invisible(.tbl)
}
mtcars %>%
expose(my_col_packs, my_row_packs) %>%
act_after_exposure(
.trigger = trigger_one_pack,
.actor = actor_one_pack
) %>%
invisible()
#> More than one pack was applied.
ruler
has function assert_any_breaker()
which can notify about presence of any breaker in exposure.
mtcars %>%
expose(my_col_packs, my_row_packs) %>%
assert_any_breaker()
#> Breakers report
#> Tidy data validation report:
#> # A tibble: 4 × 5
#> pack rule var id value
#> <chr> <chr> <chr> <int> <lgl>
#> 1 my_col_pack_1 mean_low vs 0 FALSE
#> 2 my_col_pack_1 mean_low am 0 FALSE
#> 3 col_pack__2 rule__1 vs 0 FALSE
#> 4 my_row_pack_1 is_common_row_mean .all 15 FALSE
#> Error: assert_any_breaker: Some breakers found in exposure.