--- title: "Validation" author: "Evgeni Chasnovski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Validation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) library(ruler, quietly = TRUE, warn.conflicts = FALSE) library(dplyr, quietly = TRUE, warn.conflicts = FALSE) # Packs from previous vignette my_data_packs <- data_packs( my_data_pack_1 = . %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30, ncol = ncol(.) == 12 ) ) my_group_packs <- group_packs( . %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6)), .group_vars = c("vs", "am") ) is_integerish <- function(x) { all(x == as.integer(x)) } my_col_packs <- col_packs( my_col_pack_1 = . %>% summarise_if( is_integerish, rules(mean_low = mean(.) > 0.5) ), . %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300)) ) z_score <- function(x) { (x - mean(x)) / sd(x) } my_row_packs <- row_packs( my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% slice(10:15) ) my_cell_packs <- cell_packs( my_cell_pack_1 = . %>% transmute_if( is_integerish, rules(is_common = abs(z_score(.)) < 1) ) %>% slice(20:24) ) ``` This vignette will describe the actual validation step (called 'exposure') of `ruler` workflow and show some examples of what one can do with validation results. Packs from vignette about rule packs will be used for this. ## Exposure ### Overview __Exposing__ data to rules means applying rule packs to data, collecting results in common format and attaching them to the data as an `exposure` attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline. After attaching exposure to data frame one can extract information from it using the following functions: - `get_exposure()` for exposure. - `get_packs_info()` for packs info (part of exposure). - `get_report()` for tidy data validation report (part of exposure). For exposing data to rules use `expose()`: - It takes data as a first argument and rule packs (in pure form or inside list at any depth) of interest after that. - All rule packs are actually applied to __keyed__ version of data (see [keyholder](https://echasnovski.github.io/keyholder/)) for reasons described in "Rule Packs" vignette. If input has keys they are removed and _id key_ is created. - It is guaranteed that its output is equivalent to the input data frame: only attribute `exposure` might change. If input has already `exposure` attached to it then the new one is binded with it. Simple example: ```{r Simple expose} mtcars %>% expose(my_group_packs) %>% get_exposure() ``` ### Don't remove obeyers By default exposing removes obeyers. One can leave obeyers by setting `.remove_obeyers` to `FALSE`. ```{r Expose can not remove obeyers} mtcars %>% expose(my_group_packs, .remove_obeyers = FALSE) %>% get_exposure() ``` ### Set pack name Notice imputed group pack name `group_pack__1`. To change it one can set name during creation with `group_packs()` or write the following: ```{r Renaming pack} mtcars %>% expose(new_group_pack = my_group_packs[[1]]) %>% get_report() ``` ### Expose step by step One can expose to several packs at ones or do it step by step: ```{r Two-step expose} mtcars_one_step <- mtcars %>% expose(my_data_packs, my_col_packs) mtcars_two_step <- mtcars %>% expose(my_data_packs) %>% expose(my_col_packs) identical(mtcars_one_step, mtcars_two_step) ``` ### Guessing By default `expose()` guesses which type of pack function represents (if it is not set manually). This is useful for interactive experiments. Guess is based on features of pack's output structures (see `?expose` for more details). ```{r Expose can guess} mtcars %>% expose(some_data_pack = . %>% summarise(nrow = nrow(.) == 10)) %>% get_exposure() ``` However there are some edge cases (especially for group packs). To write strict and robust code one should use `.guess = FALSE` option. ```{r Expose can not guess, error = TRUE, purl = FALSE} mtcars %>% expose(some_data_pack = . %>% summarise(nrow = nrow(.) == 10), .guess = FALSE) ``` ### Using different rule separator If for some reason not default rule separator was used in `rules()` one should take this into consideration by using argument `.rule_sep`. It takes regular expression describing the separator. __Note__ that by default it is a string '._.' surrounded by any number of 'non alpha-numeric characters' (with use of `inside_punct()`). This is done to take account of the `dplyr`'s default separator `_`. ```{r Expose can change rule separator} regular_col_packs <- col_packs( . %>% summarise_all(rules(mean(.) > 1)) ) irregular_col_packs <- col_packs( . %>% summarise_all(rules(mean(.) > 1, .prefix = "a_a_")) ) regular_report <- mtcars %>% expose(regular_col_packs) %>% get_report() irregular_report <- mtcars %>% expose(irregular_col_packs, .rule_sep = inside_punct("a_a_")) %>% get_report() identical(regular_report, irregular_report) # Note suffix '_' after column names mtcars %>% expose(irregular_col_packs, .rule_sep = "a_a_") %>% get_report() ``` ## Acting after exposure ### General actions With exposure attached to data one can perform different kinds of actions: exploration, assertion, imputation and so on. General actions are recommended to be done with `act_after_exposure()`. It takes two arguments: - `.trigger` - a function which takes the data with attached exposure and returns `TRUE` if some action should be made. - `.actor` - a function which takes the same argument as `.trigger` and performs some action. If trigger didn't notify then the input data is returned untouched. Otherwise the output of `.actor()` is returned. __Note__ that `act_after_exposure()` is often used for creating side effects (printing, throwing error etc.) and in that case should invisibly return its input (to be able to use it with pipe `%>%`). ```{r Acting after exposure} trigger_one_pack <- function(.tbl) { packs_number <- .tbl %>% get_packs_info() %>% nrow() packs_number > 1 } actor_one_pack <- function(.tbl) { cat("More than one pack was applied.\n") invisible(.tbl) } mtcars %>% expose(my_col_packs, my_row_packs) %>% act_after_exposure( .trigger = trigger_one_pack, .actor = actor_one_pack ) %>% invisible() ``` ### Assert presence of rule breaker `ruler` has function `assert_any_breaker()` which can notify about presence of any breaker in exposure. ```{r Assert any breaker, error = TRUE, purl = FALSE} mtcars %>% expose(my_col_packs, my_row_packs) %>% assert_any_breaker() ```