--- title: "Rule Packs" author: "Evgeni Chasnovski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Rule Packs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) library(ruler, quietly = TRUE, warn.conflicts = FALSE) library(dplyr, quietly = TRUE, warn.conflicts = FALSE) ``` This vignette describes and explains logic behind common ways of creating rule packs. ## Overview __Rule__ is a function which converts data unit of interest (data, group, column, row, cell) to logical value indicating whether this object satisfies certain condition. __Rule pack__ is a function which combines several rules for common data unit into one functional block. The recommended way of creating rules is by creating packs right away with the use of `dplyr` and [magrittr](https://magrittr.tidyverse.org/)'s pipe operator. Some of `ruler`'s functionality is powered by the [keyholder](https://echasnovski.github.io/keyholder/) package. It is highly recommended to use its supported functions during rule pack construction. All one- and two-table `dplyr` verbs applied to local data frames are supported and considered the most appropriate way to create rule packs. As described in vignette about design process it is necessary for rule pack to have __type__ because outputs for different data units have different structure. For this reason `ruler` has family of `*_packs()` constructors (where `*` stands for the name of data unit): - They take functions defining packs (in pure form or inside list at any depth) as arguments. It is recommended to name those arguments with future pack names. If no name is supplied then it will be imputed during exposure. - They return list of what should be rule packs of certain type. ## Data rule packs To check whether dimensions of `mtcars` obey some rules one can write the next dplyr pipeline: ```{r Data properties of mtcars} mtcars %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30, ncol = ncol(.) == 12 ) ``` The output has the following structure: - Number of rows equals to __one__. - Column names define __rule names__. - Values indicate whether the __data as a whole__ follows the rule. There is an easy way to transform this pipeline into a function to be used for any data: `mtcars` should be replaced with `.` character. To indicate that this function is a rule pack for data unit 'data' it should be wrapped with `data_packs()`. The next code creates a list `my_data_packs` with one data rule pack named `my_data_pack_1`. That rule pack defines rules with names `nrow_low`, `nrow_high`, `ncol`. ```{r Data rule packs} my_data_packs <- data_packs( my_data_pack_1 = . %>% summarise( nrow_low = nrow(.) > 10, nrow_high = nrow(.) < 30, ncol = ncol(.) == 12 ) ) ``` ## Group rule packs To check whether certain groups of rows of `mtcars` obey some rules one can write the next dplyr pipeline: ```{r Group properties of mtcars} mtcars %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6)) ``` The output has the following structure: - Some columns define group levels (`vs` and `am` in this case). - Number of rows equals to __number of validated groups__. - Names of non-grouping columns define __rule names__. - Values indicate whether the __group as a whole__ follows the rule. The next code creates a list with one nameless group rule pack (the name will be imputed during exposure). This pack contains one rule `any_cyl_6` which checks every group defined by `vs` and `am` columns. ```{r Group rule packs} my_group_packs <- group_packs( . %>% group_by(vs, am) %>% summarise(any_cyl_6 = any(cyl == 6)), .group_vars = c("vs", "am") ) ``` __Notes__: - In this example pack's output is a grouped tibble. It doesn't affect anything because during exposure this output is `ungroup`ed. - Names of grouping columns should be supplied with `.group_vars` argument to distinguish them from non-grouping ones. - Value for `var` column in validation report is created by [uniting](https://tidyr.tidyverse.org/reference/unite.html) them with the default separator `.`. In this case values will be `0.0`, `0.1`, `1.0`, `1.1`. To change separator supply it with `.group_sep` argument. ## Column rule packs To check whether certain columns of `mtcars` obey some rules one can write the next dplyr pipeline: ```{r Column properties of mtcars} is_integerish <- function(x) { all(x == as.integer(x)) } mtcars %>% summarise_if(is_integerish, list(mean_low = ~ mean(.) > 0.5)) ``` The output has the following structure: - Number of rows equals to __one__. - Column names are constructed as **'validated column name' + 'separator' + 'rule name'**. - Values indicate whether the __column as a whole__ follows the rule. In general it is hard to automatically separate output's column names into 'validated column name' and 'rule name' because default separator `_` is a commonly used one. For this reason `ruler` has function `rules()` with the following functionality: - Impute not supplied rule names. The format is 'rule__{ind}' where {ind} is the index of function position within `rules()`'s arguments. - Add rarely used prefix `._.` (Morse code for 'R') to rule names. __Note__ that one can change this prefix with `.prefix` argument. The next code creates a list with two elements: - A column rule pack `my_col_pack_1` which checks obedience of 'integerish' columns to rule `mean_low`. - A nameless column rule pack which checks obedience of column `vs` to some (will be imputed as `rule__1`) rule. __Note__ the use of named argument in `vars(vs = "vs")`. This is the current way in `dplyr`'s scoped variants of `summarise` and `mutate` to force using both column and function names in output's column name. ```{r Column rule packs} my_col_packs <- col_packs( my_col_pack_1 = . %>% summarise_if( is_integerish, rules(mean_low = mean(.) > 0.5) ), . %>% summarise_at(vars(vs = "vs"), rules(sum(.) > 300)) ) ``` ## Row rule pack To check whether certain rows of `mtcars` are not outliers one can write the next dplyr pipeline: ```{r Row properties of mtcars} z_score <- function(x) { (x - mean(x)) / sd(x) } mtcars %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% slice(10:15) ``` The output has the following structure: - Number of rows equals to __number of checked rows__. - Column names define __rule names__. - Values indicate whether the __row as a whole__ follows the rule. Pipeline like the one above is quite common: for every row compute some value based on all rows and then validate only some of them. However in the validation report column `id` should represent the row index _in the original data frame_ and this information is missing after applying `slice()`. This problem is solved by using [keyholder](https://echasnovski.github.io/keyholder/) package. Its main purpose is to track information about rows while modifying data frame. During exposure pack is applied to the __keyed__ version of input data with key equals to row index. __Note__ that to use this feature one should create rule packs using composition of functions supported by `keyholder`. The next code creates a list with one row pack `my_row_pack_1`. It contains one rule `is_common_row_mean` that checks 6 rows (from 10 to 15) for not being an outlier (based on information from all rows) in terms of row means. ```{r Row rule packs} my_row_packs <- row_packs( my_row_pack_1 = . %>% mutate(rowMean = rowMeans(.)) %>% transmute(is_common_row_mean = abs(z_score(rowMean)) < 1) %>% slice(10:15) ) ``` ## Cell rule pack To check whether certain cells of `mtcars` are not outliers one can write the next dplyr pipeline: ```{r Cell properties of mtcars} mtcars %>% transmute_if( is_integerish, list(is_common = ~ abs(z_score(.)) < 1) ) %>% slice(20:24) ``` The output has the following structure: - Number of rows equals to __number of rows for checked cells__. - Column names are constructed as **'validated column name' + 'separator' + 'rule name'**. - Values indicate whether the __cell__ follows the rule. Basically cell rule pack is a combination of column and row rule packs. It means: - Using `rules()` instead of pure list in scoped variants of `transmute()`. - Using functions supported by `keyholder`. The next code creates a list with one cell pack `my_cell_pack_1`. It checks cells of every integer-like column in rows 20-24 for not being an outlier within column. ```{r Cell rule packs} my_cell_packs <- cell_packs( my_cell_pack_1 = . %>% transmute_if( is_integerish, rules(is_common = abs(z_score(.)) < 1) ) %>% slice(20:24) ) ```