--- title: "Design Process and Exposure Format" author: "Evgeni Chasnovski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Design Process and Exposure Format} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The main idea of the `ruler` package is to create a format of validation results (along with functional API) that will work naturally with [tidyverse](https://www.tidyverse.org/) tools. This vignette will: - Guide you through the design process of __exposure__: `ruler`'s validation result format. This should help to understand the foundations of `ruler` validation workflow. - Describe exposure format. ## Design process The preferred local data structure in `tidyverse` is [tibble](https://tibble.tidyverse.org): "A modern re-imagining of the data frame", on which its implementation is based. That is why `ruler` uses data frames as preferred format for data to be validated. However the initial goal is to use tibbles in creation of validation result format as much as possible. Basically data frame is a list of variables with the same length. It is easier to think about it as two-dimensional structure where columns can be of different types. In abstract form validation of data frame can be put as ___asking whether certain subset of data frame (data unit) obeys certain rule___. The result of validation is logical __value__ representing an answer. With influence of [dplyr](https://dplyr.tidyverse.org)'s grammar of data manipulation a data frame can be represented in terms of the following data units: - Data frame as a whole. Validation can be done by `summarise()` _without_ grouping. - Collection of groups of rows. Validation can be done by `summarise()` _with_ grouping. - Collection of columns. Validation can be done by scoped variants of `summarise()` _without_ grouping: `summarise_all()`, `summarise_if()` and `summarise_at()`. - Collection of rows. Validation can be done by `transmute()`. - 2d-collection of cells. Validation can be done by scoped variants of `transmute()`: `transmute_all()`, `transmute_if()` and `transmute_at()`. In `ruler` data, group, column, row and cell are five basic data units. They all can be described by the combination of two variables: - __var__ which represents the variable name of data unit: - Value '.all' is reserved for 'all columns as a whole'. - Value _equal_ to some column name indicates column of data unit. - Value _not equal_ to some column name indicates the name of group: it is created by uniting (with delimiter) group levels of grouping columns. - __id__ which represents the row index of data unit: - Value 0 is reserved for 'all rows as a whole'. - Value not equal to 0 indicates the row index of data unit. Validation of data units can be done with the `dplyr` functions described above. Their application to some data unit can give answers to multiple questions. That is why by design __rules__ (functions that answer one certain question about one type of data unit) are combined in __rule packs__ (functions that answer multiple questions about one type of data unit). Application of rule pack to data is connected with several points: - Rule packs should have unique __names__ to be used as references. - By the same reason rules should have names. However uniqueness is necessary only within corresponding rule pack which makes pair 'pack name'+'rule name' a key of identifying the actual rule. - Output of rule packs for different data units differ in their structure. Therefore rule packs should have __types__ to apply different interpretations to their outputs. - During the actual validation the most part of results normally indicates obedience to rules. This can cause storing many redundant information in validation results. `ruler` has option of __removing obeyers__ from results during the validation. In `ruler` __exposing__ data to rules means applying rule packs to data, collecting results in common format and attaching them to the data as an `exposure` attribute. In this way actual exposure can be done in multiple steps and also be a part of a general data preparation pipeline. ## Exposure __Exposure__ is a format designed to contain uniform information about validation of different data units. For reproducibility it also saves information about applied packs. Basically exposure is a list with two elements: 1. __Packs info__: a `tibble` with the following structure: - _name_ \ : Name of the pack. If not set manually it will be imputed during exposure. - _type_ \ : Name of pack type. Indicates which data unit pack checks. - _fun_ \ : List (preferably unnamed) of rule pack functions. - _remove_obeyers_ \ : Whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack. 2. __Tidy data validation report__: a `tibble` with the following structure: - _pack_ \ : Name of rule pack (from column 'name' in packs info). - _rule_ \ : Name of the rule defined in rule pack. - _var_ \ : Name of the data unit variable. - _id_ \ : Row index of data unit. - _value_ \ : Whether the described data unit obeys the rule.