--- title: "Introduction to keyholder" author: "Evgeni Chasnovski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to keyholder} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" ) options(tibble.print_min = 3, tibble.print_max = 3) ``` `keyholder` is a package for storing information (*keys*) about rows of data frame like objects. The common use cases are to track rows of data without modifying it and to backup and restore information about rows. This is done with creating a class __keyed_df__ which has special attribute "keys". Keys are updated according to changes in rows of reference data frame. `keyholder` is designed to work tightly with [dplyr](https://github.com/tidyverse/dplyr) package. All its one- and two-table verbs update keys properly. ```{r setup, include = FALSE} library(dplyr, quietly = TRUE, warn.conflicts = FALSE) library(keyholder, quietly = TRUE, warn.conflicts = FALSE) ``` ```{r Create basic example tibble} mtcars_tbl <- mtcars %>% as_tibble() ``` ## Set keys The general agreement is that keys are always converted to [tibble](https://tibble.tidyverse.org). In this way one can use multiple variables as keys by binding them. There are two general ways of creating keys: - With assigning. The assigned object will be converted to tibble with `as_tibble()`. To make sense it should have the same number of rows as reference data frame. There are two functions for assigning: `keys<-` and `assign_keys()` which are basically the same. The former use more suitable for interactive use and the latter - for piping with [magrittr](https://magrittr.tidyverse.org)'s pipe operator `%>%`. ```{r set keys by assigning} mtcars_tbl_keyed <- mtcars_tbl keys(mtcars_tbl_keyed) <- tibble(id = 1:nrow(mtcars_tbl_keyed)) mtcars_tbl %>% assign_keys(tibble(id = 1:nrow(.))) ``` - With `key_by()` and its scoped variants (`*_all()`, `*_if()` and `*_at()`). This is similar in its design to `dplyr`'s `group_by()`: it takes some columns from reference data frame and makes keys from them. It has two important options: `.add` (whether to add specified columns to existing keys) and `.exclude` (whether to exclude specified columns from reference data frame). Grouping is ignored. ```{r set keys by key_by} mtcars_tbl %>% key_by(vs, am) mtcars_tbl %>% key_by(starts_with("c")) mtcars_tbl %>% key_by(starts_with("c"), .exclude = TRUE) # Scoped variants mtcars_tbl %>% key_by_all() # One can also rename variables before keying by supplying .funs mtcars_tbl %>% key_by_if(rlang::is_integerish, .funs = toupper) mtcars_tbl %>% key_by_at(c("vs", "am")) ``` To track rows use `use_id()` which creates a special key `.id` with row numbers as values. To properly unkey object use `unkey()`. ```{r unkey} mtcars_tbl_keyed <- mtcars_tbl %>% key_by(vs, am) # Good mtcars_tbl_keyed %>% unkey() # Bad attr(mtcars_tbl_keyed, "keys") <- NULL mtcars_tbl_keyed ``` ## Get keys There are three ways of extracting keys: - With `keys()`. This function always returns a tibble. In case of no keys it returns a tibble with number of rows as in reference data frame and zero columns. ```{r get keys with keys} mtcars_tbl %>% keys() mtcars_tbl %>% key_by(vs, am) %>% keys() ``` - With `raw_keys()` which is just a wrapper for `attr(.tbl, "keys")`. ```{r get keys with raw_keys} mtcars_tbl %>% raw_keys() mtcars_tbl %>% key_by(vs, am) %>% raw_keys() ``` - With `pull_key()` which works like `dplyr`'s `pull` applied to keys: ```{r get keys with pull_key} mtcars_tbl %>% key_by(vs, am) %>% pull_key(vs) ``` ## Manipulate keys - Remove certain keys with `remove_keys()` and its scoped variants. If all keys are removed one can automatically unkey object by setting option `.unkey` to `TRUE`. ```{r remove keys} mtcars_tbl %>% key_by(vs, mpg) %>% remove_keys(vs) mtcars_tbl %>% key_by(vs, mpg) %>% remove_keys(everything(), .unkey = TRUE) # Scoped variants # Identical to previous one mtcars_tbl %>% key_by(vs, mpg) %>% remove_keys_all(.unkey = TRUE) mtcars_tbl %>% key_by(vs, mpg) %>% remove_keys_if(rlang::is_integerish) ``` - Restore certain keys with `restore_keys()` and its scoped variants. Restoring means creating or modifying a column in reference data frame with values taken from keys. After restoring certain key one can remove it from keys by setting `.remove` to `TRUE`. There is also an option `.unkey` identical to one in `remove_keys()` (which is meaningful only in case `.remove` is `TRUE`). ```{r restore keys} mtcars_tbl_keyed <- mtcars_tbl %>% key_by(vs, mpg) %>% mutate(vs = 1, mpg = 0) mtcars_tbl_keyed mtcars_tbl_keyed %>% restore_keys(vs) mtcars_tbl_keyed %>% restore_keys(vs, .remove = TRUE) mtcars_tbl_keyed %>% restore_keys(vs, mpg, .unkey = TRUE) mtcars_tbl_keyed %>% restore_keys(vs, mpg, .remove = TRUE, .unkey = TRUE) # Scoped variants mtcars_tbl_keyed %>% restore_keys_all() mtcars_tbl_keyed %>% restore_keys_if(rlang::is_integerish, .remove = TRUE) ``` One important feature of `restore_keys()` is that restoring keys beats 'not-modifying' grouping variables rule. It is made according to the ideology of keys: they contain information about rows and by restoring you want it to be available. Groups are recomputed after restoring. ```{r restore keys grouping} mtcars_tbl_keyed %>% group_by(vs, mpg) mtcars_tbl_keyed %>% group_by(vs, mpg) %>% restore_keys(vs, mpg) ``` - Rename keys with `rename_keys()` and its scoped variants. Renaming is done with `dplyr`'s `rename()` or its scoped variant and so renaming format comes from them. ```{r rename keys} mtcars_tbl %>% key_by(vs, am) %>% rename_keys(Vs = vs) # Scoped variants mtcars_tbl %>% key_by(vs, am) %>% rename_keys_all(.funs = toupper) ``` ## React to subset A method for subsetting function `[` is implemented for `keyed_df` to react on changes in rows: if rows in reference data frame are rearranged or removed the same operation is done to keys. ```{r reaction to subset} mtcars_tbl_subset <- mtcars_tbl %>% key_by(vs, am) %>% `[`(c(3, 18, 19), c(2, 8, 9)) mtcars_tbl_subset keys(mtcars_tbl_subset) ``` ## Verbs from dplyr All one- and two-table verbs from `dplyr` (with present scoped variants) support `keyed_df`. Most functions react to changes in rows as in `[` but some functions (`summarise()`, `distinct()` and `do()`) unkey object. ```{r dplyr verbs} mtcars_tbl_keyed <- mtcars_tbl %>% key_by(vs, am) mtcars_tbl_keyed %>% select(gear, mpg) mtcars_tbl_keyed %>% summarise(meanMPG = mean(mpg)) mtcars_tbl_keyed %>% filter(vs == 1) %>% keys() mtcars_tbl_keyed %>% arrange_at("mpg") %>% keys() band_members %>% key_by(name) %>% semi_join(band_instruments, by = "name") %>% keys() ```