--- title: "Create pdqr-functions with `new_*()`" output: rmarkdown::html_vignette: fig_width: 6.5 fig_height: 4 vignette: > %\VignetteIndexEntry{Create pdqr-functions with `new_*()`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(pdqr) set.seed(101) ``` Package 'pdqr' supports two types of distributions: - **Type "discrete"**: random variable has finite number of output values. It is explicitly defined by the collection of its values with their corresponding probability. - **Type "continuous"**: there are infinite number of output values in the form of continuous random variable. It is explicitly defined by piecewise-linear density function. **Note** that all distributions assume **finite support** (output values are bounded from below and above) and **finite values of density function** (density function in case of "continuous" type can't go to infinity). All `new_*()` functions create a pdqr-function of certain type ("discrete" or "continuous") based on sample or data frame of appropriate structure: - **Sample input** is processed based on type. For "discrete" type it gets tabulated with frequency of unique values serving as their probability. For "continuous" type distribution density is estimated using [`density()`](https://rdrr.io/r/stats/density.html) function if input has at least 2 elements. For 1 element special "dirac-like" pdqr-function is created: an *approximation single number* with triangular distribution of very narrow support (1e-8 of magnitude). Basically, sample input is converted into data frame of appropriate structure that defines distribution (see next list item). - **Data frame input** should completely define distribution. For "discrete" type it should have "x" and "prob" columns for output values and their probabilities. For "continuous" type - "x" and "y" columns for points, which define piecewise-linear continuous density function. Columns "prob" and "y" will be automatically normalized to represent proper distribution: sum of "prob" will be 1 and total square under graph of piecewise-linear function will be 1. We will use the following data frame inputs in examples: ```{r setup_data-frame-inputs} # For type "discrete" dis_df <- data.frame(x = 1:4, prob = 4:1 / 10) # For type "continuous" con_df <- data.frame(x = 1:4, y = c(0, 1, 1, 1)) ``` This vignette is organized as follows: - Four sections about how to create p-, d-, q-, and r-functions (both from sample and data frame). - Section "Special cases", which describes two special cases of pdqr-functions: dirac-like and boolean. - Section "Using `density()` arguments" describes how to use `density()` arguments to tweak smoothing during creation of "continuous" pdqr-functions. - "Metadata of pdqr-functions" describes the concept of metadata of pdqr-functions. ## P-functions P-function (analogue of `p*()` functions in base R) represents a cumulative distribution function of distribution. ### From sample ```{r p-fun_sample} # Treating input as discrete p_mpg_dis <- new_p(mtcars$mpg, type = "discrete") p_mpg_dis # Treating input as continuous p_mpg_con <- new_p(mtcars$mpg, type = "continuous") p_mpg_con # Outputs are actually vectorized functions p_mpg_dis(15:20) p_mpg_con(15:20) # You can plot them directly using base `plot()` and `lines()` plot(p_mpg_con, main = "P-functions from sample") lines(p_mpg_dis, col = "blue") ``` ### From data frame ```{r p-fun_data-frame} p_df_dis <- new_p(dis_df, type = "discrete") p_df_dis p_df_con <- new_p(con_df, type = "continuous") p_df_con plot(p_df_con, main = "P-functions from data frame") lines(p_df_dis, col = "blue") ``` ## D-functions D-function (analogue of `d*()` functions in base R) represents a probability mass function for "discrete" type and density function for "continuous": ### From sample ```{r d-fun_sample} # Treating input as discrete d_mpg_dis <- new_d(mtcars$mpg, type = "discrete") d_mpg_dis # Treating input as continuous d_mpg_con <- new_d(mtcars$mpg, type = "continuous") d_mpg_con # Outputs are actually vectorized functions d_mpg_dis(15:20) d_mpg_con(15:20) # You can plot them directly using base `plot()` and `lines()` op <- par(mfrow = c(1, 2)) plot(d_mpg_con, main = '"continuous" d-function\nfrom sample') plot(d_mpg_dis, main = '"discrete" d-function\nfrom sample', col = "blue") par(op) ``` ### From data frame ```{r d-fun_data-frame} d_df_dis <- new_d(dis_df, type = "discrete") d_df_dis d_df_con <- new_d(con_df, type = "continuous") d_df_con op <- par(mfrow = c(1, 2)) plot(d_df_con, main = '"continuous" d-function\nfrom data frame') plot(d_df_dis, main = '"discrete" d-function\nfrom data frame', col = "blue") par(op) ``` ## Q-functions Q-function (analogue of `q*()` functions in base R) represents a quantile function, an inverse of corresponding p-function: ### From sample ```{r q-fun_sample} # Treating input as discrete q_mpg_dis <- new_q(mtcars$mpg, type = "discrete") q_mpg_dis # Treating input as continuous q_mpg_con <- new_q(mtcars$mpg, type = "continuous") q_mpg_con # Outputs are actually vectorized functions q_mpg_dis(c(0.1, 0.3, 0.7, 1.5)) q_mpg_con(c(0.1, 0.3, 0.7, 1.5)) # You can plot them directly using base `plot()` and `lines()` plot(q_mpg_con, main = "Q-functions from sample") lines(q_mpg_dis, col = "blue") ``` ### From data frame ```{r q-fun_data-frame} q_df_dis <- new_q(dis_df, type = "discrete") q_df_dis q_df_con <- new_q(con_df, type = "continuous") q_df_con plot(q_df_con, main = "Q-functions from data frame") lines(q_df_dis, col = "blue") ``` ## R-functions R-function (analogue of `r*()` functions in base R) represents a random generation function. For "discrete" type it will generate only values present in input. For "continuous" function it will generate values from distribution corresponding to one estimated with `density()`. ### From sample ```{r r-fun_sample} # Treating input as discrete r_mpg_dis <- new_r(mtcars$mpg, type = "discrete") r_mpg_dis # Treating input as continuous r_mpg_con <- new_r(mtcars$mpg, type = "continuous") r_mpg_con # Outputs are actually functions r_mpg_dis(5) r_mpg_con(5) # You can plot them directly using base `plot()` and `lines()` op <- par(mfrow = c(1, 2)) plot(r_mpg_con, main = '"continuous" r-function\nfrom sample') plot(r_mpg_dis, main = '"discrete" r-function\nfrom sample', col = "blue") par(op) ``` ### From data frame ```{r r-fun_data-frame} r_df_dis <- new_r(dis_df, type = "discrete") r_df_dis r_df_con <- new_r(con_df, type = "continuous") r_df_con op <- par(mfrow = c(1, 2)) plot(r_df_con, main = '"continuous" r-function\nfrom data frame') plot(r_df_dis, main = '"discrete" r-function\nfrom data frame', col = "blue") par(op) ``` ## Special cases ### Dirac-like When creating "continuous" pdqr-function with `new_*()` from single number, a special "dirac-like" pdqr-function is created. It is an *approximation of single number* with triangular distribution of very narrow support (1e-8 of magnitude): ```{r dirac} r_dirac <- new_r(3.14, type = "continuous") r_dirac r_dirac(4) # Outputs aren't exactly but approximately equal dput(r_dirac(4)) ``` ### Boolean Boolean pdqr-function is a special case of "discrete" function, which values are exactly 0 and 1. Those functions are usually created after transformations involving logical operators (see vignette on transformation for more details). It is assumed that 0 represents that some expression is false, and 1 is for being true. Corresponding probabilities describe distribution of expression's logical values. The only difference from other "discrete" pdqr-functions is in more detailed printing. ```{r boolean} new_d(data.frame(x = c(0, 1), prob = c(0.25, 0.75)), type = "discrete") ``` ## Using `density()` arguments When creating pdqr-function of "continuous" type, `density()` is used to estimate density. To tweak its performance, supply its extra arguments directly to `new_*()` functions. Here are some examples: ```{r density-args} plot( new_d(mtcars$mpg, "continuous"), lwd = 3, main = "Examples of `density()` options" ) # Argument `adjust` of `density()` helps to define smoothing bandwidth lines(new_d(mtcars$mpg, "continuous", adj = 0.3), col = "blue") # Argument `n` defines number of points to be used in piecewise-linear # approximation lines(new_d(mtcars$mpg, "continuous", n = 5), col = "green") # Argument `cut` defines the "extending" property of density estimation. # Using `cut = 0` assumes that density can't go outside of input's range lines(new_d(mtcars$mpg, "continuous", cut = 0), col = "magenta") ``` ## Metadata of pdqr-functions Every pdqr-function has metadata, information which describes underline distribution and pdqr-function. Family of `meta_*()` functions are implemented to extract that information: - **"x_tbl" metadata** (returned by `meta_x_tbl()`) completely defines distribution. It is a data frame with structure depending on type of pdqr-function: - For "discrete" type it has columns "x" (output values), "prob" (their probability), and "cumprob" (their cumulative probability). - For "continuous" type it has columns "x" (knots of piecewise-linear density), "y" (density values at those points), "cumprob" (their cumulative probability). - **Pdqr class** (returned by `meta_class()`) - class of pdqr-function. This can be one of "p", "d", "q", "r". Represents how pdqr-function describes underlying distribution. - **Pdqr type** (returned by `meta_type()`) - type of pdqr-function. This can be one of "discrete" or "continuous". Represents type of underlying distribution. - **Pdqr support** (returned by `meta_support()`) - support of distribution. This is a range of "x" column from "x_tbl" metadata. ```{r meta_x_tbl} # Type "discrete" d_dis <- new_d(1:4, type = "discrete") meta_x_tbl(d_dis) meta_class(d_dis) meta_type(d_dis) meta_support(d_dis) # Type "continuous" p_con <- new_p(1:4, type = "continuous") head(meta_x_tbl(p_con)) meta_class(p_con) meta_type(p_con) meta_support(p_con) # Dirac-like "continuous" function r_dirac <- new_r(1, type = "continuous") dput(meta_x_tbl(r_dirac)) dput(meta_support(r_dirac)) # `meta_all()` returns all metadata in a single list meta_all(d_dis) ``` For more details go to help page of `meta_all()`.