--- title: "Data Validation Rules" author: "Duncan Garmonsway" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Validation Rules} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## What data validation rules are Data validation rules control what constants can be entered into a cell, e.g. any whole number between 0 and 9, or one of several values from another part of the spreadsheet. ‘xlsx_validation()’ returns each of the data validation rules in an xlsx file, and the ranges of cells to which each rule applies. Here is a rule that restricts input to integers between 0 and 9 inclusive, or no value (blank). If any other value is attempted, then an error message is displayed with the imaginative title "message title", the informative body text "message body", and a "stop" symbol. ```{r} library(tidyxl) library(dplyr) library(tidyr) examples <- system.file("extdata/examples.xlsx", package = "tidyxl") glimpse(xlsx_validation(examples)[1, ]) ``` The gamut of possible rules is given in the examples for `xlsx_validation()`. ```{r} as.data.frame(xlsx_validation(examples)) ``` ## Joining rules to cells > There are no built-in functions for joining ranges like `A1:D5,G8` to single > cells like `B3`. For now, use the snippets in this section. In future I > might develop a dplyr-like join function (this is hard currently because dplyr > doesn't yet join on arbitrary functions, or even the standard inequalities > like `>=`). Help and advice would be gratefully accepted! To join rules to cells, a naive method is to use the `sheet` and `ref` columns to match the `sheet` and `address` columns to the output of `xlsx_cells()`. ```{r} rules <- xlsx_validation(examples) cells <- filter(xlsx_cells(examples), row >= 106, col == 1) rules cells inner_join(rules, cells, by = c("sheet" = "sheet", "ref" = "address")) ``` Notice that only 9 cells were joined, even though 15 rules were defined. Surely at least 15 cells ought to be joined? The reason why they are not is that the cells for the other 6 rules don't exist -- rules can be defined for cells that have no value, and cells with no value are not returned by `xlsx_cells()`, otherwise all `r sprintf("%11.0f", 16384 * 1048576)` cells in a worksheet must be returned. A more subtle reason for certain cells not to have joined successfully is that the `ref` column of the rules sometimes refers to more than one cell, and can even refer to several, non-contiguous ranges of cells. Specifically, the seventh rule's `ref` column has ``r rules$ref[7]``. Special treatment is needed here. Ideally, some kind of join function would be defined that can compare indidual cells with ranges. But I haven't written one, so what follows is a workaround. First, the two ranges of cells must be `unnested` into `A115` and `A121:122`. Then the range `A121:122` must be 'unranged' into `A121` and `A122`. ```{r} unrange <- function(x) { limits <- cellranger::as.cell_limits(x) rows <- seq(limits$ul[1], limits$lr[1]) cols <- seq(limits$ul[2], limits$lr[2]) rowcol <- expand.grid(rows, cols) cell_addrs <- cellranger::cell_addr(rowcol[[1]], rowcol[[2]]) cellranger::to_string(cell_addrs, fo = "A1", strict = FALSE) } unnest_ref <- function(x, ref) { UseMethod("unnest_ref") } unnest_ref.default <- function(x, ref_col = ref) { stopifnot(is.character(x), length(x) == 1L) refs <- unlist(strsplit(x, ",", fixed = TRUE)) unlist(lapply(refs, unrange)) } unrange("A121:A122") unnest_ref("A115,A121:A122") ``` The `unnest_ref()` function can also be defined for whole data frames, unnesting them by a column of references. ```{r} unnest_ref.data.frame <- function(x, ref_col) { ref <- rlang::enquo(ref_col) x[[rlang::quo_name(ref)]] <- lapply(x[[rlang::quo_name(ref)]], unnest_ref) tidyr::unnest(x, rlang::UQ(ref)) } (nested_rule <- slice(rules, 7)) unnest_ref(nested_rule, ref) ``` Finally the new data frame of rules can be joined to a data frame of cells in any of the usual ways, via the `sheet` and `ref` columns. Problems with this approach occur with rules that are defined over large ranges of cells: the 'unnesting' of those ranges results in very long vectors of individual cell addresses, or (worse) huge data frames of rules. Such cases are commonplace, because rules are often defined for entire columns of a spreadsheet, and a column has 1048576 rows.