---
title: "HTML Tables"
author: "Duncan Garmonsway"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{HTML Tables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This vignette for the [unpivotr](https://github.com/nacnudus/unpivotr) package
demonstrates unpivoting html tables of various kinds.

The HTML files are in the package directory at `system.file("extdata",
c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr")`.

```{r, echo = TRUE}
library(dplyr)
library(rvest)
library(htmltools)
library(unpivotr)
```

## Rowspan and colspan examples

If a table has cells merged across rows or columns (or both), then `as_cells()`
does not attempt to fill the cell contents across the rows or columns.  This is
different from other packages, e.g. `rvest`.  However, if merged cells cause a
table not to be square, then `as_cells()` pads the missing cells with blanks.

### Rowspan

```{r, echo = TRUE}
rowspan <- system.file("extdata", "rowspan.html", package = "unpivotr")
includeHTML(rowspan)

# rvest
rowspan %>%
  read_html() %>%
  html_table()

# unpivotr
rowspan %>%
  read_html() %>%
  as_cells()
```

### Colspan

```{r, echo = TRUE}
colspan <- system.file("extdata", "colspan.html", package = "unpivotr")
includeHTML(colspan)

# rvest
colspan %>%
  read_html() %>%
  html_table()

# unpivotr
colspan %>%
  read_html() %>%
  as_cells()
```

### Both rowspan and colspan: non-square

```{r, echo = TRUE}
rowandcolspan <- system.file("extdata",
                             "row-and-colspan.html",
                             package = "unpivotr")
includeHTML(rowandcolspan)

# rvest
rowandcolspan %>%
  read_html() %>%
  html_table()

# unpivotr
rowandcolspan %>%
  read_html() %>%
  as_cells()
```

## Nested example

`as_cells()` never descends into cells.  If there is a table inside a cell, then
to parse that table use `html_table` again on that cell.

```{r, echo = TRUE}
nested <- system.file("extdata", "nested.html", package = "unpivotr")
includeHTML(nested)

# rvest parses both tables
nested %>%
  read_html() %>%
  html_table(fill = TRUE)

# unpivotr
x <-
  nested %>%
  read_html() %>%
  as_cells() %>%
  .[[1]]
x

# The html of the table inside a cell
cell <-
  x %>%
  dplyr::filter(row == 2, col == 2) %>%
  .$html
cell

# Parsing the table inside the cell
cell %>%
  read_html() %>%
  as_cells()
```

## URL example

A motivation for using `unpivotr::as_cells()` is that it extracts more than
just text -- it can extract whatever part of the HTML you need.

Here, we extract URLs.

```{r, echo = TRUE}
urls <- system.file("extdata", "url.html", package = "unpivotr")
includeHTML(urls)

cell_url <- function(x) {
  if (is.na(x)) return(NA)
  x %>%
    read_html %>%
    html_nodes("a") %>%
    html_attr("href")
}

cell_text <- function(x) {
  if (is.na(x)) return(NA)
  x %>%
    read_html %>%
    html_nodes("a") %>%
    html_text()
}

urls %>%
  read_html() %>%
  as_cells() %>%
  .[[1]] %>%
  mutate(text = purrr::map(html, cell_text),
         url = purrr::map(html, cell_url)) %>%
  tidyr::unnest(text, url)
```