--- title: "Introduction to unheadr" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to unheadr} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(unheadr) library(dplyr) library(tidyr) ``` When we work with other people's data, we often have to struggle through multiple unexpected steps before we get to a flexible, usable structure. Popular ways of structuring and presenting data can place content beyond the reach of code-based routines to tackle repetitive tasks efficiently. ## Package functions The functions in `unheadr` help us rework data shared by other people, from a human-readable structure to a tidier machine-readable structure on which we can perform common data manipulation tasks. ### Data frames and tibbles Here is how `unheadr` works with tibble and data frame objects that suffer from common issues such as: * Embedded subheaders * Broken values * Merged cells and values wrapped within columns * Broken rows * Broken headers **Embedded subheaders** are usually grouping variables embedded into another variable, used to show hierarchical data or create small multiples of data. A simple example would be a coffee shop menu: ```{r} dat <- data.frame( drink = c( "Cold Drinks", "Soda", "Water", "Juice", "Lemonade", "Hot Drinks", "Tea", "Coffee" ), price = c(NA, 2.99, 1.99, 3.15, 2, NA, 3.99, 1.99), stringsAsFactors = FALSE ) dat ``` The beverage type is embedded in the 'drinks' variable. If we can match them with regular expressions, we can move the grouping values into their own variable using `untangle2()` ```{r} untangle2(dat, "Drinks$", drink, "beverage_type") ``` **Broken values** usually happen when we're pressed for space. For whatever reason, the entries for the Barcelona 1992 and London 2012 Olympics are broken across two contiguous rows and NAs are used as padding in the other variables. ```{r} OGames <- tibble( Games = c("Los Angeles 1984", "Barcelona", "1992", "Atlanta 1996", "Sydney 2000", "London", "2012"), Country = c("USA", "Spain", NA, "USA", "Australia", "UK", NA), Soccer_gold_medal = c("France", "Spain", NA, "Nigeria", "Cameroon", "Mexico", NA) ) OGames ``` In this case, we can use `unbreak_vals()` to 'unbreak' the lines in the 'Games' variable, matching the strings that start with numbers. ```{r} OGames %>% unbreak_vals("^[0-9]", Games, Games_unbroken, slice_groups = TRUE) %>% select(Games_unbroken, everything()) ``` **Wrapped columns** often happen when we merge cells in spreadsheets or use table formatting in a word processor, resulting in empty or NA values used to pad all the vertical space. ```{r} knicks <- data.frame( stringsAsFactors = FALSE, player = c("Allan Houston", NA, "Latrell Sprewell", NA, NA), teams = c( "Pistons", "Knicks", "Warriors", "Knicks", "Timberwolves" ), position = c("Shooting guard", NA, "Small forward", NA, NA) ) knicks ``` We can unwrap the 'teams' values into a single string using `unrwap_cols()`. ```{r} knicks %>% unwrap_cols(groupingVar = player, separator = ", ") ``` This is more or less the opposite to `separate_rows()` from `tidyr`. **Line breaks** meant to separate values inside a merged cell can be problematic. This often results from parsing tables from PDF or Word documents. ```{r} merged_dat <- data.frame( stringsAsFactors = FALSE, Name = c("Valery V."), Asset = c("Apartment\n\nPlot\nHouse\nOther\nApartment"), Area = c("45.7\n\n69.2\n65.0\n32.2\n36.9\n\n\n"), Status = c("Owned\n\nOwned\nIn use\nUnknown\nOwned\n") ) merged_dat ``` To remove duplicated and trailing new line sequences "`\n`", we can use `squish_newlines()`. ```{r} merged_dat %>% mutate(across(c(Asset:Status), squish_newlines)) %>% tidyr::separate_rows(c(Asset:Status), sep = "\n") ``` In this example, after applying the function across the variables with problematic line breaks it was possible to separate the rows into a tidy structure. **Broken rows** have values of two contiguous rows broken up and padded with empty or NA values. ```{r} basketball <- data.frame( stringsAsFactors = FALSE, v1 = c( "Player", NA, "Sleve McDichael", "Dean Wesrey", "Karl Dandleton", "Mike Sernandez", "Glenallen Mixon", "Rey McSriff" ), v2 = c( "Most points", "in a game", "55", "43", "41", "111", "109", "104" ), v3 = c( "Season", "(year ending)", "2001", "2000", "2002", "2000", "2002", "2001" ) ) basketball ``` In this case, we can match any value in any variable along the row that has broken values. ```{r} unbreak_rows(basketball, "^Most", v2) ``` **Broken headers** are variable names broken up across the first few rows. ```{r} vehicles <- data.frame( stringsAsFactors = FALSE, Vehicle = c(NA, NA, NA, "Truck", "Motorcycle", "Sedan", "Van"), Price = c("in","2014", "(local currency)","50000","44000","33000","50000"), Color = c(NA, NA, NA, "White", "Black", "Red", "White"), Emissions = c("Certificate", NA, NA, "TRUE", "FALSE", "TRUE", "TRUE") ) vehicles ``` Here, the column names are broken. The top _three_ rows (in addition to the column name) contain fragments of the name and should be mashed together column-wise. The `mash_colnames()` function makes these many header rows into column names. The names are broken up across the top three rows, which goes in to the `n_name_rows` argument. Unlike other functions in `unheadr`, we provide the number of rows directly, rather than attempt any string matching. ```{r} mash_colnames(df= vehicles, n_name_rows = 3, keep_names = TRUE) ``` When importing data with broken headers into R, variable names are not always assigned from the values in top row, leaving us with automatically generated names (e.g. X1, X2, X3, etc.). ```{r} vehicles_skip <- data.frame( stringsAsFactors = FALSE, X1 = c("Vehicle",NA,NA,NA,"Truck", "Motorcycle","Sedan","Van"), X2 = c("Price","in","2014", "(local currency)","50000","44000","33000","50000"), X3 = c("Color", NA, NA, NA, "White", "Black", "Red", "White"), X4 = c("Emissions","Certificate",NA, NA,"TRUE","FALSE","TRUE","TRUE") ) vehicles_skip ``` In this case, the `keep_names` argument in `mash_colnames()` lets us ignore the object names when building new names from the first four rows of the data. ```{r} mash_colnames(df= vehicles_skip, n_name_rows = 4, keep_names = FALSE) ``` Lastly, the `sliding_headers` argument in `mash_colnames` can be used for tables with ragged names, in which not every column has a value in the very first row. In such cases, attribution by adjacency is assumed, and when `sliding_headers = TRUE` the names are filled row-wise. This can be useful for tables reporting survey data or experimental designs in an untidy manner. ```{r} survey <- data.frame( stringsAsFactors = FALSE, X1 = c("Participant", NA, "12", "34", "45", "123"), X2 = c("How did you hear about us?", "TV","TRUE","FALSE","FALSE","FALSE"), X3 = c(NA, "Social Media", "FALSE", "TRUE", "FALSE", "FALSE"), X4 = c(NA, "Radio", "FALSE", "TRUE", "FALSE", "TRUE"), X5 = c(NA, "Flyer", "FALSE", "FALSE", "FALSE", "FALSE"), X6 = c("Age", NA, "31", "23", "19", "24") ) survey mash_colnames(survey,2,keep_names = FALSE,sliding_headers = TRUE, sep = "_") ``` ### Spreadsheets `unheadr` also includes a function for flattening font and cell formatting in spreadsheet files into character strings in the corresponding cell. #### Supported formatting: - bold text - colored text - italic text, - text with strikethrough - underlined text - double underlined text - cell highlighting One of the example files bundled with `unheadr` looks like this: Font formatting and cell highlighting is being used to label an embedded grouping variable (meaningful formatting). The `annotate_mf()` function flattens the formatting into a character string describing the formatting. The hex8 code of the colors used for cell or text highlighting is also included. ```{r} example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr") annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated) ``` To apply this approach to all cells in a spreadsheet, we call `annotate_mf_all()`. In this other bundled example file, negative values (first quarter losses) are indicated by bold. ```{r} example_spreadsheetb <- system.file("extdata/boutiques.xlsx", package = "unheadr") annotate_mf_all(example_spreadsheetb) ``` ### Character vectors Tables from PDF or other similar sources can often be imported into R as character vectors with one element for each line. These can then be parsed as fixed width files or separated into columns. `unheadr` now includes the `regex_valign()` function for aligning elements in these vectors vertically by inserting padding whitespace (and optional separators) at positions along each line matched by a regular expression. This example is based on how data on hotel guests (ID, State of Origin, and Date) in a PDF is parsed by `pdftools::pdf_text`. ```{r} guests <- unlist(strsplit(c("6 COAHUILA 20/03/2020 712 COAHUILA 20/03/2020"),"\n")) guests ``` There is inconsistent whitespace between the first and second data 'columns'. With a regular expression that matches a word boundary and uppercase letters, we can adjust the whitespace so that the matched positions are the same across lines. ```{r} regex_valign(guests, "\\b(?=[A-Z])") ``` This output is easier to parse with `readr` or other data-munging approaches. ## Further reading The underlying reasoning, background, and possible uses of `unheadr` are described in this publication: Verde Arregoitia, L. D., Cooper, N., D'Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. _Hystrix, the Italian Journal of Mammalogy_, 29(2), 155-161. [Open Access, 10.4404/hystrix-00133-2018](https://doi.org/10.4404/hystrix-00133-2018). \doi{10.4404/hystrix-00133-2018}