Package 'unheadr' reference manual

Title:	Handle Data with Messy Header Rows and Broken Values
Description:	Verb-like functions to work with messy data, often derived from spreadsheets or parsed PDF tables. Includes functions for unwrapping values broken up across rows, relocating embedded grouping values, and to annotate meaningful formatting in spreadsheet files.
Authors:	Luis D. Verde Arregoitia [aut, cre]
Maintainer:	Luis D. Verde Arregoitia <[email protected]>
License:	MIT + file LICENSE
Version:	0.4.0
Built:	2025-03-09 03:41:32 UTC
Source:	https://github.com/luisdva/unheadr

Annotate meaningful formatting

Description

Turns cell formatting into annotations for values in the target variable.

Usage

annotate_mf(xlfilepath, orig, new)
annotate_mf(xlfilepath, orig, new)

Arguments

`xlfilepath`	Path to a single-sheet spreadsheet file (xls or xlsx).
`orig`	Target variable to annotate formatting in.
`new`	Name of new variable with cell formatting pasted as a string.

Details

Seven popular approaches for meaningful formatting (bold, colored text, italic, strikethrough, underline, double underline, and cell highlighting) are hardcoded in the function. sheets, skip, and range arguments for spreadsheet input are not supported. The hex8 code of the fill color used for text color and cell highlighting is also appended in the output. Ensure the data in the spreadsheet are rectangular before running; this includes blank but formatted cells beyond the data rectangle.

Value

A tibble with a new column in which the meaningful formatting is embedded as text.

Examples

example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated)
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated)

Annotate meaningful formatting for all cells

Description

Turns cell formatting into annotations for all values across all variables.

Usage

annotate_mf_all(xlfilepath)
annotate_mf_all(xlfilepath)

Arguments

xlfilepath

Path to a single-sheet spreadsheet file (xls or xlsx).

Details

Seven popular approaches for meaningful formatting (bold, colored text, italic, strikethrough, underline, double underline, and cell highlighting) are supported in this function. sheets, skip, and range arguments for spreadsheet input are not supported. The hex8 code of the fill color used for text color and cell highlighting is also appended in the output. Ensure the data in the spreadsheet are rectangular before running; this includes blank but formatted cells beyond the data rectangle.

Value

A tibble with meaningful formatting embedded as text for all rows and columns.

Examples

example_spreadsheet <- system.file("extdata/boutiques.xlsx", package = "unheadr")
annotate_mf_all(example_spreadsheet)
example_spreadsheet <- system.file("extdata/boutiques.xlsx", package = "unheadr")
annotate_mf_all(example_spreadsheet)

Statistics for game units in Age of Empires II: Definitive Edition

Description

A dataset with the numerical values that determine the behavior and performance of selected military units available in AoE2:DE (July 2020 Game Update).

Usage

AOEunits
AOEunits

Format

A data frame with 128 observations of 19 variables:

unit: Unit name
building: Building in which each unit is trained
type: Unit class
age: Age at which the unit becomes trainable
cost_wood: Unit cost in Wood
cost_food: Unit cost in Food
cost_gold: Unit cost in Gold
build_time: Training time in seconds
rate_of_fire: Attack speed
attack_delay: Retasking time
movement_speed: Travel speed on land
line_of_sight: Vision over the surrounding area
hit_points: Unit health
min_range: Minimum attacking range for ranged units
range: Maximum attacking range for ranged units
damage: Damage inflicted per attack
accuracy: Chance that an attack will be on target
melee_armor: Armor against melee attacks
pierce_armor: Armor against projectiles

Source

Age of Empires II. Copyright Microsoft Corporation. This dataset was created under Microsoft's Usage rules for Digital Goods using assets from Age of Empires II, and it is not endorsed by or affiliated with Microsoft. All information shown is an interpretation of data collected in-game with no guarantee on the accuracy of any of the data presented.

Statistics for game units in Age of Empires II: Definitive Edition in a messy presentation

Description

A messy version of the AOEunits dataset, meant for demonstrating data cleaning functions.

Usage

AOEunits_raw
AOEunits_raw

Format

A data frame with 139 observations of 15 variables. See AOEunits for variable descriptions.

Source

boutiques.xlsx spreadsheet

Description

Open XML Format Spreadsheet with 1 sheet, 6 columns, and 8 rows. Toy dataset with Q1 profits for different store locations. Additional information is encoded as meaningful formatting. Bold indicates losses (negative values), colors indicate continent, and italic indicates a second location in the same city.

Details

This data is used in the example for annotate_mf_all().

dog_test.xlsx spreadsheet

Description

Open XML Format Spreadsheet with 1 sheet, 2 columns, and 12 rows. Items describe various tasks or behaviors that dogs can be evaluated on, assigned into three categories which appear along with their average scores as embedded subheaders with meaningful formatting.

Details

This data is used in the example for annotate_mf().

Source

Items are modified from the checklist written by Junior Watson.

References

http://www.dogtrainingbasics.com/checklist-well-behaved-dog/

Make many header rows into column names

Description

Make many header rows into column names

Usage

mash_colnames(
  df,
  n_name_rows,
  keep_names = TRUE,
  sliding_headers = FALSE,
  sep = "_"
)
mash_colnames(
  df,
  n_name_rows,
  keep_names = TRUE,
  sliding_headers = FALSE,
  sep = "_"
)

Arguments

`df`	A `data.frame` or `tibble` object in which the names are broken up across the top n rows.
`n_name_rows`	Number of rows at the top of the data to be used to create the new variable (column) names. Must be >= 1.
`keep_names`	If TRUE, existing names will be included when building the new variable names. Defaults to TRUE.
`sliding_headers`	If TRUE, empty values in the first (topmost) header header row be filled column-wise. Defaults to FALSE. See details.
`sep`	Character string to separate the unified values (default is underscore).

Details

Tables are often shared with the column names broken up across the first few rows. This function takes the number of rows at the top of a table that hold the broken up names and whether or not to include the names, and mashes the values column-wise into a single string for each column. The keep_names argument can be helpful for tables we imported using a skip argument. If keep_names is set to FALSE, adjust the value of n_name_rows accordingly.

This function will throw a warning when possible NA values end up in the variable names. sliding_headers can be used for tables with ragged names in which not every column has a value in the very first row. In these cases attribution by adjacency is assumed, and when sliding_headers is set to TRUE the names in the topmost row are filled row-wise. This can be useful for tables reporting survey data or experimental designs in an untidy manner.

Value

The original data frame, but with new column names and without the top n rows that held the broken up names.

Author(s)

This function was originally contributed by Jarrett Byrnes through a GitHub issue.

Examples

babies <-
  data.frame(
    stringsAsFactors = FALSE,
    Baby = c(NA, NA, "Angie", "Yean", "Pierre"),
    Age = c("in", "months", "11", "9", "7"),
    Weight = c("kg", NA, "2", "3", "4"),
    Ward = c(NA, NA, "A", "B", "C")
  )
# Including the object names
mash_colnames(babies, n_name_rows = 2, keep_names = TRUE)

babies_skip <-
  data.frame(
    stringsAsFactors = FALSE,
    X1 = c("Baby", NA, NA, "Jennie", "Yean", "Pierre"),
    X2 = c("Age", "in", "months", "11", "9", "7"),
    X3 = c("Hospital", NA, NA, "A", "B", "A")
  )
#' # Discarding the automatically-generated names (X1, X2, etc...)
mash_colnames(babies_skip, n_name_rows = 3, keep_names = FALSE)

fish_experiment <-
  data.frame(
    stringsAsFactors = FALSE,
    X1 = c("Sample", NA, "Pacific", "Atlantic", "Freshwater"),
    X2 = c("Larvae", "Control", "12", "11", "10"),
    X3 = c(NA, "Low Dose", "11", "12", "8"),
    X4 = c(NA, "High Dose", "8", "7", "9"),
    X5 = c("Adult", "Control", "13", "13", "8"),
    X6 = c(NA, "Low Dose", "13", "12", "7"),
    X7 = c(NA, "High Dose", "10", "10", "9")
  )
# Ragged names
mash_colnames(fish_experiment,
  n_name_rows = 2,
  keep_names = FALSE, sliding_headers = TRUE
)
babies <-
  data.frame(
    stringsAsFactors = FALSE,
    Baby = c(NA, NA, "Angie", "Yean", "Pierre"),
    Age = c("in", "months", "11", "9", "7"),
    Weight = c("kg", NA, "2", "3", "4"),
    Ward = c(NA, NA, "A", "B", "C")
  )
# Including the object names
mash_colnames(babies, n_name_rows = 2, keep_names = TRUE)

babies_skip <-
  data.frame(
    stringsAsFactors = FALSE,
    X1 = c("Baby", NA, NA, "Jennie", "Yean", "Pierre"),
    X2 = c("Age", "in", "months", "11", "9", "7"),
    X3 = c("Hospital", NA, NA, "A", "B", "A")
  )
#' # Discarding the automatically-generated names (X1, X2, etc...)
mash_colnames(babies_skip, n_name_rows = 3, keep_names = FALSE)

fish_experiment <-
  data.frame(
    stringsAsFactors = FALSE,
    X1 = c("Sample", NA, "Pacific", "Atlantic", "Freshwater"),
    X2 = c("Larvae", "Control", "12", "11", "10"),
    X3 = c(NA, "Low Dose", "11", "12", "8"),
    X4 = c(NA, "High Dose", "8", "7", "9"),
    X5 = c("Adult", "Control", "13", "13", "8"),
    X6 = c(NA, "Low Dose", "13", "12", "7"),
    X7 = c(NA, "High Dose", "10", "10", "9")
  )
# Ragged names
mash_colnames(fish_experiment,
  n_name_rows = 2,
  keep_names = FALSE, sliding_headers = TRUE
)

Comparative data for 54 species of primates

Description

A dataset with embedded subheaders.

Usage

primates2017
primates2017

Format

A data frame with 69 rows and 4 variables:

scientific_name: scientific names, with geographic region and taxonomic family embedded as subheaders.
common_name: vernacular name
red_list_status: IUCN Red List Status in January 2017
mass_kg: mean body mass in kilograms

Source

Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946

Comparative data for 16 species of primates with some broken values

Description

A dataset with embedded subheaders and some values (T. obscurus, T. leucocephalus and N. bengalensis) in the scientific_names variable broken up across two rows (typically done to fit the content in a table).

Usage

primates2017_broken
primates2017_broken

Format

A data frame with 19 rows and 4 variables:

scientific_name: scientific names, with embedded subheaders for geographic region and taxonomic family and broken values
common_name: vernacular name
red_list_status: IUCN Red List Status in January 2017
mass_kg: mean body mass in kilograms

Source

Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946

Comparative data for two species of primates

Description

A dataset in which the elements for some of the values are in separate rows'

Usage

primates2017_wrapped
primates2017_wrapped

Format

A data frame with 9 rows and 6 variables:

scientific_name: scientific names, see reference
common_name: vernacular name
habitat: habitat types listed in the IUCN Red List assessments
red_list_status: IUCN Red List Status in January 2017
mass_kg: mean body mass in kilograms
country: Countries where the species is present, from IUCN Red List assessments

Source

Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946

Vertical character string alignment through regular expressions

Description

Aligning strings with regex.

Usage

regex_valign(stringvec, regex_ai, sep_str = "")
regex_valign(stringvec, regex_ai, sep_str = "")

Arguments

`stringvec`	A character vector with one element for each line.
`regex_ai`	A regular expression matching the position for alignment.
`sep_str`	Optional character vector that will be inserted at the positions matched by the regular expression.

Details

Written mainly for reading fixed width files, text, or tables parsed from PDFs.

Value

A character vector with one element for each line, with padding inserted at the matched positions so that elements are vertically aligned across lines.

Examples

guests <-
  unlist(strsplit(c("6       COAHUILA        20/03/2020
7       COAHUILA             20/03/2020
18 BAJA CALIFORNIA     16/03/2020
109       CDMX      12/03/2020
1230   QUERETARO       21/03/2020"), "\n"))

# align at first uppercase word boundary , inserting a separator
regex_valign(guests, "\\b(?=[A-Z])", " - ")
# align dates at end of string
regex_valign(guests, "\\b(?=[0-9]{2}[\\/]{1}[0-9]{2}[\\/]{1}[0-9]{4}$)")
guests <-
  unlist(strsplit(c("6       COAHUILA        20/03/2020
7       COAHUILA             20/03/2020
18 BAJA CALIFORNIA     16/03/2020
109       CDMX      12/03/2020
1230   QUERETARO       21/03/2020"), "\n"))

# align at first uppercase word boundary , inserting a separator
regex_valign(guests, "\\b(?=[A-Z])", " - ")
# align dates at end of string
regex_valign(guests, "\\b(?=[0-9]{2}[\\/]{1}[0-9]{2}[\\/]{1}[0-9]{4}$)")

Deduplicate and remove trailing line breaks

Description

Deduplicate and remove trailing line breaks

Usage

squish_newlines(sepstring)
squish_newlines(sepstring)

Arguments

sepstring

A character vector with new line control characters.

Details

Useful for tables with merged cells, often imported from Word or PDF files. Can be applied across multiple columns before separating into rows.

Value

A vector without trailing or multiple consecutive new line sequences.

Examples

vecWithNewlines <- c("dog\n\ncat\n\n\npig\n")
squish_newlines(vecWithNewlines)

vecWithNewlines <- c("dog\n\ncat\n\n\npig\n")
squish_newlines(vecWithNewlines)

Merge rows up

Description

Merge rows up

Usage

unbreak_rows(df, regex, ogcol, sep = " ")
unbreak_rows(df, regex, ogcol, sep = " ")

Arguments

`df`	A data frame with at least two contiguous rows to be merged.
`regex`	A regular expression to identify sets of rows to be merged, meant for the leading of the two contiguous rows.
`ogcol`	Variable with the text strings to match.
`sep`	Character string to separate the unified values (default is space).

Details

This function recodes empty strings ("") to NA for smoother pattern matching.

Value

A tibble or data frame with merged rows. Values of the lagging rows are pasted onto the values in the leading row, whitespace is squished, and the lagging row is dropped.

Examples

bball <-
  data.frame(
    stringsAsFactors = FALSE,
    v1 = c(
      "Player", NA, "Sleve McDichael", "Dean Wesrey",
      "Karl Dandleton"
    ),
    v2 = c("Most points", "in a game", "55", "43", "41"),
    v3 = c("Season", "(year ending)", "2001", "2000", "2010")
  )
unbreak_rows(bball, "Most", v2)
bball <-
  data.frame(
    stringsAsFactors = FALSE,
    v1 = c(
      "Player", NA, "Sleve McDichael", "Dean Wesrey",
      "Karl Dandleton"
    ),
    v2 = c("Most points", "in a game", "55", "43", "41"),
    v3 = c("Season", "(year ending)", "2001", "2000", "2010")
  )
unbreak_rows(bball, "Most", v2)

Unbreak values using regex to match the lagging half of the broken value

Description

Unbreak values using regex to match the lagging half of the broken value

Usage

unbreak_vals(df, regex, ogcol, newcol, sep = " ", slice_groups)
unbreak_vals(df, regex, ogcol, newcol, sep = " ", slice_groups)

Arguments

`df`	A data frame with one or more values within a variable broken up across two rows.
`regex`	Regular expression for matching the trailing (lagging) half of the broken values.
`ogcol`	Variable to unbreak.
`newcol`	Name of the new variable with the unified values.
`sep`	Character string to separate the unified values (default is space).
`slice_groups`	Deprecated. See details and Package News.

Details

This function is limited to quite specific cases, but useful when dealing with tables that contain, for example, scientific names broken across two rows. For unwrapping values, see unwrap_cols.

Value

A tibble with 'unbroken' values. The variable that originally contained the broken values gets dropped, and the new variable with the unified values is placed as the first column. The slice_groups argument is now deprecated; the extra rows and the variable with broken values will be dropped.

Examples

data(primates2017_broken)
# regex matches strings starting in lowercase (broken species epithets)
unbreak_vals(primates2017_broken, "^[a-z]", scientific_name, sciname_new)
data(primates2017_broken)
# regex matches strings starting in lowercase (broken species epithets)
unbreak_vals(primates2017_broken, "^[a-z]", scientific_name, sciname_new)

Rectangling embedded subheaders

Description

Rectangling embedded subheaders

Usage

untangle2(df, regex, orig, new)
untangle2(df, regex, orig, new)

Arguments

`df`	A data frame with embedded subheaders.
`regex`	Regular expression to match the subheaders.
`orig`	Variable containing the extraneous subheaders.
`new`	Name of variable that will contain the group values.

Details

Special thanks to Jenny Bryan for fixing the initial tidyeval code and overall function structure.

Value

A tibble without the matched subheaders and a new variable containing the grouping data.

Examples

data(primates2017)
# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(
  primates2017, "Asia|Madagascar|Mainland Africa|Neotropics",
  scientific_name, family
)
# with magrittr pipes (re-exported in this package)
primates2017 %>%
  untangle2("DAE$", scientific_name, family) %>%
  untangle2(
    "Asia|Madagascar|Mainland Africa|Neotropics",
    scientific_name, region
  )
data(primates2017)
# put taxonomic family in its own variable (matches the suffix "DAE")
untangle2(primates2017, "DAE$", scientific_name, family)
# put geographic regions in their own variable (matching them all by name)
untangle2(
  primates2017, "Asia|Madagascar|Mainland Africa|Neotropics",
  scientific_name, family
)
# with magrittr pipes (re-exported in this package)
primates2017 %>%
  untangle2("DAE$", scientific_name, family) %>%
  untangle2(
    "Asia|Madagascar|Mainland Africa|Neotropics",
    scientific_name, region
  )

Unwrap values and clean up NAs used as padding

Description

Unwrap values and clean up NAs used as padding

Usage

unwrap_cols(df, groupingVar, separator)
unwrap_cols(df, groupingVar, separator)

Arguments

`df`	A data frame with wrapped values and an inconsistent number of NA values used to as within-group padding.
`groupingVar`	Name of the variable describing the observational units.
`separator`	Character string defining the separator that will delimit the elements of the unwrapped value.

Details

This is roughly the opposite of tidyr::separate_rows().

Value

A summarized tibble. Order is preserved in the grouping variable by making it a factor.

Examples

data(primates2017_wrapped)
# using commas to separate elements
unwrap_cols(primates2017_wrapped, scientific_name, ", ")

# separating with semicolons
df <- data.frame(
  ounits = c("A", NA, "B", "C", "D", NA),
  vals = c(1, 2, 2, 3, 1, 3)
)
unwrap_cols(df, ounits, ";")
data(primates2017_wrapped)
# using commas to separate elements
unwrap_cols(primates2017_wrapped, scientific_name, ", ")

# separating with semicolons
df <- data.frame(
  ounits = c("A", NA, "B", "C", "D", NA),
  vals = c(1, 2, 2, 3, 1, 3)
)
unwrap_cols(df, ounits, ";")

Package 'unheadr'

Help Index

Annotate meaningful formatting

Description

Usage

Arguments

Details

Value

Examples

Annotate meaningful formatting for all cells

Description

Usage

Arguments

Details

Value

Examples

Statistics for game units in Age of Empires II: Definitive Edition

Description

Usage

Format

Source

Statistics for game units in Age of Empires II: Definitive Edition in a messy presentation

Description

Usage

Format

Source

boutiques.xlsx spreadsheet

Description

Details

dog_test.xlsx spreadsheet

Description

Details

Source

References

Make many header rows into column names

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Comparative data for 54 species of primates

Description

Usage

Format

Source

Comparative data for 16 species of primates with some broken values

Description

Usage

Format

Source

Comparative data for two species of primates

Description

Usage

Format

Source

Vertical character string alignment through regular expressions

Description

Usage

Arguments

Details

Value

See Also

Examples

Deduplicate and remove trailing line breaks

Description

Usage

Arguments

Details

Value

Examples

Merge rows up

Description

Usage

Arguments

Details

Value

Examples

Unbreak values using regex to match the lagging half of the broken value