Title: | Handle Data with Messy Header Rows and Broken Values |
---|---|
Description: | Verb-like functions to work with messy data, often derived from spreadsheets or parsed PDF tables. Includes functions for unwrapping values broken up across rows, relocating embedded grouping values, and to annotate meaningful formatting in spreadsheet files. |
Authors: | Luis D. Verde Arregoitia [aut, cre] |
Maintainer: | Luis D. Verde Arregoitia <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.0 |
Built: | 2024-11-09 04:23:22 UTC |
Source: | https://github.com/luisdva/unheadr |
Turns cell formatting into annotations for values in the target variable.
annotate_mf(xlfilepath, orig, new)
annotate_mf(xlfilepath, orig, new)
xlfilepath |
Path to a single-sheet spreadsheet file (xls or xlsx). |
orig |
Target variable to annotate formatting in. |
new |
Name of new variable with cell formatting pasted as a string. |
Seven popular approaches for meaningful formatting (bold, colored
text, italic, strikethrough, underline, double underline, and cell
highlighting) are hardcoded in the function. sheets
, skip
, and range
arguments for spreadsheet input are not supported. The hex8 code of the
fill color used for text color and cell highlighting is also appended in
the output. Ensure the data in the spreadsheet are rectangular before
running; this includes blank but formatted cells beyond the data rectangle.
A tibble with a new column in which the meaningful formatting is embedded as text.
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr") annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated)
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr") annotate_mf(example_spreadsheet, orig = Task, new = Task_annotated)
Turns cell formatting into annotations for all values across all variables.
annotate_mf_all(xlfilepath)
annotate_mf_all(xlfilepath)
xlfilepath |
Path to a single-sheet spreadsheet file (xls or xlsx). |
Seven popular approaches for meaningful formatting (bold, colored
text, italic, strikethrough, underline, double underline, and cell
highlighting) are supported in this function. sheets
, skip
, and range
arguments for spreadsheet input are not supported. The hex8 code of the
fill color used for text color and cell highlighting is also appended in
the output. Ensure the data in the spreadsheet are rectangular before
running; this includes blank but formatted cells beyond the data rectangle.
A tibble with meaningful formatting embedded as text for all rows and columns.
example_spreadsheet <- system.file("extdata/boutiques.xlsx", package = "unheadr") annotate_mf_all(example_spreadsheet)
example_spreadsheet <- system.file("extdata/boutiques.xlsx", package = "unheadr") annotate_mf_all(example_spreadsheet)
A dataset with the numerical values that determine the behavior and performance of selected military units available in AoE2:DE (July 2020 Game Update).
AOEunits
AOEunits
A data frame with 128 observations of 19 variables:
Unit name
Building in which each unit is trained
Unit class
Age at which the unit becomes trainable
Unit cost in Wood
Unit cost in Food
Unit cost in Gold
Training time in seconds
Attack speed
Retasking time
Travel speed on land
Vision over the surrounding area
Unit health
Minimum attacking range for ranged units
Maximum attacking range for ranged units
Damage inflicted per attack
Chance that an attack will be on target
Armor against melee attacks
Armor against projectiles
Age of Empires II. Copyright Microsoft Corporation. This dataset was created under Microsoft's Usage rules for Digital Goods using assets from Age of Empires II, and it is not endorsed by or affiliated with Microsoft. All information shown is an interpretation of data collected in-game with no guarantee on the accuracy of any of the data presented.
A messy version of the AOEunits
dataset, meant for demonstrating data cleaning functions.
AOEunits_raw
AOEunits_raw
A data frame with 139 observations of 15 variables. See AOEunits
for variable descriptions.
Age of Empires II. Copyright Microsoft Corporation. This dataset was created under Microsoft's Usage rules for Digital Goods using assets from Age of Empires II, and it is not endorsed by or affiliated with Microsoft. All information shown is an interpretation of data collected in-game with no guarantee on the accuracy of any of the data presented.
Open XML Format Spreadsheet with 1 sheet, 6 columns, and 8 rows. Toy dataset with Q1 profits for different store locations. Additional information is encoded as meaningful formatting. Bold indicates losses (negative values), colors indicate continent, and italic indicates a second location in the same city.
This data is used in the example for annotate_mf_all()
.
Open XML Format Spreadsheet with 1 sheet, 2 columns, and 12 rows. Items describe various tasks or behaviors that dogs can be evaluated on, assigned into three categories which appear along with their average scores as embedded subheaders with meaningful formatting.
This data is used in the example for annotate_mf()
.
Items are modified from the checklist written by Junior Watson.
http://www.dogtrainingbasics.com/checklist-well-behaved-dog/
Make many header rows into column names
mash_colnames( df, n_name_rows, keep_names = TRUE, sliding_headers = FALSE, sep = "_" )
mash_colnames( df, n_name_rows, keep_names = TRUE, sliding_headers = FALSE, sep = "_" )
df |
A |
n_name_rows |
Number of rows at the top of the data to be used to create the new variable (column) names. Must be >= 1. |
keep_names |
If TRUE, existing names will be included when building the new variable names. Defaults to TRUE. |
sliding_headers |
If TRUE, empty values in the first (topmost) header header row be filled column-wise. Defaults to FALSE. See details. |
sep |
Character string to separate the unified values (default is underscore). |
Tables are often shared with the column names broken up across the
first few rows. This function takes the number of rows at the top of a
table that hold the broken up names and whether or not to include the
names, and mashes the values column-wise into a single string for each
column. The keep_names
argument can be helpful for tables we
imported using a skip
argument. If keep_names
is set to FALSE
,
adjust the value of n_name_rows
accordingly.
This function will throw a warning when possible NA
values end up in the
variable names. sliding_headers
can be used for tables with ragged
names in which not every column has a value in the very first row. In these
cases attribution by adjacency is assumed, and when sliding_headers
is set to TRUE
the names in the topmost row are filled row-wise. This can
be useful for tables reporting survey data or experimental designs in an
untidy manner.
The original data frame, but with new column names and without the top n rows that held the broken up names.
This function was originally contributed by Jarrett Byrnes through a GitHub issue.
babies <- data.frame( stringsAsFactors = FALSE, Baby = c(NA, NA, "Angie", "Yean", "Pierre"), Age = c("in", "months", "11", "9", "7"), Weight = c("kg", NA, "2", "3", "4"), Ward = c(NA, NA, "A", "B", "C") ) # Including the object names mash_colnames(babies, n_name_rows = 2, keep_names = TRUE) babies_skip <- data.frame( stringsAsFactors = FALSE, X1 = c("Baby", NA, NA, "Jennie", "Yean", "Pierre"), X2 = c("Age", "in", "months", "11", "9", "7"), X3 = c("Hospital", NA, NA, "A", "B", "A") ) #' # Discarding the automatically-generated names (X1, X2, etc...) mash_colnames(babies_skip, n_name_rows = 3, keep_names = FALSE) fish_experiment <- data.frame( stringsAsFactors = FALSE, X1 = c("Sample", NA, "Pacific", "Atlantic", "Freshwater"), X2 = c("Larvae", "Control", "12", "11", "10"), X3 = c(NA, "Low Dose", "11", "12", "8"), X4 = c(NA, "High Dose", "8", "7", "9"), X5 = c("Adult", "Control", "13", "13", "8"), X6 = c(NA, "Low Dose", "13", "12", "7"), X7 = c(NA, "High Dose", "10", "10", "9") ) # Ragged names mash_colnames(fish_experiment, n_name_rows = 2, keep_names = FALSE, sliding_headers = TRUE )
babies <- data.frame( stringsAsFactors = FALSE, Baby = c(NA, NA, "Angie", "Yean", "Pierre"), Age = c("in", "months", "11", "9", "7"), Weight = c("kg", NA, "2", "3", "4"), Ward = c(NA, NA, "A", "B", "C") ) # Including the object names mash_colnames(babies, n_name_rows = 2, keep_names = TRUE) babies_skip <- data.frame( stringsAsFactors = FALSE, X1 = c("Baby", NA, NA, "Jennie", "Yean", "Pierre"), X2 = c("Age", "in", "months", "11", "9", "7"), X3 = c("Hospital", NA, NA, "A", "B", "A") ) #' # Discarding the automatically-generated names (X1, X2, etc...) mash_colnames(babies_skip, n_name_rows = 3, keep_names = FALSE) fish_experiment <- data.frame( stringsAsFactors = FALSE, X1 = c("Sample", NA, "Pacific", "Atlantic", "Freshwater"), X2 = c("Larvae", "Control", "12", "11", "10"), X3 = c(NA, "Low Dose", "11", "12", "8"), X4 = c(NA, "High Dose", "8", "7", "9"), X5 = c("Adult", "Control", "13", "13", "8"), X6 = c(NA, "Low Dose", "13", "12", "7"), X7 = c(NA, "High Dose", "10", "10", "9") ) # Ragged names mash_colnames(fish_experiment, n_name_rows = 2, keep_names = FALSE, sliding_headers = TRUE )
A dataset with embedded subheaders.
primates2017
primates2017
A data frame with 69 rows and 4 variables:
scientific names, with geographic region and taxonomic family embedded as subheaders.
vernacular name
IUCN Red List Status in January 2017
mean body mass in kilograms
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
A dataset with embedded subheaders and some values (T. obscurus, T. leucocephalus and N. bengalensis) in the scientific_names variable broken up across two rows (typically done to fit the content in a table).
primates2017_broken
primates2017_broken
A data frame with 19 rows and 4 variables:
scientific names, with embedded subheaders for geographic region and taxonomic family and broken values
vernacular name
IUCN Red List Status in January 2017
mean body mass in kilograms
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
A dataset in which the elements for some of the values are in separate rows'
primates2017_wrapped
primates2017_wrapped
A data frame with 9 rows and 6 variables:
scientific names, see reference
vernacular name
habitat types listed in the IUCN Red List assessments
IUCN Red List Status in January 2017
mean body mass in kilograms
Countries where the species is present, from IUCN Red List assessments
Estrada, Alejandro, et al. "Impending extinction crisis of the world's primates: Why primates matter." Science Advances 3.1 (2017): e1600946. doi:10.1126/sciadv.1600946
Aligning strings with regex.
regex_valign(stringvec, regex_ai, sep_str = "")
regex_valign(stringvec, regex_ai, sep_str = "")
stringvec |
A character vector with one element for each line. |
regex_ai |
A regular expression matching the position for alignment. |
sep_str |
Optional character vector that will be inserted at the positions matched by the regular expression. |
Written mainly for reading fixed width files, text, or tables parsed from PDFs.
A character vector with one element for each line, with padding inserted at the matched positions so that elements are vertically aligned across lines.
This function is based loosely on
textutils::valign()
.
guests <- unlist(strsplit(c("6 COAHUILA 20/03/2020 7 COAHUILA 20/03/2020 18 BAJA CALIFORNIA 16/03/2020 109 CDMX 12/03/2020 1230 QUERETARO 21/03/2020"), "\n")) # align at first uppercase word boundary , inserting a separator regex_valign(guests, "\\b(?=[A-Z])", " - ") # align dates at end of string regex_valign(guests, "\\b(?=[0-9]{2}[\\/]{1}[0-9]{2}[\\/]{1}[0-9]{4}$)")
guests <- unlist(strsplit(c("6 COAHUILA 20/03/2020 7 COAHUILA 20/03/2020 18 BAJA CALIFORNIA 16/03/2020 109 CDMX 12/03/2020 1230 QUERETARO 21/03/2020"), "\n")) # align at first uppercase word boundary , inserting a separator regex_valign(guests, "\\b(?=[A-Z])", " - ") # align dates at end of string regex_valign(guests, "\\b(?=[0-9]{2}[\\/]{1}[0-9]{2}[\\/]{1}[0-9]{4}$)")
Deduplicate and remove trailing line breaks
squish_newlines(sepstring)
squish_newlines(sepstring)
sepstring |
A character vector with new line control characters. |
Useful for tables with merged cells, often imported from Word or PDF files. Can be applied across multiple columns before separating into rows.
A vector without trailing or multiple consecutive new line sequences.
vecWithNewlines <- c("dog\n\ncat\n\n\npig\n") squish_newlines(vecWithNewlines)
vecWithNewlines <- c("dog\n\ncat\n\n\npig\n") squish_newlines(vecWithNewlines)
Merge rows up
unbreak_rows(df, regex, ogcol, sep = " ")
unbreak_rows(df, regex, ogcol, sep = " ")
df |
A data frame with at least two contiguous rows to be merged. |
regex |
A regular expression to identify sets of rows to be merged, meant for the leading of the two contiguous rows. |
ogcol |
Variable with the text strings to match. |
sep |
Character string to separate the unified values (default is space). |
This function recodes empty strings ("") to NA
for smoother pattern matching.
A tibble or data frame with merged rows. Values of the lagging rows are pasted onto the values in the leading row, whitespace is squished, and the lagging row is dropped.
bball <- data.frame( stringsAsFactors = FALSE, v1 = c( "Player", NA, "Sleve McDichael", "Dean Wesrey", "Karl Dandleton" ), v2 = c("Most points", "in a game", "55", "43", "41"), v3 = c("Season", "(year ending)", "2001", "2000", "2010") ) unbreak_rows(bball, "Most", v2)
bball <- data.frame( stringsAsFactors = FALSE, v1 = c( "Player", NA, "Sleve McDichael", "Dean Wesrey", "Karl Dandleton" ), v2 = c("Most points", "in a game", "55", "43", "41"), v3 = c("Season", "(year ending)", "2001", "2000", "2010") ) unbreak_rows(bball, "Most", v2)
Unbreak values using regex to match the lagging half of the broken value
unbreak_vals(df, regex, ogcol, newcol, sep = " ", slice_groups)
unbreak_vals(df, regex, ogcol, newcol, sep = " ", slice_groups)
df |
A data frame with one or more values within a variable broken up across two rows. |
regex |
Regular expression for matching the trailing (lagging) half of the broken values. |
ogcol |
Variable to unbreak. |
newcol |
Name of the new variable with the unified values. |
sep |
Character string to separate the unified values (default is space). |
slice_groups |
Deprecated. See details and Package News. |
This function is limited to quite specific cases, but useful when
dealing with tables that contain, for example, scientific names broken across two rows.
For unwrapping values, see unwrap_cols
.
A tibble with 'unbroken' values. The variable that originally
contained the broken values gets dropped, and the new variable with the
unified values is placed as the first column. The slice_groups
argument is now deprecated; the extra rows and the variable with broken
values will be dropped.
data(primates2017_broken) # regex matches strings starting in lowercase (broken species epithets) unbreak_vals(primates2017_broken, "^[a-z]", scientific_name, sciname_new)
data(primates2017_broken) # regex matches strings starting in lowercase (broken species epithets) unbreak_vals(primates2017_broken, "^[a-z]", scientific_name, sciname_new)
Rectangling embedded subheaders
untangle2(df, regex, orig, new)
untangle2(df, regex, orig, new)
df |
A data frame with embedded subheaders. |
regex |
Regular expression to match the subheaders. |
orig |
Variable containing the extraneous subheaders. |
new |
Name of variable that will contain the group values. |
Special thanks to Jenny Bryan for fixing the initial tidyeval code and overall function structure.
A tibble without the matched subheaders and a new variable containing the grouping data.
data(primates2017) # put taxonomic family in its own variable (matches the suffix "DAE") untangle2(primates2017, "DAE$", scientific_name, family) # put geographic regions in their own variable (matching them all by name) untangle2( primates2017, "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, family ) # with magrittr pipes (re-exported in this package) primates2017 %>% untangle2("DAE$", scientific_name, family) %>% untangle2( "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region )
data(primates2017) # put taxonomic family in its own variable (matches the suffix "DAE") untangle2(primates2017, "DAE$", scientific_name, family) # put geographic regions in their own variable (matching them all by name) untangle2( primates2017, "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, family ) # with magrittr pipes (re-exported in this package) primates2017 %>% untangle2("DAE$", scientific_name, family) %>% untangle2( "Asia|Madagascar|Mainland Africa|Neotropics", scientific_name, region )
Unwrap values and clean up NAs used as padding
unwrap_cols(df, groupingVar, separator)
unwrap_cols(df, groupingVar, separator)
df |
A data frame with wrapped values and an inconsistent number of NA values used to as within-group padding. |
groupingVar |
Name of the variable describing the observational units. |
separator |
Character string defining the separator that will delimit the elements of the unwrapped value. |
This is roughly the opposite of tidyr::separate_rows()
.
A summarized tibble. Order is preserved in the grouping variable by making it a factor.
data(primates2017_wrapped) # using commas to separate elements unwrap_cols(primates2017_wrapped, scientific_name, ", ") # separating with semicolons df <- data.frame( ounits = c("A", NA, "B", "C", "D", NA), vals = c(1, 2, 2, 3, 1, 3) ) unwrap_cols(df, ounits, ";")
data(primates2017_wrapped) # using commas to separate elements unwrap_cols(primates2017_wrapped, scientific_name, ", ") # separating with semicolons df <- data.frame( ounits = c("A", NA, "B", "C", "D", NA), vals = c(1, 2, 2, 3, 1, 3) ) unwrap_cols(df, ounits, ";")