Title: | Fast and Easy Data Cleaning |
---|---|
Description: | Data cleaning functions for classes logical, factor, numeric, character, currency and Date to make data cleaning fast and easy. Relying on very few dependencies, it provides smart guessing, but with user options to override anything if needed. |
Authors: | Matthijs S. Berends [aut, cre] |
Maintainer: | Matthijs S. Berends <[email protected]> |
License: | GPL-2 |
Version: | 1.5.5 |
Built: | 2024-11-19 11:24:51 UTC |
Source: | https://github.com/msberends/cleaner |
Use any of these functions to quickly clean columns in your data set. Use clean()
to pick the functions that return the least relative number of NA
s. They always return the class from the function name (e.g. clean_Date()
always returns class Date
).
clean(x) ## S3 method for class 'data.frame' clean(x) clean_logical( x, true = regex_true(), false = regex_false(), na = NULL, fixed = FALSE, ignore.case = TRUE ) clean_factor( x, levels = unique(x), ordered = FALSE, droplevels = FALSE, fixed = FALSE, ignore.case = TRUE ) clean_numeric(x, remove = "[^0-9.,-]", fixed = FALSE) clean_double(x, remove = "[^0-9.,-]", fixed = FALSE) clean_integer(x, remove = "[^0-9.,-]", fixed = FALSE) clean_character( x, remove = "[^a-z \t\r\n]", fixed = FALSE, ignore.case = TRUE, trim = TRUE ) clean_currency(x, currency_symbol = NULL, remove = "[^0-9.,-]", fixed = FALSE) clean_percentage(x, remove = "[^0-9.,-]", fixed = FALSE) clean_Date(x, format = NULL, guess_each = FALSE, max_date = Sys.Date(), ...) clean_POSIXct( x, tz = "", remove = "[^.0-9 :/-]", fixed = FALSE, max_date = Sys.Date(), ... )
clean(x) ## S3 method for class 'data.frame' clean(x) clean_logical( x, true = regex_true(), false = regex_false(), na = NULL, fixed = FALSE, ignore.case = TRUE ) clean_factor( x, levels = unique(x), ordered = FALSE, droplevels = FALSE, fixed = FALSE, ignore.case = TRUE ) clean_numeric(x, remove = "[^0-9.,-]", fixed = FALSE) clean_double(x, remove = "[^0-9.,-]", fixed = FALSE) clean_integer(x, remove = "[^0-9.,-]", fixed = FALSE) clean_character( x, remove = "[^a-z \t\r\n]", fixed = FALSE, ignore.case = TRUE, trim = TRUE ) clean_currency(x, currency_symbol = NULL, remove = "[^0-9.,-]", fixed = FALSE) clean_percentage(x, remove = "[^0-9.,-]", fixed = FALSE) clean_Date(x, format = NULL, guess_each = FALSE, max_date = Sys.Date(), ...) clean_POSIXct( x, tz = "", remove = "[^.0-9 :/-]", fixed = FALSE, max_date = Sys.Date(), ... )
x |
data to clean |
true |
regex to interpret values as |
false |
regex to interpret values as |
na |
regex to force interpret values as |
fixed |
logical to indicate whether regular expressions should be turned off |
ignore.case |
logical to indicate whether matching should be case-insensitive |
levels |
new factor levels, may be named regular expressions to match existing values, see Details |
ordered |
logical to indicate whether the factor levels should be ordered |
droplevels |
logical to indicate whether non-existing factor levels should be dropped |
remove |
regex to define the character(s) that should be removed, see Details |
trim |
logical to indicate whether the result should be trimmed with |
currency_symbol |
the currency symbol to use, which will be guessed based on the input and otherwise defaults to the current system locale setting (see |
format |
character string giving a date-time format as used by For |
guess_each |
logical to indicate whether all items of |
max_date |
date (coercible with |
... |
for |
tz |
a character string. The time zone specification to be used
for the conversion, if one is required. System-specific (see
time zones), but |
Using clean()
on a vector will guess a cleaning function based on the potential number of NA
s it returns. Using clean()
on a data frame to apply this guessed cleaning over all columns.
Info about the different functions:
clean_logical()
:
Use arguments true
and false
to match values using case-insensitive regular expressions (regex). Unmatched values are considered NA
. By default, values are matched with regex_true()
and regex_false()
. This allows support for values "Yes" and "No" in various languages. Use argument na
to override values as NA
that would otherwise be matched with true
or false
. See Examples.
clean_factor()
:
Use argument levels
to set new factor levels. They can be named case-insensitive regular expressions to match existing values of x
. For matching, new values for levels
are internally temporarily sorted descending on text length. See Examples.
clean_numeric()
, clean_double()
, clean_integer()
and clean_character()
:
Use argument remove
to match values that must be removed from the input, using regular expressions (regex). In the case of clean_numeric()
, commas will be read as dots and only the last dot will be kept. Function clean_character()
will keep middle spaces by default. See Examples.
clean_percentage()
:
This new class works like clean_numeric()
, but transforms it with as.percentage()
, which will retain the original values but will print them as percentages. See Examples.
clean_currency()
:
This new class works like clean_numeric()
, but transforms it with as.currency()
. The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee, and the South African rand. See Examples.
clean_Date()
:
Use argument format
to define a date format or leave it empty to have the format guessed. Use "Excel"
to read values as Microsoft Excel dates. The format
argument will be evaluated with format_datetime()
, meaning that a format like "d-mmm-yy"
will be translated internally to "%e-%b-%y"
for convenience. See Examples.
clean_POSIXct()
:
Use argument remove
to match values that must be removed from the input, using regular expressions (regex). The resulting string will be coerced to a date/time element with class POSIXct
, using as.POSIXct()
. See Examples.
The use of invalid regular expressions in any of the above functions will not return an error (as in base R) but will instead interpret the expression as a fixed value and will throw a warning.
The clean_*
functions always return the class from the function name:
clean_logical()
: class logical
clean_factor()
: class factor
clean_numeric()
and clean_double()
: class numeric
clean_integer()
: class integer
clean_character()
: class character
clean_percentage()
: class percentage
clean_currency()
: class currency
clean_Date()
: class Date
clean_POSIXct()
: classes POSIXct/POSIXt
Triennial Central Bank Survey Foreign exchange turnover in April 2016 (PDF). Bank for International Settlements. 11 December 2016. p. 10.
clean_logical(c("Yes", "No")) # English clean_logical(c("Oui", "Non")) # French clean_logical(c("ya", "tidak")) # Indonesian clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"), true = "pos", false = "neg") gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+") clean_factor(gender_age, c("M", "F")) clean_factor(gender_age, c("Male", "Female")) clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE) clean_factor(gender_age, levels = c("Group A" = "female", "Group B" = "male 50+", Other = ".*")) clean_Date("13jul18", "ddmmmyy") clean_Date("12 August 2010") clean_Date("12 06 2012") clean_Date("October 1st 2012") clean_Date("43658") clean_Date("14526", "Excel") clean_Date(c("1 Oct 13", "October 1st 2012")) # could not be fitted in 1 format clean_Date(c("1 Oct 13", "October 1st 2012"), guess_each = TRUE) clean_Date(c("12-14-13", "1 Oct 2012"), guess_each = TRUE, format = c("d mmm yyyy", "mm-yy-dd")) # only these formats will be tried clean_POSIXct("Created log on 2020/02/11 11:23 by user Joe") clean_POSIXct("Created log on 2020.02.11 11:23 by user Joe", tz = "UTC") clean_numeric("qwerty123456") clean_numeric("Positive (0.143)") clean_numeric("0,143") clean_numeric("minus 12 degrees") clean_percentage("PCT: 0.143") clean_percentage(c("Total of -12.3%", "Total of +4.5%")) clean_character("qwerty123456") clean_character("Positive (0.143)") clean_currency(c("Received 25", "Received 31.40")) clean_currency(c("Jack sent £ 25", "Bill sent £ 31.40")) df <- data.frame(A = c("2 Apr 2016", "5 Feb 2020"), B = c("yes", "no"), C = c("Total of -12.3%", "Total of +4.5%"), D = c("Marker: 0.4513 mmol/l", "Marker: 0.2732 mmol/l")) df clean(df)
clean_logical(c("Yes", "No")) # English clean_logical(c("Oui", "Non")) # French clean_logical(c("ya", "tidak")) # Indonesian clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"), true = "pos", false = "neg") gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+") clean_factor(gender_age, c("M", "F")) clean_factor(gender_age, c("Male", "Female")) clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE) clean_factor(gender_age, levels = c("Group A" = "female", "Group B" = "male 50+", Other = ".*")) clean_Date("13jul18", "ddmmmyy") clean_Date("12 August 2010") clean_Date("12 06 2012") clean_Date("October 1st 2012") clean_Date("43658") clean_Date("14526", "Excel") clean_Date(c("1 Oct 13", "October 1st 2012")) # could not be fitted in 1 format clean_Date(c("1 Oct 13", "October 1st 2012"), guess_each = TRUE) clean_Date(c("12-14-13", "1 Oct 2012"), guess_each = TRUE, format = c("d mmm yyyy", "mm-yy-dd")) # only these formats will be tried clean_POSIXct("Created log on 2020/02/11 11:23 by user Joe") clean_POSIXct("Created log on 2020.02.11 11:23 by user Joe", tz = "UTC") clean_numeric("qwerty123456") clean_numeric("Positive (0.143)") clean_numeric("0,143") clean_numeric("minus 12 degrees") clean_percentage("PCT: 0.143") clean_percentage(c("Total of -12.3%", "Total of +4.5%")) clean_character("qwerty123456") clean_character("Positive (0.143)") clean_currency(c("Received 25", "Received 31.40")) clean_currency(c("Jack sent £ 25", "Bill sent £ 31.40")) df <- data.frame(A = c("2 Apr 2016", "5 Feb 2020"), B = c("yes", "no"), C = c("Total of -12.3%", "Total of +4.5%"), D = c("Marker: 0.4513 mmol/l", "Marker: 0.2732 mmol/l")) df clean(df)
Transform input to a currency. The actual values are numeric, but will be printed as formatted currency values.
as.currency(x, currency_symbol = Sys.localeconv()["int_curr_symbol"], ...) is.currency(x) ## S3 method for class 'currency' print( x, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark == ",", ".", ","), as_symbol = FALSE, ... ) ## S3 method for class 'currency' format( x, currency_symbol = attributes(x)$currency_symbol, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark == ",", ".", ","), as_symbol = FALSE, ... )
as.currency(x, currency_symbol = Sys.localeconv()["int_curr_symbol"], ...) is.currency(x) ## S3 method for class 'currency' print( x, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark == ",", ".", ","), as_symbol = FALSE, ... ) ## S3 method for class 'currency' format( x, currency_symbol = attributes(x)$currency_symbol, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark == ",", ".", ","), as_symbol = FALSE, ... )
x |
input |
currency_symbol |
the currency symbol to use, which defaults to the current system locale setting (see |
... |
other parameters passed on to methods |
decimal.mark |
symbol to use as a decimal separator, defaults to |
big.mark |
symbol to use as a thousands separator, defaults to a dot if |
as_symbol |
try to format and print using currency symbols instead of text |
Printing currency will always have a currency symbol followed by a space, 2 decimal places and is never written in scientific format (like 2.5e+04).
money <- as.currency(c(0.25, 2.5, 25, 25000)) money sum(money) max(money) mean(money) format(money, currency_symbol = "USD") format(money, currency_symbol = "EUR", decimal.mark = ",") format(money, currency_symbol = "EUR", as_symbol = TRUE) as.currency(2.5e+04)
money <- as.currency(c(0.25, 2.5, 25, 25000)) money sum(money) max(money) mean(money) format(money, currency_symbol = "USD") format(money, currency_symbol = "EUR", decimal.mark = ",") format(money, currency_symbol = "EUR", as_symbol = TRUE) as.currency(2.5e+04)
Use this function to transform generic date/time info writing (dd-mm-yyyy) into POSIX standardised format (%d-%m-%Y), see Examples.
format_datetime(format)
format_datetime(format)
format |
the format that needs to be transformed |
A character
string (a POSIX standardised format)
format_datetime("yyyy-mm-dd") # Very hard to remember all these characters: format(Sys.time(), "%a %b %d %Y %X") # Easy to remember and write the same as above: format(Sys.time(), format_datetime("ddd mmm dd yyyy HH:MM:ss")) # seconds since the Epoch, 1970-01-01 00:00:00 format(Sys.time(), format_datetime("epoch"))
format_datetime("yyyy-mm-dd") # Very hard to remember all these characters: format(Sys.time(), "%a %b %d %Y %X") # Easy to remember and write the same as above: format(Sys.time(), format_datetime("ddd mmm dd yyyy HH:MM:ss")) # seconds since the Epoch, 1970-01-01 00:00:00 format(Sys.time(), format_datetime("epoch"))
This function can be used on any data.frame
, list
or character vector to format their names or values. It supports snake case and camel case.
format_names( x, ..., snake_case = FALSE, camelCase = FALSE, tolower = FALSE, toupper = FALSE )
format_names( x, ..., snake_case = FALSE, camelCase = FALSE, tolower = FALSE, toupper = FALSE )
x |
a |
... |
when |
snake_case |
logical to indicate whether the column names must be in snake case. This will have no effect on manually set column names. |
camelCase |
logical to indicate whether the column names must be in camel case. This will have no effect on manually set column names. |
tolower , toupper
|
logical to indicate whether the column names must be lower/upper case. This will have no effect on manually set column names. |
df <- data.frame(Name.341ABC = "value", name_123def = "value", This.is.a.column = "value") format_names(df, snake_case = TRUE) format_names(df, camelCase = TRUE) format_names(df, letters[1:3]) format_names(df, This.is.a.column = "a_new_colname") rownames(mtcars) <- format_names(rownames(mtcars), snake_case = TRUE) mtcars[, 1:5] format_names(list(a = 1, b = 2), c("new_1", "new_2")) ## Not run: library(dplyr) starwars %>% format_names(camelCase = TRUE) %>% # new column names mutate(name = name %>% format_names(name, snake_case = TRUE)) # new values in column ## End(Not run)
df <- data.frame(Name.341ABC = "value", name_123def = "value", This.is.a.column = "value") format_names(df, snake_case = TRUE) format_names(df, camelCase = TRUE) format_names(df, letters[1:3]) format_names(df, This.is.a.column = "a_new_colname") rownames(mtcars) <- format_names(rownames(mtcars), snake_case = TRUE) mtcars[, 1:5] format_names(list(a = 1, b = 2), c("new_1", "new_2")) ## Not run: library(dplyr) starwars %>% format_names(camelCase = TRUE) %>% # new column names mutate(name = name %>% format_names(name, snake_case = TRUE)) # new values in column ## End(Not run)
This function will round p values according to the APA guideline. It will try to round to two decimals where possible, and will try to avoid printing the value of alpha
, see Examples.
format_p_value(p, alpha = 0.05, prepend_p = FALSE)
format_p_value(p, alpha = 0.05, prepend_p = FALSE)
p |
p value(s) to transform |
alpha |
the value of alpha, defaults to 0.05 |
prepend_p |
a logical to indicate whether "p =" should be prepended to the result |
A character
format_p_value(0.345678) format_p_value(0.05125) # this must not be "0.05", but is not "0.049" either, # so it will add as many decimals as needed: format_p_value(0.04993) format_p_value(c(0.123, 0.00000001)) format_p_value(c(0.123, 0.00000001), prepend_p = TRUE)
format_p_value(0.345678) format_p_value(0.05125) # this must not be "0.05", but is not "0.049" either, # so it will add as many decimals as needed: format_p_value(0.04993) format_p_value(c(0.123, 0.00000001)) format_p_value(c(0.123, 0.00000001), prepend_p = TRUE)
Create a frequency table of a vector
or a data.frame
. It supports tidyverse's quasiquotation and RMarkdown for reports. Easiest practice is: data %>% freq(var)
using the tidyverse.
top_freq
can be used to get the top/bottom n items of a frequency table, with counts as names. It respects ties.
freq(x, ...) ## Default S3 method: freq( x, sort.count = TRUE, nmax = getOption("max.print.freq"), na.rm = TRUE, row.names = TRUE, markdown = !interactive(), digits = 2, quote = NULL, header = TRUE, title = NULL, na = "<NA>", sep = " ", decimal.mark = getOption("OutDec"), big.mark = "", wt = NULL, ... ) ## S3 method for class 'factor' freq(x, ..., droplevels = FALSE) ## S3 method for class 'matrix' freq(x, ..., quote = FALSE) ## S3 method for class 'table' freq(x, ..., sep = " ") ## S3 method for class 'numeric' freq(x, ..., digits = 2) ## S3 method for class 'Date' freq(x, ..., format = "yyyy-mm-dd") ## S3 method for class 'hms' freq(x, ..., format = "HH:MM:SS") is.freq(f) top_freq(f, n) header(f, property = NULL) ## S3 method for class 'freq' print( x, nmax = getOption("max.print.freq", default = 10), markdown = !interactive(), header = TRUE, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark != ",", ",", "."), ... )
freq(x, ...) ## Default S3 method: freq( x, sort.count = TRUE, nmax = getOption("max.print.freq"), na.rm = TRUE, row.names = TRUE, markdown = !interactive(), digits = 2, quote = NULL, header = TRUE, title = NULL, na = "<NA>", sep = " ", decimal.mark = getOption("OutDec"), big.mark = "", wt = NULL, ... ) ## S3 method for class 'factor' freq(x, ..., droplevels = FALSE) ## S3 method for class 'matrix' freq(x, ..., quote = FALSE) ## S3 method for class 'table' freq(x, ..., sep = " ") ## S3 method for class 'numeric' freq(x, ..., digits = 2) ## S3 method for class 'Date' freq(x, ..., format = "yyyy-mm-dd") ## S3 method for class 'hms' freq(x, ..., format = "HH:MM:SS") is.freq(f) top_freq(f, n) header(f, property = NULL) ## S3 method for class 'freq' print( x, nmax = getOption("max.print.freq", default = 10), markdown = !interactive(), header = TRUE, decimal.mark = getOption("OutDec"), big.mark = ifelse(decimal.mark != ",", ",", "."), ... )
x |
vector of any class or a |
... |
up to nine different columns of |
sort.count |
sort on count, i.e. frequencies. This will be |
nmax |
number of row to print. The default, |
na.rm |
a logical value indicating whether |
row.names |
a logical value indicating whether row indices should be printed as |
markdown |
a logical value indicating whether the frequency table should be printed in markdown format. This will print all rows (except when |
digits |
how many significant digits are to be used for numeric values in the header (not for the items themselves, that depends on |
quote |
a logical value indicating whether or not strings should be printed with surrounding quotes. Default is to print them only around characters that are actually numeric values. |
header |
a logical value indicating whether an informative header should be printed |
title |
text to show above frequency table, at default to tries to coerce from the variables passed to |
na |
a character string that should be used to show empty ( |
sep |
a character string to separate the terms when selecting multiple columns |
decimal.mark |
the character to be used to indicate the numeric decimal point |
big.mark |
character; if not empty used as mark between every 'big.interval' decimals before (hence big) the decimal point |
wt |
frequency weights. If a variable, computes |
droplevels |
a logical value indicating whether in factors empty levels should be dropped |
format |
a character to define the printing format (it supports |
f |
a frequency table |
n |
number of top n items to return, use -n for the bottom n items. It will include more than |
property |
property in header to return this value directly |
Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the 'freq' function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution.
Input can be done in many different ways. Base R methods are:
freq(df$variable) freq(df[, "variable"])
Tidyverse methods are:
df$variable %>% freq() df[, "variable"] %>% freq() df %>% freq("variable") df %>% freq(variable)
For numeric values of any class, these additional values will all be calculated with na.rm = TRUE
and shown into the header:
Mean, using mean
Standard Deviation, using sd
Coefficient of Variation (CV), the standard deviation divided by the mean
Mean Absolute Deviation (MAD), using mad
Tukey Five-Number Summaries (minimum, Q1, median, Q3, maximum), see NOTE below
Interquartile Range (IQR) calculated as Q3 - Q1
, see NOTE below
Coefficient of Quartile Variation (CQV, sometimes called coefficient of dispersion) calculated as (Q3 - Q1) / (Q3 + Q1)
, see NOTE below
Outliers (total count and percentage), using boxplot.stats
NOTE: These values are calculated using the same algorithm as used by Minitab and SPSS: p[k] = E[F(x[k])]. See Type 6 on the quantile
page.
For dates and times of any class, these additional values will be calculated with na.rm = TRUE
and shown into the header:
In factors, all factor levels that are not existing in the input data will be dropped at default.
The function top_freq
will include more than n
rows if there are ties. Use a negative number for n (like n = -3
) to select the bottom n values.
A data.frame
(with an additional class "freq"
) with five columns: item
, count
, percent
, cum_count
and cum_percent
.
freq()
functionInterested in extending the freq()
function with your own class? Add a method like below to your package, and optionally define some header info by passing a list
to the .add_header
parameter, like below example for class difftime
. This example assumes that you use the roxygen2
package for package development.
#' @method freq difftime #' @importFrom cleaner freq.default #' @export #' @noRd freq.difftime <- function(x, ...) { freq.default(x = x, ..., .add_header = list(units = attributes(x)$units)) }
Be sure to call freq.default
in your function and not just freq
. Also, add cleaner
to the Imports:
field of your DESCRIPTION
file, to make sure that it will be installed with your package, e.g.:
Imports: cleaner
freq(unclean$gender, markdown = FALSE) freq(x = clean_factor(unclean$gender, levels = c("^m" = "Male", "^f" = "Female")), markdown = TRUE, title = "Frequencies of a cleaned version for a markdown report!", header = FALSE, quote = TRUE)
freq(unclean$gender, markdown = FALSE) freq(x = clean_factor(unclean$gender, levels = c("^m" = "Male", "^f" = "Female")), markdown = TRUE, title = "Frequencies of a cleaned version for a markdown report!", header = FALSE, quote = TRUE)
This is a generic function to replace NA values in data. It takes most data types as input and is extendible by other packages.
na_replace(x, ...) ## Default S3 method: na_replace(x, replacement = "", ...) ## S3 method for class 'data.frame' na_replace(x, ..., replacement = NULL) ## S3 method for class 'matrix' na_replace(x, replacement = 0, ...) ## S3 method for class 'list' na_replace(x, replacement = NULL, ...) ## S3 method for class 'numeric' na_replace(x, replacement = 0, ...) ## S3 method for class 'Date' na_replace(x, replacement = Sys.Date(), ...) ## S3 method for class 'logical' na_replace(x, replacement = FALSE, ...)
na_replace(x, ...) ## Default S3 method: na_replace(x, replacement = "", ...) ## S3 method for class 'data.frame' na_replace(x, ..., replacement = NULL) ## S3 method for class 'matrix' na_replace(x, replacement = 0, ...) ## S3 method for class 'list' na_replace(x, replacement = NULL, ...) ## S3 method for class 'numeric' na_replace(x, replacement = 0, ...) ## S3 method for class 'Date' na_replace(x, replacement = Sys.Date(), ...) ## S3 method for class 'logical' na_replace(x, replacement = FALSE, ...)
x |
any vector, data.frame, matrix or list with values of which |
... |
When |
replacement |
value to replace |
All functions preserve attributes. Within a list
or data.frame
, all attributes per index/item/column are also preserved.
mtrx <- matrix(c(1, 2, NA, 3), nrow = 2) mtrx na_replace(mtrx) na_replace(c(1, 2, NA, NA)) na_replace(c(1, 2, NA, NA), replacement = -1) na_replace(c(1, 2, NA, NA), replacement = c(0, -1)) na_replace(c(Sys.Date(), NA)) # replacement defaults to 'today' na_replace(c(TRUE, FALSE, NA)) na_replace(c(TRUE, FALSE, NA), replacement = TRUE) # we're flexible, the class only remains the same if # the replacement value allows it na_replace(c(1, 2, 3, NA), replacement = "-") # data.frame is a special case mtcars[1:6, c("mpg", "hp")] <- NA head(mtcars) head(na_replace(mtcars, mpg, hp)) # no need to quote columns (but you can) head(na_replace(mtcars, mpg, hp, replacement = c(999, 123))) ## Not run: # practical way using tidyverse library(dplyr) starwars %>% na_replace() # even maintains groups starwars %>% group_by(hair_color) %>% na_replace(hair_color, replacement = "TEST!") %>% summarise(n = n()) ## End(Not run)
mtrx <- matrix(c(1, 2, NA, 3), nrow = 2) mtrx na_replace(mtrx) na_replace(c(1, 2, NA, NA)) na_replace(c(1, 2, NA, NA), replacement = -1) na_replace(c(1, 2, NA, NA), replacement = c(0, -1)) na_replace(c(Sys.Date(), NA)) # replacement defaults to 'today' na_replace(c(TRUE, FALSE, NA)) na_replace(c(TRUE, FALSE, NA), replacement = TRUE) # we're flexible, the class only remains the same if # the replacement value allows it na_replace(c(1, 2, 3, NA), replacement = "-") # data.frame is a special case mtcars[1:6, c("mpg", "hp")] <- NA head(mtcars) head(na_replace(mtcars, mpg, hp)) # no need to quote columns (but you can) head(na_replace(mtcars, mpg, hp, replacement = c(999, 123))) ## Not run: # practical way using tidyverse library(dplyr) starwars %>% na_replace() # even maintains groups starwars %>% group_by(hair_color) %>% na_replace(hair_color, replacement = "TEST!") %>% summarise(n = n()) ## End(Not run)
Transform input to a percentage. The actual values are numeric, but will be printed as formatted percentages.
as.percentage(x, ...) is.percentage(x) ## S3 method for class 'percentage' print(x, ...) ## S3 method for class 'percentage' format(x, digits = NULL, ...) percentage(x, digits = NULL, ...)
as.percentage(x, ...) is.percentage(x) ## S3 method for class 'percentage' print(x, ...) ## S3 method for class 'percentage' format(x, digits = NULL, ...) percentage(x, digits = NULL, ...)
x |
input |
... |
other parameters passed on to methods |
digits |
how many digits should be printed. It defaults to printing all decimals available in the data after transforming to a percentage, with a minimum of 0 and a maximum of 3. |
Printing percentages will always have a percentage symbol and is never written in scientific format (like 2.5e+04%).
The function percentage
is a wrapper around format(as.percentage(...))
with automatic determination of the number of digits, varying between 0 and 1. It also, unlike R, rounds according to basic math rules: percentage(0.4455)
returns "44.6%"
and not "44.5%"
. This function always returns a character, and can also be used in plotting, see Examples.
proportion <- as.percentage(c(0.25, 2.5, 0.0025)) proportion sum(proportion) max(proportion) mean(proportion) as.percentage(2.5e-14) as.percentage(pi) format(as.percentage(pi)) format(as.percentage(pi), digits = 6) round(0.4455 * 100, 1) # mind the rounding percentage(0.4455) # does not round to 44.5% if (require("ggplot2")) { ggplot(iris) + geom_col(aes(Species, Sepal.Length / sum(Sepal.Length)), position = "stack") + # add percentage as function to the labels: scale_y_continuous(labels = percentage) }
proportion <- as.percentage(c(0.25, 2.5, 0.0025)) proportion sum(proportion) max(proportion) mean(proportion) as.percentage(2.5e-14) as.percentage(pi) format(as.percentage(pi)) format(as.percentage(pi), digits = 6) round(0.4455 * 100, 1) # mind the rounding percentage(0.4455) # does not round to 44.5% if (require("ggplot2")) { ggplot(iris) + geom_col(aes(Species, Sepal.Length / sum(Sepal.Length)), position = "stack") + # add percentage as function to the labels: scale_y_continuous(labels = percentage) }
This function provides random date generation with a specified range, that defaults to the beginning and end of the current year.
rdate( n, min = paste0(format(Sys.Date(), "%Y"), "-01-01"), max = paste0(format(Sys.Date(), "%Y"), "-12-31"), ... )
rdate( n, min = paste0(format(Sys.Date(), "%Y"), "-01-01"), max = paste0(format(Sys.Date(), "%Y"), "-12-31"), ... )
n |
number of observations. If |
min , max
|
lower and upper limits of the distribution. Must be (coercible to) valid dates. |
... |
parameters given to |
# generate a million random dates and check the distribution hist(rdate(1000000), breaks = "months")
# generate a million random dates and check the distribution hist(rdate(1000000), breaks = "months")
TRUE
and FALSE
These functions just return a regular expression to define values TRUE
and FALSE
in the most spoken languages in the world. They are the default input for the function clean_logical
.
regex_true() regex_false()
regex_true() regex_false()
Both functions support values "Yes" and "No" in the following languages: Arabic, Bengali, Chinese (Mandarin), Dutch, English, French, German, Hindi, Indonesian, Japanese, Malay, Portuguese, Russian, Spanish, Telugu, Turkish and Urdu.
Note: all these translations are in Latin characters only (e.g. "da" for Russian, "haan" for Hindi and "hai" for Japanese).
Wolfram Alpha, query: https://www.wolframalpha.com/input/?i=20+most+spoken+languages
This typical data example can be used for checking and cleaning.
unclean
unclean
A data.frame
with 500 observations and the following variables:
date
Dates imported from Excel, they are integers ranging from ~30,000 to ~43,000.
gender
Characters with mixed values observed in original data about patients gender.
freq
to check values and clean
to clean them.