| Title: | Audit and Diagnostic Tools for 'data.table' Workflows |
|---|---|
| Description: | Diagnostic tools for auditing data analysis workflows built on 'data.table'. Provides functions to validate join operations, compare data.tables, filter with diagnostic output, summarize data quality, check primary keys and variable relationships, and diagnose string columns. Designed to help analysts understand and document data transformations. |
| Authors: | Fernando Cordeiro [aut, cre, cph] |
| Maintainer: | Fernando Cordeiro <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-05-20 09:24:02 UTC |
| Source: | https://github.com/fpcordeiro/dtaudit |
Applies a cleaning function to a character vector and reports what changed. Provides transparency about the cleaning operation by showing counts and before/after examples.
audit_clean(x, clean_fn, name = NULL)audit_clean(x, clean_fn, name = NULL)
x |
Character vector to clean. |
clean_fn |
A function that takes a character vector and returns a cleaned character vector of the same length. |
name |
Optional name for the variable (used in output). If NULL, attempts to capture the variable name from the call. |
An S3 object of class audit_clean containing:
Name of the variable
Name of the cleaning function used
Total number of elements
Count of values that changed
Count of values that stayed the same
Count of NA values (unchanged by definition)
Percentage of non-NA values that changed
data.table with sample before/after pairs
The cleaned character vector
diagnose_strings() for string quality diagnostics,
clean_var_names() and clean_firm_name() for built-in cleaning functions
library(data.table) firms <- c("Apple Inc.", "MICROSOFT CORP", "Alphabet LLC", NA) result <- audit_clean(firms, clean_firm_name) result$cleanedlibrary(data.table) firms <- c("Apple Inc.", "MICROSOFT CORP", "Alphabet LLC", NA) result <- audit_clean(firms, clean_firm_name) result$cleaned
Verifies whether a date vector contains data for all periods within a specified date range. Reports any missing periods.
check_date_coverage( date_var, start_date, end_date, by = "month", quiet = FALSE )check_date_coverage( date_var, start_date, end_date, by = "month", quiet = FALSE )
date_var |
A vector of dates to check. |
start_date |
Character or Date. Start of the expected date range (format: "YYYY-MM-DD"). |
end_date |
Character or Date. End of the expected date range (format: "YYYY-MM-DD"). |
by |
Character. Period granularity: one of |
quiet |
Logical. If |
An IDate vector of missing periods, returned invisibly.
check_months_coverage() for a convenience wrapper with
by = "month"
library(data.table) dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-04-10")) check_date_coverage(dates, "2023-01-01", "2023-04-30") check_date_coverage(dates, "2023-01-01", "2023-04-30", by = "quarter")library(data.table) dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-04-10")) check_date_coverage(dates, "2023-01-01", "2023-04-30") check_date_coverage(dates, "2023-01-01", "2023-04-30", by = "quarter")
Verifies whether a date vector contains data for all months within a specified date range. Reports any missing months.
check_months_coverage(date_var, start_date, end_date, quiet = FALSE)check_months_coverage(date_var, start_date, end_date, quiet = FALSE)
date_var |
A vector of dates to check. |
start_date |
Character or Date. Start of the expected date range (format: "YYYY-MM-DD"). |
end_date |
Character or Date. End of the expected date range (format: "YYYY-MM-DD"). |
quiet |
Logical. If |
This is a convenience wrapper around check_date_coverage() with
by = "month".
An IDate vector of missing months, returned invisibly.
check_date_coverage() for other period granularities
library(data.table) dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-04-10")) check_months_coverage(dates, "2023-01-01", "2023-04-30")library(data.table) dates <- as.IDate(c("2023-01-15", "2023-02-20", "2023-04-10")) check_months_coverage(dates, "2023-01-01", "2023-04-30")
Normalizes firm names by converting to uppercase ASCII, removing common suffixes (Corp, LLC, Inc, etc.), and standardizing whitespace. Useful for fuzzy matching or deduplication of company names.
clean_firm_name(text)clean_firm_name(text)
text |
Character vector of firm names to clean. |
Character vector of cleaned firm names.
clean_firm_name(c("Apple, Inc.", "MICROSOFT CORP.", "Alphabet LLC"))clean_firm_name(c("Apple, Inc.", "MICROSOFT CORP.", "Alphabet LLC"))
Standardizes variable names by trimming whitespace, converting to lowercase ASCII, replacing all non-alphanumeric characters with underscores, and removing leading/trailing underscores.
clean_var_names(text)clean_var_names(text)
text |
Character vector of variable names to clean. |
Character vector of cleaned variable names containing only lowercase letters, digits, and underscores.
clean_var_names(c("Sales Revenue", "cost-of-goods", " margin ")) # Returns: c("sales_revenue", "cost_of_goods", "margin")clean_var_names(c("Sales Revenue", "cost-of-goods", " margin ")) # Returns: c("sales_revenue", "cost_of_goods", "margin")
Compares two data.tables by examining column names, row counts, key overlap, and numeric discrepancies. Useful for validating data processing pipelines.
compare_datatables(dt1, dt2, key_cols = NULL)compare_datatables(dt1, dt2, key_cols = NULL)
dt1 |
First data.table to compare. |
dt2 |
Second data.table to compare. |
key_cols |
Character vector of column names to use as keys for matching rows. If NULL (default), automatically detects character, factor, and integer columns as keys. |
An S3 object of class compare_dt containing:
Names of the compared objects
Column names present in both tables
Column names only in dt1
Column names only in dt2
Data.table of columns with same name but different types, with columns: column, type_dt1, type_dt2. NULL if no mismatches.
Number of rows in dt1
Number of rows in dt2
Summary of key overlap (if keys found)
Data.table of numeric column discrepancies
How numeric columns were compared ("keys", "row_index", or NA)
Number of rows matched on keys (when method is "keys")
validate_join() for analyzing join relationships,
validate_primary_keys() for key uniqueness validation
library(data.table) dt1 <- data.table(id = 1:3, value = c(10.0, 20.0, 30.0)) dt2 <- data.table(id = 1:3, value = c(10.1, 20.0, 30.5)) compare_datatables(dt1, dt2)library(data.table) dt1 <- data.table(id = 1:3, value = c(10.0, 20.0, 30.0)) dt2 <- data.table(id = 1:3, value = c(10.1, 20.0, 30.5)) compare_datatables(dt1, dt2)
Reports NA counts and percentages for each column in a data.table, sorted by missing percentage in descending order.
diagnose_nas(dt)diagnose_nas(dt)
dt |
A data.table to diagnose. |
An S3 object of class diagnose_na containing:
A data.table with columns variable, n_na, pct_na, and
n_valid, sorted by pct_na descending.
Total number of columns in the input.
Number of columns that have at least one NA.
get_summary_table() for comprehensive column summaries,
diagnose_strings() for string column quality
library(data.table) dt <- data.table( a = c(1, NA, 3), b = c(NA, NA, "x"), c = c(TRUE, FALSE, TRUE) ) diagnose_nas(dt)library(data.table) dt <- data.table( a = c(1, NA, 3), b = c(NA, NA, "x"), c = c(TRUE, FALSE, TRUE) ) diagnose_nas(dt)
Audits a character vector for common data quality issues including missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies. Useful for understanding string data before cleaning.
diagnose_strings(x, name = NULL)diagnose_strings(x, name = NULL)
x |
Character vector to diagnose. |
name |
Optional name for the variable (used in output). If NULL, attempts to capture the variable name from the call. |
An S3 object of class diagnose_strings containing:
Name of the variable
Total number of elements
Count of NA values
Count of empty strings ("")
Count of strings containing only whitespace
Count of non-empty strings with leading whitespace
Count of non-empty strings with trailing whitespace
Count of strings containing non-ASCII characters
Number of unique values that have case variants
Number of groups of case-insensitive duplicates
data.table with examples of case variants
audit_clean() for auditing the effect of cleaning functions,
diagnose_nas() for missing value diagnostics
library(data.table) firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms)library(data.table) firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms)
Expands a data.table to include all combinations of the specified grouping variables (cartesian product). Missing combinations are filled with NA or a specified value.
embed_into_cartesian(dt, group_vars, dt_frame = NULL, fill = NA)embed_into_cartesian(dt, group_vars, dt_frame = NULL, fill = NA)
dt |
A data.table to expand. |
group_vars |
Character vector of column names defining the grouping structure. |
dt_frame |
Optional data.table containing the target cartesian frame. If NULL (default), creates cartesian product from unique values in dt. |
fill |
Value to fill for missing combinations. Default is NA. Set to NA to leave gaps unfilled. |
A data.table with all combinations of group_vars, with original data merged in.
library(data.table) dt <- data.table( year = c(2020, 2020, 2021), region = c("A", "B", "A"), value = c(10, 20, 30) ) embed_into_cartesian(dt, c("year", "region"))library(data.table) dt <- data.table( year = c(2020, 2020, 2021), region = c("A", "B", "A"), value = c(10, 20, 30) ) embed_into_cartesian(dt, c("year", "region"))
Filters a data.table by DROPPING rows where the expression is TRUE, while reporting statistics about dropped rows and optionally the sum of a statistic column that was dropped.
filter_drop(x, ...) ## S3 method for class 'data.table' filter_drop( x, expr, stat = NULL, na_as = FALSE, quiet = FALSE, warn_threshold = NULL, ... )filter_drop(x, ...) ## S3 method for class 'data.table' filter_drop( x, expr, stat = NULL, na_as = FALSE, quiet = FALSE, warn_threshold = NULL, ... )
x |
A data.table or other object. |
... |
Arguments passed to methods. |
expr |
A filtering expression written in terms of columns of x. Rows where expr is TRUE are DROPPED; others are kept. |
stat |
An unquoted column or expression to total, e.g., sales, price*qty, etc. Reports the amount dropped and its share of total. |
na_as |
Logical. Treat NA results of expr as this value (default FALSE: drop rows where expr is NA). |
quiet |
Logical. If TRUE, suppress printing diagnostics. |
warn_threshold |
Numeric between 0 and 1. If set and the proportion of dropped rows exceeds this threshold, a warning is issued. |
The filtered data.table.
filter_drop(data.table): Method for data.table objects
library(data.table) DT <- data.table( id = 1:5, bad = c(FALSE, TRUE, FALSE, TRUE, FALSE), sales = 10:14 ) # Drop rows where bad == TRUE; report dropped statistics DT2 <- filter_drop(DT, bad == TRUE) # Also report dropped sales value DT3 <- filter_drop(DT, bad == TRUE, stat = sales)library(data.table) DT <- data.table( id = 1:5, bad = c(FALSE, TRUE, FALSE, TRUE, FALSE), sales = 10:14 ) # Drop rows where bad == TRUE; report dropped statistics DT2 <- filter_drop(DT, bad == TRUE) # Also report dropped sales value DT3 <- filter_drop(DT, bad == TRUE, stat = sales)
Filters a data.table while reporting statistics about dropped rows and optionally the sum of a statistic column that was dropped.
filter_keep(x, ...) ## S3 method for class 'data.table' filter_keep( x, expr, stat = NULL, na_as = FALSE, quiet = FALSE, warn_threshold = NULL, ... )filter_keep(x, ...) ## S3 method for class 'data.table' filter_keep( x, expr, stat = NULL, na_as = FALSE, quiet = FALSE, warn_threshold = NULL, ... )
x |
A data.table or other object. |
... |
Arguments passed to methods. |
expr |
A filtering expression written in terms of columns of x. Rows where expr is TRUE are KEPT; others are dropped. |
stat |
An unquoted column or expression to total, e.g., sales, price*qty, etc. Reports the amount dropped and its share of total. |
na_as |
Logical. Treat NA results of expr as this value (default FALSE: drop rows where expr is NA). |
quiet |
Logical. If TRUE, suppress printing diagnostics. |
warn_threshold |
Numeric between 0 and 1. If set and the proportion of dropped rows exceeds this threshold, a warning is issued. |
The filtered data.table.
filter_keep(data.table): Method for data.table objects
library(data.table) DT <- data.table( id = 1:6, keep = c(TRUE, FALSE, TRUE, NA, TRUE, FALSE), sales = c(100, 50, 200, 25, NA, 75) ) # Keep rows where keep == TRUE; report dropped statistics DT2 <- filter_keep(DT, keep == TRUE) # Also report dropped sales value DT3 <- filter_keep(DT, keep == TRUE, stat = sales)library(data.table) DT <- data.table( id = 1:6, keep = c(TRUE, FALSE, TRUE, NA, TRUE, FALSE), sales = c(100, 50, 200, 25, NA, 75) ) # Keep rows where keep == TRUE; report dropped statistics DT2 <- filter_keep(DT, keep == TRUE) # Also report dropped sales value DT3 <- filter_keep(DT, keep == TRUE, stat = sales)
Creates a comprehensive summary of all columns in a data.table, including type, missing values, descriptive statistics, and example values.
get_summary_table(dt, cols = NULL)get_summary_table(dt, cols = NULL)
dt |
A data.table to summarize. |
cols |
Optional character vector of column names to summarize. If
|
A data.table with one row per column containing summary statistics.
summarize_vector() for single-vector summaries,
diagnose_nas() for missing value diagnostics
library(data.table) dt <- data.table( id = 1:100, value = rnorm(100), category = sample(letters[1:5], 100, replace = TRUE) ) get_summary_table(dt) get_summary_table(dt, cols = c("value", "category"))library(data.table) dt <- data.table( id = 1:100, value = rnorm(100), category = sample(letters[1:5], 100, replace = TRUE) ) get_summary_table(dt) get_summary_table(dt, cols = c("value", "category"))
Print Method for audit_clean Objects
## S3 method for class 'audit_clean' print(x, ...)## S3 method for class 'audit_clean' print(x, ...)
x |
An |
... |
Additional arguments (ignored). |
Invisibly returns the input object.
Print Method for compare_dt Objects
## S3 method for class 'compare_dt' print(x, ...)## S3 method for class 'compare_dt' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
The compare_dt object, invisibly.
Print Method for diagnose_na Objects
## S3 method for class 'diagnose_na' print(x, ...)## S3 method for class 'diagnose_na' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
The diagnose_na object, invisibly.
Print Method for diagnose_strings Objects
## S3 method for class 'diagnose_strings' print(x, ...)## S3 method for class 'diagnose_strings' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
Invisibly returns the input object.
Displays a compact summary of join diagnostics.
## S3 method for class 'validate_join' print(x, ...)## S3 method for class 'validate_join' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
Invisibly returns the input object.
Displays a compact summary of primary key validation results.
## S3 method for class 'validate_pk' print(x, ...)## S3 method for class 'validate_pk' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
Invisibly returns the input object.
Displays a compact summary of variable relationship validation results.
## S3 method for class 'validate_var_rel' print(x, ...)## S3 method for class 'validate_var_rel' print(x, ...)
x |
A |
... |
Additional arguments (ignored). |
Invisibly returns the input object.
Computes summary statistics for a vector. Handles numeric, character, factor, logical, Date, and other types with appropriate statistics for each.
summarize_vector(x)summarize_vector(x)
x |
A vector to summarize. |
A named character vector with summary statistics including: type, unique count, missing count, most frequent value (for non-numeric), mean, sd, min, quartiles (q25, q50, q75), max, and three example values.
summarize_vector(c(1, 2, 3, NA, 5)) summarize_vector(c("a", "b", "a", "c"))summarize_vector(c(1, 2, 3, NA, 5)) summarize_vector(c("a", "b", "a", "c"))
Returns the summary table from a validate_join object.
## S3 method for class 'validate_join' summary(object, ...)## S3 method for class 'validate_join' summary(object, ...)
object |
A |
... |
Additional arguments (ignored). |
Invisibly returns the summary data.table.
Analyzes a potential join between two data.tables without performing the full join between original tables. Reports relationship type (one-to-one, one-to-many, etc.), match rates, duplicate keys, and unmatched rows. Optionally tracks a numeric statistic column through the join to quantify impact.
validate_join( x, y, by = NULL, by.x = NULL, by.y = NULL, stat = NULL, stat.x = NULL, stat.y = NULL )validate_join( x, y, by = NULL, by.x = NULL, by.y = NULL, stat = NULL, stat.x = NULL, stat.y = NULL )
x |
A data.table (left table). |
y |
A data.table (right table). |
by |
Character vector of column names to join on (used for both tables). |
by.x |
Character vector of column names in |
by.y |
Character vector of column names in |
stat |
Character string naming a numeric column present in both tables. Reports total, matched, and unmatched sums for each table. |
stat.x |
Character string naming a numeric column in |
stat.y |
Character string naming a numeric column in |
An S3 object of class validate_join containing:
Names of the input tables from the original call
Key columns used for the join
List with row counts, match rates, and overlap statistics
When stat, stat.x, or stat.y is provided, a list with
elements stat_col_x and/or stat_col_y (column names) and sublists
x and/or y each containing total, matched, only, rate, and
n_na. NULL when no stat is provided.
List with duplicate key information for each table
A data.table summarizing the join diagnostics
Character string: "one-to-one", "one-to-many", "many-to-one", "many-to-many", or "no matches"
Keys present in x but not in y
Keys present in y but not in x
validate_primary_keys() for key uniqueness validation,
validate_var_relationship() for variable relationship analysis,
compare_datatables() for structural comparison
library(data.table) dt1 <- data.table(id = c(1, 2, 3, 3), value = c("a", "b", "c", "d")) dt2 <- data.table(id = c(2, 3, 4), score = c(10, 20, 30)) result <- validate_join(dt1, dt2, by = "id") print(result) # Track a numeric column through the join orders <- data.table(id = 1:4, revenue = c(100, 200, 300, 400)) products <- data.table(id = 2:5, cost = c(10, 20, 30, 40)) validate_join(orders, products, by = "id", stat.x = "revenue", stat.y = "cost")library(data.table) dt1 <- data.table(id = c(1, 2, 3, 3), value = c("a", "b", "c", "d")) dt2 <- data.table(id = c(2, 3, 4), score = c(10, 20, 30)) result <- validate_join(dt1, dt2, by = "id") print(result) # Track a numeric column through the join orders <- data.table(id = 1:4, revenue = c(100, 200, 300, 400)) products <- data.table(id = 2:5, cost = c(10, 20, 30, 40)) validate_join(orders, products, by = "id", stat.x = "revenue", stat.y = "cost")
Tests whether a set of columns constitute primary keys of a data.table, i.e., whether they uniquely identify every row in the table.
validate_primary_keys(dt, keys)validate_primary_keys(dt, keys)
dt |
A data.table. |
keys |
Character vector of column names to test as primary keys. |
A warning is issued if any key column is numeric (double), as floating-point values can cause unexpected behavior in exact matching operations.
An S3 object of class validate_pk containing:
Name of the input table from the original call
Character vector of column names tested
Logical: TRUE if keys uniquely identify all rows
Total number of rows in the table
Number of distinct key combinations
Number of key combinations that appear more than once
A data.table of duplicated key values with their counts
Logical: TRUE if any key column is of type double
validate_join() for join relationship analysis,
validate_var_relationship() for variable relationship analysis
library(data.table) dt <- data.table( id = c(1L, 2L, 3L, 4L), group = c("A", "A", "B", "B"), value = c(10, 20, 30, 40) ) # Single column that IS a primary key validate_primary_keys(dt, "id") # Single column that is NOT a primary key validate_primary_keys(dt, "group") # Composite key that IS a primary key validate_primary_keys(dt, c("group", "id"))library(data.table) dt <- data.table( id = c(1L, 2L, 3L, 4L), group = c("A", "A", "B", "B"), value = c(10, 20, 30, 40) ) # Single column that IS a primary key validate_primary_keys(dt, "id") # Single column that is NOT a primary key validate_primary_keys(dt, "group") # Composite key that IS a primary key validate_primary_keys(dt, c("group", "id"))
Determines the relationship between two variables in a data.table: one-to-one, one-to-many, many-to-one, or many-to-many.
validate_var_relationship(dt, var1, var2)validate_var_relationship(dt, var1, var2)
dt |
A data.table. |
var1 |
Character string: name of the first variable. |
var2 |
Character string: name of the second variable. |
This function only accepts variables of type character, integer, or factor. Numeric (double) variables are not allowed due to potential floating-point comparison issues.
The relationship is determined as follows:
one-to-one: Each value of var1 maps to exactly one value of var2, and vice versa.
one-to-many: Each value of var1 maps to exactly one value of var2, but some var2 values map to multiple var1 values.
many-to-one: Some var1 values map to multiple var2 values, but each var2 value maps to exactly one var1 value.
many-to-many: Both variables have values that map to multiple values of the other.
An S3 object of class validate_var_rel containing:
Name of the input table from the original call
Names of the variables analyzed
Character string: "one-to-one", "one-to-many", "many-to-one", or "many-to-many"
Number of distinct values in var1
Number of distinct values in var2
Number of unique (var1, var2) pairs
Logical: does any var1 value map to multiple var2 values?
Logical: does any var2 value map to multiple var1 values?
validate_primary_keys() for key uniqueness validation,
validate_join() for join relationship analysis
library(data.table) dt <- data.table( person_id = c(1L, 2L, 3L, 4L), department = c("Sales", "Sales", "Engineering", "Engineering"), country = c("US", "US", "US", "UK") ) # Many-to-one: multiple persons per department validate_var_relationship(dt, "person_id", "department") # Many-to-many: departments and countries have complex mapping validate_var_relationship(dt, "department", "country")library(data.table) dt <- data.table( person_id = c(1L, 2L, 3L, 4L), department = c("Sales", "Sales", "Engineering", "Engineering"), country = c("US", "US", "US", "UK") ) # Many-to-one: multiple persons per department validate_var_relationship(dt, "person_id", "department") # Many-to-many: departments and countries have complex mapping validate_var_relationship(dt, "department", "country")