| Title: | Pipeline Audit Trails and Data Diagnostics for 'tidyverse' Workflows |
|---|---|
| Description: | Provides pipeline audit trails and data diagnostics for 'tidyverse' workflows. The audit trail system captures lightweight metadata snapshots at each step of a pipeline, building a structured record without storing the data itself. Operation-aware taps enrich snapshots with join match rates and filter drop statistics. Trails can be serialized to 'JSON' or 'RDS' and exported as self-contained 'HTML' visualizations. Also includes diagnostic functions for interactive data analysis including frequency tables, string quality auditing, and data comparison. |
| Authors: | Fernando Cordeiro [aut, cre, cph] |
| Maintainer: | Fernando Cordeiro <[email protected]> |
| License: | LGPL (>= 3) |
| Version: | 0.2.1 |
| Built: | 2026-05-28 02:48:27 UTC |
| Source: | https://github.com/fpcordeiro/tidyaudit |
Computes detailed differences between any two snapshots in an audit trail, including row/column/NA deltas, columns added/removed, type changes, per-column NA changes, and numeric distribution shifts.
audit_diff(.trail, from, to) ## S3 method for class 'audit_diff' print(x, ...)audit_diff(.trail, from, to) ## S3 method for class 'audit_diff' print(x, ...)
.trail |
An |
from |
Label (character) or index (integer) of the first snapshot. |
to |
Label (character) or index (integer) of the second snapshot. |
x |
An |
... |
Additional arguments (currently unused). |
An audit_diff object (S3 list).
Other audit trail:
audit_report(),
audit_tap(),
print.audit_snap(),
tab_tap()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") audit_diff(trail, "raw", "filtered")trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") audit_diff(trail, "raw", "filtered")
Produces a standalone HTML file that visualises the audit trail as an interactive pipeline flow diagram. The file is completely self-contained — no server, internet connection, or R installation is required to view it. Open it in any browser.
audit_export(.trail, file = NULL)audit_export(.trail, file = NULL)
.trail |
An |
file |
Path to the output |
The trail is serialised via trail_to_list() and embedded as JSON inside
an HTML template with inline CSS and vanilla JavaScript. The visualisation
features:
Horizontal pipeline flow diagram with colour-coded nodes per operation type (snapshot, join, filter).
Edges annotated with key deltas (match rate, drop \ added).
Clickable nodes expanding to show column schema, operation
diagnostics, and custom .fns results.
Clickable edges showing the full diff between adjacent snapshots.
Light / dark theme toggle.
Collapsible JSON export panel.
The file path (character), invisibly.
trail_to_list(), write_trail()
Other audit export:
read_trail(),
trail_to_df(),
trail_to_list(),
write_trail()
trail <- audit_trail("demo") mtcars |> audit_tap(trail, "raw") dplyr::filter(mtcars, mpg > 20) |> audit_tap(trail, "filtered") audit_export(trail, tempfile(fileext = ".html"))trail <- audit_trail("demo") mtcars |> audit_tap(trail, "raw") dplyr::filter(mtcars, mpg > 20) |> audit_tap(trail, "filtered") audit_export(trail, tempfile(fileext = ".html"))
Prints a full audit report for a trail, including the trail summary, all diffs between consecutive snapshots, custom diagnostic results, and a final data profile.
audit_report(.trail, format = "console")audit_report(.trail, format = "console")
.trail |
An |
format |
Report format. Currently only |
.trail, invisibly.
Other audit trail:
audit_diff(),
audit_tap(),
print.audit_snap(),
tab_tap()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") audit_report(trail)trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") audit_report(trail)
Transparent pipe pass-through that captures a metadata snapshot and appends
it to an audit trail. Returns .data unchanged — the function's only purpose
is its side effect on .trail.
audit_tap( .data, .trail, .label = NULL, .fns = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )audit_tap( .data, .trail, .label = NULL, .fns = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )
.data |
A data.frame or tibble flowing through the pipe. |
.trail |
An |
.label |
Optional character label for this snapshot. If |
.fns |
Optional named list of diagnostic functions (or formula lambdas)
to run on |
.numeric_summary |
Logical. If |
.cols_include |
Character vector of column names to include in the
snapshot schema, or |
.cols_exclude |
Character vector of column names to exclude from the
snapshot schema, or |
.data, unchanged, returned invisibly. The function is a
transparent pass-through; its only effect is the side effect on .trail.
Other audit trail:
audit_diff(),
audit_report(),
print.audit_snap(),
tab_tap()
trail <- audit_trail("example") result <- mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") print(trail)trail <- audit_trail("example") result <- mtcars |> audit_tap(trail, "raw") |> dplyr::filter(mpg > 20) |> audit_tap(trail, "filtered") print(trail)
Applies a transformation function to a vector and reports what changed. Works with any vector type: character, numeric, Date/POSIXct, factor, or logical. Diagnostics are adapted to the detected input type.
audit_transform( x, clean_fn, name = NULL, .tolerance = sqrt(.Machine$double.eps) ) ## S3 method for class 'audit_transform' print(x, ...)audit_transform( x, clean_fn, name = NULL, .tolerance = sqrt(.Machine$double.eps) ) ## S3 method for class 'audit_transform' print(x, ...)
x |
Vector to transform. Accepted types: character, numeric, Date, POSIXct, factor, or logical. |
clean_fn |
A function applied to |
name |
Optional name for the variable (used in output). If |
.tolerance |
Numeric tolerance used for the "changed beyond tolerance"
diagnostic (numeric type only). Defaults to |
... |
Additional arguments (currently unused). |
An S3 object of class audit_transform containing:
Name of the variable
Name of the transformation function, or
"<pre-computed>" when a vector was supplied directly
Detected type: "character", "numeric",
"Date", "POSIXct", "factor", or "logical"
Total number of elements
Count of values that changed (including NA status changes)
Count of values that stayed the same
Count of NA values before transformation
Count of NA values after transformation
Percentage of total elements that changed
Data frame with before/after pairs (up to 10)
Type-specific diagnostic list, or NULL for character
The transformed vector, retaining its type
Other data quality:
diagnose_nas(),
diagnose_strings(),
get_summary_table(),
summarize_column(),
tab()
# Character x <- c(" hello ", "WORLD", " foo ", NA) result <- audit_transform(x, trimws) result$cleaned # Numeric prices <- c(10.5, 20.0, NA, 30.0) audit_transform(prices, function(v) round(v)) # Pre-computed result audit_transform(prices, round(prices))# Character x <- c(" hello ", "WORLD", " foo ", NA) result <- audit_transform(x, trimws) result$cleaned # Numeric prices <- c(10.5, 20.0, NA, 30.0) audit_transform(prices, function(v) round(v)) # Pre-computed result audit_transform(prices, round(prices))
Compares two data.frames or tibbles by examining column names, row counts, key overlap, numeric discrepancies, and categorical discrepancies. Useful for validating data processing pipelines.
compare_tables( x, y, key_cols = NULL, tol = .Machine$double.eps, top_n = Inf, compare_cols = NULL, exclude_cols = NULL, on_non_unique = c("warn", "stop") ) ## S3 method for class 'compare_tbl' print(x, show_n = 5L, ...) ## S3 method for class 'compare_tbl' as.data.frame(x, row.names = NULL, optional = FALSE, ...)compare_tables( x, y, key_cols = NULL, tol = .Machine$double.eps, top_n = Inf, compare_cols = NULL, exclude_cols = NULL, on_non_unique = c("warn", "stop") ) ## S3 method for class 'compare_tbl' print(x, show_n = 5L, ...) ## S3 method for class 'compare_tbl' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
First data.frame or tibble to compare. |
y |
Second data.frame or tibble to compare. |
key_cols |
Character vector of column names to use as keys for matching
rows. If |
tol |
Numeric tolerance for comparing numeric columns. Differences
less than or equal to |
top_n |
Maximum number of row-level discrepancies to store per
column (numeric and categorical), and maximum unmatched keys to store.
Defaults to |
compare_cols |
Character vector of column names to compare. If |
exclude_cols |
Character vector of column names to exclude from
comparison. If |
on_non_unique |
What to do when the chosen |
show_n |
Maximum number of rows to display for discrepancies and
unmatched keys in the printed output. Defaults to |
... |
Additional arguments (currently unused). |
row.names |
Passed to |
optional |
Passed to |
An S3 object of class compare_tbl containing:
Names of the compared objects
Column names present in both tables
Column names only in x
Column names only in y
Data.frame of columns with different types, or NULL
Number of rows in x
Number of rows in y
List summarising the chosen keys and their overlap, or
NULL if no keys could be determined. Fields: keys, auto (logical),
x_unique, y_unique, matches, only_x, only_y, is_pk_x,
is_pk_y (logical: do keys uniquely identify rows in each table),
n_dup_combos_x, n_dup_combos_y (number of key combinations appearing
more than once), has_na_keys_x, has_na_keys_y (NA values present in
any key column).
Data.frame of numeric discrepancy quantiles (with
n_over_tol count), or NULL
How columns were compared ("keys",
"row_index", or NA)
Number of rows matched on keys
The tolerance used
The top_n used
Data.frame of row-level numeric discrepancies
exceeding tol (or where one side is NA), with key columns (or
row_index), column, value_x, value_y, abs_diff, and
pct_diff (relative difference as a proportion). NULL if none.
Data.frame with column, n_compared,
n_mismatched, pct_mismatched (proportion, 0–1), n_na_mismatch,
or NULL
Data.frame of row-level categorical
discrepancies with key columns (or row_index), column, value_x,
value_y. NULL if none.
Total number of cell-level discrepancies
across all column types (not limited by top_n)
Data.frame of key combinations only in x (up to
top_n rows), or NULL
Data.frame of key combinations only in y (up to
top_n rows), or NULL
List with only_x, only_y, matched_no_disc,
matched_with_disc, pct_no_disc (proportion, 0–1),
pct_with_disc (proportion, 0–1)
Use as.data.frame() to extract all discrepancies (numeric and categorical)
as a single tidy data.frame.
Other join validation:
validate_join(),
validate_primary_keys(),
validate_var_relationship()
x <- data.frame(id = 1:3, value = c(10.0, 20.0, 30.0)) y <- data.frame(id = 1:3, value = c(10.1, 20.0, 30.5)) compare_tables(x, y) # With tolerance — differences <= 0.15 are considered equal compare_tables(x, y, tol = 0.15) # Categorical columns are also compared a <- data.frame(id = 1:3, status = c("ok", "warn", "fail"), stringsAsFactors = FALSE) b <- data.frame(id = 1:3, status = c("ok", "warn", "error"), stringsAsFactors = FALSE) compare_tables(a, b)x <- data.frame(id = 1:3, value = c(10.0, 20.0, 30.0)) y <- data.frame(id = 1:3, value = c(10.1, 20.0, 30.5)) compare_tables(x, y) # With tolerance — differences <= 0.15 are considered equal compare_tables(x, y, tol = 0.15) # Categorical columns are also compared a <- data.frame(id = 1:3, status = c("ok", "warn", "fail"), stringsAsFactors = FALSE) b <- data.frame(id = 1:3, status = c("ok", "warn", "error"), stringsAsFactors = FALSE) compare_tables(a, b)
Reports NA counts and percentages for each column in a data.frame, sorted by missing percentage in descending order.
diagnose_nas(.data) ## S3 method for class 'diagnose_na' print(x, ...)diagnose_nas(.data) ## S3 method for class 'diagnose_na' print(x, ...)
.data |
A data.frame or tibble to diagnose. |
x |
An object to print. |
... |
Additional arguments (currently unused). |
An S3 object of class diagnose_na containing:
A data.frame with columns variable, n_na, pct_na, and
n_valid, sorted by pct_na descending.
Total number of columns in the input.
Number of columns that have at least one NA.
Other data quality:
audit_transform(),
diagnose_strings(),
get_summary_table(),
summarize_column(),
tab()
df <- data.frame( a = c(1, NA, 3), b = c(NA, NA, "x"), c = c(TRUE, FALSE, TRUE) ) diagnose_nas(df)df <- data.frame( a = c(1, NA, 3), b = c(NA, NA, "x"), c = c(TRUE, FALSE, TRUE) ) diagnose_nas(df)
Audits a character vector for common data quality issues including missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies. Requires the stringi package (in Suggests).
diagnose_strings(x, name = NULL) ## S3 method for class 'diagnose_strings' print(x, ...)diagnose_strings(x, name = NULL) ## S3 method for class 'diagnose_strings' print(x, ...)
x |
Character vector to diagnose. |
name |
Optional name for the variable (used in output). If |
... |
Additional arguments (currently unused). |
An S3 object of class diagnose_strings containing:
Name of the variable
Total number of elements
Count of NA values
Count of empty strings
Count of whitespace-only strings
Count of strings with leading whitespace
Count of strings with trailing whitespace
Count of strings with non-ASCII characters
Number of unique values with case variants
Number of groups of case-insensitive duplicates
Data.frame with examples of case variants
Other data quality:
audit_transform(),
diagnose_nas(),
get_summary_table(),
summarize_column(),
tab()
firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms)firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms)
Filters a data.frame or tibble by DROPPING rows where the conditions are TRUE, while reporting statistics about dropped rows and optionally the sum of a statistic column that was dropped.
filter_drop(.data, ...) ## S3 method for class 'data.frame' filter_drop(.data, ..., .stat = NULL, .quiet = FALSE, .warn_threshold = NULL)filter_drop(.data, ...) ## S3 method for class 'data.frame' filter_drop(.data, ..., .stat = NULL, .quiet = FALSE, .warn_threshold = NULL)
.data |
A data.frame, tibble, or other object. |
... |
Filter conditions specifying rows to DROP, evaluated in the
context of |
.stat |
An unquoted column or expression to total, e.g., |
.quiet |
Logical. If |
.warn_threshold |
Numeric between 0 and 1. If set and the proportion of dropped rows exceeds this threshold, a warning is issued. |
The filtered data.frame or tibble.
filter_drop(data.frame): Method for data.frame objects
Other filter diagnostics:
filter_keep()
df <- data.frame( id = 1:5, bad = c(FALSE, TRUE, FALSE, TRUE, FALSE), sales = 10:14 ) filter_drop(df, bad == TRUE) filter_drop(df, bad == TRUE, .stat = sales)df <- data.frame( id = 1:5, bad = c(FALSE, TRUE, FALSE, TRUE, FALSE), sales = 10:14 ) filter_drop(df, bad == TRUE) filter_drop(df, bad == TRUE, .stat = sales)
Filters a data.frame or tibble while reporting statistics about dropped rows
and optionally the sum of a statistic column that was dropped. Keeps rows
where the conditions are TRUE (same as dplyr::filter()).
filter_keep(.data, ...) ## S3 method for class 'data.frame' filter_keep(.data, ..., .stat = NULL, .quiet = FALSE, .warn_threshold = NULL)filter_keep(.data, ...) ## S3 method for class 'data.frame' filter_keep(.data, ..., .stat = NULL, .quiet = FALSE, .warn_threshold = NULL)
.data |
A data.frame, tibble, or other object. |
... |
Filter conditions, evaluated in the context of |
.stat |
An unquoted column or expression to total, e.g., |
.quiet |
Logical. If |
.warn_threshold |
Numeric between 0 and 1. If set and the proportion of dropped rows exceeds this threshold, a warning is issued. |
The filtered data.frame or tibble.
filter_keep(data.frame): Method for data.frame objects
Other filter diagnostics:
filter_drop()
df <- data.frame( id = 1:6, keep = c(TRUE, FALSE, TRUE, NA, TRUE, FALSE), sales = c(100, 50, 200, 25, NA, 75) ) filter_keep(df, keep == TRUE) filter_keep(df, keep == TRUE, .stat = sales)df <- data.frame( id = 1:6, keep = c(TRUE, FALSE, TRUE, NA, TRUE, FALSE), sales = c(100, 50, 200, 25, NA, 75) ) filter_keep(df, keep == TRUE) filter_keep(df, keep == TRUE, .stat = sales)
Performs a diagnostic filter AND records filter diagnostics in an audit trail.
filter_tap() keeps matching rows (like dplyr::filter()),
filter_out_tap() drops matching rows (the inverse).
filter_tap( .data, ..., .trail = NULL, .label = NULL, .stat = NULL, .quiet = FALSE, .warn_threshold = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) filter_out_tap( .data, ..., .trail = NULL, .label = NULL, .stat = NULL, .quiet = FALSE, .warn_threshold = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )filter_tap( .data, ..., .trail = NULL, .label = NULL, .stat = NULL, .quiet = FALSE, .warn_threshold = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) filter_out_tap( .data, ..., .trail = NULL, .label = NULL, .stat = NULL, .quiet = FALSE, .warn_threshold = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )
.data |
A data.frame or tibble. |
... |
Filter conditions, evaluated in the context of |
.trail |
An |
.label |
Optional character label for this snapshot. If |
.stat |
An unquoted column or expression to total, e.g., |
.quiet |
Logical. If |
.warn_threshold |
Numeric between 0 and 1. If set and the proportion of dropped rows exceeds this threshold, a warning is issued. |
.numeric_summary |
Logical. If |
.cols_include |
Character vector of column names to include in the
snapshot schema, or |
.cols_exclude |
Character vector of column names to exclude from the
snapshot schema, or |
When .trail is NULL:
No diagnostic args: plain dplyr::filter() / dplyr::filter_out()
Diagnostic args provided: delegates to filter_keep() /
filter_drop() (prints diagnostics but no trail recording)
.label provided: warns that label is ignored
The filtered data.frame or tibble.
Other operation taps:
join_tap
df <- data.frame(id = 1:10, amount = 1:10 * 100, flag = rep(c(TRUE, FALSE), 5)) # With trail trail <- audit_trail("filter_example") result <- df |> audit_tap(trail, "raw") |> filter_tap(amount > 300, .trail = trail, .label = "big_only") print(trail) # Inverse: drop matching rows trail2 <- audit_trail("filter_out_example") result2 <- df |> audit_tap(trail2, "raw") |> filter_out_tap(flag == FALSE, .trail = trail2, .label = "flagged_only") print(trail2) # Without trail (plain filter) result3 <- filter_tap(df, amount > 300)df <- data.frame(id = 1:10, amount = 1:10 * 100, flag = rep(c(TRUE, FALSE), 5)) # With trail trail <- audit_trail("filter_example") result <- df |> audit_tap(trail, "raw") |> filter_tap(amount > 300, .trail = trail, .label = "big_only") print(trail) # Inverse: drop matching rows trail2 <- audit_trail("filter_out_example") result2 <- df |> audit_tap(trail2, "raw") |> filter_out_tap(flag == FALSE, .trail = trail2, .label = "flagged_only") print(trail2) # Without trail (plain filter) result3 <- filter_tap(df, amount > 300)
Creates a comprehensive summary of all columns in a data.frame, including type, missing values, descriptive statistics, and example values.
get_summary_table(.data, cols = NULL)get_summary_table(.data, cols = NULL)
.data |
A data.frame or tibble to summarize. |
cols |
Optional character vector of column names to summarize. If
|
A data.frame with one row per column containing summary statistics.
Other data quality:
audit_transform(),
diagnose_nas(),
diagnose_strings(),
summarize_column(),
tab()
df <- data.frame( id = 1:100, value = rnorm(100), category = sample(letters[1:5], 100, replace = TRUE) ) get_summary_table(df)df <- data.frame( id = 1:100, value = rnorm(100), category = sample(letters[1:5], 100, replace = TRUE) ) get_summary_table(df)
Performs a dplyr join AND records enriched diagnostics in an audit trail.
These functions replace the pattern of wrapping a join with two
audit_tap() calls, capturing information that plain taps cannot:
match rates, relationship type, duplicate keys, and unmatched row counts.
left_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) right_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) inner_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) full_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) anti_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) semi_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )left_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) right_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) inner_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) full_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) anti_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL ) semi_join_tap( .data, y, ..., .trail = NULL, .label = NULL, .stat = NULL, .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )
.data |
A data.frame or tibble (left table in the join). |
y |
A data.frame or tibble (right table in the join). |
... |
Arguments passed to the corresponding |
.trail |
An |
.label |
Optional character label for this snapshot. If |
.stat |
An unquoted column name for stat tracking, e.g., |
.numeric_summary |
Logical. If |
.cols_include |
Character vector of column names to include in the
snapshot schema, or |
.cols_exclude |
Character vector of column names to exclude from the
snapshot schema, or |
Enriched diagnostics (match rates, relationship type, duplicate keys) require
equality joins — by as a character vector, named character vector, or
simple equality join_by() expression (e.g., join_by(id),
join_by(a == b)). For non-equi join_by() expressions, the tap records
a basic snapshot without match-rate diagnostics.
All dplyr join features (join_by, multiple, unmatched, suffix, etc.)
work unchanged via ....
When .trail is NULL:
.stat also NULL: plain dplyr join
.stat provided: prints validate_join() diagnostics, then joins
.label provided: warns that label is ignored
The joined data.frame or tibble (same as the corresponding
dplyr::*_join()).
Other operation taps:
filter_tap()
orders <- data.frame(id = 1:4, amount = c(100, 200, 300, 400)) customers <- data.frame(id = c(2, 3, 5), name = c("A", "B", "C")) # With trail trail <- audit_trail("join_example") result <- orders |> audit_tap(trail, "raw") |> left_join_tap(customers, by = "id", .trail = trail, .label = "joined") print(trail) # Without trail (plain join) result2 <- left_join_tap(orders, customers, by = "id")orders <- data.frame(id = 1:4, amount = c(100, 200, 300, 400)) customers <- data.frame(id = c(2, 3, 5), name = c("A", "B", "C")) # With trail trail <- audit_trail("join_example") result <- orders |> audit_tap(trail, "raw") |> left_join_tap(customers, by = "id", .trail = trail, .label = "joined") print(trail) # Without trail (plain join) result2 <- left_join_tap(orders, customers, by = "id")
Creates an audit trail object that captures metadata snapshots at each step
of a data pipeline. The trail uses environment-based reference semantics so
it can be modified in place inside pipes via audit_tap().
## S3 method for class 'audit_snap' print(x, ...) audit_trail(name = NULL) ## S3 method for class 'audit_trail' print(x, show_custom = TRUE, ...)## S3 method for class 'audit_snap' print(x, ...) audit_trail(name = NULL) ## S3 method for class 'audit_trail' print(x, show_custom = TRUE, ...)
x |
An object to print. |
... |
Additional arguments (currently unused). |
name |
Optional name for the trail. If |
show_custom |
Logical. If |
An audit_trail object (S3 class wrapping an environment).
Other audit trail:
audit_diff(),
audit_report(),
audit_tap(),
tab_tap()
trail <- audit_trail("my_analysis") print(trail)trail <- audit_trail("my_analysis") print(trail)
Restores an audit_trail() previously saved with write_trail(). The
file format is detected automatically from the file extension (.rds for
RDS, .json for JSON), or can be specified explicitly via format.
read_trail(file, format = NULL)read_trail(file, format = NULL)
file |
Path to an RDS or JSON file created by |
format |
One of |
A reconstructed audit_trail() object with all S3 classes
restored.
Other audit export:
audit_export(),
trail_to_df(),
trail_to_list(),
write_trail()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") tmp <- tempfile(fileext = ".rds") write_trail(trail, tmp) restored <- read_trail(tmp) print(restored)trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") tmp <- tempfile(fileext = ".rds") write_trail(trail, tmp) restored <- read_trail(tmp) print(restored)
Computes summary statistics for a vector. Handles numeric, character, factor, logical, Date, and other types with appropriate statistics for each.
summarize_column(x)summarize_column(x)
x |
A vector to summarize. |
A named character vector with summary statistics including: type, unique count, missing count, missing share (proportion from 0 to 1), most frequent value (for non-numeric), mean, sd, min, quartiles (q25, q50, q75), max, and three example values.
Other data quality:
audit_transform(),
diagnose_nas(),
diagnose_strings(),
get_summary_table(),
tab()
summarize_column(c(1, 2, 3, NA, 5)) summarize_column(c("a", "b", "a", "c"))summarize_column(c(1, 2, 3, NA, 5)) summarize_column(c("a", "b", "a", "c"))
Produces one-way frequency tables or two-way crosstabulations. One variable gives counts, percentages, and cumulative percentages; two variables give a crosstabulation matrix with row/column totals.
tab( .data, ..., .wt = NULL, .sort = c("value_asc", "value_desc", "freq_desc", "freq_asc"), .cutoff = NULL, .na = c("include", "exclude", "only"), .display = c("count", "row_pct", "col_pct", "total_pct") ) ## S3 method for class 'tidyaudit_tab' print(x, ...) ## S3 method for class 'tidyaudit_tab' as.data.frame(x, row.names = NULL, optional = FALSE, ...)tab( .data, ..., .wt = NULL, .sort = c("value_asc", "value_desc", "freq_desc", "freq_asc"), .cutoff = NULL, .na = c("include", "exclude", "only"), .display = c("count", "row_pct", "col_pct", "total_pct") ) ## S3 method for class 'tidyaudit_tab' print(x, ...) ## S3 method for class 'tidyaudit_tab' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
.data |
A data.frame or tibble. |
... |
Additional arguments (currently unused). |
.wt |
Optional unquoted column to use as frequency weights. When supplied, frequencies are weighted sums instead of row counts. |
.sort |
How to order the rows (and columns in two-way tables).
|
.cutoff |
Controls how many values to display.
An integer >= 1 keeps the top-N values by frequency.
A number in (0, 1) keeps values that cumulatively account for that
proportion of the total. Remaining values are grouped under |
.na |
How to handle |
.display |
Cell contents for two-way crosstabulations. One of
|
x |
A |
row.names |
Passed to |
optional |
Passed to |
An S3 object of class tidyaudit_tab. Use as.data.frame() to
extract the underlying table.
Other data quality:
audit_transform(),
diagnose_nas(),
diagnose_strings(),
get_summary_table(),
summarize_column()
tab(mtcars, cyl) tab(mtcars, cyl, .sort = "freq_desc") tab(mtcars, cyl, gear) tab(mtcars, cyl, gear, .display = "row_pct") tab(mtcars, cyl, .wt = mpg) tab(mtcars, cyl, .cutoff = 2)tab(mtcars, cyl) tab(mtcars, cyl, .sort = "freq_desc") tab(mtcars, cyl, gear) tab(mtcars, cyl, gear, .display = "row_pct") tab(mtcars, cyl, .wt = mpg) tab(mtcars, cyl, .cutoff = 2)
Transparent pipe pass-through that runs tab() on the data and stores the
result as a custom diagnostic annotation in the audit trail snapshot.
Returns .data unchanged.
tab_tap( .data, ..., .trail, .label = NULL, .wt = NULL, .sort = c("value_asc", "value_desc", "freq_desc", "freq_asc"), .cutoff = NULL, .na = c("include", "exclude", "only"), .display = c("count", "row_pct", "col_pct", "total_pct"), .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )tab_tap( .data, ..., .trail, .label = NULL, .wt = NULL, .sort = c("value_asc", "value_desc", "freq_desc", "freq_asc"), .cutoff = NULL, .na = c("include", "exclude", "only"), .display = c("count", "row_pct", "col_pct", "total_pct"), .numeric_summary = TRUE, .cols_include = NULL, .cols_exclude = NULL )
.data |
A data.frame or tibble. |
... |
Additional arguments (currently unused). |
.trail |
An |
.label |
Character label for this snapshot. |
.wt |
Optional unquoted column to use as frequency weights. When supplied, frequencies are weighted sums instead of row counts. |
.sort |
How to order the rows (and columns in two-way tables).
|
.cutoff |
Controls how many values to display.
An integer >= 1 keeps the top-N values by frequency.
A number in (0, 1) keeps values that cumulatively account for that
proportion of the total. Remaining values are grouped under |
.na |
How to handle |
.display |
Cell contents for two-way crosstabulations. One of
|
.numeric_summary |
Logical. If |
.cols_include |
Character vector of column names to include in the
snapshot schema, or |
.cols_exclude |
Character vector of column names to exclude from the
snapshot schema, or |
.data, unchanged, returned invisibly.
Other audit trail:
audit_diff(),
audit_report(),
audit_tap(),
print.audit_snap()
trail <- audit_trail("example") result <- mtcars |> tab_tap(cyl, .trail = trail, .label = "by_cyl") |> dplyr::filter(mpg > 20) |> tab_tap(cyl, .trail = trail, .label = "by_cyl_filtered") print(trail)trail <- audit_trail("example") result <- mtcars |> tab_tap(cyl, .trail = trail, .label = "by_cyl") |> dplyr::filter(mpg > 20) |> tab_tap(cyl, .trail = trail, .label = "by_cyl_filtered") print(trail)
Returns a plain data.frame with one row per snapshot. Nested fields
(all_columns, schema, numeric_summary, changes, diagnostics,
custom, pipeline, controls) become list-columns. Trail metadata is
stored as attributes on the result.
trail_to_df(.trail)trail_to_df(.trail)
.trail |
An |
A data.frame with columns index, label, type, timestamp,
nrow, ncol, total_nas, all_columns, schema, numeric_summary,
changes, diagnostics, custom, pipeline, and controls. Trail
name and created_at are stored as attributes "trail_name" and
"created_at".
Other audit export:
audit_export(),
read_trail(),
trail_to_list(),
write_trail()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") dplyr::filter(mtcars, mpg > 20) |> audit_tap(trail, "filtered") df <- trail_to_df(trail) print(df) attr(df, "trail_name")trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") dplyr::filter(mtcars, mpg > 20) |> audit_tap(trail, "filtered") df <- trail_to_df(trail) print(df) attr(df, "trail_name")
Converts an audit_trail() object to a plain R list suitable for
serialisation with jsonlite::toJSON(). All POSIXct timestamps are
converted to ISO 8601 character strings and data.frames are converted
to lists of named rows for JSON compatibility.
trail_to_list(.trail)trail_to_list(.trail)
.trail |
An |
A named list with elements name, created_at (ISO 8601 string),
n_snapshots, and snapshots (a named list keyed by snapshot label).
Other audit export:
audit_export(),
read_trail(),
trail_to_df(),
write_trail()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") lst <- trail_to_list(trail) str(lst, max.level = 2)trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") lst <- trail_to_list(trail) str(lst, max.level = 2)
Analyzes a potential join between two data.frames or tibbles without performing the full join. Reports relationship type (one-to-one, one-to-many, etc.), match rates, duplicate keys, and unmatched rows. Optionally tracks a numeric statistic column through the join to quantify impact.
validate_join(x, y, by = NULL, stat = NULL, stat_x = NULL, stat_y = NULL) ## S3 method for class 'validate_join' print(x, ...) ## S3 method for class 'validate_join' summary(object, ...)validate_join(x, y, by = NULL, stat = NULL, stat_x = NULL, stat_y = NULL) ## S3 method for class 'validate_join' print(x, ...) ## S3 method for class 'validate_join' summary(object, ...)
x |
A data.frame or tibble (left table). |
y |
A data.frame or tibble (right table). |
by |
A character vector of column names to join on. Use a named vector
|
stat |
Optional single column name (string) to track in both tables when
the column name is the same. Ignored if |
stat_x |
Optional column name (string) for a numeric statistic in |
stat_y |
Optional column name (string) for a numeric statistic in |
... |
Additional arguments (currently unused). |
object |
A |
An S3 object of class validate_join containing:
Names of the input tables from the original call
Key columns used for the join
List with row counts, match rates, and overlap statistics
When stat, stat_x, or stat_y is provided, a list with
stat diagnostics per table. NULL when no stat is provided.
List with duplicate key information for each table
A data.frame summarizing the join diagnostics
Character string describing the relationship
Unmatched keys from x
Unmatched keys from y
Other join validation:
compare_tables(),
validate_primary_keys(),
validate_var_relationship()
x <- data.frame(id = c(1L, 2L, 3L, 3L), value = c("a", "b", "c", "d")) y <- data.frame(id = c(2L, 3L, 4L), score = c(10, 20, 30)) result <- validate_join(x, y, by = "id") print(result) # Track a stat column with different names in each table x2 <- data.frame(id = 1:3, sales = c(100, 200, 300)) y2 <- data.frame(id = 2:4, cost = c(10, 20, 30)) validate_join(x2, y2, by = "id", stat_x = "sales", stat_y = "cost")x <- data.frame(id = c(1L, 2L, 3L, 3L), value = c("a", "b", "c", "d")) y <- data.frame(id = c(2L, 3L, 4L), score = c(10, 20, 30)) result <- validate_join(x, y, by = "id") print(result) # Track a stat column with different names in each table x2 <- data.frame(id = 1:3, sales = c(100, 200, 300)) y2 <- data.frame(id = 2:4, cost = c(10, 20, 30)) validate_join(x2, y2, by = "id", stat_x = "sales", stat_y = "cost")
Tests whether a set of columns constitute primary keys of a data.frame, i.e., whether they uniquely identify every row in the table.
validate_primary_keys(.data, keys) ## S3 method for class 'validate_pk' print(x, ...)validate_primary_keys(.data, keys) ## S3 method for class 'validate_pk' print(x, ...)
.data |
A data.frame or tibble. |
keys |
Character vector of column names to test as primary keys. |
x |
An object to print. |
... |
Additional arguments (currently unused). |
An S3 object of class validate_pk containing:
Name of the input table from the original call
Character vector of column names tested
Logical: TRUE if keys uniquely identify all rows AND no key column contains NA values
Total number of rows in the table
Number of distinct key combinations
Number of key combinations that appear more than once
A data.frame of duplicated key values with their counts
Logical: TRUE if any key column is of type double
Logical: TRUE if any key column contains NA values
Named logical vector indicating which key columns contain NAs
Other join validation:
compare_tables(),
validate_join(),
validate_var_relationship()
df <- data.frame( id = c(1L, 2L, 3L, 4L), group = c("A", "A", "B", "B"), value = c(10, 20, 30, 40) ) validate_primary_keys(df, "id") validate_primary_keys(df, "group")df <- data.frame( id = c(1L, 2L, 3L, 4L), group = c("A", "A", "B", "B"), value = c(10, 20, 30, 40) ) validate_primary_keys(df, "id") validate_primary_keys(df, "group")
Determines the relationship between two variables in a data.frame: one-to-one, one-to-many, many-to-one, or many-to-many.
validate_var_relationship(.data, var1, var2) ## S3 method for class 'validate_var_rel' print(x, ...)validate_var_relationship(.data, var1, var2) ## S3 method for class 'validate_var_rel' print(x, ...)
.data |
A data.frame or tibble. |
var1 |
Character string: name of the first variable. |
var2 |
Character string: name of the second variable. |
x |
An object to print. |
... |
Additional arguments (currently unused). |
Only accepts variables of type character, integer, or factor. Numeric (double) variables are not allowed due to floating-point comparison issues.
An S3 object of class validate_var_rel containing:
Name of the input table
Names of the variables analyzed
Character string: "one-to-one", "one-to-many", "many-to-one", or "many-to-many"
Number of distinct values in var1
Number of distinct values in var2
Number of unique (var1, var2) pairs
Does any var1 value map to multiple var2 values?
Does any var2 value map to multiple var1 values?
Other join validation:
compare_tables(),
validate_join(),
validate_primary_keys()
df <- data.frame( person_id = c(1L, 2L, 3L, 4L), department = c("Sales", "Sales", "Engineering", "Engineering"), country = c("US", "US", "US", "UK") ) validate_var_relationship(df, "person_id", "department")df <- data.frame( person_id = c(1L, 2L, 3L, 4L), department = c("Sales", "Sales", "Engineering", "Engineering"), country = c("US", "US", "US", "UK") ) validate_var_relationship(df, "person_id", "department")
Saves an audit_trail() to disk as either an RDS file (default) or a
JSON file. The RDS format preserves all R types and can be restored
perfectly with read_trail(). The JSON format produces a human-readable
representation suitable for archiving or interoperability with other tools.
write_trail(.trail, file, format = c("rds", "json"))write_trail(.trail, file, format = c("rds", "json"))
.trail |
An |
file |
Path to the output file. A |
format |
One of |
.trail, invisibly.
Custom diagnostic results (the custom field, populated via .fns
in audit_tap()) are serialised on a best-effort basis for JSON output.
Complex R objects such as environments or functions cannot be represented
in JSON and will cause an error.
Other audit export:
audit_export(),
read_trail(),
trail_to_df(),
trail_to_list()
trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") tmp <- tempfile(fileext = ".rds") write_trail(trail, tmp) restored <- read_trail(tmp)trail <- audit_trail("example") mtcars |> audit_tap(trail, "raw") tmp <- tempfile(fileext = ".rds") write_trail(trail, tmp) restored <- read_trail(tmp)