The audit trail layer (audit_tap() and the
operation-aware taps) records exactly the steps you mark.
Audited execution flips that around: write one line at
the top, run your wrangling, and tidyaudit records the lineage of
every data.frame you create or change — no per-step taps
required.
trail <- audit_record({
raw <- as_tibble(mtcars)
clean <- filter(raw, mpg > 20)
lookup <- data.frame(cyl = c(4, 6, 8), label = c("small", "mid", "big"))
joined <- left_join(clean, lookup, by = "cyl")
})
trail
#>
#> ── Audit Trail: "trail_20260627_132548" ────────────────────────────────────────
#> Created: 2026-06-27 13:25:48
#> Snapshots: 4
#>
#> # Label Rows Cols NAs Type
#> ─ ────── ──── ──── ─── ────
#> 1 raw 32 11 0 tap
#> 2 clean 14 11 0 tap
#> 3 lookup 3 2 0 tap
#> 4 joined 14 12 0 tapEach top-level statement becomes a versioned snapshot, tagged with
the line of code that produced it and the parent data.frames it derived
from. The join is where lineage earns its keep: joined
converges from both tables it was built from, while
raw — the entry point — has no parent.
ids <- vapply(trail$snapshots, function(s) s$snapshot_id, character(1))
obj <- vapply(trail$snapshots, function(s) s$object_name, character(1))
data.frame(
object = obj,
parents = vapply(trail$snapshots, function(s) {
paste(obj[match(unlist(s$parent_snapshot_ids), ids)], collapse = ", ")
}, character(1))
)
#> object parents
#> 1 raw
#> 2 clean raw
#> 3 lookup
#> 4 joined clean, lookupCapture is metadata-only by default: shape, types, and NA counts, never the rows.
| Function | Use it for |
|---|---|
audit_source("script.R") |
The canonical runner. Works in every context — interactive,
source()d, or Rscript. |
audit_record({ ... }) |
Auditing an inline block, as above. |
audit_start() / audit_stop() |
Interactive-session convenience: a line at the top of a console session and one at the bottom. |
audit_source() is the right tool for a whole script:
script <- tempfile(fileext = ".R")
writeLines(c(
"raw <- dplyr::as_tibble(mtcars)",
"clean <- dplyr::filter(raw, mpg > 20)",
"agg <- dplyr::summarise(clean, n = dplyr::n(), mean_mpg = mean(mpg))"
), script)
strail <- audit_source(script)
strail
#>
#> ── Audit Trail: "filecf557d925b1.R" ────────────────────────────────────────────
#> Created: 2026-06-27 13:25:48
#> Snapshots: 3
#>
#> # Label Rows Cols NAs Type
#> ─ ───── ──── ──── ─── ────
#> 1 raw 32 11 0 tap
#> 2 clean 14 11 0 tap
#> 3 agg 1 2 0 tapNote on
audit_start(). It registers a top-level task callback, which fires per statement at the console. R treatssource("file.R")as a single task, so a script run viasource()underaudit_start()records only one combined step. For scripts, useaudit_source().
Every snapshot carries versioned-lineage fields. Names are for display; IDs are for identity, so reassignments and self-overwrites stay unambiguous.
df <- trail_to_df(trail)
df[, c("object_name", "version", "event", "nrow", "ncol")]
#> object_name version event nrow ncol
#> 1 raw 1 create 32 11
#> 2 clean 1 create 14 11
#> 3 lookup 1 create 3 2
#> 4 joined 1 create 14 12The identity fields ride along on every row as list-columns. Unpack
them to see the snapshot_id that names each version and the
parent_snapshot_ids that wire the lineage together — the
same IDs the join above resolved back to names:
data.frame(
object_name = unlist(df$object_name),
snapshot_id = unlist(df$snapshot_id),
parents = vapply(df$parent_snapshot_ids,
function(p) paste(p, collapse = ", "), character(1))
)
#> object_name snapshot_id parents
#> 1 raw s1
#> 2 clean s2 s1
#> 3 lookup s3
#> 4 joined s4 s2, s3Parents are resolved from the state before each statement runs, so a self-overwrite links to the previous version:
trail2 <- audit_record({
x <- data.frame(a = 1:5)
x <- mutate(x, b = a * 2) # parent is the previous `x`, not the new one
})
vapply(trail2$snapshots, function(s) {
paste0(s$object_name, " v", s$version, " <- ",
paste(s$parent_snapshot_ids, collapse = ", "))
}, character(1))
#> [1] "clean v1 <- " "df v1 <- " "joined v1 <- " "lookup v1 <- "
#> [5] "raw v1 <- " "x v1 <- " "x v2 <- s6"Assignments, in-place edits, and deletions all produce events:
create, update, delete (via
rm()), retire (a binding that stops being a
data.frame), and unchanged_assignment (assigned, but no
detectable change at the current evidence level).
At the default level = "metadata", a change that does
not alter shape, types, or NA counts is reported as
unchanged_assignment. To detect value-only edits, raise the
evidence level — tidyaudit then hashes column contents with a per-run
salt:
audit_record({
x <- data.frame(a = c(1, 2, 3))
x <- mutate(x, a = a * 10) # same shape; value-only change
}, level = "column_hash")$snapshots[[2]]$event
#> [1] "create"When a hash level is used, each snapshot records an
evidence entry describing the algorithm, sampling, and salt
policy alongside the hash itself, so the trail documents exactly how it
was produced:
s <- audit_record({
x <- data.frame(a = c(1, 2, 3))
}, level = "column_hash")$snapshots[[1]]
str(s$evidence)
#> List of 5
#> $ algorithm : chr "xxhash (rlang::hash)"
#> $ level : chr "column_hash"
#> $ sample : chr "all rows and columns"
#> $ salt_policy: chr "per-run (clock + PID); not privacy-preserving"
#> $ hash : chr "4d0a204b0a947f2a05b73c41577c6d44"Hashes use a per-run salt and are not a privacy
guarantee — unsalted hashes of small categorical columns can be reversed
by dictionary attack. Keep the default "metadata" level
when the trail may be shared.
Parents are inferred statically from the call you wrote, not from
runtime values. For data-mask verbs (filter(),
mutate(), summarise(), select(),
and the like) only the primary data argument is treated as a parent; the
masked expressions are not scanned. That is deliberate — it keeps a
column reference such as filter(df, mpg > 20) from
inventing a parent when a stray object named mpg happens to
exist. The trade-off is that a genuine cross-data-frame reference
inside a masked argument,
e.g. mutate(df, new = other_df$x), will not record
other_df as a parent. If you want that edge captured in the
lineage, lift the reference into its own statement
(v <- other_df$x) before using it.