--- title: "Audited Execution" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Audited Execution} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message=FALSE} library(tidyaudit) library(dplyr) ``` ## Capturing lineage without taps The audit trail layer (`audit_tap()` and the operation-aware taps) records exactly the steps you mark. **Audited execution** flips that around: write one line at the top, run your wrangling, and tidyaudit records the lineage of *every* data.frame you create or change — no per-step taps required. ```{r} trail <- audit_record({ raw <- as_tibble(mtcars) clean <- filter(raw, mpg > 20) lookup <- data.frame(cyl = c(4, 6, 8), label = c("small", "mid", "big")) joined <- left_join(clean, lookup, by = "cyl") }) trail ``` Each top-level statement becomes a versioned snapshot, tagged with the line of code that produced it and the parent data.frames it derived from. The join is where lineage earns its keep: `joined` converges from *both* tables it was built from, while `raw` — the entry point — has no parent. ```{r} ids <- vapply(trail$snapshots, function(s) s$snapshot_id, character(1)) obj <- vapply(trail$snapshots, function(s) s$object_name, character(1)) data.frame( object = obj, parents = vapply(trail$snapshots, function(s) { paste(obj[match(unlist(s$parent_snapshot_ids), ids)], collapse = ", ") }, character(1)) ) ``` Capture is metadata-only by default: shape, types, and NA counts, never the rows. ## Three ways in | Function | Use it for | |----------|-----------| | `audit_source("script.R")` | The canonical runner. Works in every context — interactive, `source()`d, or `Rscript`. | | `audit_record({ ... })` | Auditing an inline block, as above. | | `audit_start()` / `audit_stop()` | Interactive-session convenience: a line at the top of a console session and one at the bottom. | `audit_source()` is the right tool for a whole script: ```{r} script <- tempfile(fileext = ".R") writeLines(c( "raw <- dplyr::as_tibble(mtcars)", "clean <- dplyr::filter(raw, mpg > 20)", "agg <- dplyr::summarise(clean, n = dplyr::n(), mean_mpg = mean(mpg))" ), script) strail <- audit_source(script) strail ``` > **Note on `audit_start()`.** It registers a top-level task callback, which > fires per statement at the console. R treats `source("file.R")` as a *single* > task, so a script run via `source()` under `audit_start()` records only one > combined step. For scripts, use `audit_source()`. ## What a snapshot knows Every snapshot carries versioned-lineage fields. Names are for display; IDs are for identity, so reassignments and self-overwrites stay unambiguous. ```{r} df <- trail_to_df(trail) df[, c("object_name", "version", "event", "nrow", "ncol")] ``` The identity fields ride along on every row as list-columns. Unpack them to see the `snapshot_id` that names each version and the `parent_snapshot_ids` that wire the lineage together — the same IDs the join above resolved back to names: ```{r} data.frame( object_name = unlist(df$object_name), snapshot_id = unlist(df$snapshot_id), parents = vapply(df$parent_snapshot_ids, function(p) paste(p, collapse = ", "), character(1)) ) ``` Parents are resolved from the state *before* each statement runs, so a self-overwrite links to the previous version: ```{r} trail2 <- audit_record({ x <- data.frame(a = 1:5) x <- mutate(x, b = a * 2) # parent is the previous `x`, not the new one }) vapply(trail2$snapshots, function(s) { paste0(s$object_name, " v", s$version, " <- ", paste(s$parent_snapshot_ids, collapse = ", ")) }, character(1)) ``` ## Lifecycle events Assignments, in-place edits, and deletions all produce events: `create`, `update`, `delete` (via `rm()`), `retire` (a binding that stops being a data.frame), and `unchanged_assignment` (assigned, but no detectable change at the current evidence level). ## Detecting value-only changes At the default `level = "metadata"`, a change that does not alter shape, types, or NA counts is reported as `unchanged_assignment`. To detect value-only edits, raise the evidence level — tidyaudit then hashes column contents with a per-run salt: ```{r} audit_record({ x <- data.frame(a = c(1, 2, 3)) x <- mutate(x, a = a * 10) # same shape; value-only change }, level = "column_hash")$snapshots[[2]]$event ``` When a hash level is used, each snapshot records an `evidence` entry describing the algorithm, sampling, and salt policy alongside the hash itself, so the trail documents exactly how it was produced: ```{r} s <- audit_record({ x <- data.frame(a = c(1, 2, 3)) }, level = "column_hash")$snapshots[[1]] str(s$evidence) ``` Hashes use a per-run salt and are **not** a privacy guarantee — unsalted hashes of small categorical columns can be reversed by dictionary attack. Keep the default `"metadata"` level when the trail may be shared. ## Lineage resolution limits Parents are inferred statically from the call you wrote, not from runtime values. For data-mask verbs (`filter()`, `mutate()`, `summarise()`, `select()`, and the like) only the primary data argument is treated as a parent; the masked expressions are not scanned. That is deliberate — it keeps a column reference such as `filter(df, mpg > 20)` from inventing a parent when a stray object named `mpg` happens to exist. The trade-off is that a genuine cross-data-frame reference *inside* a masked argument, e.g. `mutate(df, new = other_df$x)`, will not record `other_df` as a parent. If you want that edge captured in the lineage, lift the reference into its own statement (`v <- other_df$x`) before using it. ## Exporting the report `audit_export()` renders an audited trail as a self-contained HTML file: a tabular report (run summary, step timeline, objects and versions, warnings and errors) followed by a per-object lineage graph where joins converge from two parents. ```{r, eval=FALSE} audit_export(trail, "lineage.html") ```