---
title: "Audited Execution"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Audited Execution}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE}
library(tidyaudit)
library(dplyr)
```

## Capturing lineage without taps

The audit trail layer (`audit_tap()` and the operation-aware taps) records
exactly the steps you mark. **Audited execution** flips that around: write one
line at the top, run your wrangling, and tidyaudit records the lineage of
*every* data.frame you create or change — no per-step taps required.

```{r}
trail <- audit_record({
  raw    <- as_tibble(mtcars)
  clean  <- filter(raw, mpg > 20)
  lookup <- data.frame(cyl = c(4, 6, 8), label = c("small", "mid", "big"))
  joined <- left_join(clean, lookup, by = "cyl")
})

trail
```

Each top-level statement becomes a versioned snapshot, tagged with the line of
code that produced it and the parent data.frames it derived from. The join is
where lineage earns its keep: `joined` converges from *both* tables it was built
from, while `raw` — the entry point — has no parent.

```{r}
ids <- vapply(trail$snapshots, function(s) s$snapshot_id, character(1))
obj <- vapply(trail$snapshots, function(s) s$object_name, character(1))
data.frame(
  object  = obj,
  parents = vapply(trail$snapshots, function(s) {
    paste(obj[match(unlist(s$parent_snapshot_ids), ids)], collapse = ", ")
  }, character(1))
)
```

Capture is metadata-only by default: shape, types, and NA counts, never the rows.

## Three ways in

| Function | Use it for |
|----------|-----------|
| `audit_source("script.R")` | The canonical runner. Works in every context — interactive, `source()`d, or `Rscript`. |
| `audit_record({ ... })` | Auditing an inline block, as above. |
| `audit_start()` / `audit_stop()` | Interactive-session convenience: a line at the top of a console session and one at the bottom. |

`audit_source()` is the right tool for a whole script:

```{r}
script <- tempfile(fileext = ".R")
writeLines(c(
  "raw   <- dplyr::as_tibble(mtcars)",
  "clean <- dplyr::filter(raw, mpg > 20)",
  "agg   <- dplyr::summarise(clean, n = dplyr::n(), mean_mpg = mean(mpg))"
), script)

strail <- audit_source(script)
strail
```

> **Note on `audit_start()`.** It registers a top-level task callback, which
> fires per statement at the console. R treats `source("file.R")` as a *single*
> task, so a script run via `source()` under `audit_start()` records only one
> combined step. For scripts, use `audit_source()`.

## What a snapshot knows

Every snapshot carries versioned-lineage fields. Names are for display; IDs are
for identity, so reassignments and self-overwrites stay unambiguous.

```{r}
df <- trail_to_df(trail)
df[, c("object_name", "version", "event", "nrow", "ncol")]
```

The identity fields ride along on every row as list-columns. Unpack them to see
the `snapshot_id` that names each version and the `parent_snapshot_ids` that
wire the lineage together — the same IDs the join above resolved back to names:

```{r}
data.frame(
  object_name = unlist(df$object_name),
  snapshot_id = unlist(df$snapshot_id),
  parents     = vapply(df$parent_snapshot_ids,
                       function(p) paste(p, collapse = ", "), character(1))
)
```

Parents are resolved from the state *before* each statement runs, so a
self-overwrite links to the previous version:

```{r}
trail2 <- audit_record({
  x <- data.frame(a = 1:5)
  x <- mutate(x, b = a * 2)   # parent is the previous `x`, not the new one
})
vapply(trail2$snapshots, function(s) {
  paste0(s$object_name, " v", s$version, " <- ",
         paste(s$parent_snapshot_ids, collapse = ", "))
}, character(1))
```

## Lifecycle events

Assignments, in-place edits, and deletions all produce events: `create`,
`update`, `delete` (via `rm()`), `retire` (a binding that stops being a
data.frame), and `unchanged_assignment` (assigned, but no detectable change at
the current evidence level).

## Detecting value-only changes

At the default `level = "metadata"`, a change that does not alter shape, types,
or NA counts is reported as `unchanged_assignment`. To detect value-only edits,
raise the evidence level — tidyaudit then hashes column contents with a
per-run salt:

```{r}
audit_record({
  x <- data.frame(a = c(1, 2, 3))
  x <- mutate(x, a = a * 10)   # same shape; value-only change
}, level = "column_hash")$snapshots[[2]]$event
```

When a hash level is used, each snapshot records an `evidence` entry describing
the algorithm, sampling, and salt policy alongside the hash itself, so the trail
documents exactly how it was produced:

```{r}
s <- audit_record({
  x <- data.frame(a = c(1, 2, 3))
}, level = "column_hash")$snapshots[[1]]
str(s$evidence)
```

Hashes use a per-run salt and are **not** a privacy guarantee — unsalted hashes
of small categorical columns can be reversed by dictionary attack. Keep the
default `"metadata"` level when the trail may be shared.

## Lineage resolution limits

Parents are inferred statically from the call you wrote, not from runtime values.
For data-mask verbs (`filter()`, `mutate()`, `summarise()`, `select()`, and the
like) only the primary data argument is treated as a parent; the masked
expressions are not scanned. That is deliberate — it keeps a column reference
such as `filter(df, mpg > 20)` from inventing a parent when a stray object named
`mpg` happens to exist. The trade-off is that a genuine cross-data-frame
reference *inside* a masked argument, e.g. `mutate(df, new = other_df$x)`, will
not record `other_df` as a parent. If you want that edge captured in the
lineage, lift the reference into its own statement (`v <- other_df$x`) before
using it.

## Exporting the report

`audit_export()` renders an audited trail as a self-contained HTML file: a
tabular report (run summary, step timeline, objects and versions, warnings and
errors) followed by a per-object lineage graph where joins converge from two
parents.

```{r, eval=FALSE}
audit_export(trail, "lineage.html")
```