The audit_* functions create a lightweight analysis
manifest. They are not a workflow engine: the goal is to add small audit
records at natural points in an ordinary ukbflow analysis, using objects
that already exist in the script.
A typical audit captures:
derive_*
column names;assoc_*;The examples below use synthetic data from ops_toy() and
can be developed without RAP access. In a real RAP project, the same
audit calls sit next to extract_batch(),
job_result(), derive_*(), and
assoc_*() calls.
Start one audit object near the beginning of the analysis.
audit_start() records the analysis name, start time,
ukbflow version, R session information, and current DNAnexus
user/project when available. If the dx CLI or RAP context is
unavailable, those fields are recorded as NA without
failing.
Field IDs are usually already stored in a vector before extraction. Reuse that object directly in the audit.
fields <- c(
31, 53, 21022, 21001, 20116, 1558, 22189, 54,
22009, 20001, 20006, 40006, 40011, 40012, 40005, 40000
)
aud <- audit_fields(aud, fields, label = "analysis_fields")
# In a RAP workflow this same vector can be used for extraction:
# job_id <- extract_batch(field_id = fields, file = "lung_analysis_pheno")
# aud <- audit_job(aud, job_id, "phenotype_extraction")The manifest stores the declared field IDs, an optional dataset name, a label, the number of fields, and a timestamp.
audit_job() records the DNAnexus job ID and any
lightweight metadata available from
dx describe job-XXXX --json, such as job state and output
file ID. It does not estimate RAP cost; use the DNAnexus / RAP billing
interface for cost review.
Use snapshots at points where the dataset changes meaningfully: raw data, after phenotype derivation, after exclusions, and immediately before modelling.
data <- ops_toy(scenario = "cohort", n = 1000, seed = 2026)
aud <- audit_snapshot(aud, data, "raw")
data <- derive_missing(data)
aud <- audit_snapshot(aud, data, "after_missing")Each audit snapshot stores the full column names. Retrieve them by label when you need to inspect or compare the data structure recorded in the manifest.
After running derive_* functions,
audit_pheno() can summarise phenotype columns that follow
ukbflow’s standard naming convention. It only needs the audit object,
the data, and the phenotype prefix.
data <- derive_selfreport(
data,
name = "lung_cancer",
regex = "lung cancer",
field = "cancer"
)
data <- derive_icd10(
data,
name = "lung",
icd10 = "^C3[34]",
match = "regex",
source = "cancer_registry",
behaviour = 3L
)
data <- derive_case(
data,
name = "lung",
selfreport_col = "lung_cancer_selfreport",
selfreport_date_col = "lung_cancer_selfreport_date"
)
data <- derive_timing(data, name = "lung", baseline_col = "p53_i0")
data <- derive_followup(
data,
name = "lung",
event_col = "lung_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0",
lost_col = FALSE
)
aud <- audit_pheno(aud, data, "lung")
aud <- audit_snapshot(aud, data, "after_phenotype")audit_pheno() records whichever components exist:
self-report, ICD-10, per-source ICD-10 columns, combined status/date,
timing, and follow-up. Missing components are marked as not present
rather than treated as errors.
Audit snapshots work well for cohort exclusions because they record row count, column count, missingness count, and column names at each stage.
aud <- audit_snapshot(aud, data, "before_exclusions")
data <- data[lung_timing != 1L | is.na(lung_timing)]
aud <- audit_snapshot(aud, data, "after_excluding_prevalent")
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
data <- data[
!is.na(smoking_ever) &
!is.na(p31) &
!is.na(p21022) &
!is.na(p1558_i0) &
!is.na(p54_i0)
]
aud <- audit_snapshot(aud, data, "analysis_ready")For UKB withdrawal files, run ops_withdraw() early in
the pipeline and then record an audit snapshot.
ops_withdraw() itself records before/after snapshots in the
session-level ops_snapshot() history.
withdraw_file <- tempfile(fileext = ".csv")
writeLines(as.character(data$eid[1:3]), withdraw_file)
data <- ops_withdraw(data, file = withdraw_file)
aud <- audit_snapshot(aud, data, "after_withdraw")Association result tables are usually small and already contain the
most useful model summary. audit_model() stores the result
table directly. If the covariate vector already exists in your script,
pass it along.
covars <- c(
"p21022",
"p31",
"p1558_i0",
"p54_i0"
)
res <- assoc_coxph(
data = data,
outcome_col = "lung_status",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = covars
)
aud <- audit_model(
aud,
result = res,
label = "smoking_lung_cox",
covariates = covars
)The model record stores the full result table, inferred method, exposures, model labels, optional covariates, and a timestamp.
Use summary() for a short directory-style overview.
Write the manifest as JSON alongside the analysis outputs.
The resulting JSON contains the audit metadata, extraction field records, snapshots, phenotype summaries, model result records, and session information.
For most analyses, these are enough:
audit_start() after loading ukbflow.audit_fields() next to the field vector used for
extraction.audit_snapshot() after loading raw data.audit_snapshot() and audit_pheno() after
phenotype derivation.audit_snapshot() after each major cohort
exclusion.audit_snapshot() immediately before modelling.audit_model() after each main association result.audit_job() next to long-running RAP jobs when a job ID
is available.audit_write() at the end of the script.Keep the audit close to the real workflow. Do not duplicate logic just for the manifest; record objects that already exist in the analysis.