--- title: "Analysis Audit and Reproducibility" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Analysis Audit and Reproducibility} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The `audit_*` functions create a lightweight analysis manifest. They are not a workflow engine: the goal is to add small audit records at natural points in an ordinary ukbflow analysis, using objects that already exist in the script. A typical audit captures: - the analysis name, ukbflow version, session information, and optional RAP context; - the UKB field IDs requested for extraction; - dataset snapshots at key stages, including row count, column count, missingness count, object size, and complete column names; - derived phenotype summaries from standard `derive_*` column names; - association result tables returned by `assoc_*`; - DNAnexus job IDs and lightweight job metadata when available; - a JSON manifest that can be saved with the analysis outputs. The examples below use synthetic data from `ops_toy()` and can be developed without RAP access. In a real RAP project, the same audit calls sit next to `extract_batch()`, `job_result()`, `derive_*()`, and `assoc_*()` calls. --- ## Start an Audit Start one audit object near the beginning of the analysis. ```{r audit-start} library(ukbflow) aud <- audit_start("smoking_lung_cancer") aud ``` `audit_start()` records the analysis name, start time, ukbflow version, R session information, and current DNAnexus user/project when available. If the dx CLI or RAP context is unavailable, those fields are recorded as `NA` without failing. --- ## Record Field IDs Field IDs are usually already stored in a vector before extraction. Reuse that object directly in the audit. ```{r audit-fields} fields <- c( 31, 53, 21022, 21001, 20116, 1558, 22189, 54, 22009, 20001, 20006, 40006, 40011, 40012, 40005, 40000 ) aud <- audit_fields(aud, fields, label = "analysis_fields") # In a RAP workflow this same vector can be used for extraction: # job_id <- extract_batch(field_id = fields, file = "lung_analysis_pheno") # aud <- audit_job(aud, job_id, "phenotype_extraction") ``` The manifest stores the declared field IDs, an optional dataset name, a label, the number of fields, and a timestamp. `audit_job()` records the DNAnexus job ID and any lightweight metadata available from `dx describe job-XXXX --json`, such as job state and output file ID. It does not estimate RAP cost; use the DNAnexus / RAP billing interface for cost review. --- ## Snapshot Data States Use snapshots at points where the dataset changes meaningfully: raw data, after phenotype derivation, after exclusions, and immediately before modelling. ```{r audit-snapshots} data <- ops_toy(scenario = "cohort", n = 1000, seed = 2026) aud <- audit_snapshot(aud, data, "raw") data <- derive_missing(data) aud <- audit_snapshot(aud, data, "after_missing") ``` Each audit snapshot stores the full column names. Retrieve them by label when you need to inspect or compare the data structure recorded in the manifest. ```{r audit-cols} raw_cols <- audit_cols(aud, "raw") head(raw_cols) ``` --- ## Record Phenotype Summaries After running `derive_*` functions, `audit_pheno()` can summarise phenotype columns that follow ukbflow's standard naming convention. It only needs the audit object, the data, and the phenotype prefix. ```{r audit-pheno} data <- derive_selfreport( data, name = "lung_cancer", regex = "lung cancer", field = "cancer" ) data <- derive_icd10( data, name = "lung", icd10 = "^C3[34]", match = "regex", source = "cancer_registry", behaviour = 3L ) data <- derive_case( data, name = "lung", selfreport_col = "lung_cancer_selfreport", selfreport_date_col = "lung_cancer_selfreport_date" ) data <- derive_timing(data, name = "lung", baseline_col = "p53_i0") data <- derive_followup( data, name = "lung", event_col = "lung_date", baseline_col = "p53_i0", censor_date = as.Date("2022-10-31"), death_col = "p40000_i0", lost_col = FALSE ) aud <- audit_pheno(aud, data, "lung") aud <- audit_snapshot(aud, data, "after_phenotype") ``` `audit_pheno()` records whichever components exist: self-report, ICD-10, per-source ICD-10 columns, combined status/date, timing, and follow-up. Missing components are marked as not present rather than treated as errors. --- ## Record Cohort Assembly Audit snapshots work well for cohort exclusions because they record row count, column count, missingness count, and column names at each stage. ```{r audit-cohort} aud <- audit_snapshot(aud, data, "before_exclusions") data <- data[lung_timing != 1L | is.na(lung_timing)] aud <- audit_snapshot(aud, data, "after_excluding_prevalent") data[, smoking_ever := factor( ifelse(p20116_i0 == "Never", "Never", "Ever"), levels = c("Never", "Ever") )] data <- data[ !is.na(smoking_ever) & !is.na(p31) & !is.na(p21022) & !is.na(p1558_i0) & !is.na(p54_i0) ] aud <- audit_snapshot(aud, data, "analysis_ready") ``` For UKB withdrawal files, run `ops_withdraw()` early in the pipeline and then record an audit snapshot. `ops_withdraw()` itself records before/after snapshots in the session-level `ops_snapshot()` history. ```{r audit-withdraw} withdraw_file <- tempfile(fileext = ".csv") writeLines(as.character(data$eid[1:3]), withdraw_file) data <- ops_withdraw(data, file = withdraw_file) aud <- audit_snapshot(aud, data, "after_withdraw") ``` --- ## Record Model Results Association result tables are usually small and already contain the most useful model summary. `audit_model()` stores the result table directly. If the covariate vector already exists in your script, pass it along. ```{r audit-model} covars <- c( "p21022", "p31", "p1558_i0", "p54_i0" ) res <- assoc_coxph( data = data, outcome_col = "lung_status", time_col = "lung_followup_years", exposure_col = "smoking_ever", covariates = covars ) aud <- audit_model( aud, result = res, label = "smoking_lung_cox", covariates = covars ) ``` The model record stores the full result table, inferred method, exposures, model labels, optional covariates, and a timestamp. --- ## Review and Write the Manifest Use `summary()` for a short directory-style overview. ```{r audit-summary} summary(aud) ``` Write the manifest as JSON alongside the analysis outputs. ```{r audit-write} audit_write(aud, "ukbflow-audit.json", overwrite = TRUE) ``` The resulting JSON contains the audit metadata, extraction field records, snapshots, phenotype summaries, model result records, and session information. --- ## Suggested Audit Points For most analyses, these are enough: 1. `audit_start()` after loading ukbflow. 2. `audit_fields()` next to the field vector used for extraction. 3. `audit_snapshot()` after loading raw data. 4. `audit_snapshot()` and `audit_pheno()` after phenotype derivation. 5. `audit_snapshot()` after each major cohort exclusion. 6. `audit_snapshot()` immediately before modelling. 7. `audit_model()` after each main association result. 8. `audit_job()` next to long-running RAP jobs when a job ID is available. 9. `audit_write()` at the end of the script. Keep the audit close to the real workflow. Do not duplicate logic just for the manifest; record objects that already exist in the analysis.