| Title: | Streamlined Workflow for UK Biobank Data Extraction, Analysis, and Visualization |
|---|---|
| Description: | Provides an R-native, RAP-aware workflow system for UK Biobank controlled-data analysis on the Research Analysis Platform (RAP). Includes tools for phenotype extraction and decoding, variable derivation, survival and association analysis, genetic risk score computation, audit records, and publication-quality visualization. For details on the UK Biobank resource, see Bycroft et al. (2018) <doi:10.1038/s41586-018-0579-z>. |
| Authors: | Yibin Zhou [aut, cre] (ORCID: <https://orcid.org/0009-0009-4600-8175>) |
| Maintainer: | Yibin Zhou <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.4 |
| Built: | 2026-06-03 08:32:38 UTC |
| Source: | https://github.com/evanbio/ukbflow |
Fits a Fine-Gray subdistribution hazard model (via
survival::finegray() + weighted coxph()) for each exposure
variable and returns a tidy result table with subdistribution hazard ratios
(SHR).
assoc_competing( data, outcome_col, time_col, exposure_col, compete_col = NULL, event_val = 1L, compete_val = 2L, covariates = NULL, base = TRUE, conf_level = 0.95 ) assoc_fg( data, outcome_col, time_col, exposure_col, compete_col = NULL, event_val = 1L, compete_val = 2L, covariates = NULL, base = TRUE, conf_level = 0.95 )assoc_competing( data, outcome_col, time_col, exposure_col, compete_col = NULL, event_val = 1L, compete_val = 2L, covariates = NULL, base = TRUE, conf_level = 0.95 ) assoc_fg( data, outcome_col, time_col, exposure_col, compete_col = NULL, event_val = 1L, compete_val = 2L, covariates = NULL, base = TRUE, conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Primary event column. In Mode A: a multi-value column (any integer or character codes). In Mode B: a 0/1 binary column. |
time_col |
(character) Follow-up time column (numeric, e.g. years). |
exposure_col |
(character) One or more exposure variable names. |
compete_col |
(character or NULL) Mode B only: name of the 0/1
competing event column. When |
event_val |
Scalar value in |
compete_val |
Scalar value in |
covariates |
(character or NULL) Covariate names for the Fully
adjusted model. When |
base |
(logical) Whether to auto-detect age and sex columns from
standard UKB or decoded names and include an Age and sex adjusted model.
Default: |
conf_level |
(numeric) Confidence level for SHR intervals.
Default: |
Two input modes are supported depending on how the outcome is coded in your dataset:
compete_col = NULL (default). outcome_col contains all
event codes in one column (e.g. 0/1/2/3).
Use event_val and compete_val to identify the event of
interest and the competing event; all other values are treated as
censored. Example: UKB censoring_type where 1 = event, 2 = death
(competing), 0/3 = censored.
compete_col is the name of a separate 0/1 column for the
competing event. outcome_col is a 0/1 column for the primary
event. When both are 1 for the same participant, the primary event takes
priority. Example: outcome_col = "copd_status",
compete_col = "death_status".
Internally both modes are converted to a three-level factor
c("censor", "event", "compete") before being passed to
finegray().
Three adjustment models are produced (where data allow):
Unadjusted - always included.
Age and sex adjusted - when base = TRUE; age/sex
must be detectable from standard UKB or decoded names.
Fully adjusted - when covariates is non-NULL.
A data.table with one row per exposure term
model combination:
exposureExposure variable name.
termCoefficient name as returned by coxph.
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
nParticipants in the model (after NA removal).
n_eventsPrimary events in the analysis set.
n_competeCompeting events in the analysis set.
SHRSubdistribution hazard ratio.
CI_lowerLower CI bound.
CI_upperUpper CI bound.
p_valueRobust z-test p-value from weighted Cox.
SHR_labelFormatted string, e.g. "1.23 (1.05-1.44)".
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L & htn_timing != 1L] res <- assoc_competing( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", compete_col = "htn_status", covariates = c("bmi_cat", "tdi_cat") )dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L & htn_timing != 1L] res <- assoc_competing( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", compete_col = "htn_status", covariates = c("bmi_cat", "tdi_cat") )
Fits one or more Cox models for each exposure variable and returns a tidy result table suitable for downstream forest plots. By default, two standard adjustment models are always included alongside any user-specified model:
assoc_coxph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), cluster_col = NULL, conf_level = 0.95 ) assoc_cox( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), cluster_col = NULL, conf_level = 0.95 )assoc_coxph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), cluster_col = NULL, conf_level = 0.95 ) assoc_cox( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), cluster_col = NULL, conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. Must contain all
columns referenced by |
outcome_col |
(character) Name of the event indicator column.
Accepts |
time_col |
(character) Name of the follow-up time column (numeric, in consistent units, e.g. years). |
exposure_col |
(character) One or more exposure variable names. Each variable is analysed separately; results are stacked row-wise. |
covariates |
(character or NULL) Additional covariate column names for
the Fully adjusted model (e.g.
|
base |
(logical) If |
strata |
(character or NULL) Optional stratification variable.
Passed to |
test |
(character) P-value method for Cox models: |
cluster_col |
(character or NULL) Optional clustering variable for
cluster-robust Cox variance, added to the model as
|
conf_level |
(numeric) Confidence level for hazard ratio intervals.
Default: |
Unadjusted - no covariates (crude).
Age and sex adjusted - age + sex auto-detected from
standard UKB names (p21022/p31) or decoded names
(age_at_recruitment/sex). Errors if either column cannot
be found.
Fully adjusted - the covariates supplied via the
covariates argument. Only run when covariates is non-NULL.
Outcome coding: outcome_col may be logical
(TRUE/FALSE) or integer/numeric (0/1).
Logical values are converted to integer internally.
Exposure types supported:
Binary - 0/1 or TRUE/FALSE;
produces one term row per model.
Factor - produces one term row per non-reference level.
Numeric (continuous) - produces one term row per model.
P-value method: test = "wald" returns the term-level Wald
p-value from summary.coxph(). test = "lrt" returns the
exposure-level likelihood-ratio p-value from single-term deletion
(drop1(..., test = "Chisq")); for factor exposures, the same overall
exposure p-value is repeated across the non-reference level rows.
When cluster_col is supplied, only test = "wald" is supported;
the Wald p-value and confidence interval use the cluster-robust variance from
summary.coxph().
A data.table with one row per exposure term
model combination, and the following columns:
exposureExposure variable name.
termCoefficient name as returned by coxph (e.g.
"bmi_categoryObese" for a factor, or the variable name itself
for numeric/binary exposures).
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
nNumber of participants included in the model (after
NA removal).
n_eventsNumber of events in the model's analysis set
(after NA removal).
person_yearsTotal person-years of follow-up in the
model's analysis set (rounded, after NA removal).
HRHazard ratio (point estimate).
CI_lowerLower bound of the confidence interval.
CI_upperUpper bound of the confidence interval.
p_valueP-value from the method selected by
test. When cluster_col is supplied, this is the
cluster-robust Wald p-value.
HR_labelFormatted string, e.g.
"1.23 (1.05-1.44)".
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) # Fully adjusted only (skip Unadjusted + Age-sex) res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("p21022", "p31", "bmi_cat", "tdi_cat"), base = FALSE )dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) # Fully adjusted only (skip Unadjusted + Age-sex) res <- assoc_coxph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("p21022", "p31", "bmi_cat", "tdi_cat"), base = FALSE )
Tests the proportional hazards (PH) assumption using Schoenfeld residuals
via cox.zph. Re-fits the same models as
assoc_coxph (same interface) and returns a tidy result table
with term-level and global test statistics.
assoc_coxph_zph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL ) assoc_zph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL )assoc_coxph_zph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL ) assoc_zph( data, outcome_col, time_col, exposure_col, covariates = NULL, base = TRUE, strata = NULL )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Binary event indicator ( |
time_col |
(character) Follow-up time column name. |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariates for the Fully adjusted
model. Default: |
base |
(logical) Include Unadjusted and Age and sex adjusted models.
Default: |
strata |
(character or NULL) Optional stratification variable. |
A non-significant p-value (p > 0.05) indicates the PH assumption is
satisfied for that term. The global test (global_p) reflects the
overall PH assumption for the whole model.
A data.table with one row per exposure term
model combination, and columns:
exposureExposure variable name.
termCoefficient name.
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
chisqSchoenfeld residual chi-squared statistic.
dfDegrees of freedom.
p_valueP-value for the PH test (term-level).
ph_satisfiedLogical; TRUE if p_value > 0.05.
global_chisqGlobal chi-squared for the whole model.
global_dfGlobal degrees of freedom.
global_pGlobal p-value for the whole model.
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] zph <- assoc_coxph_zph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) zph[ph_satisfied == FALSE]dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] zph <- assoc_coxph_zph( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) zph[ph_satisfied == FALSE]
Runs Cox proportional hazards models at one or more lag periods to assess
whether associations are robust to the exclusion of early events. For each
lag, participants whose follow-up time is less than lag_years are
removed from the analysis dataset; follow-up time is kept on its original
scale (not shifted). This mirrors the approach used in UK Biobank sensitivity
analyses to address reverse causation and detection bias.
assoc_lag( data, outcome_col, time_col, exposure_col, lag_years = c(1, 2), covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), conf_level = 0.95 )assoc_lag( data, outcome_col, time_col, exposure_col, lag_years = c(1, 2), covariates = NULL, base = TRUE, strata = NULL, test = c("wald", "lrt"), conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Event indicator column (0/1 or logical). |
time_col |
(character) Follow-up time column (numeric, e.g. years). |
exposure_col |
(character) One or more exposure variable names. |
lag_years |
(numeric) One or more lag periods in the same units as
|
covariates |
(character or NULL) Covariates for the Fully adjusted
model. When |
base |
(logical) Auto-detect age and sex from standard UKB or decoded
names and include an Age and sex adjusted model. Default: |
strata |
(character or NULL) Optional stratification variable passed to
|
test |
(character) P-value method: |
conf_level |
(numeric) Confidence level for HR intervals.
Default: |
Setting lag_years = 0 (or including 0 in the vector) runs the
model on the full unfiltered cohort, providing a reference against which
lagged results can be compared.
The same three adjustment models produced by assoc_coxph are
available here (Unadjusted, Age and sex adjusted,
Fully adjusted).
P-value method: test = "wald" returns term-level Wald
p-values from summary.coxph(). test = "lrt" returns the
exposure-level likelihood-ratio p-value from single-term deletion within
each lag-specific analysis set.
A data.table with one row per lag exposure
term model combination, containing all columns
produced by assoc_coxph plus:
lag_yearsThe lag period applied (numeric).
n_excludedNumber of participants excluded because their
follow-up time was less than lag_years.
lag_years and n_excluded are placed immediately after
model in the column order.
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_lag( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", lag_years = c(0, 1, 2), covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) res[, .(lag_years, n, n_excluded, HR, CI_lower, CI_upper, p_value)]dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_lag( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", lag_years = c(0, 1, 2), covariates = c("bmi_cat", "tdi_cat"), base = FALSE ) res[, .(lag_years, n, n_excluded, HR, CI_lower, CI_upper, p_value)]
Fits one or more linear regression models for each exposure variable and returns a tidy result table. By default, two standard adjustment models are always included:
assoc_linear( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), conf_level = 0.95 ) assoc_lm( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), conf_level = 0.95 )assoc_linear( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), conf_level = 0.95 ) assoc_lm( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Name of the continuous numeric outcome column. |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariate column names for the
Fully adjusted model. Default: |
base |
(logical) Include Unadjusted and Age and sex
adjusted models. Default: |
test |
(character) P-value method for linear models: |
conf_level |
(numeric) Confidence level. Default: |
Unadjusted - no covariates (crude).
Age and sex adjusted - age + sex auto-detected from
standard UKB names (p21022/p31) or decoded names
(age_at_recruitment/sex). Errors if either column cannot
be found.
Fully adjusted - the covariates supplied via the
covariates argument. Only run when covariates is non-NULL.
Outcome: intended for continuous numeric variables. Passing a
binary (0/1) or logical column is permitted (linear probability model) but
will trigger a warning recommending assoc_logistic instead.
CI method: based on the t-distribution via confint.lm(),
which is exact under the normal linear model assumption. There is no
ci_method argument (unlike assoc_logistic) as profile
likelihood does not apply to lm.
SE column: the standard error of is included to
support downstream meta-analysis and GWAS-style summary statistics.
P-value method: test = "wald" returns coefficient-level
t-test p-values from summary.lm(). test = "lrt" returns an
exposure-level nested-model p-value from single-term deletion
(drop1(..., test = "F")); for factor exposures, the same overall
exposure p-value is repeated across the non-reference level rows.
A data.table with one row per exposure term
model combination, and columns:
exposureExposure variable name.
termCoefficient name (e.g. "bmi_categoryObese").
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
nParticipants in model (after NA removal).
betaRegression coefficient ().
seStandard error of .
CI_lowerLower confidence bound.
CI_upperUpper confidence bound.
p_valueP-value from the method selected by
test.
beta_labelFormatted string, e.g.
"0.23 (0.05-0.41)".
dt <- ops_toy(scenario = "association", n = 500) res <- assoc_linear( data = dt, outcome_col = "p21001_i0", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE )dt <- ops_toy(scenario = "association", n = 500) res <- assoc_linear( data = dt, outcome_col = "p21001_i0", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE )
Fits one or more logistic regression models for each exposure variable and returns a tidy result table suitable for downstream forest plots. By default, two standard adjustment models are always included:
assoc_logistic( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), ci_method = c("wald", "profile"), conf_level = 0.95 ) assoc_logit( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), ci_method = c("wald", "profile"), conf_level = 0.95 )assoc_logistic( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), ci_method = c("wald", "profile"), conf_level = 0.95 ) assoc_logit( data, outcome_col, exposure_col, covariates = NULL, base = TRUE, test = c("wald", "lrt"), ci_method = c("wald", "profile"), conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Binary outcome column ( |
exposure_col |
(character) One or more exposure variable names. |
covariates |
(character or NULL) Covariate column names for the
Fully adjusted model. Default: |
base |
(logical) Include Unadjusted and Age and sex
adjusted models. Default: |
test |
(character) P-value method for logistic models: |
ci_method |
(character) CI calculation method: |
conf_level |
(numeric) Confidence level. Default: |
Unadjusted - no covariates (crude).
Age and sex adjusted - age + sex auto-detected from
standard UKB names (p21022/p31) or decoded names
(age_at_recruitment/sex). Errors if either column cannot
be found.
Fully adjusted - the covariates supplied via the
covariates argument. Only run when covariates is non-NULL.
Outcome coding: outcome_col may be logical
(TRUE/FALSE) or integer/numeric (0/1).
Logical values are converted to integer internally.
CI methods:
"wald" (default) - fast, appropriate for large UKB samples.
"profile" - profile likelihood CI via confint.glm();
slower but more accurate for small or sparse data.
P-value method: test = "wald" returns coefficient-level Wald
p-values from summary.glm(). test = "lrt" returns the
exposure-level likelihood-ratio p-value from single-term deletion
(drop1(..., test = "Chisq")); for factor exposures, the same overall
exposure p-value is repeated across the non-reference level rows.
A data.table with one row per exposure term
model combination, and columns:
exposureExposure variable name.
termCoefficient name (e.g. "bmi_categoryObese").
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
nParticipants in model (after NA removal).
n_casesNumber of cases (outcome = 1) in model.
OROdds ratio (point estimate).
CI_lowerLower confidence bound.
CI_upperUpper confidence bound.
p_valueP-value from the method selected by
test.
OR_labelFormatted string, e.g. "1.23 (1.05-1.44)".
dt <- ops_toy(scenario = "association", n = 500) res <- assoc_logistic( data = dt, outcome_col = "dm_status", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE )dt <- ops_toy(scenario = "association", n = 500) res <- assoc_logistic( data = dt, outcome_col = "dm_status", exposure_col = "p20116_i0", covariates = c("bmi_cat", "tdi_cat"), base = FALSE )
Stratifies the dataset by a single grouping variable (by) and runs
the specified association model (coxph, logistic, or
linear) within each subgroup. Unlike assoc_coxph and
its siblings, there is no automatic age-and-sex-adjusted model: at the
subgroup level the variable that defines the stratum would have zero
variance, making the auto-detected adjustment meaningless. Accordingly,
this function does not accept a base argument. Instead, two
models are available:
assoc_subgroup( data, outcome_col, time_col = NULL, exposure_col, by, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, interaction = TRUE, conf_level = 0.95 ) assoc_sub( data, outcome_col, time_col = NULL, exposure_col, by, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, interaction = TRUE, conf_level = 0.95 )assoc_subgroup( data, outcome_col, time_col = NULL, exposure_col, by, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, interaction = TRUE, conf_level = 0.95 ) assoc_sub( data, outcome_col, time_col = NULL, exposure_col, by, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, interaction = TRUE, conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Outcome column name. |
time_col |
(character or NULL) Follow-up time column (required when
|
exposure_col |
(character) One or more exposure variable names. |
by |
(character) Single stratification variable name. Its unique
non-NA values (or factor levels, in order) define the subgroups.
Should be a categorical or binary variable with a small number of
levels (e.g. sex, smoking status). Continuous variables are technically
permitted and the interaction LRT will still run, but per-unique-value
subgrouping is rarely meaningful in practice — use a pre-categorised
version (e.g. via |
method |
(character) Regression method: |
test |
(character) P-value method for subgroup-specific main
associations: |
covariates |
(character or NULL) Covariate column names for the Fully
adjusted model. When |
interaction |
(logical) Compute the LRT p-value for the exposure
|
conf_level |
(numeric) Confidence level. Default: |
Unadjusted - always run (no covariates).
Fully adjusted - run when covariates is non-NULL.
Users are responsible for excluding the by variable from
covariates (a warning is issued if it is included).
Interaction test: when interaction = TRUE (default), a
likelihood ratio test (LRT) for the exposure by interaction is
computed on the full dataset for each exposure model
combination and appended as p_interaction. LRT is preferred over
Wald because it handles factor, binary, and continuous by variables
uniformly without requiring the user to recode the by variable.
P-value method: test controls the subgroup-specific main
association p-values. It does not change p_interaction, which remains
an interaction LRT.
A data.table with one row per subgroup level
exposure term model, containing:
subgroupName of the by variable.
subgroup_levelFactor: level of the by variable
(ordered by original level / sort order).
exposureExposure variable name.
termCoefficient name from the fitted model.
modelOrdered factor: Unadjusted <
Fully adjusted.
nParticipants in model (after NA removal).
Effect estimate columns: HR/OR/beta,
CI_lower, CI_upper, p_value, and a formatted
label column. Cox models additionally include n_events and
person_years; logistic models include n_cases; linear
models include se.
p_interactionLRT p-value for the exposure
by interaction on the full dataset. Shared across all subgroup levels
for the same exposure model. NA when the
interaction model fails. Only present when
interaction = TRUE.
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_subgroup( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", by = "p31", method = "coxph", covariates = c("bmi_cat", "tdi_cat") )dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_subgroup( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "p20116_i0", by = "p31", method = "coxph", covariates = c("bmi_cat", "tdi_cat") )
Fits categorical and trend models simultaneously for each ordered-factor
exposure, returning per-category effect estimates alongside a p-value for
linear trend. Two models are run internally per exposure
adjustment combination:
assoc_trend( data, outcome_col, time_col = NULL, exposure_col, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, base = TRUE, scores = NULL, conf_level = 0.95 ) assoc_tr( data, outcome_col, time_col = NULL, exposure_col, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, base = TRUE, scores = NULL, conf_level = 0.95 )assoc_trend( data, outcome_col, time_col = NULL, exposure_col, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, base = TRUE, scores = NULL, conf_level = 0.95 ) assoc_tr( data, outcome_col, time_col = NULL, exposure_col, method = c("coxph", "logistic", "linear"), test = c("wald", "lrt"), covariates = NULL, base = TRUE, scores = NULL, conf_level = 0.95 )
data |
(data.frame or data.table) Analysis dataset. |
outcome_col |
(character) Outcome column name. |
time_col |
(character or NULL) Follow-up time column (required when
|
exposure_col |
(character) One or more exposure column names. Each
must be a |
method |
(character) Regression method: |
test |
(character) P-value method: |
covariates |
(character or NULL) Covariates for the Fully adjusted
model. Default: |
base |
(logical) Include Unadjusted and Age-and-sex-adjusted models.
Default: |
scores |
(numeric or NULL) Numeric scores assigned to factor levels in
level order. Length must equal |
conf_level |
(numeric) Confidence level. Default: |
Categorical model - exposure treated as a factor; produces
one row per non-reference level (HR / OR / vs reference).
Trend model - exposure recoded as numeric scores (default:
0, 1, 2, ...); produces the per-score-unit effect estimate
(*_per_score) and p_trend.
Both results are merged: the output contains a reference row (effect = 1 /
0, CI = NA) followed by non-reference rows, with *_per_score and
p_trend appended as shared columns (same value within each
exposure model combination).
Scores: By default levels are scored 0, 1, 2, ... so the
reference group = 0 and each step = 1 unit. Supply scores to use
meaningful units (e.g. median years per category) - only p_trend and
the per-score estimate change; per-category HRs are unaffected.
Adjustment models: follows the same logic as
assoc_coxph - Unadjusted and Age-and-sex-adjusted models are
included by default (base = TRUE); a Fully adjusted model is added
when covariates is non-NULL.
P-value method: test controls both categorical-model
p_value and trend-model p_trend. For method = "linear",
"lrt" uses the conventional nested-model F test.
A data.table with one row per exposure level
model, containing:
exposureExposure variable name.
levelFactor level (ordered factor preserving original level order).
termCoefficient name as returned by the model (reference
row uses paste0(exposure, ref_level)).
modelOrdered factor: Unadjusted <
Age and sex adjusted < Fully adjusted.
nParticipants in model (after NA removal).
Effect estimate columns from the categorical model:
HR/OR/beta, CI_lower, CI_upper,
p_value, and a formatted label. Reference row has
HR = 1 / beta = 0 and NA for CI and p.
Cox models additionally include n_events and
person_years; logistic includes n_cases; linear
includes se.
HR_per_score / OR_per_score /
beta_per_score
Per-score-unit effect estimate from the trend
model. Shared across all levels within the same exposure
model.
HR_per_score_label / OR_per_score_label /
beta_per_score_label
Formatted string for the per-score estimate.
p_trendP-value for linear trend from the trend model.
Shared across all levels within the same exposure
model.
dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_trend( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "bmi_cat", method = "coxph", covariates = c("tdi_cat", "p20116_i0"), base = FALSE )dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] res <- assoc_trend( data = dt, outcome_col = "dm_status", time_col = "dm_followup_years", exposure_col = "bmi_cat", method = "coxph", covariates = c("tdi_cat", "p20116_i0"), base = FALSE )
Returns the complete column names recorded by audit_snapshot
for a given snapshot label. Unlike ops_snapshot_cols, this
helper does not exclude protected columns; it returns exactly the columns
stored in the audit manifest.
audit_cols(audit, label)audit_cols(audit, label)
audit |
A |
label |
(character) Snapshot label passed to
|
A character vector of column names.
aud <- audit_start("example_analysis") dt <- data.frame(eid = 1:3, x = c(1, NA, 3)) aud <- audit_snapshot(aud, dt, "raw", verbose = FALSE) audit_cols(aud, "raw")aud <- audit_start("example_analysis") dt <- data.frame(eid = 1:3, x = c(1, NA, 3)) aud <- audit_snapshot(aud, dt, "raw", verbose = FALSE) audit_cols(aud, "raw")
Appends one extraction record to a audit_start object. The
function records the declared UKB field IDs, optional dataset name, optional
label, number of fields, and timestamp. It does not validate field
availability against RAP; use ops_fields or
extract_ls separately for project-specific field discovery.
audit_fields(audit, field_id, dataset = NULL, label = NULL)audit_fields(audit, field_id, dataset = NULL, label = NULL)
audit |
A |
field_id |
(integer) UKB field IDs used for extraction. |
dataset |
(character or NULL) Optional RAP dataset file name.
Default: |
label |
(character or NULL) Optional label for this extraction record.
Default: |
The updated ukbflow_audit object.
aud <- audit_start("example_analysis") aud <- audit_fields(aud, c(31, 53, 21022), label = "core_fields")aud <- audit_start("example_analysis") aud <- audit_fields(aud, c(31, 53, 21022), label = "core_fields")
Records a DNAnexus job ID and, when available, lightweight metadata from
dx describe job-XXXX --json. The function is best-effort: if the job
cannot be described in the current environment, the job ID is still recorded
and metadata fields are set to NA. Cost is not estimated.
audit_job(audit, job_id, label = NULL)audit_job(audit, job_id, label = NULL)
audit |
A |
job_id |
(character) DNAnexus job ID, e.g. |
label |
(character or NULL) Optional label for this job record.
Default: |
The updated ukbflow_audit object.
aud <- audit_start("example_analysis") ## Not run: job_id <- extract_batch(c(31, 53, 21022)) aud <- audit_job(aud, job_id, "phenotype_extraction") ## End(Not run)aud <- audit_start("example_analysis") ## Not run: job_id <- extract_batch(c(31, 53, 21022)) aud <- audit_job(aud, job_id, "phenotype_extraction") ## End(Not run)
Stores a model result table returned by the assoc_* family in the
audit manifest. The result table is recorded directly because it is usually
small and contains the most useful analysis summary. Optional covariates can
be supplied when they already exist as a vector in the analysis script.
audit_model(audit, result, label = NULL, covariates = NULL)audit_model(audit, result, label = NULL, covariates = NULL)
audit |
A |
result |
A data.frame or data.table result table, typically returned by
|
label |
(character or NULL) Optional label for this model record.
Default: |
covariates |
(character or NULL) Optional covariate column names used
in the model. Default: |
The updated ukbflow_audit object.
aud <- audit_start("example_analysis") res <- data.frame( exposure = "smoking", term = "smokingEver", model = "Fully adjusted", n = 100, HR = 1.2, CI_lower = 1.0, CI_upper = 1.4, p_value = 0.04 ) aud <- audit_model(aud, res, "smoking_model", covariates = c("age", "sex"))aud <- audit_start("example_analysis") res <- data.frame( exposure = "smoking", term = "smokingEver", model = "Fully adjusted", n = 100, HR = 1.2, CI_lower = 1.0, CI_upper = 1.4, p_value = 0.04 ) aud <- audit_model(aud, res, "smoking_model", covariates = c("age", "sex"))
Summarises phenotype columns created by the derive_* family using
ukbflow's standard name prefix convention. The function records
whichever columns are present and marks missing components as not present.
audit_pheno(audit, data, name)audit_pheno(audit, data, name)
audit |
A |
data |
A data.frame or data.table containing derived phenotype columns. |
name |
(character) Phenotype prefix used by |
The updated ukbflow_audit object.
aud <- audit_start("example_analysis") dt <- data.frame( eid = 1:3, lung_status = c(TRUE, FALSE, TRUE), lung_date = as.Date(c("2020-01-01", NA, "2021-01-01")) ) aud <- audit_pheno(aud, dt, "lung")aud <- audit_start("example_analysis") dt <- data.frame( eid = 1:3, lung_status = c(TRUE, FALSE, TRUE), lung_date = as.Date(c("2020-01-01", NA, "2021-01-01")) ) aud <- audit_pheno(aud, dt, "lung")
Captures a lightweight structural snapshot of a data.frame at a named
analysis stage and appends it to the snapshots layer of a
ukbflow_audit object. This function mirrors the core behavior of
ops_snapshot but stores records inside the explicit audit
object rather than a session cache.
audit_snapshot( audit, data = NULL, label = NULL, reset = FALSE, check_na = TRUE, verbose = TRUE )audit_snapshot( audit, data = NULL, label = NULL, reset = FALSE, check_na = TRUE, verbose = TRUE )
audit |
A |
data |
A data.frame or data.table to snapshot. Required unless
|
label |
(character) A unique label for this snapshot, e.g.
|
reset |
(logical) If |
check_na |
(logical) Whether to count columns with any |
verbose |
(logical) Print a short status message. Default:
|
The updated ukbflow_audit object.
aud <- audit_start("example_analysis") dt <- data.frame(eid = 1:3, x = c(1, NA, 3)) aud <- audit_snapshot(aud, dt, "raw")aud <- audit_start("example_analysis") dt <- data.frame(eid = 1:3, x = c(1, NA, 3)) aud <- audit_snapshot(aud, dt, "raw")
Creates a minimal S3 audit object for one analysis. The object records only the root metadata needed to identify and reproduce the analysis context: analysis name, start time, ukbflow version, R session information, and the current DNAnexus user / project when available. Later audit helpers can append fields, snapshots, exclusions, models, and jobs to this object.
audit_start(name)audit_start(name)
name |
(character) User-defined analysis name, e.g.
|
DNAnexus context is captured opportunistically. If the dx CLI is unavailable,
the user is not logged in, or no project is selected, the corresponding
fields are recorded as NA without failing.
An S3 object with class c("ukbflow_audit", "list").
aud <- audit_start("example_analysis") audaud <- audit_start("example_analysis") aud
Writes a ukbflow_audit object to a JSON manifest. The manifest is a
plain-list representation of the audit object: session information is
converted to character lines, and record layers such as extraction
and snapshots are written as JSON arrays.
audit_write(audit, file = "ukbflow-audit.json", overwrite = FALSE)audit_write(audit, file = "ukbflow-audit.json", overwrite = FALSE)
audit |
A |
file |
(character) Output JSON file path. Default:
|
overwrite |
(logical) Overwrite |
Invisibly returns the normalized output path.
aud <- audit_start("example_analysis") outfile <- tempfile(fileext = ".json") audit_write(aud, outfile)aud <- audit_start("example_analysis") outfile <- tempfile(fileext = ".json") audit_write(aud, outfile)
Returns a list of all projects accessible to the current user.
auth_list_projects()auth_list_projects()
A character vector of project names and IDs.
## Not run: auth_list_projects() ## End(Not run)## Not run: auth_list_projects() ## End(Not run)
Authenticates with the DNAnexus Research Analysis Platform using an API
token. Equivalent to running dx login --token on the command line.
auth_login(token = NULL)auth_login(token = NULL)
token |
(character) DNAnexus API token. If NULL, reads from the
environment variable |
Invisible TRUE on success.
## Not run: # Supply token directly auth_login(token = "your_token_here") # Or store token in ~/.Renviron (recommended): # usethis::edit_r_environ() # Add: DX_API_TOKEN=your_token_here # Save and restart R, then call: auth_login() ## End(Not run)## Not run: # Supply token directly auth_login(token = "your_token_here") # Or store token in ~/.Renviron (recommended): # usethis::edit_r_environ() # Add: DX_API_TOKEN=your_token_here # Save and restart R, then call: auth_login() ## End(Not run)
Invalidates the current DNAnexus session on the remote platform. The local
token file is not removed but becomes invalid. A new token must be generated
from the DNAnexus platform before calling auth_login() again.
auth_logout()auth_logout()
Invisible TRUE on success.
## Not run: auth_logout() ## End(Not run)## Not run: auth_logout() ## End(Not run)
Switches the active project context on the DNAnexus platform. Only project
IDs (e.g. "project-XXXXXXXXXXXX") are accepted. Run
auth_list_projects() to find your project ID.
auth_select_project(project)auth_select_project(project)
project |
(character) Project ID in the form |
Invisible TRUE on success.
## Not run: auth_select_project("project-XXXXXXXXXXXX") ## End(Not run)## Not run: auth_select_project("project-XXXXXXXXXXXX") ## End(Not run)
Returns the current logged-in user and selected project.
auth_status()auth_status()
A named list with user and project.
## Not run: auth_status() ## End(Not run)## Not run: auth_status() ## End(Not run)
Renames columns from the raw UKB field ID format used by
extract_pheno (e.g. participant.p31) and
job_result (e.g. p53_i0) to human-readable
snake_case identifiers (e.g. sex,
date_of_attending_assessment_centre_i0).
decode_names(data, max_nchar = 60L)decode_names(data, max_nchar = 60L)
data |
(data.frame or data.table) Data extracted from UKB-RAP via
|
max_nchar |
(integer) Column names longer than this value are flagged.
Default: |
Column labels are taken from the UKB field title dictionary via
extract_ls. Both the dataset name and field list are cached
after the first call, so subsequent calls to decode_names() involve
no network requests.
When an auto-generated name exceeds max_nchar characters it is
flagged with a warning so you can decide whether to shorten it manually
with names(data)[...] <- .... The function never truncates names
automatically, because the right short name depends on scientific context
that only you know.
The input data with column names replaced by snake_case
labels. Returns a data.table if the input is a data.table.
## Not run: df <- extract_pheno(c(31, 53, 21022)) df <- decode_names(df) # participant.eid → eid # participant.p31 → sex # participant.p21022 → age_at_recruitment # participant.p53_i0 → date_of_attending_assessment_centre_i0 (warned if > 60) # Shorten a long name afterwards names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline" ## End(Not run)## Not run: df <- extract_pheno(c(31, 53, 21022)) df <- decode_names(df) # participant.eid → eid # participant.p31 → sex # participant.p21022 → age_at_recruitment # participant.p53_i0 → date_of_attending_assessment_centre_i0 (warned if > 60) # Shorten a long name afterwards names(df)[names(df) == "date_of_attending_assessment_centre_i0"] <- "date_baseline" ## End(Not run)
Converts raw integer codes produced by extract_pheno into
human-readable labels for all categorical fields
(value_type 21 and 22), using the UKB Showcase encoding tables.
Continuous, text, date, and already-decoded columns are left unchanged.
decode_values(data, metadata_dir = "data/metadata/")decode_values(data, metadata_dir = "data/metadata/")
data |
(data.frame or data.table) Data from |
metadata_dir |
(character) Directory containing |
This function requires two metadata files downloaded from the UKB Research Analysis Platform:
field.tsv - maps field IDs to encoding IDs and value types.
esimpint.tsv - maps encoding ID + integer code to label.
Download them once with:
fetch_metadata(dest_dir = "data/metadata")
Both files are cached in the session after the first read.
Call order: use decode_values() before
decode_names, so that column names still contain the numeric
field ID needed to look up the encoding.
The input data with categorical columns replaced by character
labels. Returns a data.table if the input is a data.table.
## Not run: # Download metadata once fetch_metadata(dest_dir = "data/metadata") # Recommended call order df <- extract_pheno(c(31, 54, 20116, 21000)) df <- decode_values(df) # 0/1 → "Female"/"Male", etc. df <- decode_names(df) # participant.p31 → sex ## End(Not run)## Not run: # Download metadata once fetch_metadata(dest_dir = "data/metadata") # Recommended call order df <- extract_pheno(c(31, 54, 20116, 21000)) df <- decode_values(df) # 0/1 → "Female"/"Male", etc. df <- decode_names(df) # participant.p31 → sex ## End(Not run)
For each name in name, adds one column age_at_{name} (numeric,
years) computed as:
derive_age( data, name, baseline_col, age_col, date_cols = NULL, status_cols = NULL )derive_age( data, name, baseline_col, age_col, date_cols = NULL, status_cols = NULL )
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) One or more output prefixes, e.g.
|
baseline_col |
(character) Name of the baseline date column
(e.g. |
age_col |
(character) Name of the age-at-baseline column
(e.g. |
date_cols |
(character or NULL) Named character vector mapping each
name to its event date column, e.g.
|
status_cols |
(character or NULL) Named character vector mapping each
name to its status column. |
The value is NA for participants who did not experience the event
(status is FALSE / 0) or who lack an event date.
Auto-detection per name (when date_cols / status_cols
are NULL):
date_col - looked up as {name}_date.
status_col - looked up first as {name}_status, then
as {name} (logical column); if neither exists all rows with a
non-NA date are treated as cases.
data.table pass-by-reference: new columns are added in-place.
The input data with one new age_at_{name} column
per entry in name, added in-place.
dt <- ops_toy(scenario = "association", n = 100) derive_age(dt, name = "dm", baseline_col = "p53_i0", age_col = "p21022")dt <- ops_toy(scenario = "association", n = 100) derive_age(dt, name = "dm", baseline_col = "p53_i0", age_col = "p21022")
The UK Biobank cancer registry links each participant's cancer diagnoses
from national cancer registries. Each diagnosis is stored as a separate
instance with four parallel fields: ICD-10 code (p40006), histology
code (p40011), behaviour code (p40012), and diagnosis date
(p40005). Unlike HES or self-report data, each instance holds
exactly one record - there is no array (a*) dimension.
derive_cancer_registry( data, name, icd10 = NULL, histology = NULL, behaviour = NULL, code_cols = NULL, hist_cols = NULL, behv_cols = NULL, date_cols = NULL )derive_cancer_registry( data, name, icd10 = NULL, histology = NULL, behaviour = NULL, code_cols = NULL, hist_cols = NULL, behv_cols = NULL, date_cols = NULL )
data |
(data.frame or data.table) UKB phenotype data containing cancer registry fields. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character or NULL) Regular expression matched against the
ICD-10 code column ( |
histology |
(integer vector or NULL) Histology codes to retain
( |
behaviour |
(integer vector or NULL) Behaviour codes to retain
( |
code_cols |
(character or NULL) Names of ICD-10 code columns
( |
hist_cols |
(character or NULL) Names of histology columns
( |
behv_cols |
(character or NULL) Names of behaviour columns
( |
date_cols |
(character or NULL) Names of diagnosis date columns
( |
All three filter arguments (icd10, histology,
behaviour) are applied with AND logic: a record must satisfy
every non-NULL filter to be counted. For OR conditions (e.g.\
D04 or C44 with specific histology), call the function twice and
combine the resulting columns downstream.
{name}_cancerLogical flag: TRUE if any cancer
registry record satisfies all supplied filters.
{name}_cancer_dateEarliest matching diagnosis date
(IDate).
The input data with two new columns added in-place:
{name}_cancer (logical) and {name}_cancer_date (IDate).
Always returns a data.table.
dt <- ops_toy(n = 100) derive_cancer_registry(dt, name = "skin", icd10 = "^C44") derive_cancer_registry(dt, name = "invasive", icd10 = "^C", behaviour = 3L)dt <- ops_toy(n = 100) derive_cancer_registry(dt, name = "skin", icd10 = "^C44") derive_cancer_registry(dt, name = "invasive", icd10 = "^C", behaviour = 3L)
Takes the self-report flag and ICD-10 flag produced by
derive_selfreport and derive_icd10 (or any
pair of logical columns) and merges them into a single logical case status
and earliest date.
derive_case( data, name, icd10_col = NULL, selfreport_col = NULL, icd10_date_col = NULL, selfreport_date_col = NULL )derive_case( data, name, icd10_col = NULL, selfreport_col = NULL, icd10_date_col = NULL, selfreport_date_col = NULL )
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Column prefix used both to locate the default
input columns and to name the output columns. Defaults:
|
icd10_col |
(character or NULL) Name of the ICD-10 status column.
|
selfreport_col |
(character or NULL) Name of the self-report status
column. |
icd10_date_col |
(character or NULL) Name of the ICD-10 date column.
|
selfreport_date_col |
(character or NULL) Name of the self-report date
column. |
{name}_statusLogical: TRUE if positive in at
least one source.
{name}_dateEarliest diagnosis/report date across both
sources (IDate).
The input data with two new columns added in-place:
{name}_status (logical) and {name}_date (IDate).
Always returns a data.table.
dt <- ops_toy(n = 100) derive_selfreport(dt, name = "htn", regex = "hypertension", field = "noncancer", disease_cols = paste0("p20002_i0_a", 0:4), date_cols = paste0("p20008_i0_a", 0:4), visit_cols = "p53_i0") derive_icd10(dt, name = "htn", icd10 = "I10", source = c("hes", "death")) derive_case(dt, name = "htn")dt <- ops_toy(n = 100) derive_selfreport(dt, name = "htn", regex = "hypertension", field = "noncancer", disease_cols = paste0("p20002_i0_a", 0:4), date_cols = paste0("p20008_i0_a", 0:4), visit_cols = "p53_i0") derive_icd10(dt, name = "htn", icd10 = "I10", source = c("hes", "death")) derive_case(dt, name = "htn")
Converts decoded UKB columns to analysis-ready types: character-encoded
numeric fields to numeric, and categorical fields to factor.
Prints a concise summary for each converted column - mean / median / SD /
missing rate for numeric columns, and level counts for factor columns - so
you can verify distributions without leaving the pipeline.
derive_covariate( data, as_numeric = NULL, as_factor = NULL, factor_levels = NULL, max_levels = 5L )derive_covariate( data, as_numeric = NULL, as_factor = NULL, factor_levels = NULL, max_levels = 5L )
data |
(data.frame or data.table) UKB data, typically output of
|
as_numeric |
(character or NULL) Column names to convert to
|
as_factor |
(character or NULL) Column names to convert to
|
factor_levels |
(named list or NULL) Custom level ordering for specific
factor columns. Names must match entries in |
max_levels |
(integer) Factor columns with more levels than this
threshold trigger a warning suggesting the user consider collapsing
categories. Default: |
data.table pass-by-reference: when the input is a
data.table, modifications are made in-place. Pass
data.table::copy(data) to preserve the original.
The input data with converted columns. Always returns a
data.table.
dt <- ops_toy(n = 100) dt <- derive_missing(dt) # messy_label is a character column with numeric-looking values ("-1", "999") # mixed with non-parseable strings — demonstrates coercion and NA warnings derive_covariate(dt, as_numeric = "messy_label", as_factor = c("p31", "p20116_i0"), factor_levels = list( p20116_i0 = c("Never", "Previous", "Current") ) )dt <- ops_toy(n = 100) dt <- derive_missing(dt) # messy_label is a character column with numeric-looking values ("-1", "999") # mixed with non-parseable strings — demonstrates coercion and NA warnings derive_covariate(dt, as_numeric = "messy_label", as_factor = c("p31", "p20116_i0"), factor_levels = list( p20116_i0 = c("Never", "Previous", "Current") ) )
Creates a new factor column by binning a continuous variable into n
groups. When breaks is omitted, group boundaries are derived from
quantiles of the observed data (equal-frequency binning). When
breaks is supplied, those values are used as interior cut points.
derive_cut(data, col, n, breaks = NULL, labels = NULL, name = NULL)derive_cut(data, col, n, breaks = NULL, labels = NULL, name = NULL)
data |
(data.frame or data.table) UKB data. |
col |
(character) Name of the source numeric column. |
n |
(integer) Number of groups. Any integer |
breaks |
(numeric vector or NULL) Interior cut points; length must
equal |
labels |
(character vector or NULL) Group labels of length |
name |
(character or NULL) Name for the new column. Defaults to
|
Before binning, a numeric summary (mean, median, SD, Q1, Q3, missing rate) is printed for the source column. After binning, the group distribution is printed via an internal summary helper.
Only one column can be processed per call; loop over columns explicitly when binning multiple variables.
data.table pass-by-reference: when the input is a
data.table, the new column is added in-place. Pass
data.table::copy(data) to preserve the original.
The input data with one new factor column appended. Always
returns a data.table.
dt <- ops_toy(n = 100) derive_cut(dt, col = "p21001_i0", n = 3) # Custom breaks and labels derive_cut(dt, col = "p21001_i0", n = 3, breaks = c(25, 30), labels = c("<25", "25-30", "30+"), name = "bmi_group")dt <- ops_toy(n = 100) derive_cut(dt, col = "p21001_i0", n = 3) # Custom breaks and labels derive_cut(dt, col = "p21001_i0", n = 3, breaks = c(25, 30), labels = c("<25", "25-30", "30+"), name = "bmi_group")
Death registry records store the underlying (primary) cause of death in
field p40001 and contributory (secondary) causes in field
p40002, both coded in ICD-10. The date of death is in field
p40000. All three fields have an instance dimension (i0,
i1) reflecting potential amendments; p40002 additionally
has an array dimension (a0, a1, ...).
derive_death_registry( data, name, icd10, match = c("prefix", "exact", "regex"), primary_cols = NULL, secondary_cols = NULL, date_cols = NULL )derive_death_registry( data, name, icd10, match = c("prefix", "exact", "regex"), primary_cols = NULL, secondary_cols = NULL, date_cols = NULL )
data |
(data.frame or data.table) UKB phenotype data containing death registry fields. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
match |
(character) Matching strategy: |
primary_cols |
(character or NULL) Names of primary cause columns
( |
secondary_cols |
(character or NULL) Names of secondary cause
columns ( |
date_cols |
(character or NULL) Names of death date columns
( |
{name}_deathLogical flag: TRUE if any death
registry record (primary or secondary cause) contains a matching
ICD-10 code.
{name}_death_dateEarliest death date across matching
instances (IDate). Note: this is the date of death,
not onset date.
The input data with two new columns added in-place:
{name}_death (logical) and {name}_death_date (IDate).
Always returns a data.table.
dt <- ops_toy(n = 100) derive_death_registry(dt, name = "mi", icd10 = "I21") derive_death_registry(dt, name = "lung", icd10 = "C34")dt <- ops_toy(n = 100) derive_death_registry(dt, name = "mi", icd10 = "I21") derive_death_registry(dt, name = "lung", icd10 = "C34")
UKB pre-computes the earliest recorded date for hundreds of ICD-10 chapters
and categories as First Occurrence fields (p131xxx). Each
field contains a single date per participant; no array or instance depth is
involved. This function reads that date, converts it to IDate,
and writes two analysis-ready columns:
derive_first_occurrence(data, name, field, col = NULL)derive_first_occurrence(data, name, field, col = NULL)
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
field |
(integer or character) UKB field ID of the First Occurrence
field, e.g. |
col |
(character or NULL) Name of the source column in |
{name}_fo_dateEarliest First Occurrence date (IDate).
Values that cannot be coerced to a valid date (e.g. UKB error codes)
are silently set to NA.
{name}_foLogical flag derived from
{name}_fo_date: TRUE if and only if a valid date exists.
This guarantees that every positive case has a usable date - essential
for time-to-event and prevalent/incident classification.
Column detection: the function locates the source column
automatically from field, handling both the raw format used by
extract_pheno (participant.p131666) and the
snake_case format produced by decode_names
(date_e11_first_reported_type_2_diabetes). Supply col
to override auto-detection.
data.table pass-by-reference: when the input is a
data.table, new columns are added in-place via :=.
The returned object and the original variable point to the same memory.
The input data (invisibly) with two new columns added
in-place: {name}_fo (logical) and {name}_fo_date (IDate).
dt <- ops_toy(n = 100) derive_first_occurrence(dt, name = "outcome", field = 131742L, col = "p131742")dt <- ops_toy(n = 100) derive_first_occurrence(dt, name = "outcome", field = 131742L, col = "p131742")
Adds two columns to data:
{name}_followup_end (IDate) - the earliest of the outcome
event date, death date, lost-to-follow-up date, and the administrative
censoring date.
{name}_followup_years (numeric) - time in years from
baseline_col to {name}_followup_end.
derive_followup( data, name, event_col, baseline_col, censor_date, death_col = NULL, lost_col = NULL )derive_followup( data, name, event_col, baseline_col, censor_date, death_col = NULL, lost_col = NULL )
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
event_col |
(character) Name of the outcome event date column
(e.g. |
baseline_col |
(character) Name of the baseline date column
(e.g. |
censor_date |
(Date or character) Scalar administrative censoring date,
e.g. |
death_col |
(character or NULL) Name of the death date column
(UKB field 40000). |
lost_col |
(character or NULL) Name of the lost-to-follow-up date
column (UKB field 191). |
data.table pass-by-reference: when the input is a
data.table, new columns are added in-place via :=.
The input data with two new columns added in-place:
{name}_followup_end (IDate) and {name}_followup_years
(numeric).
dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_followup(dt, name = "htn_hes", event_col = "htn_hes_date", baseline_col = "p53_i0", censor_date = as.Date("2022-06-01"), death_col = "p40000_i0", lost_col = FALSE)dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_followup(dt, name = "htn_hes", event_col = "htn_hes_date", baseline_col = "p53_i0", censor_date = as.Date("2022-06-01"), death_col = "p40000_i0", lost_col = FALSE)
Hospital Episode Statistics (HES) inpatient records store ICD-10 diagnosis
codes in field p41270 (single JSON-array column on UKB RAP) and
corresponding first-diagnosis dates in field p41280
(p41280_a0, p41280_a1, ...). The array index in
p41270 and p41280 are aligned: the N-th code in the
JSON array corresponds to p41280_aN (date of first in-patient
diagnosis for that code).
derive_hes( data, name, icd10, match = c("prefix", "exact", "regex"), disease_cols = NULL, date_cols = NULL )derive_hes( data, name, icd10, match = c("prefix", "exact", "regex"), disease_cols = NULL, date_cols = NULL )
data |
(data.frame or data.table) UKB phenotype data containing HES
fields ( |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
match |
(character) Matching strategy: |
disease_cols |
(character or NULL) Name of the |
date_cols |
(character or NULL) Names of |
Diagnosis position: derive_hes() uses field
p41270, which represents any recorded HES inpatient ICD-10
diagnosis position. A participant is counted as a case if any matching
ICD-10 code appears in p41270. The function does not currently
distinguish main/primary diagnoses (p41202) from secondary diagnoses
(p41204).
{name}_hesLogical flag: TRUE if any HES record
contains a matching ICD-10 code.
{name}_hes_dateEarliest first-diagnosis date across all
matching codes (IDate). NA if no date is available.
The input data with two new columns added in-place:
{name}_hes (logical) and {name}_hes_date (IDate).
Always returns a data.table.
dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_hes(dt, name = "diabetes", icd10 = c("E10", "E11")) derive_hes(dt, name = "asthma", icd10 = "^J4", match = "regex")dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_hes(dt, name = "diabetes", icd10 = c("E10", "E11")) derive_hes(dt, name = "asthma", icd10 = "^J4", match = "regex")
A high-level wrapper that calls one or more of derive_hes,
derive_death_registry, derive_first_occurrence,
and derive_cancer_registry according to the source
argument, then combines their results into a single status flag and
earliest-date column.
derive_icd10( data, name, icd10, source = c("hes", "death", "first_occurrence", "cancer_registry"), match = c("prefix", "exact", "regex"), fo_field = NULL, fo_col = NULL, histology = NULL, behaviour = NULL, hes_code_col = NULL, hes_date_cols = NULL, primary_cols = NULL, secondary_cols = NULL, death_date_cols = NULL, cr_code_cols = NULL, cr_hist_cols = NULL, cr_behv_cols = NULL, cr_date_cols = NULL )derive_icd10( data, name, icd10, source = c("hes", "death", "first_occurrence", "cancer_registry"), match = c("prefix", "exact", "regex"), fo_field = NULL, fo_col = NULL, histology = NULL, behaviour = NULL, hes_code_col = NULL, hes_date_cols = NULL, primary_cols = NULL, secondary_cols = NULL, death_date_cols = NULL, cr_code_cols = NULL, cr_hist_cols = NULL, cr_behv_cols = NULL, cr_date_cols = NULL )
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix, e.g. |
icd10 |
(character) ICD-10 code(s) to match. For |
source |
(character) One or more of |
match |
(character) Matching strategy passed to |
fo_field |
(integer or character or NULL) UKB field ID for the
First Occurrence column (e.g. |
fo_col |
(character or NULL) Column name of the First Occurrence
field in |
histology |
(integer vector or NULL) Passed to
|
behaviour |
(integer vector or NULL) Passed to
|
hes_code_col |
(character or NULL) Passed as |
hes_date_cols |
(character or NULL) Passed as |
primary_cols |
(character or NULL) Passed to
|
secondary_cols |
(character or NULL) Passed to
|
death_date_cols |
(character or NULL) Passed as |
cr_code_cols |
(character or NULL) Passed as |
cr_hist_cols |
(character or NULL) Passed as |
cr_behv_cols |
(character or NULL) Passed as |
cr_date_cols |
(character or NULL) Passed as |
All intermediate source columns ({name}_hes, {name}_death,
{name}_fo, {name}_cancer and their _date counterparts)
are retained in data so that per-source contributions remain
traceable.
{name}_icd10Logical flag: TRUE if any selected
source contains a matching record.
{name}_icd10_dateEarliest matching date across all
selected sources (IDate).
The input data with {name}_icd10 (logical) and
{name}_icd10_date (IDate) added in-place, plus all intermediate
source columns. Always returns a data.table.
dt <- ops_toy(n = 100) derive_icd10(dt, name = "htn", icd10 = "I10", source = c("hes", "death")) derive_icd10(dt, name = "mi", icd10 = "I21", source = c("hes", "death", "first_occurrence"), fo_col = "p131742")dt <- ops_toy(n = 100) derive_icd10(dt, name = "htn", icd10 = "I10", source = c("hes", "death")) derive_icd10(dt, name = "mi", icd10 = "I21", source = c("hes", "death", "first_occurrence"), fo_col = "p131742")
After decode_values converts categorical codes to character
labels, some values represent meaningful non-response rather than true data:
"Do not know", "Prefer not to answer", and
"Prefer not to say". This function either converts them to NA
(for complete-case analysis) or retains them as "Unknown" (to
preserve the informative missingness as a model category).
derive_missing( data, cols = tidyselect::everything(), action = c("na", "unknown"), extra_labels = NULL )derive_missing( data, cols = tidyselect::everything(), action = c("na", "unknown"), extra_labels = NULL )
data |
(data.frame or data.table) Decoded UKB data, typically output
of |
cols |
(tidyselect) Columns to process. Default: |
action |
(character) One of
Empty strings are always converted to |
extra_labels |
(character or NULL) Additional labels to treat as
informative missing, appended to the built-in list. Default: |
Empty strings ("") are always converted to NA regardless of
action, as they carry no informational content.
Only character columns are modified; numeric, integer, Date, and logical columns are silently skipped.
data.table pass-by-reference: when the input is a
data.table, modifications are made in-place via
set. The returned object and the original variable
point to the same memory. If you need to preserve the original, pass
data.table::copy(data) instead.
The input data with missing labels replaced in-place.
Always returns a data.table.
dt <- ops_toy(n = 100) derive_missing(dt) # Retain informative non-response as a model category derive_missing(dt, action = "unknown") # Add a custom label derive_missing(dt, extra_labels = "Not applicable")dt <- ops_toy(n = 100) derive_missing(dt) # Retain informative non-response as a model category derive_missing(dt, action = "unknown") # Add a custom label derive_missing(dt, extra_labels = "Not applicable")
Searches UKB self-reported illness fields across all instances and arrays,
matches records against a user-supplied regex, parses the associated
year-month date, and appends two columns to the data:
{name}_selfreport (logical) and {name}_selfreport_date
(IDate, earliest matching instance).
derive_selfreport( data, name, regex, field = c("noncancer", "cancer"), ignore_case = TRUE, disease_cols = NULL, date_cols = NULL, visit_cols = NULL )derive_selfreport( data, name, regex, field = c("noncancer", "cancer"), ignore_case = TRUE, disease_cols = NULL, date_cols = NULL, visit_cols = NULL )
data |
(data.frame or data.table) UKB data containing self-report fields. |
name |
(character) Output column name prefix, e.g. |
regex |
(character) Regular expression matched against disease text
values (after |
field |
(character) Self-report field type: |
ignore_case |
(logical) Should regex matching ignore case?
Default: |
disease_cols |
(character or NULL) Column name(s) containing disease
text (p20002 or p20001). |
date_cols |
(character or NULL) Column name(s) containing the
self-report year-month date (p20008 or p20006). |
visit_cols |
(character or NULL) Column name(s) containing the
baseline assessment date (p53). |
The function auto-detects the relevant columns using the
extract_ls() field dictionary cache (populated by
extract_ls() or extract_pheno()). The three _cols
parameters let you override auto-detection when column names have been
customised (e.g. after decode_names()).
Field mapping by field:
"noncancer": disease text = p20002, date = p20008.
"cancer": disease text = p20001, date = p20006.
Baseline visit date (p53) is used as a fallback when no specific diagnosis date is recorded.
data.table pass-by-reference: new columns are added in-place.
Pass data.table::copy(data) to preserve the original.
The input data with two new columns appended in-place:
{name}_selfreport (logical) and
{name}_selfreport_date (IDate). Always returns a
data.table.
dt <- ops_toy(n = 100) # disease_cols / date_cols / visit_cols can be omitted — auto-detected from # column names matching p20002_* / p20008_* / p53_* respectively derive_selfreport(dt, name = "htn", regex = "hypertension", field = "noncancer")dt <- ops_toy(n = 100) # disease_cols / date_cols / visit_cols can be omitted — auto-detected from # column names matching p20002_* / p20008_* / p53_* respectively derive_selfreport(dt, name = "htn", regex = "hypertension", field = "noncancer")
Assigns each participant an integer timing category based on whether their disease date falls before or after the baseline visit date:
derive_timing(data, name, baseline_col, status_col = NULL, date_col = NULL)derive_timing(data, name, baseline_col, status_col = NULL, date_col = NULL)
data |
(data.frame or data.table) UKB phenotype data. |
name |
(character) Output column prefix. The new column is named
|
baseline_col |
(character) Name of the baseline date column in
|
status_col |
(character or NULL) Name of the logical disease flag.
|
date_col |
(character or NULL) Name of the disease date column
( |
0No disease (status_col is FALSE).
1Prevalent - disease date on or before baseline.
2Incident - disease date strictly after baseline.
NACase with missing date; timing cannot be determined.
Call once per timing variable needed (e.g. once for the combined case, once per individual source).
The input data with one new integer column
{name}_timing (0/1/2/NA) added in-place.
Always returns a data.table.
dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_timing(dt, name = "htn_hes", status_col = "htn_hes", date_col = "htn_hes_date", baseline_col = "p53_i0")dt <- ops_toy(n = 100) derive_hes(dt, name = "htn", icd10 = "I10") derive_timing(dt, name = "htn_hes", status_col = "htn_hes", date_col = "htn_hes_date", baseline_col = "p53_i0")
Submits an asynchronous table-exporter job on the DNAnexus Research
Analysis Platform for large-scale phenotype extraction. Use this instead
of extract_pheno() when extracting many fields (e.g. 50+).
extract_batch( field_id, dataset = NULL, file = NULL, instance_type = NULL, priority = c("low", "high") )extract_batch( field_id, dataset = NULL, file = NULL, instance_type = NULL, priority = c("low", "high") )
field_id |
(integer) Vector of UKB Field IDs to extract.
|
dataset |
(character) Dataset file name. Default: |
file |
(character) Output file name on the cloud (without extension),
e.g. |
instance_type |
(character) DNAnexus instance type, e.g.
|
priority |
(character) Job scheduling priority. |
The job runs on the cloud and typically completes in 20-40 minutes.
Monitor progress and retrieve results using the job_ series.
Invisibly returns the job ID string (e.g. "job-XXXX").
## Not run: job_id <- extract_batch(core_field_ids) job_id <- extract_batch(core_field_ids, file = "ad_cscc_pheno") job_id <- extract_batch(core_field_ids, priority = "high") # Monitor: job_status(job_id) # Download: job_result(job_id, dest = "data/pheno.csv") ## End(Not run)## Not run: job_id <- extract_batch(core_field_ids) job_id <- extract_batch(core_field_ids, file = "ad_cscc_pheno") job_id <- extract_batch(core_field_ids, priority = "high") # Monitor: job_status(job_id) # Download: job_result(job_id, dest = "data/pheno.csv") ## End(Not run)
Returns a data.frame of all fields available for extraction in the current UKB project dataset. Fields reflect what has been approved for your project — not all UKB fields are present.
extract_ls(dataset = NULL, pattern = NULL, refresh = FALSE)extract_ls(dataset = NULL, pattern = NULL, refresh = FALSE)
dataset |
(character) Dataset file name, e.g.
|
pattern |
(character) Optional regex to filter results by
|
refresh |
(logical) Force re-fetch from cloud, ignoring cache.
Default: |
Results are cached in the session after the first call. Subsequent calls
return instantly from cache. Use refresh = TRUE to force a new
network request (e.g. after switching projects).
A data.frame with columns:
Full field name as used in extraction, e.g.
"participant.p31", "participant.p53_i0".
Human-readable field description, e.g.
"Sex", "Date of attending assessment centre | Instance 0".
## Not run: # List all approved fields extract_ls() # Search by keyword extract_ls(pattern = "cancer") extract_ls(pattern = "p31|p53|p22009") # Force refresh after switching projects extract_ls(refresh = TRUE) ## End(Not run)## Not run: # List all approved fields extract_ls() # Search by keyword extract_ls(pattern = "cancer") extract_ls(pattern = "p31|p53|p22009") # Force refresh after switching projects extract_ls(refresh = TRUE) ## End(Not run)
Extracts phenotypic fields from the UKB Research Analysis Platform dataset
and returns a data.table. All instances and arrays are returned for
each requested field. Column names are kept as-is (e.g.
participant.p53_i0); use decode_names() for renaming.
extract_pheno(field_id, dataset = NULL, timeout = 300)extract_pheno(field_id, dataset = NULL, timeout = 300)
field_id |
(integer) Vector of UKB Field IDs to extract, e.g.
|
dataset |
(character) Dataset file name. Default: |
timeout |
(integer) Extraction timeout in seconds. Default: |
A data.table with one row per participant. Column names
follow the participant.p<id>_i<n>_a<m> convention.
Fields not found are skipped with a warning.
## Not run: df <- extract_pheno(c(31, 53, 21022)) df <- extract_pheno(c(31, 53, 20002), dataset = "app12345_20260101.dataset") ## End(Not run)## Not run: df <- extract_pheno(c(31, 53, 21022)) df <- extract_pheno(c(31, 53, 20002), dataset = "app12345_20260101.dataset") ## End(Not run)
Downloads field.tsv from the Showcase metadata/ folder on the
DNAnexus Research Analysis Platform. This file contains the complete UKB
data dictionary: field IDs, titles, value types, and encoding references.
fetch_field(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)fetch_field(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local file. Default:
|
resume |
(logical) Resume an interrupted download. Default:
|
verbose |
(logical) Show download progress. Default: |
Invisibly returns the local file path as a character string.
## Not run: fetch_field(dest_dir = "metadata") fetch_field(dest_dir = "metadata", overwrite = TRUE) ## End(Not run)## Not run: fetch_field(dest_dir = "metadata") fetch_field(dest_dir = "metadata", overwrite = TRUE) ## End(Not run)
Downloads one file or all files within a folder from RAP project storage to the current directory or a specified destination within the RAP environment. This function must be called from within RAP.
fetch_file(path, dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)fetch_file(path, dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
path |
(character) Remote file or folder path. |
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local files. Default:
|
resume |
(logical) Resume an interrupted download. Useful for large
files (e.g. |
verbose |
(logical) Show download progress. Default: |
Invisibly returns the local file path(s) as a character vector.
## Not run: # Download a single metadata file fetch_file("Showcase metadata/field.tsv", dest_dir = "data/") # Download an entire folder fetch_file("Showcase metadata/", dest_dir = "data/metadata/") # Resume an interrupted download fetch_file("results/summary_stats.csv", dest_dir = "data/", resume = TRUE) ## End(Not run)## Not run: # Download a single metadata file fetch_file("Showcase metadata/field.tsv", dest_dir = "data/") # Download an entire folder fetch_file("Showcase metadata/", dest_dir = "data/metadata/") # Resume an interrupted download fetch_file("results/summary_stats.csv", dest_dir = "data/", resume = TRUE) ## End(Not run)
Returns a structured data.frame describing the contents of a remote
DNAnexus Research Analysis Platform (RAP) directory. Analogous to
file_info() but for remote project storage.
fetch_ls(path = ".", type = "all", pattern = NULL)fetch_ls(path = ".", type = "all", pattern = NULL)
path |
(character) Remote path to list. Default: |
type |
(character) Filter results by entry type: |
pattern |
(character) Optional regex to filter by name, e.g.
|
A data.frame with columns:
Entry name (no trailing slash for folders).
"file" or "folder".
File size string (e.g. "120.94 GB"), NA for
folders or non-file objects.
Last modified time (POSIXct), NA for
folders.
## Not run: fetch_ls() fetch_ls("Showcase metadata/", type = "file") fetch_ls("results/", pattern = "\\.csv$") ## End(Not run)## Not run: fetch_ls() fetch_ls("Showcase metadata/", type = "file") fetch_ls("results/", pattern = "\\.csv$") ## End(Not run)
Downloads the entire Showcase metadata/ folder from the DNAnexus
Research Analysis Platform to a local directory. This includes
field.tsv, encoding.tsv, and all associated encoding tables.
fetch_metadata(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)fetch_metadata(dest_dir, overwrite = FALSE, resume = FALSE, verbose = TRUE)
dest_dir |
(character) Destination directory. Created automatically if it does not exist. |
overwrite |
(logical) Overwrite existing local files. Default:
|
resume |
(logical) Resume interrupted downloads. Default: |
verbose |
(logical) Show download progress. Default: |
Invisibly returns the local file paths as a character vector.
## Not run: fetch_metadata(dest_dir = "metadata") fetch_metadata(dest_dir = "metadata", overwrite = TRUE) ## End(Not run)## Not run: fetch_metadata(dest_dir = "metadata") fetch_metadata(dest_dir = "metadata", overwrite = TRUE) ## End(Not run)
Displays the remote directory structure in a tree-like format by
recursively listing sub-folders up to max_depth. Analogous to
file_tree() but for remote project storage.
fetch_tree(path = ".", max_depth = 2, verbose = TRUE)fetch_tree(path = ".", max_depth = 2, verbose = TRUE)
path |
(character) Remote root path. Default: |
max_depth |
(integer) Maximum recursion depth. Default: |
verbose |
(logical) Whether to print the tree to the console.
Default: |
Invisibly returns a character vector of tree lines.
Each level of recursion triggers one HTTPS API call per folder. Deep trees
(e.g. max_depth > 3) on large UKB projects may issue 100+ network
requests, causing the console to hang for tens of seconds or time out.
Keep max_depth at 2-3 for interactive use.
## Not run: fetch_tree() fetch_tree("Bulk/", max_depth = 2) fetch_tree(verbose = FALSE) ## End(Not run)## Not run: fetch_tree() fetch_tree("Bulk/", max_depth = 2) fetch_tree(verbose = FALSE) ## End(Not run)
Generates temporary HTTPS URLs for files on the DNAnexus Research Analysis Platform. For a single file, returns one URL. For a folder, lists all files inside and returns a named character vector of URLs. This function must be called from within RAP, as a pre-authenticated URL is itself a means of retrieving the underlying file.
fetch_url(path, duration = "1d")fetch_url(path, duration = "1d")
path |
(character) Remote file path or folder path, e.g.
|
duration |
(character) How long the URLs remain valid. Accepts
suffixes: |
A named character vector of pre-authenticated HTTPS URLs. Names are the file names.
## Not run: # Single file fetch_url("Showcase metadata/field.tsv") # Entire folder fetch_url("Showcase metadata/", duration = "7d") ## End(Not run)## Not run: # Single file fetch_url("Showcase metadata/field.tsv") # Entire folder fetch_url("Showcase metadata/", duration = "7d") ## End(Not run)
Submits one Swiss Army Knife job per chromosome to the DNAnexus Research Analysis Platform, each converting a UKB imputed BGEN file to PGEN format with a MAF > 0.01 filter applied via plink2. Jobs run in parallel across chromosomes.
grs_bgen2pgen( chr = 1:22, dest = NULL, maf = 0.01, instance = "standard", priority = "low" )grs_bgen2pgen( chr = 1:22, dest = NULL, maf = 0.01, instance = "standard", priority = "low" )
chr |
Integer vector. Chromosomes to process. Default: |
dest |
Character scalar. RAP destination path for output PGEN files
(e.g. |
maf |
Numeric scalar. Minor allele frequency filter passed to plink2
|
instance |
Character scalar. Instance type preset: |
priority |
Character scalar. Job priority: |
The function auto-generates the plink2 driver script, uploads it once to
the RAP project root (/) on RAP, then loops over chr submitting
one job per chromosome. A 500 ms pause between submissions prevents API
rate-limiting.
Output path is critical. The driver script writes plink2 output to
/home/dnanexus/out/out/ - the fixed path that Swiss Army Knife
auto-uploads to dest on completion. Output files per chromosome:
chr{N}_maf001.pgen/.pvar/.psam/.log.
Instance types:
"standard"mem2_ssd1_v2_x4: 4 cores, 12 GB RAM.
Suitable for smaller chromosomes (roughly chr 17–22).
"large"mem2_ssd2_v2_x8: 8 cores, 28 GB RAM,
640 GB SSD. Required for large chromosomes (roughly chr 1–16) where
standard storage is insufficient.
A character vector of job IDs (one per chromosome), returned
invisibly. Failed submissions are NA. Use job_ls,
job_status, or job_wait to monitor progress.
## Not run: # Test with chr22 first (smallest chromosome) ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high") # Small chromosomes - standard instance ids_small <- grs_bgen2pgen(chr = 15:22, dest = "/pgen/") # Large chromosomes - upgrade instance to handle storage ids_large <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", instance = "large") # Monitor job_ls() ## End(Not run)## Not run: # Test with chr22 first (smallest chromosome) ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high") # Small chromosomes - standard instance ids_small <- grs_bgen2pgen(chr = 15:22, dest = "/pgen/") # Large chromosomes - upgrade instance to handle storage ids_large <- grs_bgen2pgen(chr = 1:16, dest = "/pgen/", instance = "large") # Monitor job_ls() ## End(Not run)
Reads a SNP weights file, validates its content, and writes a plink2-compatible space-delimited output ready for upload to UKB RAP.
grs_check(file, dest = NULL)grs_check(file, dest = NULL)
file |
Character scalar. Path to the input weights file.
Read via |
dest |
Character scalar or |
The input file must contain at least the three columns below (additional columns are ignored):
snpSNP identifier, expected in rs + digits format.
effect_alleleEffect allele; must be one of A / T / C / G.
betaEffect size (log-OR or beta coefficient); must be numeric.
Checks performed:
Required columns present.
No NA values in the three required columns.
No duplicate snp identifiers.
snp matches rs[0-9]+ pattern (warning if not).
effect_allele contains only A / T / C / G (warning if not).
beta is numeric (error if not).
A data.table with columns snp, effect_allele,
and beta, returned invisibly.
tmp_in <- tempfile(fileext = ".csv") weights <- data.frame( snp = c("rs1234567", "rs2345678", "rs3456789"), effect_allele = c("A", "T", "G"), beta = c(0.12, -0.05, 0.23) ) write.csv(weights, tmp_in, row.names = FALSE) w <- grs_check(tmp_in) w # Save validated weights to a file tmp_out <- tempfile(fileext = ".txt") grs_check(tmp_in, dest = tmp_out)tmp_in <- tempfile(fileext = ".csv") weights <- data.frame( snp = c("rs1234567", "rs2345678", "rs3456789"), effect_allele = c("A", "T", "G"), beta = c(0.12, -0.05, 0.23) ) write.csv(weights, tmp_in, row.names = FALSE) w <- grs_check(tmp_in) w # Save validated weights to a file tmp_out <- tempfile(fileext = ".txt") grs_check(tmp_in, dest = tmp_out)
Uploads local SNP weight files to the RAP project root, then submits one
Swiss Army Knife job per GRS. Each job runs plink2 --score across
all 22 chromosomes and saves a single CSV to dest on completion.
Jobs run in parallel; use job_ls to monitor progress.
grs_score( file, pgen_dir = NULL, dest = NULL, maf = 0.01, instance = "standard", priority = "low" )grs_score( file, pgen_dir = NULL, dest = NULL, maf = 0.01, instance = "standard", priority = "low" )
file |
Named character vector of local weight file paths. Names become
the GRS identifiers (output column = |
pgen_dir |
Character scalar. Path to PGEN files on RAP
(e.g. |
dest |
Character scalar. RAP destination path for output CSV files
(e.g. |
maf |
Numeric scalar. MAF filter threshold used when locating PGEN
files. Must match the value used in |
instance |
Character scalar. Instance type preset: |
priority |
Character scalar. Job priority: |
Weight files should have three columns (any delimiter, header required):
Variant ID (e.g. rs IDs).
Effect allele (A1).
Effect weight (beta / log-OR).
This matches the output format of grs_check.
Output per job: dest/<score_name>_scores.csv with columns
IID and the GRS score (named GRS_<name>).
Instance types:
"standard"mem2_ssd1_v2_x4: 4 cores, 12 GB RAM.
"large"mem2_ssd2_v2_x8: 8 cores, 28 GB RAM.
A named character vector of job IDs (one per GRS), returned
invisibly. Failed submissions are NA. Use job_ls to
monitor progress.
## Not run: ids <- grs_score( file = c( grs_a = "weights/grs_a_weights.txt", grs_b = "weights/grs_b_weights.txt" ), pgen_dir = "/mnt/project/pgen", dest = "/grs/", priority = "high" ) job_ls() ## End(Not run)## Not run: ids <- grs_score( file = c( grs_a = "weights/grs_a_weights.txt", grs_b = "weights/grs_b_weights.txt" ), pgen_dir = "/mnt/project/pgen", dest = "/grs/", priority = "high" ) job_ls() ## End(Not run)
Adds a _z column for every selected GRS column:
z = (x - mean(x)) / sd(x). The original columns are kept
unchanged. grs_zscore is an alias for this function.
grs_standardize(data, grs_cols = NULL) grs_zscore(data, grs_cols = NULL)grs_standardize(data, grs_cols = NULL) grs_zscore(data, grs_cols = NULL)
data |
A |
grs_cols |
Character vector of column names to standardise.
If |
The input data as a data.table with one additional
_z column per GRS column appended after its source column.
dt <- data.frame( IID = 1:5, GRS_a = c(0.12, 0.34, 0.56, 0.23, 0.45), GRS_b = c(1.1, 0.9, 1.3, 0.8, 1.0) ) grs_standardize(dt) grs_zscore(dt) # identicaldt <- data.frame( IID = 1:5, GRS_a = c(0.12, 0.34, 0.56, 0.23, 0.45), GRS_b = c(1.1, 0.9, 1.3, 0.8, 1.0) ) grs_standardize(dt) grs_zscore(dt) # identical
For each GRS column, computes four sets of validation metrics:
Per SD - OR (logistic) or HR (Cox) per 1-SD increase.
High vs Low - OR / HR comparing top 20\ (extreme tertile grouping: Low / Mid / High).
Trend test - P-trend across quartiles (Q1–Q4).
Discrimination - AUC (logistic) or C-index (Cox).
grs_validate( data, grs_cols = NULL, outcome_col, time_col = NULL, covariates = NULL )grs_validate( data, grs_cols = NULL, outcome_col, time_col = NULL, covariates = NULL )
data |
A |
grs_cols |
Character vector of GRS column names to validate.
If |
outcome_col |
Character scalar. Name of the outcome column
( |
time_col |
Character scalar or |
covariates |
Character vector or |
GRS grouping columns are created internally via derive_cut
and are not added to the user's data. When time_col is
NULL, logistic regression is used throughout; when supplied, Cox
proportional hazards models are used.
Models follow the same adjustment logic as assoc_logistic /
assoc_coxph: unadjusted and age-sex adjusted models are
always included; a fully adjusted model is added when covariates
is non-NULL.
A named list with four data.table elements:
per_sd: OR / HR per 1-SD increase in GRS.
high_vs_low: OR / HR for High vs Low extreme tertile.
trend: P-trend across Q1–Q4 quartiles.
discrimination: AUC (logistic) or C-index (Cox) with 95\
dt <- ops_toy(scenario = "association", n = 500) dt <- grs_standardize(dt, grs_cols = "grs_bmi") res <- grs_validate( data = dt, grs_cols = "grs_bmi_z", outcome_col = "dm_status", time_col = "dm_followup_years" ) res$per_sd if (requireNamespace("pROC", quietly = TRUE)) { res_logit <- grs_validate( data = dt, grs_cols = "grs_bmi_z", outcome_col = "dm_status" ) res_logit$discrimination }dt <- ops_toy(scenario = "association", n = 500) dt <- grs_standardize(dt, grs_cols = "grs_bmi") res <- grs_validate( data = dt, grs_cols = "grs_bmi_z", outcome_col = "dm_status", time_col = "dm_followup_years" ) res$per_sd if (requireNamespace("pROC", quietly = TRUE)) { res_logit <- grs_validate( data = dt, grs_cols = "grs_bmi_z", outcome_col = "dm_status" ) res_logit$discrimination }
Returns a summary of the most recent jobs, optionally filtered by state. Useful for quickly reviewing which jobs have completed, failed, or are still running.
job_ls(n = 20, state = NULL)job_ls(n = 20, state = NULL)
n |
(integer) Maximum number of recent jobs to return. Must be a
single positive integer. Default: |
state |
(character) Filter by state(s). Must be |
A data.frame with columns:
Job ID string, e.g. "job-XXXX".
Job name (typically "Table exporter").
Current job state.
Job creation time (POSIXct).
Runtime string (e.g. "0:04:36"), NA if
still running.
## Not run: job_ls() job_ls(n = 5) job_ls(state = "failed") job_ls(state = c("done", "failed")) ## End(Not run)## Not run: job_ls() job_ls(n = 5) job_ls(state = "failed") job_ls(state = c("done", "failed")) ## End(Not run)
Returns the absolute /mnt/project/ path of the CSV produced by
extract_batch(). Use this to read the file directly on the RAP
without downloading.
job_path(job_id)job_path(job_id)
job_id |
(character) Job ID returned by |
A character string — the absolute path to the output CSV under
/mnt/project/.
## Not run: path <- job_path(job_id) df <- data.table::fread(path) ## End(Not run)## Not run: path <- job_path(job_id) df <- data.table::fread(path) ## End(Not run)
Reads the output CSV produced by extract_batch() directly from RAP
project storage and returns a data.table. Must be run inside the
RAP environment.
job_result(job_id)job_result(job_id)
job_id |
(character) Job ID returned by |
A data.table with one row per participant.
## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_wait(job_id) df <- job_result(job_id) ## End(Not run)## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_wait(job_id) df <- job_result(job_id) ## End(Not run)
Returns the current state of a job submitted by extract_batch().
For failed jobs, the failure message is attached as an attribute.
job_status(job_id)job_status(job_id)
job_id |
(character) Job ID returned by |
A named character string — the job state. Possible values:
"idle"Queued, waiting to be scheduled.
"runnable"Resources being allocated.
"running"Actively executing.
"done"Completed successfully.
"failed"Failed; see attr(result, "failure_message").
"terminated"Manually terminated.
## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_status(job_id) s <- job_status(job_id) if (s == "failed") cli::cli_inform(attr(s, "failure_message")) ## End(Not run)## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_status(job_id) s <- job_status(job_id) if (s == "failed") cli::cli_inform(attr(s, "failure_message")) ## End(Not run)
Polls job_status() at regular intervals until the job reaches a
terminal state ("done", "failed", or "terminated").
Stops with an error if the job fails, is terminated, or times out.
job_wait(job_id, interval = 30, timeout = Inf, verbose = TRUE)job_wait(job_id, interval = 30, timeout = Inf, verbose = TRUE)
job_id |
(character) Job ID returned by |
interval |
(integer) Polling interval in seconds. Default: |
timeout |
(numeric) Maximum wait time in seconds. Default: |
verbose |
(logical) Print state and elapsed time at each poll.
Default: |
Invisibly returns the final state string ("done").
## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_wait(job_id) # Read result immediately after completion (RAP only) job_wait(job_id) df <- job_result(job_id) ## End(Not run)## Not run: job_id <- extract_batch(c(31, 53, 21022)) job_wait(job_id) # Read result immediately after completion (RAP only) job_wait(job_id) df <- job_result(job_id) ## End(Not run)
Searches the field list returned by extract_ls and summarizes
matching RAP columns at the UKB field-ID level. This is a project-specific
field-discovery helper: results reflect fields approved and available in the
active RAP dataset, not the full UK Biobank data dictionary.
ops_fields( pattern, dataset = NULL, refresh = FALSE, regex = FALSE, details = FALSE )ops_fields( pattern, dataset = NULL, refresh = FALSE, regex = FALSE, details = FALSE )
pattern |
(character) Keyword string or regular expression to search.
For keyword search, separate multiple required keywords with spaces, e.g.
|
dataset |
(character or NULL) Dataset file name, e.g.
|
refresh |
(logical) Force re-fetch from cloud, ignoring the cached
field list. Default: |
regex |
(logical) Interpret |
details |
(logical) Return the raw matching RAP columns instead of the
field-level summary. Default: |
By default, pattern is treated as one or more case-insensitive
keywords that must all be present in either the RAP field name or the field
title. Set regex = TRUE to use pattern as a regular
expression.
A data.table. With details = FALSE, columns are:
field_id, title, n_cols, and
example_field_name. With details = TRUE, columns are:
field_id, field_name, and title.
## Not run: ops_fields("sex") ops_fields("age recruitment") ops_fields("p31|p53|p21022", regex = TRUE) ops_fields("cancer", details = TRUE) ## End(Not run)## Not run: ops_fields("sex") ops_fields("age recruitment") ops_fields("p31|p53|p21022", regex = TRUE) ops_fields("cancer", details = TRUE) ## End(Not run)
Returns a small offline reference table of frequently used UK Biobank field
IDs. This helper is intentionally limited: it is not a complete UKB data
dictionary and does not imply that a field is approved or available in the
current RAP project. Use extract_ls to search the approved
fields in the active project before extraction.
ops_fields_common(pattern = NULL, group = NULL)ops_fields_common(pattern = NULL, group = NULL)
pattern |
(character or NULL) Optional case-insensitive keyword or
regular expression used to filter across |
group |
(character or NULL) Optional group filter. Use values from the
returned |
A data.table with columns:
Integer UKB field ID.
UKB field title.
Short practical description of the field.
Broad reference group.
Expected field shape: "single",
"instance", "array", or "instance_array".
ops_fields_common() ops_fields_common("sex") ops_fields_common(group = "genetics")ops_fields_common() ops_fields_common("sex") ops_fields_common(group = "genetics")
Scans each column of a data.frame or data.table and returns the count and
percentage of missing values. Results are sorted by missingness in
descending order. Columns above 10\
and 10\
threshold.
ops_na(data, threshold = 0, verbose = TRUE)ops_na(data, threshold = 0, verbose = TRUE)
data |
A data.frame or data.table to scan. |
threshold |
(numeric) Columns with |
verbose |
(logical) Print the CLI report. Default |
Missing value definition: a value is counted as missing if it is NA
or an empty string (""). Empty strings are treated as missing because
UKB exports frequently use "" as a placeholder for absent text values.
This means n_na and pct_na reflect effective missingness, not just
is.na(). Numeric and logical columns are not affected (they cannot hold
"").
An invisible data.table with columns column, n_na, and pct_na,
sorted by pct_na descending. n_na counts both NA and "".
Always contains all columns regardless of threshold (which only affects
CLI output).
dt <- ops_toy(n = 100) # Show all columns with any missing value ops_na(dt) # Only list columns with > 10% missing in the CLI output ops_na(dt, threshold = 10) # Programmatic use — retrieve result silently result <- ops_na(dt, verbose = FALSE) result[pct_na > 50]dt <- ops_toy(n = 100) # Show all columns with any missing value ops_na(dt) # Only list columns with > 10% missing in the CLI output ops_na(dt, threshold = 10) # Programmatic use — retrieve result silently result <- ops_na(dt, verbose = FALSE) result[pct_na > 50]
Adds column names to the session-level safe list. Columns in this list are
automatically excluded when ops_snapshot_cols is used to
build a drop vector, in addition to the built-in protected columns
("eid", "sex", "age", "age_at_recruitment").
ops_set_safe_cols(cols = NULL, reset = FALSE)ops_set_safe_cols(cols = NULL, reset = FALSE)
cols |
(character) One or more column names to protect. |
reset |
(logical) If |
Invisibly returns the updated safe cols vector.
ops_set_safe_cols(c("date_baseline", "townsend_index")) ops_set_safe_cols(reset = TRUE) # clear user-registered safe colsops_set_safe_cols(c("date_baseline", "townsend_index")) ops_set_safe_cols(reset = TRUE) # clear user-registered safe cols
Runs a four-block health check covering the dx CLI, dxpy (Python), RAP authentication, and R package dependencies. Designed to be the first function a new user runs after installation.
ops_setup( check_dx = TRUE, check_auth = TRUE, check_deps = TRUE, verbose = TRUE )ops_setup( check_dx = TRUE, check_auth = TRUE, check_deps = TRUE, verbose = TRUE )
check_dx |
(logical) Check dx CLI installation (dxpy is implied by dx). |
check_auth |
(logical) Check RAP login status. |
check_deps |
(logical) Check R package dependencies. |
verbose |
(logical) Print the formatted report. Set to |
The function is read-only: it never modifies system state, installs packages, or authenticates. Auth failures are reported as warnings, not errors, because the check itself does not require a live RAP connection.
An invisible named list with elements dx, dxpy, auth, deps,
and summary. Each element reflects the result of its respective check
block and can be inspected programmatically.
ops_setup(check_dx = FALSE, check_auth = FALSE) result <- ops_setup(check_dx = FALSE, check_auth = FALSE, verbose = FALSE) result$summary$fail == 0ops_setup(check_dx = FALSE, check_auth = FALSE) result <- ops_setup(check_dx = FALSE, check_auth = FALSE, verbose = FALSE) result$summary$fail == 0
Captures a lightweight summary of a data.frame at a given pipeline stage
and stores it in the session cache. Subsequent calls automatically compute
deltas against the previous snapshot, making it easy to track how data
changes through derive_*, assoc_*, and other processing steps.
ops_snapshot( data = NULL, label = NULL, reset = FALSE, verbose = TRUE, check_na = TRUE )ops_snapshot( data = NULL, label = NULL, reset = FALSE, verbose = TRUE, check_na = TRUE )
data |
A data.frame or data.table to snapshot. Pass |
label |
(character) A short name for this pipeline stage, e.g.
|
reset |
(logical) If |
verbose |
(logical) Print the CLI report. Default |
check_na |
(logical) Whether to count columns with any |
When data is supplied, returns the new snapshot row invisibly
(a one-row data.table). When called with no data, returns the full
history data.table invisibly.
ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "after_derive_missing") # View full history ops_snapshot() # Reset history ops_snapshot(reset = TRUE)ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "after_derive_missing") # View full history ops_snapshot() # Reset history ops_snapshot(reset = TRUE)
Returns the column names stored by a previous ops_snapshot
call, optionally excluding columns you wish to keep.
ops_snapshot_cols(label, keep = NULL)ops_snapshot_cols(label, keep = NULL)
label |
(character) Snapshot label passed to |
keep |
(character or NULL) Column names to exclude from the returned
vector (i.e. columns to retain in the data even if they were present at
that snapshot). Default |
A character vector of column names.
ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") ops_snapshot_cols("raw") ops_snapshot_cols("raw", keep = "eid") ops_snapshot(reset = TRUE, verbose = FALSE)ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") ops_snapshot_cols("raw") ops_snapshot_cols("raw", keep = "eid") ops_snapshot(reset = TRUE, verbose = FALSE)
Returns lists of columns added and removed between two recorded snapshots.
ops_snapshot_diff(label1, label2)ops_snapshot_diff(label1, label2)
label1 |
(character) Label of the earlier snapshot. |
label2 |
(character) Label of the later snapshot. |
A named list with two character vectors: added (columns
present in label2 but not label1) and removed
(columns present in label1 but not label2).
ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "derived") ops_snapshot_diff("raw", "derived") # $added — newly derived columns # $removed — columns dropped between snapshots ops_snapshot(reset = TRUE, verbose = FALSE)ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "derived") ops_snapshot_diff("raw", "derived") # $added — newly derived columns # $removed — columns dropped between snapshots ops_snapshot(reset = TRUE, verbose = FALSE)
Drops columns that were present at snapshot from from data,
while automatically protecting built-in safe columns
("eid", "sex", "age", "age_at_recruitment") and
any user-registered safe columns set via ops_set_safe_cols.
Columns that no longer exist in data are silently skipped.
ops_snapshot_remove(data, from, keep = NULL, verbose = TRUE)ops_snapshot_remove(data, from, keep = NULL, verbose = TRUE)
data |
A data.frame or data.table. |
from |
(character) Label of the snapshot whose columns should be
dropped (typically |
keep |
(character or NULL) Additional column names to protect beyond
the built-in and user-registered safe cols. Default |
verbose |
(logical) Print a summary of dropped columns. Default
|
A data.table with the specified columns removed. For
data.table input the operation is performed by reference (in-place);
for data.frame input the data is first converted to a new
data.table — the original data.frame is not modified.
ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "derived") ops_snapshot_diff("raw", "derived") dt <- ops_snapshot_remove(dt, from = "raw") ops_snapshot(reset = TRUE, verbose = FALSE)ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) ops_snapshot(dt, label = "raw") dt <- derive_missing(dt) ops_snapshot(dt, label = "derived") ops_snapshot_diff("raw", "derived") dt <- ops_snapshot_remove(dt, from = "raw") ops_snapshot(reset = TRUE, verbose = FALSE)
Creates a small, synthetic dataset that mimics the structure of UK Biobank
phenotype data on the RAP. Useful for developing and testing derive_*,
assoc_*, and plot_* functions without requiring real UKB data access.
ops_toy(scenario = "cohort", n = 1000L, seed = 42L)ops_toy(scenario = "cohort", n = 1000L, seed = 42L)
scenario |
(character) Data structure to generate:
|
n |
(integer) Number of participants (or exposures for |
seed |
(integer or NULL) Random seed for reproducibility. Pass |
This dataset is entirely synthetic. Column names follow RAP conventions
(e.g. p41270, p20002_i0_a0).
Includes the following column groups:
Demographics: eid, p31, p34, p53_i0, p21022
Covariates: p21001_i0, p20116_i0, p1558_i0, p21000_i0,
p22189, p54_i0
Genetic PCs: p22009_a1 – p22009_a10
Self-report disease: p20002_i0_a0–a4, p20008_i0_a0–a4
Self-report cancer: p20001_i0_a0–a4, p20006_i0_a0–a4
HES: p41270 (JSON array), p41280_a0–a8
Cancer registry: p40006_i0–i2, p40011_i0–i2, p40012_i0–i2,
p40005_i0–i2
Death registry: p40001_i0, p40002_i0_a0–a2, p40000_i0
First occurrence: p131742
GRS: grs_bmi, grs_raw, grs_finngen
Messy columns: messy_allna, messy_empty, messy_label
Analysis-ready table. All derive inputs (raw arrays, HES JSON, registry fields) are omitted; derive outputs are pre-computed with internally consistent relationships:
Demographics: eid, p31 (factor), p53_i0 (IDate), p21022
Covariates: p21001_i0, bmi_cat (factor, derived from p21001_i0),
p20116_i0 (factor), p1558_i0 (factor), p21000_i0 (factor),
p22189, tdi_cat (factor, derived from p22189 quartiles), p54_i0
(factor)
Genetic PCs: p22009_a1 – p22009_a10
GRS: grs_bmi (continuous exposure)
DM outcome: dm_status, dm_date, dm_timing, dm_followup_end,
dm_followup_years (type 2 diabetes, ~12% prevalence)
HTN outcome: htn_status, htn_date, htn_timing,
htn_followup_end, htn_followup_years (hypertension, ~28% prevalence)
Internal relationships guaranteed:
bmi_cat is cut from p21001_i0 (breaks 18.5 / 25 / 30)
tdi_cat is cut from p22189 quartiles
dm_date is NA iff dm_status = FALSE
dm_timing: 0 = no disease, 1 = prevalent, 2 = incident, NA = no date
dm_followup_years is NA for prevalent cases (dm_timing == 1)
A data.table with UKB-style column names. See Details for the
columns included in each scenario.
# cohort: raw UKB-style columns, feed into derive pipeline dt <- ops_toy(n = 100) dt <- derive_missing(dt) # association: analysis-ready, feed directly into assoc_* functions dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] # exclude prevalent cases # forest: results table for plot_forest() dt <- ops_toy(scenario = "forest")# cohort: raw UKB-style columns, feed into derive pipeline dt <- ops_toy(n = 100) dt <- derive_missing(dt) # association: analysis-ready, feed directly into assoc_* functions dt <- ops_toy(scenario = "association", n = 500) dt <- dt[dm_timing != 1L] # exclude prevalent cases # forest: results table for plot_forest() dt <- ops_toy(scenario = "forest")
Reads a UK Biobank withdrawal list (a headerless single-column CSV of
anonymised participant IDs) and removes the corresponding rows from
data. A pair of ops_snapshot() calls is made automatically so the
before/after row counts are recorded in the session snapshot history.
ops_withdraw(data, file, eid_col = "eid", verbose = TRUE)ops_withdraw(data, file, eid_col = "eid", verbose = TRUE)
data |
A data.frame or data.table containing a participant ID column. |
file |
(character) Path to the UKB withdrawal CSV file. The file must
be a single-column, header-free CSV as supplied by UK Biobank
(e.g. |
eid_col |
(character) Name of the participant ID column in |
verbose |
(logical) Print the CLI report. Default |
A data.table with withdrawn participants removed.
ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) withdraw_file <- tempfile(fileext = ".csv") writeLines(as.character(dt$eid[1:5]), withdraw_file) dt <- ops_withdraw(dt, file = withdraw_file) ops_snapshot(reset = TRUE, verbose = FALSE)ops_snapshot(reset = TRUE, verbose = FALSE) dt <- ops_toy(n = 100) withdraw_file <- tempfile(fileext = ".csv") writeLines(as.character(dt$eid[1:5]), withdraw_file) dt <- ops_withdraw(dt, file = withdraw_file) ops_snapshot(reset = TRUE, verbose = FALSE)
Produces a publication-ready forest plot with UKB-standard styling.
The user supplies a data frame whose first column is the row label
(item), plus any additional display columns (e.g. Cases/N).
The gap column and the auto-formatted OR (95% CI) text column are
inserted automatically at ci_column. Numeric p-value columns
declared via p_cols are formatted in-place.
plot_forest( data, est, lower, upper, ci_column = 2L, ref_line = 1, xlim = NULL, ticks_at = NULL, arrow_lab = c("Lower risk", "Higher risk"), header = NULL, indent = NULL, bold_label = NULL, ci_col = "black", ci_sizes = 0.6, ci_Theight = 0.2, ci_digits = 2L, ci_sep = ", ", p_cols = NULL, p_digits = 3L, bold_p = TRUE, p_threshold = 0.05, align = NULL, background = "zebra", bg_col = "#F0F0F0", border = "three_line", border_width = 3, row_height = NULL, col_width = NULL, save = FALSE, dest = NULL, save_width = 20, save_height = NULL, theme = "default" )plot_forest( data, est, lower, upper, ci_column = 2L, ref_line = 1, xlim = NULL, ticks_at = NULL, arrow_lab = c("Lower risk", "Higher risk"), header = NULL, indent = NULL, bold_label = NULL, ci_col = "black", ci_sizes = 0.6, ci_Theight = 0.2, ci_digits = 2L, ci_sep = ", ", p_cols = NULL, p_digits = 3L, bold_p = TRUE, p_threshold = 0.05, align = NULL, background = "zebra", bg_col = "#F0F0F0", border = "three_line", border_width = 3, row_height = NULL, col_width = NULL, save = FALSE, dest = NULL, save_width = 20, save_height = NULL, theme = "default" )
data |
data.frame. First column must be the label column ( |
est |
Numeric vector. Point estimates ( |
lower |
Numeric vector. Lower CI bounds (same length as |
upper |
Numeric vector. Upper CI bounds (same length as |
ci_column |
Integer. Column position in the final rendered table where
the gap/CI graphic is placed. Must be between |
ref_line |
Numeric. Reference line. Default: |
xlim |
Numeric vector of length 2. X-axis limits. |
ticks_at |
Numeric vector. Tick positions. |
arrow_lab |
Character vector of length 2. Directional labels.
Default: |
header |
Character vector of length |
indent |
Integer vector (length = |
bold_label |
Logical vector (length = |
ci_col |
Character scalar or vector (length = |
ci_sizes |
Numeric. Point size. Default: |
ci_Theight |
Numeric. CI cap height. Default: |
ci_digits |
Integer. Decimal places for the auto-generated
|
ci_sep |
Character. Separator between lower and upper CI in the label,
e.g. |
p_cols |
Character vector. Names of numeric p-value columns in
|
p_digits |
Integer. Decimal places for p-value formatting.
Default: |
bold_p |
|
p_threshold |
Numeric. P-value threshold for bolding when
|
align |
Integer vector of length |
background |
Character. Row background style: |
bg_col |
Character. Shading colour for backgrounds (scalar), or a
per-row vector of length |
border |
Character. Border style: |
border_width |
Numeric. Border line width(s). Scalar = all three lines
same width; length-3 vector = top-of-header, bottom-of-header,
bottom-of-body. Default: |
row_height |
|
col_width |
|
save |
Logical. Save output to files? Default: |
dest |
Character. Destination file path (extension ignored; all four
formats are saved). Required when |
save_width |
Numeric. Output width in cm. Default: |
save_height |
Numeric or |
theme |
Character preset ( |
A forestploter plot object (gtable), returned invisibly.
Display with plot() or grid::grid.draw().
df <- data.frame( item = c("Exposure vs. control", "Unadjusted", "Fully adjusted"), `Cases/N` = c("", "89/4521", "89/4521"), p_value = c(NA_real_, 0.001, 0.006), check.names = FALSE ) p <- plot_forest( data = df, est = c(NA, 1.52, 1.43), lower = c(NA, 1.18, 1.11), upper = c(NA, 1.96, 1.85), ci_column = 2L, indent = c(0L, 1L, 1L), bold_label = c(TRUE, FALSE, FALSE), p_cols = "p_value", xlim = c(0.5, 3.0) ) plot(p)df <- data.frame( item = c("Exposure vs. control", "Unadjusted", "Fully adjusted"), `Cases/N` = c("", "89/4521", "89/4521"), p_value = c(NA_real_, 0.001, 0.006), check.names = FALSE ) p <- plot_forest( data = df, est = c(NA, 1.52, 1.43), lower = c(NA, 1.18, 1.11), upper = c(NA, 1.96, 1.85), ci_column = 2L, indent = c(0L, 1L, 1L), bold_label = c(TRUE, FALSE, FALSE), p_cols = "p_value", xlim = c(0.5, 3.0) ) plot(p)
Generates a publication-quality baseline-characteristics table (Table 1) using gtsummary, with optional SMD column, Lancet-style theming, and automatic export to four formats.
plot_tableone( data, vars, strata = NULL, type = NULL, label = NULL, statistic = NULL, digits = NULL, percent = "column", missing = "no", add_p = TRUE, add_smd = FALSE, overall = FALSE, exclude_labels = NULL, theme = "lancet", label_width = 200, stat_width = 140, pvalue_width = 100, row_height = 8, save = FALSE, dest = NULL, png_scale = 2, pdf_width = NULL, pdf_height = NULL )plot_tableone( data, vars, strata = NULL, type = NULL, label = NULL, statistic = NULL, digits = NULL, percent = "column", missing = "no", add_p = TRUE, add_smd = FALSE, overall = FALSE, exclude_labels = NULL, theme = "lancet", label_width = 200, stat_width = 140, pvalue_width = 100, row_height = 8, save = FALSE, dest = NULL, png_scale = 2, pdf_width = NULL, pdf_height = NULL )
data |
data.frame. Input data containing all variables. |
vars |
Character vector. Variable names to display in the table. |
strata |
Character scalar. Column name for the grouping/stratification
variable. |
type |
Named list or |
label |
Named list / formula list or |
statistic |
Named list or |
digits |
Named list or |
percent |
Character. Percentage base for categorical variables:
|
missing |
Character. Display of missing counts:
|
add_p |
Logical. Add a p-value column. Default |
add_smd |
Logical. Add a standardised mean difference (SMD) column.
Continuous variables use Cohen's d; categorical variables use
root-mean-square deviation (RMSD) of group proportions. SMD is shown
only on label rows. Default |
overall |
Logical. Add an Overall column when |
exclude_labels |
Character vector or |
theme |
Character. Visual theme preset. Only |
label_width |
Integer. Label column width in pixels. Default |
stat_width |
Integer. Statistics column(s) width in pixels.
Default |
pvalue_width |
Integer. P-value and SMD column width in pixels.
Default |
row_height |
Integer. Data row padding (top + bottom) in pixels.
Default |
save |
Logical. Export the table to files. Default |
dest |
Character or |
png_scale |
Numeric. Zoom factor for PNG export via webshot2.
Higher values produce larger, higher-resolution images. Default: |
pdf_width |
Numeric or |
pdf_height |
Numeric or |
The following behaviours are fixed (not exposed as parameters):
Variable labels are bold (bold_labels()).
P-values are formatted as <0.001 or to 3 decimal places.
Significant p-values () are bold.
The p-value column header is rendered as P-value.
The table uses a three-line (booktabs) border.
Saving always exports word, html, pdf, and png.
A gt table object, returned invisibly.
library(gtsummary) data(trial) # Basic stratified table plot_tableone( data = trial, vars = c("age", "marker", "grade"), strata = "trt", save = FALSE ) # With SMD, custom labels, exclude Unknown level plot_tableone( data = trial, vars = c("age", "marker", "grade", "stage"), strata = "trt", label = list(age ~ "Age (years)", marker ~ "Marker level"), add_smd = TRUE, exclude_labels = "Unknown", save = FALSE )library(gtsummary) data(trial) # Basic stratified table plot_tableone( data = trial, vars = c("age", "marker", "grade"), strata = "trt", save = FALSE ) # With SMD, custom labels, exclude Unknown level plot_tableone( data = trial, vars = c("age", "marker", "grade", "stage"), strata = "trt", label = list(age ~ "Age (years)", marker ~ "Marker level"), add_smd = TRUE, exclude_labels = "Unknown", save = FALSE )