--- title: "Deriving Disease Phenotypes from UKB Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Deriving Disease Phenotypes from UKB Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The `derive_*` functions convert raw UKB columns into analysis-ready variables. This vignette covers the disease phenotype derivation pipeline: | Step | Function(s) | Purpose | |---|---|---| | 1 | `derive_missing()` | Handle "Do not know" / "Prefer not to answer" | | 2 | `derive_covariate()` | Convert types; summarise covariates | | 3 | `derive_cut()` | Bin continuous variables into groups | | 4 | `derive_selfreport()` | Self-reported disease status + date | | 5 | `derive_hes()` | HES inpatient ICD-10 status + date | | 6 | `derive_first_occurrence()` | First Occurrence field status + date | | 7 | `derive_cancer_registry()` | Cancer registry status + date | | 8 | `derive_death_registry()` | Death registry ICD-10 status + date | | 9 | `derive_icd10()` | Combine any subset of sources (wrapper) | | 10 | `derive_case()` | Merge self-report + ICD-10 into final case definition | Current phenotype-source support is intentionally scoped to the common UKB sources below: | Source | Code system / field type | Main function(s) | |---|---|---| | Self-reported illness / cancer | UKB fields `20002` / `20001` | `derive_selfreport()` | | HES inpatient diagnoses | ICD-10, any-position field `41270` with dates from `41280` | `derive_hes()` | | First Occurrence fields | UKB precomputed `p131xxx` dates | `derive_first_occurrence()` | | Cancer registry | ICD-10, histology, behaviour, diagnosis date | `derive_cancer_registry()` | | Death registry | ICD-10 primary / secondary cause of death | `derive_death_registry()` | | Multi-source ICD-10 phenotype | HES, death, First Occurrence, cancer registry | `derive_icd10()` | | Final case definition | Self-report plus ICD-10-derived status/date | `derive_case()` | ICD-9, OPCS-4, Read v2, CTV3, and other GP / primary-care code systems are not part of the current public API. All functions accept a `data.frame` or `data.table` and return a `data.table`. For `data.table` input, new columns are added **by reference** (no copy); `data.frame` input is converted to `data.table` internally before modification. > **In production**, replace `ops_toy()` with `extract_batch()` followed by > `decode_values()` and `decode_names()`. See `vignette("decode")`. > Column names below use the RAP raw format (`p{field}_{instance}_{array}`) > as returned by `ops_toy()` and `extract_batch()` before decoding. --- ## Setup ```{r load-data} library(ukbflow) df <- ops_toy(n = 500) ``` --- ## Step 1: Handle Informative Missing Labels UKB uses special labels such as `"Do not know"` and `"Prefer not to answer"` to distinguish refusal from true missing data. `derive_missing()` converts these to `NA` (default) or retains them as `"Unknown"` for modelling. ```{r derive-missing} df <- derive_missing(df) ``` > **Performance**: `derive_missing()` uses `data.table::set()` for in-place > replacement — no column copies are made regardless of dataset size. To keep non-response as a model category: ```{r derive-missing-unknown} df <- derive_missing(df, action = "unknown") ``` To add custom labels beyond the built-in list: ```{r derive-missing-extra} df <- derive_missing(df, extra_labels = "Not applicable") ``` --- ## Step 2: Prepare Covariates `derive_covariate()` converts categorical columns to `factor` and prints a distribution summary for each. ```{r derive-covariate} df <- derive_covariate( df, as_factor = c( "p31", # sex "p20116_i0", # smoking_status_i0 "p1558_i0" # alcohol_intake_frequency_i0 ), factor_levels = list( p20116_i0 = c("Never", "Previous", "Current") ) ) ``` --- ## Step 3: Bin Continuous Variables `derive_cut()` creates a new factor column by binning a continuous variable into quantile-based or custom groups. ```{r derive-cut} df <- derive_cut( df, col = "p21001_i0", # body_mass_index_bmi_i0 n = 4, breaks = c(18.5, 25, 30), labels = c("Underweight", "Normal", "Overweight", "Obese"), name = "bmi_cat" ) df <- derive_cut( df, col = "p22189", # townsend_deprivation_index_at_recruitment n = 4, labels = c("Q1 (least deprived)", "Q2", "Q3", "Q4 (most deprived)"), name = "tdi_cat" ) ``` --- ## Step 4: Self-Reported Disease `derive_selfreport()` searches UKB self-reported non-cancer illness (field 20002) or cancer (field 20001) columns for a disease label matching a regex, then returns binary status and the earliest report date. Column detection is automatic from field IDs. ```{r derive-selfreport} # Non-cancer: type 2 diabetes (field 20002) df <- derive_selfreport(df, name = "dm", regex = "type 2 diabetes" ) ``` ```{r derive-selfreport-cancer} # Cancer: lung cancer (field 20001) df <- derive_selfreport(df, name = "lung_cancer", regex = "lung cancer", field = "cancer" ) ``` This adds two columns per call: | Column | Type | Description | |---|---|---| | `dm_selfreport` | logical | `TRUE` if any instance matched | | `dm_selfreport_date` | IDate | Earliest report date | --- ## Step 5: HES Inpatient Records `derive_hes()` scans UKB Hospital Episode Statistics ICD-10 codes (field 41270, stored as a JSON array per participant) and matches the earliest corresponding date from field 41280. Field 41270 contains any recorded HES inpatient ICD-10 diagnosis position. `derive_hes()` therefore treats any matching ICD-10 code in this field as a case. It does not currently distinguish main/primary diagnoses (field 41202) from secondary diagnoses (field 41204). ```{r derive-hes} # Prefix match: codes starting with "I10" (hypertension) df <- derive_hes(df, name = "htn", icd10 = "I10") # Exact match df <- derive_hes(df, name = "dm_hes", icd10 = "E11", match = "exact") # Regex: E10 and E11 simultaneously df <- derive_hes(df, name = "dm_broad", icd10 = "^E1[01]", match = "regex") ``` The `match` argument controls how codes are compared: | `match` | Behaviour | Example | |---|---|---| | `"prefix"` (default) | Code starts with pattern | `"E11"` matches `"E110"`, `"E119"` | | `"exact"` | Full 3- or 4-digit match | `"E11"` matches only `"E11"` | | `"regex"` | Full regular expression | `"^E1[01]"` | --- ## Step 6: First Occurrence Fields UKB First Occurrence fields (p131xxx) record the earliest date a condition was observed across **all linked sources** — self-report, HES inpatient, GP records, and death registry — pre-integrated by UKB. Look up your disease in the [UKB Field Finder](https://biobank.ndph.ox.ac.uk/showcase/search.cgi). ```{r derive-fo} # ops_toy includes p131742 as a representative First Occurrence column df <- derive_first_occurrence(df, name = "htn", field = 131742L, col = "p131742") ``` --- ## Step 7: Cancer Registry `derive_cancer_registry()` searches the cancer registry ICD-10 field (40006) and optionally filters by histology (field 40011) and behaviour (field 40012). ```{r derive-cancer} # ICD-10 only df <- derive_cancer_registry(df, name = "skin_cancer", icd10 = "^C44" ) # With histology and behaviour filters df <- derive_cancer_registry(df, name = "scc", icd10 = "^C44", histology = c(8070L, 8071L, 8072L), behaviour = 3L # 3 = malignant ) ``` --- ## Step 8: Death Registry `derive_death_registry()` searches primary (field 40001) and secondary (field 40002) causes of death for ICD-10 codes. ```{r derive-death} df <- derive_death_registry(df, name = "mi", icd10 = "I21") df <- derive_death_registry(df, name = "dm", icd10 = "E11") df <- derive_death_registry(df, name = "lung", icd10 = "C34") ``` --- ## Step 9: Combine Sources with `derive_icd10()` `derive_icd10()` is a high-level wrapper that calls any combination of the source-specific functions above and merges their outputs into a single status column and earliest date. This is the recommended approach for multi-source ascertainment. ```{r derive-icd10} # Non-cancer disease: HES + death + First Occurrence df <- derive_icd10(df, name = "dm", icd10 = "E11", source = c("hes", "death", "first_occurrence"), fo_col = "p131742" ) # Cancer outcome: cancer registry df <- derive_icd10(df, name = "lung", icd10 = "^C3[34]", match = "regex", source = "cancer_registry", behaviour = 3L ) ``` Intermediate source columns are retained alongside the combined result: | Column | Type | Description | |---|---|---| | `dm_icd10` | logical | `TRUE` if positive in any specified source | | `dm_icd10_date` | IDate | Earliest date across all sources | | `dm_hes` | logical | HES status | | `dm_hes_date` | IDate | HES date | | `dm_fo` | logical | First Occurrence status | | `dm_fo_date` | IDate | First Occurrence date | | `dm_death` | logical | Death registry status | | `dm_death_date` | IDate | Death registry date | --- ## Step 10: Final Case Definition `derive_case()` applies an any-source reconciliation rule by default. The final status is `TRUE` if either the ICD-10-derived status or the self-report status is `TRUE`; this is an OR rule, not a medical-record confirmation rule. The final date is the earliest available date across the included sources, computed with `pmin()`. Use `derive_icd10(source = ...)` to control which medical / registry sources enter the ICD-10-derived status before calling `derive_case()`. If only one of `{name}_icd10` or `{name}_selfreport` is present, `derive_case()` uses that available source alone and prints a warning. ```{r derive-case} df <- derive_case(df, name = "dm") ``` Single-source case definitions are also possible. For an ICD-10-derived medical / registry definition, run `derive_icd10()` for a distinct `name` and do not create the matching self-report columns: ```{r derive-case-icd10-only} df <- derive_icd10(df, name = "dm_medical", icd10 = "E11", source = c("hes", "death", "first_occurrence"), fo_col = "p131742" ) df <- derive_case(df, name = "dm_medical") ``` For a self-report-only definition, run `derive_selfreport()` for a distinct `name` and do not create the matching ICD-10-derived columns: ```{r derive-case-selfreport-only} df <- derive_selfreport(df, name = "dm_selfonly", regex = "type 2 diabetes" ) df <- derive_case(df, name = "dm_selfonly") ``` Output columns: | Column | Type | Description | |---|---|---| | `dm_status` | logical | `TRUE` if positive in self-report OR ICD-10 | | `dm_date` | IDate | **Earliest** date across all sources (`pmin`) | > **Why the earliest date matters**: `dm_date` is the direct input to > `derive_timing()`, `derive_age()`, and `derive_followup()` — it is the > chronological anchor of every downstream survival analysis. > See `vignette("derive-survival")`. --- ## Getting Help - `?derive_missing`, `?derive_covariate`, `?derive_cut` - `?derive_selfreport`, `?derive_hes`, `?derive_first_occurrence` - `?derive_cancer_registry`, `?derive_death_registry` - `?derive_icd10`, `?derive_case` - `vignette("derive-survival")` — timing, age at event, follow-up - `vignette("decode")` — decoding column names and values - [GitHub Issues](https://github.com/evanbio/ukbflow/issues)