---
title: "Genetic Risk Score (GRS) Analysis on UKB RAP"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genetic Risk Score (GRS) Analysis on UKB RAP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  eval     = FALSE
)
```

## Overview

The `grs_*` functions provide an end-to-end pipeline for computing and
validating polygenic / genetic risk scores within the UK Biobank Research
Analysis Platform (RAP). Because individual-level genotype data cannot be
downloaded locally, all computationally intensive steps are executed as
cloud jobs via the DNAnexus Swiss Army Knife app.

| Function | Where it runs | Purpose |
|---|---|---|
| `grs_check()` | Local or RAP | Validate and export a SNP weights file |
| `grs_bgen2pgen()` | Submits RAP jobs | Convert UKB imputed BGEN files to PGEN |
| `grs_score()` | Submits RAP jobs | Score genotypes across chromosomes with plink2 |
| `grs_standardize()` | Local or RAP | Z-score standardise GRS columns |
| `grs_zscore()` | Local or RAP | Alias for `grs_standardize()` |
| `grs_validate()` | Local or RAP | Assess predictive performance (OR/HR, AUC, C-index) |

**Typical pipeline:**

```
grs_check()  ->  grs_bgen2pgen()  ->  grs_score()  ->  grs_standardize()  ->  grs_validate()
```

> **Prerequisite**: you must be authenticated to UKB RAP and have a project
> selected. See `vignette("auth")`.

---

## Step 1: Validate the Weights File -- `grs_check()`

Before uploading to RAP, validate your SNP weights file. The function reads
the file, checks required columns, flags problems, and writes a
plink2-compatible space-delimited output.

**Required columns:**

| Column | Description |
|---|---|
| `snp` | SNP identifier (expected `rs` + digits format) |
| `effect_allele` | Effect allele: one of A / T / C / G |
| `beta` | Effect size (log-OR or beta); must be numeric |

```{r grs-check-local}
library(ukbflow)

# Local: weights file on your machine
w <- grs_check("weights.csv", dest = "weights_clean.txt")
#> Read 'weights.csv': 312 rows, 5 columns.
#> ✔ No NA values.
#> ✔ No duplicate SNPs.
#> ✔ All SNP IDs match rs[0-9]+ format.
#> ✔ All effect alleles are A/T/C/G.
#> Beta summary:
#>   Range     : -0.412 to 0.589
#>   Mean |beta|: 0.183
#>   Positive  : 187 (59.9%)
#>   Negative  : 125 (40.1%)
#>   Zero      : 0
#> ✔ Weights file passed checks: 312 SNPs ready for UKB RAP.
#> ✔ Saved: 'weights_clean.txt'
```

```{r grs-check-rap}
# On RAP (RStudio) -- use /mnt/project/ paths directly
w <- grs_check(
  file = "/mnt/project/weights/weights.csv",
  dest = "/mnt/project/weights/weights_clean.txt"
)
```

The validated weights are also returned invisibly for inspection.

---

## Step 2: Convert BGEN to PGEN -- `grs_bgen2pgen()`

UKB imputed genotype data is stored in BGEN format on RAP. `grs_bgen2pgen()`
submits one Swiss Army Knife job per chromosome to convert BGEN files to the
plink2-native PGEN format with a MAF filter applied via plink2. This staged
workflow separates one-time genotype conversion from repeated GRS scoring, so
the converted PGEN files can be reused across multiple score files.

The trade-off is additional RAP project storage: PGEN outputs should be kept
only when they will be reused, and removed according to the project's storage
and cost policy when no longer needed.

This function is called from your local machine or RAP RStudio -- the heavy
lifting runs entirely in the cloud.

```{r grs-bgen2pgen-test}
# Test on the smallest chromosome first
ids <- grs_bgen2pgen(chr = 22, dest = "/pgen/", priority = "high")
#> Uploading 'grs_bgen2pgen_std.R' to RAP ...
#> ✔ Uploaded: '/grs_bgen2pgen_std.R'
#> Submitting 1 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ chr22 -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ 1/1 job(s) submitted. Monitor with job_ls().
```

```{r grs-bgen2pgen-all}
# Full run: small chromosomes on standard, large on upgraded instance
ids_small <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/")
ids_large <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", instance = "large")

# Monitor progress (job_wait() takes one job ID at a time)
job_ls()
for (id in c(ids_small, ids_large)) job_wait(id)
```

**Instance types:**

| Preset | DNAnexus instance | Cores | RAM | SSD | Recommended for |
|---|---|---|---|---|---|
| `"standard"` | `mem2_ssd1_v2_x4` | 4 | 12 GB | 200 GB | chr 15–22 |
| `"large"` | `mem2_ssd2_v2_x8` | 8 | 28 GB | 640 GB | chr 1–16 |

**Key arguments:**

| Argument | Default | Description |
|---|---|---|
| `chr` | `1:22` | Chromosomes to process |
| `dest` | — | RAP output path for PGEN files (required) |
| `maf` | `0.01` | MAF filter passed to plink2 `--maf` |
| `instance` | `"standard"` | Instance type preset |
| `priority` | `"low"` | Job priority (`"low"` or `"high"`) |

> **Storage warning**: chromosomes 1–16 may exceed the 200 GB SSD on
> `"standard"` instances. Use `instance = "large"` for these.

---

## Step 3: Calculate GRS Scores -- `grs_score()`

Once PGEN files are ready, `grs_score()` uploads your weights file(s) and
submits one plink2 scoring job per GRS. Each job scores all 22 chromosomes
and saves a single CSV to RAP.

```{r grs-score}
ids <- grs_score(
  file     = c(
    grs_a = "weights/grs_a_weights.txt",
    grs_b = "weights/grs_b_weights.txt"
  ),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/",
  priority = "high"
)
#> ── Uploading 2 weight file(s) to RAP ────────────────────────────────────────
#> Uploading 'grs_a_weights.txt' ...
#> ✔ Uploaded: '/grs_a_weights.txt'
#> Uploading 'grs_b_weights.txt' ...
#> ✔ Uploaded: '/grs_b_weights.txt'
#> Submitting 2 job(s) -- mem2_ssd1_v2_x4 / priority: high
#> ✔ grs_a -> 'job-Gxxxxxxxxxxxxxxxxxxxxxxxx'
#> ✔ grs_b -> 'job-Gyyyyyyyyyyyyyyyyyyyyyyyy'
#> ✔ 2/2 job(s) submitted. Monitor with job_ls().

job_wait(ids)
```

When running from RAP RStudio with weights already at the project root,
the upload step is skipped automatically:

```{r grs-score-rap}
# On RAP: weights already at /mnt/project/grs_a_weights.txt
ids <- grs_score(
  file     = c(grs_a = "/mnt/project/grs_a_weights.txt"),
  pgen_dir = "/mnt/project/pgen",
  dest     = "/grs/"
)
#> ℹ grs_a_weights.txt already at RAP root, skipping upload.
```

> **Important**: the `maf` argument must match the value used in
> `grs_bgen2pgen()` so that the correct PGEN files are located.

Output per job: `/grs/<GRS_name>_scores.csv` with columns `IID` and the GRS
score named `GRS_<name>`.

---

## Step 4: Standardise GRS Columns -- `grs_standardize()`

After downloading the score CSVs from RAP (via `fetch_file()`) and merging
them into your analysis dataset, Z-score standardise each GRS column so that
effect estimates are interpretable as per-SD associations.

```{r grs-standardize}
# Auto-detect all columns containing "grs" (case-insensitive)
cohort <- grs_standardize(cohort)
#> Auto-detected 2 GRS column(s): 'GRS_a', 'GRS_b'
#> ✔ GRS_a -> GRS_a_z  [mean=0.0000, sd=1.0000]
#> ✔ GRS_b -> GRS_b_z  [mean=0.0000, sd=1.0000]
```

```{r grs-standardize-explicit}
# Or specify columns explicitly
cohort <- grs_standardize(cohort, grs_cols = c("GRS_a", "GRS_b"))
```

`grs_zscore()` is an alias and produces an identical result.

The original columns are preserved; `_z` columns are inserted immediately
after their source column.

---

## Step 5: Validate Predictive Performance -- `grs_validate()`

`grs_validate()` runs four complementary validation analyses for each GRS:

| Analysis | What it measures |
|---|---|
| **Per SD** | OR (logistic) or HR (Cox) per 1-SD increase in GRS |
| **High vs Low** | OR / HR comparing top 20% vs bottom 20% |
| **Trend test** | P-trend across Q1–Q4 quartiles |
| **Discrimination** | AUC (logistic) or C-index (Cox) with 95% CI |

### Logistic (cross-sectional)

```{r grs-validate-logistic}
res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome"
)
#> ── Creating GRS groups ───────────────────────────────────────────────────────
#> ── Effect per SD (OR) ───────────────────────────────────────────────────────
#> ── High vs Low ──────────────────────────────────────────────────────────────
#> ── Trend test ───────────────────────────────────────────────────────────────
#> ── AUC ──────────────────────────────────────────────────────────────────────
#> ✔ Validation complete.

res$per_sd
res$discrimination
```

### Cox (survival)

```{r grs-validate-cox}
res <- grs_validate(
  data        = cohort,
  grs_cols    = c("GRS_a_z", "GRS_b_z"),
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd          # HR per SD x model
res$high_vs_low     # HR: top 20% vs bottom 20%
res$trend           # p-trend across Q1–Q4
res$discrimination  # C-index with 95% CI
```

**Return value:** a named list with four `data.table` elements.

| Element | Columns (logistic) | Columns (Cox) |
|---|---|---|
| `per_sd` | `exposure`, `term`, `model`, `OR`, `CI_lower`, `CI_upper`, `p_value` | same with `HR` |
| `high_vs_low` | same as `per_sd` | same with `HR` |
| `trend` | `exposure`, `model`, `p_trend` | same |
| `discrimination` | `GRS`, `AUC`, `CI_lower`, `CI_upper` | `GRS`, `C_index`, `CI_lower`, `CI_upper` |

> AUC calculation requires the `pROC` package.
> Install with `install.packages("pROC")` if needed.

---

## Complete Pipeline Example

```{r grs-full-pipeline}
library(ukbflow)

# 1. Validate weights (local)
grs_check("weights.csv", dest = "weights_clean.txt")

# 2. Convert BGEN -> PGEN on RAP (submit jobs)
ids_std  <- grs_bgen2pgen(chr = 17:22, dest = "/pgen/", maf = 0.01)
ids_lrg  <- grs_bgen2pgen(chr = 1:16,  dest = "/pgen/", maf = 0.01, instance = "large")
for (id in c(ids_std, ids_lrg)) job_wait(id)

# 3. Score GRS on RAP (submit jobs)
score_ids <- grs_score(
  file     = c(grs_a = "weights_clean.txt"),
  pgen_dir = "/mnt/project/pgen",
  maf      = 0.01,   # must match grs_bgen2pgen()
  dest     = "/grs/"
)
job_wait(score_ids)

# 4. Download score CSV from RAP
fetch_file("/grs/GRS_a_scores.csv", dest = "GRS_a_scores.csv")

# 5. Merge into analysis dataset and standardise
# cohort: your analysis data.table with IID and phenotype columns
scores <- data.table::fread("GRS_a_scores.csv")   # columns: IID, GRS_a
cohort <- scores[cohort, on = "IID"]               # right-join: keep all cohort rows
cohort <- grs_standardize(cohort, grs_cols = "GRS_a")

# 6. Validate
res <- grs_validate(
  data        = cohort,
  grs_cols    = "GRS_a_z",
  outcome_col = "outcome",
  time_col    = "followup_years",
  covariates  = c("age", "sex", paste0("pc", 1:10))
)

res$per_sd
res$discrimination
```

---

## Getting Help

- `?grs_check`, `?grs_bgen2pgen`, `?grs_score`
- `?grs_standardize`, `?grs_validate`
- `vignette("auth")` -- RAP authentication and project selection
- `vignette("job")` -- monitoring and managing RAP jobs
- `vignette("assoc")` -- association analysis functions used inside `grs_validate()`
- [GitHub Issues](https://github.com/evanbio/ukbflow/issues)