Installation
To install the latest version of sHDL
package via
Github, run the following code in R
:
# install.packages('remotes')
remotes::install_github("now2014/sHDL", ref='main')
The data.table
package is also required for this
example:
install.packages('data.table')
Data preparation
Pre-computed LD reference panels
There are three publically available pre-computed
autosome LD reference panels for the
European-ancestry population (335,265 Genomic
British individuals in UK Biobank) from HDL
:
Panel 1. QCed UK Biobank Axiom Array (307,519 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/fuvpwsf6r8tjd6c/UKB_array_SVD_eigen90_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_array_SVD_eigen90_extraction.tar.gz
Panel 2. QCed UK Biobank imputed HapMap2 (769,306 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/4vuktycxz1an6sp/UKB_imputed_hapmap2_SVD_eigen99_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_imputed_hapmap2_SVD_eigen99_extraction.tar.gz
Panel 3. QCed UK Biobank imputed HapMap3 (1,029,876 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/6js1dzy4tkc3gac/UKB_imputed_SVD_eigen99_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_imputed_SVD_eigen99_extraction.tar.gz
You may download one of the above reference panel according to the SNP density of your GWAS summary statistics, or build your own LD reference panel. Alternatively, you can download the reference panels via Baidu Netdisk referring to this FAQ.
Here, we illustrate the usage of sHDL with the Panel 2. QCed UK Biobank Axiom Array.
-
Step 01. Download the LD reference panel.
-
Step 02. Extract files from the archive.
GWAS summary statistics
You can test with the dataset from HDL
directly:
data(gwas1.example, package='HDL') # GWAS summary statistics from the Neale Lab round 2 GWAS of UK Biobank of birth weight.
gwas.df <- gwas1.example[, c('SNP', 'A1', 'A2', 'Z', 'N')]
saveRDS(gwas1.example, file='gwas.df.rds') # save the gwas.df for sHDL
Or you can prepare your data as follows:
-
Step 01. Download the Height GWAS summary statistics from the GIANT consortium.
-
Step 02. Load the GWAS summary statistics into
R
.gwas.df <- data.table::fread('/path/to/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EUR.gz', fill=TRUE, header=TRUE) head(gwas.df, 5) # show the first five rows # SNPID RSID CHR POS EFFECT_ALLELE OTHER_ALLELE EFFECT_ALLELE_FREQ BETA SE P N # 1: 1:566875:C:T rs2185539 1 566875 T C 0.002800 -0.046315600 0.03930 0.238128 537968 # 2: 1:728951:C:T rs11240767 1 728951 T C 0.000356 0.167358000 0.12600 0.185025 85591 # 3: 1:734462:A:G rs12564807 1 734462 A G 0.893000 0.004656900 0.01100 0.672866 112953 # 4: 1:752721:A:G rs3131972 1 752721 G A 0.840000 0.000544089 0.00284 0.84811 615932 # 5: 1:754182:A:G rs3131969 1 754182 G A 0.865000 0.001333110 0.00185 0.470389 1100634
-
Step 03. Select & rename 4 required columns for
sHDL
:- SNP: rsid of variants
- A1: effect allele
- A2: other allele
- Z: GWAS Z-score
- N: Sample size
gwas.df$Z <- gwas.df$BETA / gwas.df$SE # calculate the GWAS Z-score gwas.df <- gwas.df[, c('RSID', 'EFFECT_ALLELE', 'OTHER_ALLELE', 'Z', 'N')] colnames(gwas.df) <- c('SNP', 'A1', 'A2', 'Z', 'N') head(gwas.df, 5) # SNP A1 A2 Z N # 1: rs2185539 T C -1.1785140 537968 # 2: rs11240767 T C 1.3282381 85591 # 3: rs12564807 A G 0.4233545 112953 # 4: rs3131972 G A 0.1915806 615932 # 5: rs3131969 G A 0.7206000 1100634 saveRDS(gwas.df, file='gwas.df.rds') # save the gwas.df for sHDL
Genomic annotation
In this example, we will create a binary annotation of all genes based on the gene locations (hg19).
-
Step 01. Download the gene location information from
GeneBreak
package. -
Step 02. Create a binary annotation of all genes.
load('ens.gene.ann.hg19.rda') # load ens.gene.ann.hg19 LD.path <- '/your/path/to/UKB_array_SVD_eigen90_extraction' ## extract gene locations ## gene.pos <- unique(ens.gene.ann.hg19[, c(3:5)]) colnames(gene.pos) <- c('chrom', 'start', 'end') autosomes <- as.character(1:22) gene.pos <- gene.pos[gene.pos$chrom %in% autosomes, ] gene.pos$chrom <- as.numeric(gene.pos$chrom) ## extract variants position from LD reference panel ## bim <- do.call(rbind, lapply( sHDL:::list.LD.ref.files(LD.path, suffix='.bim'), data.table::fread, header=FALSE )) bim <- bim[, c(2, 1, 4)] colnames(bim) <- c('SNP', 'chrom', 'pos') ## match annotated variants ## gene2snp <- function(gene.pos, bim, mc.cores=4){ map.gene <- function(i, gene.pos.chr, bim.chr){ s <- gene.pos.chr[i, 'start'] e <- gene.pos.chr[i, 'end'] return(bim.chr[bim.chr$pos >= s & bim.chr$pos <= e, ]$SNP) } chroms <- unique(bim$chrom) all.snps <- c() for(cur.chrom in chroms){ gene.pos.chr <- gene.pos[gene.pos$chrom == cur.chrom, ] bim.chr <- bim[bim$chrom == cur.chrom, ] snps <- parallel::mclapply( seq_len(nrow(gene.pos.chr)), map.gene, gene.pos.chr=gene.pos.chr, bim.chr=bim.chr, mc.cores=mc.cores ) snps <- unique(unlist(snps)) all.snps <- c(all.snps, snps) } return(all.snps) } snps <- gene2snp(gene.pos, bim, 4) ## create the annotation weights for sHDL ## D <- matrix(1, nrow=length(snps), ncol=1, dimnames=list(snps, 'ENS.gene')) saveRDS(D, file='D.rds') # save the annotation weights for sHDL
Notice: sHDL
will automatically align
D
to the LD reference panel according to rsid
,
and set the weights of missing variants in the LD reference panel to
ZERO.
Run sHDL
library(sHDL)
LD.path <- '/your/path/to/UKB_array_SVD_eigen90_extraction'
gwas.df <- readRDS('gwas.df.rds')
D <- readRDS('D.rds')
t0 <- Sys.time()
res <- sHDL::sHDL(D, gwas.df, LD.path, mc.cores=4, stepwise=TRUE, lam.cut=1, mode='memory', Dr.path=NULL)
print(Sys.time() - t0)
sHDL will finish in ~2 mins with 4 CPU cores.
Down below message in the console during the analysis (birth weight
from HDL
) :
Starting sHDL analysis...
307519 out of 307519 (100.00%) SNPs in reference panel are available in the GWAS.
Transfoming z (D) to zr (Dr)...
D is not NULL and Dr.path is not provided. The default path ./sHDL_Dr will be used.
Applied `minmax` weight nomalization on 147849 (48.078%) annotated variants.
The theoretical upper boundary for enrichment fold is M / Md = 2.080.
Starting optimization...
Optimization done in 52.539 seconds.
item estimation se p note
time 52.5394132137299 NA NA seconds
h2 0.16391651163241 0.00545604327753226 2.68604947190574e-198 total heritability
intercept 0.963235890727237 0.0106589577723641 0 intercept
fold.ENS.gene 1.2739805907671 0.0259721362754955 5.13295507071105e-26 enrichment fold
converged NA NA NA TRUE
message NA NA NA CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL
Time difference of 2.185567 mins
Results of sHDL
item | estimation | se | p | note |
---|---|---|---|---|
time | 52.5394132 | NA | NA | seconds |
h2 | 0.1639165 | 0.0054560 | 2.686049e-198 | total heritability |
intercept | 0.9632359 | 0.0106590 | 0.000000e+00 | intercept |
fold.ENS.gene | 1.2739806 | 0.0259721 | 5.132955e-26 | enrichment fold |
converged | NA | NA | NA | TRUE |
message | NA | NA | NA | CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL |
Notice:
- The item
time
only indicates the time for maximizing log likelihood, NOT including the time for converting z (D) to zr (Dr). - The null hypothesis of
p
value for the fold parameter is \(fold \ne 1\), NOT \(fold > 1\).