Installation
To install the latest version of sHDL
package via
Github, run the following code in R
:
# install.packages('remotes')
remotes::install_github("now2014/sHDL", ref='main')
The data.table
package is also required for this
example:
install.packages('data.table')
Data preparation
Pre-computed LD reference panels
There are three publically available pre-computed
autosome LD reference panels for the
European-ancestry population (335,265 Genomic
British individuals in UK Biobank) from HDL
:
Panel 1. QCed UK Biobank Axiom Array (307,519 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/fuvpwsf6r8tjd6c/UKB_array_SVD_eigen90_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_array_SVD_eigen90_extraction.tar.gz
Panel 2. QCed UK Biobank imputed HapMap2 (769,306 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/4vuktycxz1an6sp/UKB_imputed_hapmap2_SVD_eigen99_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_imputed_hapmap2_SVD_eigen99_extraction.tar.gz
Panel 3. QCed UK Biobank imputed HapMap3 (1,029,876 SNPs):
wget -c -t 1 \
https://www.dropbox.com/s/6js1dzy4tkc3gac/UKB_imputed_SVD_eigen99_extraction.tar.gz?dl=0 \
--no-check-certificate -O /path/to/UKB_imputed_SVD_eigen99_extraction.tar.gz
You may download one of the above reference panel according to the SNP density of your GWAS summary statistics, or build your own LD reference panel. Alternatively, you can download the reference panels via Baidu Netdisk referring to this FAQ.
Here, we illustrate the usage of sHDL with the Panel 2. QCed UK Biobank Axiom Array.
-
Step 01. Download the LD reference panel.
-
Step 02. Extract files from the archive.
GWAS summary statistics
You can test with the dataset from HDL
directly:
data(gwas1.example, package='HDL') # GWAS summary statistics from the Neale Lab round 2 GWAS of UK Biobank of birth weight.
gwas.df <- gwas1.example[, c('SNP', 'A1', 'A2', 'Z', 'N')]
saveRDS(gwas1.example, file='gwas.df.rds') # save the gwas.df for sHDL
Or you can prepare your data as follows:
-
Step 01. Download the Height GWAS summary statistics from the GIANT consortium.
-
Step 02. Load the GWAS summary statistics into
R
.gwas.df <- data.table::fread('/path/to/GIANT_HEIGHT_YENGO_2022_GWAS_SUMMARY_STATS_EUR.gz', fill=TRUE, header=TRUE) head(gwas.df, 5) # show the first five rows # SNPID RSID CHR POS EFFECT_ALLELE OTHER_ALLELE EFFECT_ALLELE_FREQ BETA SE P N # 1: 1:566875:C:T rs2185539 1 566875 T C 0.002800 -0.046315600 0.03930 0.238128 537968 # 2: 1:728951:C:T rs11240767 1 728951 T C 0.000356 0.167358000 0.12600 0.185025 85591 # 3: 1:734462:A:G rs12564807 1 734462 A G 0.893000 0.004656900 0.01100 0.672866 112953 # 4: 1:752721:A:G rs3131972 1 752721 G A 0.840000 0.000544089 0.00284 0.84811 615932 # 5: 1:754182:A:G rs3131969 1 754182 G A 0.865000 0.001333110 0.00185 0.470389 1100634
-
Step 03. Select & rename 4 required columns for
sHDL
:- SNP: rsid of variants
- A1: effect allele
- A2: other allele
- Z: GWAS Z-score
- N: Sample size
gwas.df$Z <- gwas.df$BETA / gwas.df$SE # calculate the GWAS Z-score gwas.df <- gwas.df[, c('RSID', 'EFFECT_ALLELE', 'OTHER_ALLELE', 'Z', 'N')] colnames(gwas.df) <- c('SNP', 'A1', 'A2', 'Z', 'N') head(gwas.df, 5) # SNP A1 A2 Z N # 1: rs2185539 T C -1.1785140 537968 # 2: rs11240767 T C 1.3282381 85591 # 3: rs12564807 A G 0.4233545 112953 # 4: rs3131972 G A 0.1915806 615932 # 5: rs3131969 G A 0.7206000 1100634 saveRDS(gwas.df, file='gwas.df.rds') # save the gwas.df for sHDL
Genomic annotation
In this example, we will create a binary annotation of all genes based on the gene locations (hg19).
-
Step 01. Download the gene location information from
GeneBreak
package. -
Step 02. Create a binary annotation of all genes.
load('ens.gene.ann.hg19.rda') # load ens.gene.ann.hg19 LD.path <- '/your/path/to/UKB_array_SVD_eigen90_extraction' ## extract gene locations ## gene.pos <- unique(ens.gene.ann.hg19[, c(3:5)]) colnames(gene.pos) <- c('chrom', 'start', 'end') autosomes <- as.character(1:22) gene.pos <- gene.pos[gene.pos$chrom %in% autosomes, ] gene.pos$chrom <- as.numeric(gene.pos$chrom) ## extract variants position from LD reference panel ## bim <- do.call(rbind, lapply( sHDL:::list.LD.ref.files(LD.path, suffix='.bim'), data.table::fread, header=FALSE )) bim <- bim[, c(2, 1, 4)] colnames(bim) <- c('SNP', 'chrom', 'pos') ## match annotated variants ## gene2snp <- function(i, j){ cur.chrom <- gene.pos[j, 'chrom'] s <- gene.pos[j, 'start'] e <- gene.pos[j, 'end'] bim <- bim[bim$chrom == cur.chrom, ] bim <- bim[bim$pos >= s & bim$pos <= e, ] snps <- unique(c(i, bim$SNP)) nsnps <- length(snps) if(j %% 1000 == 0) cat(sprintf( '# of processed genes %d: # of annotated %d variants (%.2f%%) \n', j, nsnps, nsnps/M*100 )) return(snps) } M <- nrow(bim) snps <- Reduce(gene2snp, 1:nrow(gene.pos), init=c()) ## create the annotation weights for sHDL ## D <- rep(1, length(snps)) names(D) <- snps saveRDS(D, file='D.rds') # save the annotation weights for sHDL
Notice: sHDL
will automatically align
D
to the LD reference panel according to rsid
,
and set the weights of missing variants in the LD reference panel to
ZERO.
Run sHDL
library(sHDL)
LD.path <- '/your/path/to/UKB_array_SVD_eigen90_extraction'
gwas.df <- readRDS('gwas.df.rds')
D <- readRDS('D.rds')
t0 <- Sys.time()
res <- sHDL::sHDL(D, gwas.df, LD.path, nthreads=4, stepwise=TRUE, lam.cut=1, mode='memory', Dr.path=NULL)
print(Sys.time() - t0)
sHDL will finish in ~3 mins with 4 CPU cores.
Down below message in the console during the analysis (birth weight
from HDL
) :
Formating GWAS summary statistics ...
307519 out of 307519 (100%) SNPs in reference panel are available in the GWAS.
Converting z (D) to zr (Dr) ...
Applied `minmax` weight nomalization on 147849 (48.078%) annotated variants.
The theoretical upper boundary for enrichment fold is 2.080.
Optimizing ...
Optimization done in 63.731 seconds.
time 63.7308533191681
fold 1.27398058787877
h2 0.16391651163241
intercept 0.963235890727237
fold.se 0.0259721362117597
h2.se 0.00545604327753226
intercept.se 0.0106589577723641
fold.p 5.13295973147164e-26
h2.p 2.68604947190569e-198
intercept.p 0
stepwise TRUE
converged TRUE
optim.message CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL
Time difference of 2.521161 mins
Results of sHDL
Item | Description |
---|---|
time | Optimization time for sHDL, NOT including the time for converting z (D) to zr (Dr). |
fold | Estimation for heritability enrichment fold. |
h2 | Estimation for heritability. |
intercept | Estimation for intercept. |
fold.se | Standard error of estimation for heritability enrichment fold. |
h2.se | Standard error of estimation for heritability. |
intercept.se | Standard error of estimation for intercept.se. |
fold.p | P-value of estimation for heritability enrichment fold. |
h2.p | P-value of estimation for heritability. |
stepwise | Input stepwise parameter. |
converged | Whether the optimization converges or not. |
optim.message | The optimization message returned by optim function in
R . |