Diagnose scale issue of summary statistics#

suppressPackageStartupMessages(library(tidyverse))
library(data.table)
library(ggplot2)
library(patchwork)

In an association of a single SNP \(j\), we have a simple linear model (Yang, Jian, et al(2012)):

\[\begin{equation}y = X_jb_j + e \tag{1}\end{equation}\]

Based on OLS, we have

\[\begin{equation}\hat{b}_j = (X_j^TX_j)^{-1}X_j^Ty = D_j^{-1}X_j^Ty\tag{2}\end{equation}\]

and

\[\begin{equation}\sigma_b^2 = \sigma_e^2 (X_j^TX_j)^{-1} = \sigma_e^2D_j^{-1}\tag{3}\end{equation}\]

Based on ANOVA, we have

\[\begin{equation}\sigma_e^2 = \frac{y^Ty - \hat{y}^T\hat{y}}{(n-1)} = \frac{y^Ty - (X_j\hat{b}_j)^T(X_j\hat{b}_j)}{(n-1)} \tag{4}\end{equation}\]

Based on probability, we have Z-score

\[\begin{equation}z_j = \frac{\hat{b}_j}{\sigma_b}\tag{5}\end{equation}\]

Reformatting Eq(4), we derive

\[\begin{equation}(n-1)\sigma_e^2 = y^Ty - \hat{b}_j^2D_j \tag{6}\end{equation}\]

By combining Eq(3) and Eq(5) into Eq(6), we have

\[\begin{split} \begin{equation} \begin{aligned}(n-1)\sigma_b^2D_j &= y^Ty - z_j^2\sigma_b^2D_j \\ \Leftrightarrow \sigma_b &= \sqrt{\frac{y^Ty}{(n-1)D_j+z_j^2D_j}}\end{aligned} \tag\end{equation}{7} \end{equation} \end{split}\]

Here, we assume trait phenotype \(y\) is standardized (thus, \(y^Ty = n\)) and \(D_j = 2p_j(1-p_j)n\). If only summary statistics Z score is provided, we can re-estimate \(\hat{b}_j\) and \(\sigma_b\) given allele frequency \(p\), sample size \(n\) and zscore \(z\), therefore,

\[\begin{split} \begin{equation} \begin{aligned}\sigma_b &= \sqrt{\frac{n}{(n-1)2p_j(1-p_j)n+z_j^22p_j(1-p_j)n}} \\ &= \sqrt{\frac{1}{(n-1)2p_j(1-p_j)+z_j^22p_j(1-p_j)}}\end{aligned}\tag{8} \end{equation} \end{split}\]

Also, we have

\[\begin{equation}b_j = \sqrt{\frac{z^2}{(n-1)2p_j(1-p_j)+z_j^22p_j(1-p_j)}}\tag{9}\end{equation}\]
############################################
### Run a simulation to simulate gene exression
### y = beta * x + e, where y is scaled gene expression
### x is the genotype at 0/1/2 scale, BETA and SE
###  is the original effect size and standard error
###  N is the sample size
###  A1Freq is the allele frequency 
###  zz is the zscore
############################################
set.seed(1234)
sumPath = "~/Documents/personal/SbayesAnno/downSampling/Verify0319/summary/"
smrIndFileSuffix = "ukbg-chr22-causal-model-50k-ind"
geneLDFile = paste0(sumPath,smrIndFileSuffix)
xqtl = fread(paste0(geneLDFile,".q.query.gz"))
test = xqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) 
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
#### beta based on zscore and Original beta
p1 = ggplot(test, aes(x = beta, y = BETA,)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Orignal Beta",
    x = "Beta based on Zscore",
    title = "Fig 1.1 Orignal Beta vs Beta based on Zscore in the simulation"
  ) + theme_minimal();
  #### beta based on zscore and Original beta
p2 = ggplot(test, aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    x = "Beta based on Zscore",
    y = "scaled Beta",
    title = "Fig 1.2 scaled Beta vs Beta based on Zscore in the simulation"
  ) + theme_minimal();
(p1 + p2) 
print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><dbl><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
ENSG0000018790522-99921409346rs428214 2221361060TG0.155100-0.0061234550.008735706 4.833227e-0150000 -0.7009685-0.006123393-0.0031348740.008735618
ENSG0000018790522-99921409346rs373092 2221361870GT0.027860-0.5117405060.0190788132.225000e-30850000-26.8224492-0.511735459-0.1191021920.019078625
ENSG0000018790522-99921409346rs398577 2221362849AC0.139213-0.2755731050.0090601382.225000e-30850000-30.4159945-0.275319807-0.1349085810.009051810
ENSG0000018790522-99921409346rs20752742221363306AG0.371860-0.3016518350.0064052352.225000e-30850000-47.0945754-0.301531294-0.2061763880.006402676
ENSG0000018790522-99921409346rs25165422221363505TC0.126070-0.3421852290.0094034732.225000e-30850000-36.3892402-0.342181893-0.1606277060.009403381
ENSG0000018790522-99921409346rs20752762221363744CT0.498304-0.4346833230.0060302752.225000e-30850000-72.0834933-0.433913672-0.3073657570.006019598
[1] "correlation between beta and BETA is 0.999999335847599"
[1] "correlation between beta and betaScale is 0.957713529698195"
../_images/de11d4ba9af13e7c3b9bec79e594f0d8bc5efdbf27ca5c02d5daa51feb0ba3d1.png

Fig 1.1 suggests that the original beta values are very well captured by the Z-score-based estimates. In Fig 1.2 (right plot), the data points systematically deviate from the reference line \(y = x\), indicating a upward biased (Scaled Beta is larger than Beta based on Z-score).

############################################
### blood pQTLs
############################################
library(data.table)
library(tidyverse)
beqtlPath = "~/Downloads/data/xqtl/"
beqtl = fread(paste0(beqtlPath,"ukb-ppp-pGWAS-v1-eur-hg19-pro3k-chr22-cis1M-HM3q.query.gz"))
test = beqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) %>% slice_sample(n = 10000)
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
p1 = ggplot(test, aes(x = beta, y = BETA)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Beta from raw data",
    x = "Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for blood pQTLs"
  ) + theme_minimal()
p2 =  test  %>%
mutate( betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq))) %>%
 ggplot( aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Fig 2.1 Beta based on Zscore adjusted by snp2pq (beta * sqrt(2 * A1Freq * (1-A1Freq)))",
    x = "Fig 2.2 Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for blood pQTLs"
  ) + theme_minimal()
  p1 + p2
 print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><dbl><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
OID20838_ENSG0000010008322-99937621144rs86582 2237603390TC0.194599 0.0133660000.009724400.16929250232922 1.3744807 0.0135300154 0.00748329450.009843729
OID30372_ENSG0000009997722-99923975534rs170040532224250835GA0.032967-0.0917222000.020526100.00000787533924-4.4685644-0.0960545088-0.02316060090.021495608
OID30382_ENSG0000010034522-99936334645rs5999726 2235502070AC0.620594 0.0123639000.007783280.11216868533924 1.5885205 0.0125677829 0.00848450220.007911628
OID20865_ENSG0000013063822-99945758552rs1555208 2246577210TG0.090022-0.0049312400.014656100.73652151132922-0.3364633-0.0045813734-0.00199600170.013616264
OID30352_ENSG0000010002422-99924511248rs9608196 2224163081CT0.116033 0.0008675720.011569000.94022179633942 0.0749911 0.0008987183 0.00039294280.011984333
OID30708_ENSG0000010034222-99936260300rs2481 2236677400AG0.132781 0.0095437700.010046000.34210871233985 0.9500070 0.0107383219 0.00458001940.011303414
[1] "correlation between beta and BETA is 0.997710343811819"
[1] "correlation between beta and betaScale is 0.921397747836"
../_images/211faa11ad0309823d83e3403452cb645ba8a86eaf74693dd3a8550d88da8a64.png

The transformation behavior in blood pQTLs data is highly similar to the simulated results. The systematic bias pattern (negative values shrinking, positive values stretching) remains evident.

############################################
### Brain eQTLs
############################################
library(data.table)
library(tidyverse)
beqtlPath = "~/Downloads/data/xqtl/"
beqtl = fread(paste0(beqtlPath,"QC-HM3-brain-gene16620-eqtl-1086269-cis1M-chr1-22-2025-03-17.query.gz"))
test = beqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) %>% slice_sample(n = 10000)
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
p1 = ggplot(test, aes(x = beta, y = BETA)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Beta from raw data",
    x = "Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for brain eQTLs"
  ) + theme_minimal()
p2 =  test  %>%
mutate( betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq))) %>%
 ggplot( aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Fig 3.1 Beta based on Zscore adjusted by snp2pq (beta * sqrt(2 * A1Freq * (1-A1Freq)))",
    x = "Fig 3.2 Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for brain eQTLs"
  ) + theme_minimal()
  p1 + p2

print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><int><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
ENSG00000143771.12 1-999224555860rs4653593 1224881970AG0.213115-0.0069129260.037108730.852218592865-0.1862884-0.006010617-0.0040034970.03226512
ENSG00000146909.8 7-999156754138rs4078489 7155858530CT0.371585-0.0062396480.036078180.862692342865-0.1729480-0.004728891-0.0042641000.02734286
ENSG00000254343.2 8-999121066430rs10093792 8121515110AG0.314208-0.0973122120.039779020.014432292865-2.4463199-0.069559050-0.0638832740.02843416
ENSG00000272512.1 1-999 932388rs2250833 1 1854321AG0.306011 0.0448373670.038294680.241658652865 1.1708511 0.033562277 0.0292213250.02866486
ENSG00000213199.8 7-999150747611rs11984138 7150300735AG0.224044-0.0045913390.031811960.885241912865-0.1443274-0.004573626-0.0027073200.03168923
ENSG00000139209.1612-999 47192368rs2465230 12 47324596TC0.423497-0.0872165110.036316310.016324442865-2.4015800-0.064155329-0.0609452220.02671380
[1] "correlation between beta and BETA is 0.978783611798265"
[1] "correlation between beta and betaScale is 0.931821297460271"
../_images/f3e04fb1ec357c2d0a85163daceab7023cf4ad4f44d5c002b64be47ecb9019af.png

Similar trend was found in brain eQTLs but Fig 3.1 had larger variance since here we use unique total sample size for all eQTLs.

# Extreme effect for snp rs7749880
# /QRISdata/Q4062/softwares/smr/smr --beqtl-summary  BrainMeta_cis_eQTL_chr6 --snp rs7749880 --out myquery --query
# SNP	Chr	BP	A1	A2	Freq	Probe	Probe_Chr	Probe_bp	Gene	Orientation	b	SE	p
# rs7749880	6	58346584	A	G	0.469945	ENSG00000271761.1	6	58229156	AL021368.1	-	-0.137552	0.0222519	6.34526e-10
# rs7749880	6	58346584	A	G	0.469945	ENSG00000215190.9	6	58280066	LINC00680	-	**-3156.06**	2.8288	0
# Cis-sQTL mapping was conducted using FastQTL, testing for associations between genetic variants located within ±1 Mb of 
# each gene’s transcription start site (TSS). The analysis incorporated the covariates detailed in Section 4.1, along with
# 15 PEER factors derived from splicing quantifications (Section 3.5.2) (GTEx Consortium, 2020). 

# FastQTL(Ongen, Halit, et al.(2016)) utilizes normalized gene expression data, covariate matrices, and dosage genotype information
# within a regression framework to identify statistically significant associations between genetic variants and molecular phenotypes,
# thereby uncovering both cis (proximal) and trans (distal) QTLs.