Author: Shouye Liu
Date: 2025-03-21
Description: Diagnose scale issue of summary statistics

In [20]:
suppressPackageStartupMessages(library(tidyverse))
library(data.table)
library(ggplot2)
library(patchwork)

In an association of a single SNP $j$, we have a simple linear model (Yang, Jian, et al(2012)): $$y = X_jb_j + e \tag{1}$$ Based on OLS, we have $$\hat{b}_j = (X_j^TX_j)^{-1}X_j^Ty = D_j^{-1}X_j^Ty\tag{2}$$ and $$\sigma_b^2 = \sigma_e^2 (X_j^TX_j)^{-1} = \sigma_e^2D_j^{-1}\tag{3}$$ Based on ANOVA, we have $$\sigma_e^2 = \frac{y^Ty - \hat{y}^T\hat{y}}{(n-1)} = \frac{y^Ty - (X_j\hat{b}_j)^T(X_j\hat{b}_j)}{(n-1)} \tag{4}$$ Based on probability, we have Z-score $$z_j = \frac{\hat{b}_j}{\sigma_b}\tag{5}$$ Reformatting Eq(4), we derive $$(n-1)\sigma_e^2 = y^Ty - \hat{b}_j^2D_j \tag{6}$$ By combining Eq(3) and Eq(5) into Eq(6), we have $$\begin{aligned}(n-1)\sigma_b^2D_j &= y^Ty - z_j^2\sigma_b^2D_j \\ \Leftrightarrow \sigma_b &= \sqrt{\frac{y^Ty}{(n-1)D_j+z_j^2D_j}}\end{aligned}\tag{7}$$ Here, we assume trait phenotype $y$ is standardized (thus, $y^Ty = n$) and $D_j = 2p_j(1-p_j)n$. If only summary statistics Z score is provided, we can re-estimate $\hat{b}_j$ and $\sigma_b$ given allele frequency $p$, sample size $n$ and zscore $z$, therefore, $$\begin{aligned}\sigma_b &= \sqrt{\frac{n}{(n-1)2p_j(1-p_j)n+z_j^22p_j(1-p_j)n}} \\ &= \sqrt{\frac{1}{(n-1)2p_j(1-p_j)+z_j^22p_j(1-p_j)}}\end{aligned}\tag{8}$$ Also, we have $$b_j = \sqrt{\frac{z^2}{(n-1)2p_j(1-p_j)+z_j^22p_j(1-p_j)}}\tag{9}$$

In [21]:
############################################
### Run a simulation to simulate gene exression
### y = beta * x + e, where y is scaled gene expression
### x is the genotype at 0/1/2 scale, BETA and SE
###  is the original effect size and standard error
###  N is the sample size
###  A1Freq is the allele frequency 
###  zz is the zscore
############################################
set.seed(1234)
sumPath = "~/Documents/personal/SbayesAnno/downSampling/Verify0319/summary/"
smrIndFileSuffix = "ukbg-chr22-causal-model-50k-ind"
geneLDFile = paste0(sumPath,smrIndFileSuffix)
xqtl = fread(paste0(geneLDFile,".q.query.gz"))
test = xqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) 
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
#### beta based on zscore and Original beta
p1 = ggplot(test, aes(x = beta, y = BETA,)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Orignal Beta",
    x = "Beta based on Zscore",
    title = "Fig 1.1 Orignal Beta vs Beta based on Zscore in the simulation"
  ) + theme_minimal();
  #### beta based on zscore and Original beta
p2 = ggplot(test, aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    x = "Beta based on Zscore",
    y = "scaled Beta",
    title = "Fig 1.2 scaled Beta vs Beta based on Zscore in the simulation"
  ) + theme_minimal();
(p1 + p2) 
print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><dbl><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
ENSG0000018790522-99921409346rs428214 2221361060TG0.155100-0.0061234550.008735706 4.833227e-0150000 -0.7009685-0.006123393-0.0031348740.008735618
ENSG0000018790522-99921409346rs373092 2221361870GT0.027860-0.5117405060.0190788132.225000e-30850000-26.8224492-0.511735459-0.1191021920.019078625
ENSG0000018790522-99921409346rs398577 2221362849AC0.139213-0.2755731050.0090601382.225000e-30850000-30.4159945-0.275319807-0.1349085810.009051810
ENSG0000018790522-99921409346rs20752742221363306AG0.371860-0.3016518350.0064052352.225000e-30850000-47.0945754-0.301531294-0.2061763880.006402676
ENSG0000018790522-99921409346rs25165422221363505TC0.126070-0.3421852290.0094034732.225000e-30850000-36.3892402-0.342181893-0.1606277060.009403381
ENSG0000018790522-99921409346rs20752762221363744CT0.498304-0.4346833230.0060302752.225000e-30850000-72.0834933-0.433913672-0.3073657570.006019598
[1] "correlation between beta and BETA is 0.999999335847599"
[1] "correlation between beta and betaScale is 0.957713529698195"
No description has been provided for this image

Fig 1.1 suggests that the original beta values are very well captured by the Z-score-based estimates. In Fig 1.2 (right plot), the data points systematically deviate from the reference line $y = x$, indicating a upward biased (Scaled Beta is larger than Beta based on Z-score).

In [22]:
############################################
### blood pQTLs
############################################
library(data.table)
library(tidyverse)
beqtlPath = "~/Downloads/data/xqtl/"
beqtl = fread(paste0(beqtlPath,"ukb-ppp-pGWAS-v1-eur-hg19-pro3k-chr22-cis1M-HM3q.query.gz"))
test = beqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) %>% slice_sample(n = 10000)
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
p1 = ggplot(test, aes(x = beta, y = BETA)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Beta from raw data",
    x = "Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for blood pQTLs"
  ) + theme_minimal()
p2 =  test  %>%
mutate( betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq))) %>%
 ggplot( aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Fig 2.1 Beta based on Zscore adjusted by snp2pq (beta * sqrt(2 * A1Freq * (1-A1Freq)))",
    x = "Fig 2.2 Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for blood pQTLs"
  ) + theme_minimal()
  p1 + p2
 print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><dbl><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
OID20838_ENSG0000010008322-99937621144rs86582 2237603390TC0.194599 0.0133660000.009724400.16929250232922 1.3744807 0.0135300154 0.00748329450.009843729
OID30372_ENSG0000009997722-99923975534rs170040532224250835GA0.032967-0.0917222000.020526100.00000787533924-4.4685644-0.0960545088-0.02316060090.021495608
OID30382_ENSG0000010034522-99936334645rs5999726 2235502070AC0.620594 0.0123639000.007783280.11216868533924 1.5885205 0.0125677829 0.00848450220.007911628
OID20865_ENSG0000013063822-99945758552rs1555208 2246577210TG0.090022-0.0049312400.014656100.73652151132922-0.3364633-0.0045813734-0.00199600170.013616264
OID30352_ENSG0000010002422-99924511248rs9608196 2224163081CT0.116033 0.0008675720.011569000.94022179633942 0.0749911 0.0008987183 0.00039294280.011984333
OID30708_ENSG0000010034222-99936260300rs2481 2236677400AG0.132781 0.0095437700.010046000.34210871233985 0.9500070 0.0107383219 0.00458001940.011303414
[1] "correlation between beta and BETA is 0.997710343811819"
[1] "correlation between beta and betaScale is 0.921397747836"
No description has been provided for this image

The transformation behavior in blood pQTLs data is highly similar to the simulated results. The systematic bias pattern (negative values shrinking, positive values stretching) remains evident.

In [23]:
############################################
### Brain eQTLs
############################################
library(data.table)
library(tidyverse)
beqtlPath = "~/Downloads/data/xqtl/"
beqtl = fread(paste0(beqtlPath,"QC-HM3-brain-gene16620-eqtl-1086269-cis1M-chr1-22-2025-03-17.query.gz"))
test = beqtl %>% mutate(zz = BETA/SE) %>%
mutate(beta = zz / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2)),
       betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq)),
se = 1 / sqrt(2 * A1Freq * (1-A1Freq) * (N-1 + zz^2))
) %>% slice_sample(n = 10000)
head(test)
options(repr.plot.width = 12, repr.plot.height = 6)
p1 = ggplot(test, aes(x = beta, y = BETA)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Beta from raw data",
    x = "Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for brain eQTLs"
  ) + theme_minimal()
p2 =  test  %>%
mutate( betaScale = BETA * sqrt(2 * A1Freq * (1-A1Freq))) %>%
 ggplot( aes(x = beta, y = betaScale)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(
    y = "Fig 3.1 Beta based on Zscore adjusted by snp2pq (beta * sqrt(2 * A1Freq * (1-A1Freq)))",
    x = "Fig 3.2 Beta based on Zscore",
    title = "Randomly selected 10,000 SNPs from genes for brain eQTLs"
  ) + theme_minimal()
  p1 + p2

print(paste0("correlation between beta and BETA is ", cor(test$beta, test$BETA)))
print(paste0("correlation between beta and betaScale is ", cor(test$beta, test$betaScale)))
A data.table: 6 x 18
GeneIDGeneChrGeneLengthGenePhyPosSNPIDSNPChrSNPPhyPosA1A2A1FreqBETASEPvalueNzzbetabetaScalese
<chr><int><int><int><chr><int><int><chr><chr><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
ENSG00000143771.12 1-999224555860rs4653593 1224881970AG0.213115-0.0069129260.037108730.852218592865-0.1862884-0.006010617-0.0040034970.03226512
ENSG00000146909.8 7-999156754138rs4078489 7155858530CT0.371585-0.0062396480.036078180.862692342865-0.1729480-0.004728891-0.0042641000.02734286
ENSG00000254343.2 8-999121066430rs10093792 8121515110AG0.314208-0.0973122120.039779020.014432292865-2.4463199-0.069559050-0.0638832740.02843416
ENSG00000272512.1 1-999 932388rs2250833 1 1854321AG0.306011 0.0448373670.038294680.241658652865 1.1708511 0.033562277 0.0292213250.02866486
ENSG00000213199.8 7-999150747611rs11984138 7150300735AG0.224044-0.0045913390.031811960.885241912865-0.1443274-0.004573626-0.0027073200.03168923
ENSG00000139209.1612-999 47192368rs2465230 12 47324596TC0.423497-0.0872165110.036316310.016324442865-2.4015800-0.064155329-0.0609452220.02671380
[1] "correlation between beta and BETA is 0.978783611798265"
[1] "correlation between beta and betaScale is 0.931821297460271"
No description has been provided for this image

Similar trend was found in brain eQTLs but Fig 3.1 had larger variance since here we use unique total sample size for all eQTLs.

In [19]:
# Extreme effect for snp rs7749880
# /QRISdata/Q4062/softwares/smr/smr --beqtl-summary  BrainMeta_cis_eQTL_chr6 --snp rs7749880 --out myquery --query
# SNP	Chr	BP	A1	A2	Freq	Probe	Probe_Chr	Probe_bp	Gene	Orientation	b	SE	p
# rs7749880	6	58346584	A	G	0.469945	ENSG00000271761.1	6	58229156	AL021368.1	-	-0.137552	0.0222519	6.34526e-10
# rs7749880	6	58346584	A	G	0.469945	ENSG00000215190.9	6	58280066	LINC00680	-	**-3156.06**	2.8288	0
In [ ]:
# Cis-sQTL mapping was conducted using FastQTL, testing for associations between genetic variants located within ±1 Mb of 
# each gene’s transcription start site (TSS). The analysis incorporated the covariates detailed in Section 4.1, along with
# 15 PEER factors derived from splicing quantifications (Section 3.5.2) (GTEx Consortium, 2020). 

# FastQTL(Ongen, Halit, et al.(2016)) utilizes normalized gene expression data, covariate matrices, and dosage genotype information
# within a regression framework to identify statistically significant associations between genetic variants and molecular phenotypes,
# thereby uncovering both cis (proximal) and trans (distal) QTLs.