Author: Shouye Liu
Email: syliu.xue@foxmail.com
Date: 2025-04-18
1. Math background¶
1.1 Exponential family distribution¶
The general form of an exponential family distribution is:
$$ f(x, y) = h(y) \exp\left( \eta^\top T(x, y) - A(\eta) \right) $$
In the context of generalized linear models (GLMs), the natural parameter $\eta$ is linked to the predictors $\mathbf{x}$ via:
$$ \eta = \mathbf{x}^\top \boldsymbol{\beta} $$
Thus, the conditional density of $y$ given $\mathbf{x}$ becomes:
$$ f(y \mid \mathbf{x}) = h(y) \exp\left( y \cdot \mathbf{x}^\top \boldsymbol{\beta} - A(\mathbf{x}^\top \boldsymbol{\beta}) \right) $$
The logistic regression Exponential family distribution for logistic regression: Now, we rewrite Bernoulli distribution in terms of exponential family distribution: $$\begin{aligned} f(y_i|p) &= p^{y_i}(1-p)^{y_i} \\ &= exp\{log[p^{y_i}(1-p)^{y_i}] \} \\ &= exp\{{y_i}log^p + {y_i}log^{(1-p)}\} \\ &= exp\{{y_i}log^{\frac{p}{1-p}} \} \\ \end{aligned} \tag{1}$$ where, $$h(x) = 1\tag{2}$$ $$T(x) = x\tag{3}$$ $$\eta = log^{\frac{p}{1-p}}\tag{4}$$ $$A(\eta) = 1\tag{5}$$ In logistic regression, we assume $\eta = \boldsymbol{x}^T_i\beta$, so $$p = Pr(y_i = 1|\boldsymbol{x_i}) = \frac{e^{\eta}}{1+e^{\eta}} = [1+e^{-\eta}]^{-1} = [1+e^{-\boldsymbol{x}_i\boldsymbol{\beta}}]^{-1}\tag{6}$$
1.2 Maximax likehood estimation¶
$$\begin{aligned}L(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})=\prod^n_{i = 1} p_i^{y_i}(1-p_i)^{1-y_i}
\end{aligned}$$
where $y$ is the total number of success in $n$ trials.
The corresponding log-likelihood is
$$\begin{aligned}\ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})
&=log L(\mathbf{y}|\mathbf{X},\boldsymbol{\beta}) \\
&=\sum^n_{i=1}[ y_ilog^{p_i} + (1-y_i)log^{(1-p_i)}] \\
&= \sum^n_{i = 1} [y_ilog^{\frac{p_i}{1-p_i}} + log^{(1-p_i)}] \\
&= \sum^n_{i = 1} [y_i \boldsymbol{x}_i\boldsymbol{\beta} + log^{\frac{1}{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}}}] \\
&= \sum^n_{i = 1} [y_i \boldsymbol{x}_i\boldsymbol{\beta} - log^{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}}] \\
&= \sum^n_{i = 1} y_i \boldsymbol{x}_i\boldsymbol{\beta} - \sum^n_{i = 1} log^{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}} \\
\end{aligned}$$
1.2.1. Gradient¶
Take first-order derivative of $\ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})$ about $\boldsymbol{\beta}$, we have $$\begin{aligned} \nabla \ell(\boldsymbol{\beta}) &= \frac{\partial \ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \\ &= \sum^n_{i = 1} y_i \boldsymbol{x}_i - \sum^n_{i = 1} [\frac{e^{\boldsymbol{x}_i\boldsymbol{\beta}} }{1+ e^{\boldsymbol{x}_i\boldsymbol{\beta}}}\times \boldsymbol{x}_i] \\ &= \sum^n_{i=1}(y_i -p_i)\boldsymbol{x}_i \\ &= \mathbf{X}^T(\mathbf{y} - \boldsymbol{\mu}) \\ \end{aligned}$$ where, $$\boldsymbol{\mu} =\begin{bmatrix} p_1 \\ p_2 \\ \vdots \\ p_n \end{bmatrix} = \begin{bmatrix} \sigma(\mathbf{x}_1^T \boldsymbol{\beta}) \\ \sigma(\mathbf{x}_2^T \boldsymbol{\beta}) \\ \vdots \\ \sigma(\mathbf{x}_n^T \boldsymbol{\beta}) \end{bmatrix}$$
1.2.2. Hessian (second derivative matrix)¶
$$\begin{aligned}\mathbf{H} &= \frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \, \partial \boldsymbol{\beta}^T} \\ &= \frac{\partial{[\nabla \ell(\boldsymbol{\beta})]}}{\partial{\boldsymbol{\beta}}} \\ &= \frac{\partial{[- \sum^n_{i = 1} [\frac{e^{\boldsymbol{x}_i\boldsymbol{\beta}} }{1+ e^{\boldsymbol{x}_i\boldsymbol{\beta}}}\times \boldsymbol{x}_i]]}}{\partial{\boldsymbol{\beta}}} \\ &= \frac{\partial{[- \sum^n_{i = 1} [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-1}\times \boldsymbol{x}_i]]}}{\partial{\boldsymbol{\beta}}} \\ &= - \sum^n_{i = 1} (-1) \times [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-2}\times \boldsymbol{x}_i] \times e^{-\boldsymbol{x}_i\boldsymbol{\beta}} \times (-1) \times \boldsymbol{x}_i^T \end{aligned}$$ Given $p_i = {(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-1}$ and $e^{\boldsymbol{x}_i\boldsymbol{\beta}} = \frac{p_i}{1-p_i}$, then $$\begin{aligned}\mathbf{H} &= - \sum^n_{i = 1} (-1) \times [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-2}\times \boldsymbol{x}_i] \times e^{-\boldsymbol{x}_i\boldsymbol{\beta}} \times (-1) \times \boldsymbol{x}_i^T \\ &= - \sum^n_{i = 1} [p_i^2\times \boldsymbol{x}_i] \times \frac{1-p_i}{p_i} \times \boldsymbol{x}_i^T \\ &= - \sum^n_{i = 1} (1-p_i)p_i \times \boldsymbol{x}_i \times \boldsymbol{x}_i^T \\ &= -\mathbf{X}^T\mathbf{W}\mathbf{X} \end{aligned}$$ Where, $\mathbf{W} = diag[p_i(1-p_i)]$.
1.3 Numerical Approximation¶
1.3.1 Gradient Ascent (on log-likelihood)¶
We aim to maximize the log-likelihood:
$$ \ell(\boldsymbol{\beta}) = \mathbf{y}^T \log(\boldsymbol{\mu}) + (\mathbf{1} - \mathbf{y})^T \log(1 - \boldsymbol{\mu}) $$
Or simplified:
$$ \ell(\boldsymbol{\beta}) = \mathbf{y}^T \mathbf{X} \boldsymbol{\beta} - \mathbf{1}^T \log(1 + e^{\mathbf{X} \boldsymbol{\beta}}) $$
Update rule:
$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \cdot \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$
Where:
- $\alpha$: learning rate
- $\boldsymbol{\mu} = \sigma(\mathbf{X} \boldsymbol{\beta}^{(t)})$
This is a first-order method.
1.3.2. Newton-Raphson (Second-order method)¶
Use both the gradient and the Hessian to update $\boldsymbol{\beta}$:
Gradient:
$$ \nabla \ell(\boldsymbol{\beta}) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$
Hessian:
$$ \mathbf{H} = -\mathbf{X}^T \mathbf{W} \mathbf{X}, \quad \text{where } \mathbf{W} = \operatorname{diag}(\mu_i(1 - \mu_i)) $$
Update rule:
$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \mathbf{H}^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)}) $$
Or equivalently:
$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$
This method converges faster than gradient ascent but requires inverting a matrix, which can be expensive.
1.3.3. Iteratively Reweighted Least Squares (IRLS)¶
IRLS is a special case of Newton-Raphson, reformulated as a weighted least squares:
Define:
Pseudo-response (working response):
$$ \mathbf{z} = \boldsymbol{\eta} + \frac{\mathbf{y} - \boldsymbol{\mu}}{\mu_i (1 - \mu_i)} $$Weight matrix:
$$ \mathbf{W} = \operatorname{diag}(\mu_i (1 - \mu_i)) $$
Solve at each step:
$$ \boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z} $$
This looks like a weighted linear regression, where weights and pseudo-response ( \mathbf{z} ) are updated at each iteration.
1.3.4 Summary Comparison¶
Method | Uses | Update Formula | Notes |
---|---|---|---|
Gradient Ascent | Gradient only | $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$ | Simple, slow, needs tuning $\alpha$ |
Newton-Raphson | Gradient + Hessian | $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$ | Faster convergence, more computation |
IRLS | Weighted Least Squares | $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z}$ | Very popular for GLMs |
import numpy as np
import torch
from torch.nn.functional import sigmoid
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
# Setup
device = torch.device("cpu")
np.random.seed(42)
# Parameters
n_samples = 500
n_features = 10
# Simulate features and labels
X_np = np.random.randn(n_samples, n_features)
true_beta = np.random.randn(n_features)
logits = X_np @ true_beta
probs = 1 / (1 + np.exp(-logits))
y_np = np.random.binomial(1, probs)
# Convert to PyTorch
X = torch.tensor(X_np, dtype=torch.float32, device=device)
y = torch.tensor(y_np, dtype=torch.float32, device=device).view(-1, 1)
1.5.2. Accuracy Helper Function (Code Cell)¶
def compute_accuracy(beta, X, y):
with torch.no_grad():
preds = (sigmoid(X @ beta) >= 0.5).float()
return (preds == y).float().mean().item()
1.5.3. Gradient Ascent (Code Cell)¶
def gradient_ascent(X, y, lr=0.1, max_iter=100):
# Initialize beta (weights) as a zero vector with shape (features, 1)
# requires_grad=True allows autograd to track operations on beta
beta = torch.zeros((X.shape[1], 1), requires_grad=True, device=device)
# Iterate for a fixed number of steps
for _ in range(max_iter):
# Compute the predicted probabilities using the sigmoid function
mu = sigmoid(X @ beta)
# Compute the negative log-likelihood loss
# Add small epsilon (1e-8) to avoid log(0) instability
loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
# Compute gradients via backpropagation
loss.backward()
# Update beta using gradient ascent
with torch.no_grad():
# Gradient ascent step: move in the direction of the negative gradient
beta += lr * (-beta.grad)
# Reset gradients to zero for the next iteration
beta.grad.zero_()
# Detach the tensor from the computation graph and return it
return beta.detach()
1.5.4. Newton-Raphson (Code Cell)¶
def newton_raphson(X, y, max_iter=10):
# Initialize beta (weights) as zeros; shape = (features, 1)
beta = torch.zeros((X.shape[1], 1), device=device)
# Loop for a fixed number of iterations
for _ in range(max_iter):
# Compute the predicted probabilities using the sigmoid function
mu = sigmoid(X @ beta) # shape: (n_samples, 1)
# Compute the diagonal weight matrix W = diag(mu * (1 - mu))
W = torch.diag((mu * (1 - mu)).flatten()) # shape: (n_samples, n_samples)
# Compute the adjusted response z (working variable in IRLS/NR)
# z = η + (y - μ) / W = η + W⁻¹ (y - μ)
# η = X @ β
z = X @ beta + torch.inverse(W + 1e-4 * torch.eye(W.shape[0])) @ (y - mu)
# Update rule for Newton-Raphson:
# β = (Xᵀ W X)⁻¹ Xᵀ W z
# Add a small ridge term (1e-4 * I) for numerical stability
beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z
# Return final estimate of beta
return beta
1.5.5. IRLS (Code Cell)¶
def irls(X, y, max_iter=10):
# Initialize beta (weights) as zeros; shape = (features, 1)
beta = torch.zeros((X.shape[1], 1), device=device)
# Iteratively update beta using weighted least squares
for _ in range(max_iter):
# Compute the linear predictor η = X @ beta
eta = X @ beta
# Compute predicted probabilities using sigmoid(η)
mu = sigmoid(eta) # shape: (n_samples, 1)
# Compute weights: W_diag = μ(1 - μ) (flattened for diag)
W_diag = (mu * (1 - mu)).flatten() # shape: (n_samples,)
# Construct the diagonal weight matrix W
W = torch.diag(W_diag) # shape: (n_samples, n_samples)
# Compute the working response z:
# z = η + (y - μ) / [μ(1 - μ)]
z = eta + (y - mu) / (W_diag.view(-1, 1) + 1e-8) # add epsilon for numerical stability
# Weighted least squares update:
# β = (Xᵀ W X)⁻¹ Xᵀ W z
beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z
# Return the final estimate of beta
return beta
1.5.6. Compare All Methods (Code Cell)¶
results = {}
start = time.time()
beta_gd = gradient_ascent(X, y)
results['Gradient Ascent'] = (time.time() - start, compute_accuracy(beta_gd, X, y))
start = time.time()
beta_nr = newton_raphson(X, y)
results['Newton-Raphson'] = (time.time() - start, compute_accuracy(beta_nr, X, y))
start = time.time()
beta_irls = irls(X, y)
results['IRLS'] = (time.time() - start, compute_accuracy(beta_irls, X, y))
# Scikit-learn
sk_model = LogisticRegression(fit_intercept=False, max_iter=1000)
start = time.time()
sk_model.fit(X_np, y_np)
sk_time = time.time() - start
sk_acc = accuracy_score(y_np, sk_model.predict(X_np))
results['scikit-learn'] = (sk_time, sk_acc)
# Print comparison
import pandas as pd
pd.DataFrame(results, index=['Time (s)', 'Accuracy']).T
Time (s) | Accuracy | |
---|---|---|
Gradient Ascent | 0.012920 | 0.834 |
Newton-Raphson | 0.016865 | 0.836 |
IRLS | 0.001456 | 0.836 |
scikit-learn | 0.001023 | 0.836 |
1.5.7. Plot Gradient Ascent Convergence (Optional Code Cell)¶
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.nn.functional import sigmoid
from sklearn.linear_model import LogisticRegression
# --------- Gradient Ascent with loss tracking ---------
def gradient_ascent_trace(X, y, lr=0.1, max_iter=100):
# Initialize beta with zeros and enable gradient tracking
beta = torch.zeros((X.shape[1], 1), requires_grad=True, device=device)
loss_trace = [] # Store loss values for plotting
for _ in range(max_iter):
# Predicted probabilities using sigmoid function
mu = sigmoid(X @ beta)
# Compute negative log-likelihood (add small epsilon for numerical stability)
loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
loss_trace.append(loss.item()) # Store current loss
# Compute gradient
loss.backward()
# Gradient ascent step
with torch.no_grad():
beta += lr * (-beta.grad) # Ascend the negative gradient
beta.grad.zero_() # Reset gradients for next iteration
return beta.detach(), loss_trace # Return final beta and loss history
# --------- Newton-Raphson with loss tracking ---------
def newton_raphson_trace(X, y, max_iter=10):
beta = torch.zeros((X.shape[1], 1), device=device)
loss_trace = []
for _ in range(max_iter):
mu = sigmoid(X @ beta) # Predicted probabilities
loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
loss_trace.append(loss.item())
# Construct weight matrix W = diag(mu * (1 - mu))
W = torch.diag((mu * (1 - mu)).flatten())
# Compute pseudo-response z
z = X @ beta + torch.inverse(W + 1e-4 * torch.eye(W.shape[0])) @ (y - mu)
# Newton-Raphson update: beta = (Xᵀ W X)⁻¹ Xᵀ W z
beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z
return beta, loss_trace
# --------- IRLS with loss tracking ---------
def irls_trace(X, y, max_iter=10):
beta = torch.zeros((X.shape[1], 1), device=device)
loss_trace = []
for _ in range(max_iter):
eta = X @ beta # Linear predictor
mu = sigmoid(eta) # Predicted probabilities
loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
loss_trace.append(loss.item())
W_diag = (mu * (1 - mu)).flatten() # Diagonal of weight matrix
W = torch.diag(W_diag) # Weight matrix W
# Compute pseudo-response z
z = eta + (y - mu) / (W_diag.view(-1, 1) + 1e-8) # Add epsilon to prevent div-by-zero
# IRLS update: beta = (Xᵀ W X)⁻¹ Xᵀ W z
beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z
return beta, loss_trace
# --------- scikit-learn model as benchmark ---------
# Fit logistic regression using sklearn for reference
sk_model = LogisticRegression(fit_intercept=False, max_iter=1000)
sk_model.fit(X_np, y_np)
# Compute loss using sklearn's fitted coefficients
logits_sk = X_np @ sk_model.coef_.flatten()
mu_sk = 1 / (1 + np.exp(-logits_sk))
loss_sklearn = -np.mean(y_np * np.log(mu_sk + 1e-8) + (1 - y_np) * np.log(1 - mu_sk + 1e-8))
# --------- Run all methods and collect loss traces ---------
_, loss_gd = gradient_ascent_trace(X, y, lr=0.1, max_iter=100)
_, loss_nr = newton_raphson_trace(X, y, max_iter=10)
_, loss_irls = irls_trace(X, y, max_iter=10)
# For sklearn, use a flat horizontal line for loss comparison
loss_sklearn_line = [loss_sklearn] * max(len(loss_gd), len(loss_nr), len(loss_irls))
# --------- Plot convergence of all methods ---------
plt.figure(figsize=(10, 6))
# Gradient Ascent: smooth line
plt.plot(loss_gd, label="Gradient Ascent", color="blue", linewidth=2)
# Newton-Raphson: dots with orange line
plt.plot(np.arange(len(loss_nr)), loss_nr, 'o-', label="Newton-Raphson", color="orange", markersize=6)
# IRLS: green dashed line with squares
plt.plot(np.arange(len(loss_irls)), loss_irls, 's--', label="IRLS", color="green", markersize=6)
# scikit-learn: black dashed horizontal line
plt.axhline(loss_sklearn, color="black", linestyle="--", linewidth=2, label="scikit-learn")
# Decorate the plot
plt.xlabel("Iteration")
plt.ylabel("Negative Log-Likelihood")
plt.title("Convergence of Logistic Regression Methods")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()