Author: Shouye Liu
Email: syliu.xue@foxmail.com
Date: 2025-04-18

1. Math background¶

1.1 Exponential family distribution¶

The general form of an exponential family distribution is:

$$ f(x, y) = h(y) \exp\left( \eta^\top T(x, y) - A(\eta) \right) $$

In the context of generalized linear models (GLMs), the natural parameter $\eta$ is linked to the predictors $\mathbf{x}$ via:

$$ \eta = \mathbf{x}^\top \boldsymbol{\beta} $$

Thus, the conditional density of $y$ given $\mathbf{x}$ becomes:

$$ f(y \mid \mathbf{x}) = h(y) \exp\left( y \cdot \mathbf{x}^\top \boldsymbol{\beta} - A(\mathbf{x}^\top \boldsymbol{\beta}) \right) $$

The logistic regression Exponential family distribution for logistic regression: Now, we rewrite Bernoulli distribution in terms of exponential family distribution: $$\begin{aligned} f(y_i|p) &= p^{y_i}(1-p)^{y_i} \\ &= exp\{log[p^{y_i}(1-p)^{y_i}] \} \\ &= exp\{{y_i}log^p + {y_i}log^{(1-p)}\} \\ &= exp\{{y_i}log^{\frac{p}{1-p}} \} \\ \end{aligned} \tag{1}$$ where, $$h(x) = 1\tag{2}$$ $$T(x) = x\tag{3}$$ $$\eta = log^{\frac{p}{1-p}}\tag{4}$$ $$A(\eta) = 1\tag{5}$$ In logistic regression, we assume $\eta = \boldsymbol{x}^T_i\beta$, so $$p = Pr(y_i = 1|\boldsymbol{x_i}) = \frac{e^{\eta}}{1+e^{\eta}} = [1+e^{-\eta}]^{-1} = [1+e^{-\boldsymbol{x}_i\boldsymbol{\beta}}]^{-1}\tag{6}$$

1.2 Maximax likehood estimation¶

$$\begin{aligned}L(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})=\prod^n_{i = 1} p_i^{y_i}(1-p_i)^{1-y_i} \end{aligned}$$ where $y$ is the total number of success in $n$ trials.
The corresponding log-likelihood is $$\begin{aligned}\ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta}) &=log L(\mathbf{y}|\mathbf{X},\boldsymbol{\beta}) \\ &=\sum^n_{i=1}[ y_ilog^{p_i} + (1-y_i)log^{(1-p_i)}] \\ &= \sum^n_{i = 1} [y_ilog^{\frac{p_i}{1-p_i}} + log^{(1-p_i)}] \\ &= \sum^n_{i = 1} [y_i \boldsymbol{x}_i\boldsymbol{\beta} + log^{\frac{1}{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}}}] \\ &= \sum^n_{i = 1} [y_i \boldsymbol{x}_i\boldsymbol{\beta} - log^{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}}] \\ &= \sum^n_{i = 1} y_i \boldsymbol{x}_i\boldsymbol{\beta} - \sum^n_{i = 1} log^{1+e^{\boldsymbol{x}_i\boldsymbol{\beta}}} \\ \end{aligned}$$

1.2.1. Gradient¶

Take first-order derivative of $\ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})$ about $\boldsymbol{\beta}$, we have $$\begin{aligned} \nabla \ell(\boldsymbol{\beta}) &= \frac{\partial \ell(\mathbf{y}|\mathbf{X},\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \\ &= \sum^n_{i = 1} y_i \boldsymbol{x}_i - \sum^n_{i = 1} [\frac{e^{\boldsymbol{x}_i\boldsymbol{\beta}} }{1+ e^{\boldsymbol{x}_i\boldsymbol{\beta}}}\times \boldsymbol{x}_i] \\ &= \sum^n_{i=1}(y_i -p_i)\boldsymbol{x}_i \\ &= \mathbf{X}^T(\mathbf{y} - \boldsymbol{\mu}) \\ \end{aligned}$$ where, $$\boldsymbol{\mu} =\begin{bmatrix} p_1 \\ p_2 \\ \vdots \\ p_n \end{bmatrix} = \begin{bmatrix} \sigma(\mathbf{x}_1^T \boldsymbol{\beta}) \\ \sigma(\mathbf{x}_2^T \boldsymbol{\beta}) \\ \vdots \\ \sigma(\mathbf{x}_n^T \boldsymbol{\beta}) \end{bmatrix}$$

1.2.2. Hessian (second derivative matrix)¶

$$\begin{aligned}\mathbf{H} &= \frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta} \, \partial \boldsymbol{\beta}^T} \\ &= \frac{\partial{[\nabla \ell(\boldsymbol{\beta})]}}{\partial{\boldsymbol{\beta}}} \\ &= \frac{\partial{[- \sum^n_{i = 1} [\frac{e^{\boldsymbol{x}_i\boldsymbol{\beta}} }{1+ e^{\boldsymbol{x}_i\boldsymbol{\beta}}}\times \boldsymbol{x}_i]]}}{\partial{\boldsymbol{\beta}}} \\ &= \frac{\partial{[- \sum^n_{i = 1} [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-1}\times \boldsymbol{x}_i]]}}{\partial{\boldsymbol{\beta}}} \\ &= - \sum^n_{i = 1} (-1) \times [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-2}\times \boldsymbol{x}_i] \times e^{-\boldsymbol{x}_i\boldsymbol{\beta}} \times (-1) \times \boldsymbol{x}_i^T \end{aligned}$$ Given $p_i = {(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-1}$ and $e^{\boldsymbol{x}_i\boldsymbol{\beta}} = \frac{p_i}{1-p_i}$, then $$\begin{aligned}\mathbf{H} &= - \sum^n_{i = 1} (-1) \times [{(1+ e^{-\boldsymbol{x}_i\boldsymbol{\beta}})}^{-2}\times \boldsymbol{x}_i] \times e^{-\boldsymbol{x}_i\boldsymbol{\beta}} \times (-1) \times \boldsymbol{x}_i^T \\ &= - \sum^n_{i = 1} [p_i^2\times \boldsymbol{x}_i] \times \frac{1-p_i}{p_i} \times \boldsymbol{x}_i^T \\ &= - \sum^n_{i = 1} (1-p_i)p_i \times \boldsymbol{x}_i \times \boldsymbol{x}_i^T \\ &= -\mathbf{X}^T\mathbf{W}\mathbf{X} \end{aligned}$$ Where, $\mathbf{W} = diag[p_i(1-p_i)]$.

1.3 Numerical Approximation¶

1.3.1 Gradient Ascent (on log-likelihood)¶

We aim to maximize the log-likelihood:

$$ \ell(\boldsymbol{\beta}) = \mathbf{y}^T \log(\boldsymbol{\mu}) + (\mathbf{1} - \mathbf{y})^T \log(1 - \boldsymbol{\mu}) $$

Or simplified:

$$ \ell(\boldsymbol{\beta}) = \mathbf{y}^T \mathbf{X} \boldsymbol{\beta} - \mathbf{1}^T \log(1 + e^{\mathbf{X} \boldsymbol{\beta}}) $$

Update rule:

$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \cdot \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$

Where:

  • $\alpha$: learning rate
  • $\boldsymbol{\mu} = \sigma(\mathbf{X} \boldsymbol{\beta}^{(t)})$

This is a first-order method.

1.3.2. Newton-Raphson (Second-order method)¶

Use both the gradient and the Hessian to update $\boldsymbol{\beta}$:

Gradient:

$$ \nabla \ell(\boldsymbol{\beta}) = \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$

Hessian:

$$ \mathbf{H} = -\mathbf{X}^T \mathbf{W} \mathbf{X}, \quad \text{where } \mathbf{W} = \operatorname{diag}(\mu_i(1 - \mu_i)) $$

Update rule:

$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \mathbf{H}^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)}) $$

Or equivalently:

$$ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu}) $$

This method converges faster than gradient ascent but requires inverting a matrix, which can be expensive.

1.3.3. Iteratively Reweighted Least Squares (IRLS)¶

IRLS is a special case of Newton-Raphson, reformulated as a weighted least squares:

Define:

  • Pseudo-response (working response):
    $$ \mathbf{z} = \boldsymbol{\eta} + \frac{\mathbf{y} - \boldsymbol{\mu}}{\mu_i (1 - \mu_i)} $$

  • Weight matrix:
    $$ \mathbf{W} = \operatorname{diag}(\mu_i (1 - \mu_i)) $$

Solve at each step:

$$ \boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z} $$

This looks like a weighted linear regression, where weights and pseudo-response ( \mathbf{z} ) are updated at each iteration.

1.3.4 Summary Comparison¶

Method Uses Update Formula Notes
Gradient Ascent Gradient only $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \alpha \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$ Simple, slow, needs tuning $\alpha$
Newton-Raphson Gradient + Hessian $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \boldsymbol{\mu})$ Faster convergence, more computation
IRLS Weighted Least Squares $\boldsymbol{\beta}^{(t+1)} = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W} \mathbf{z}$ Very popular for GLMs

1.5 Implementation¶

1.5.1 Simulate dataset¶

In [34]:
import numpy as np
import torch
from torch.nn.functional import sigmoid
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

# Setup
device = torch.device("cpu")
np.random.seed(42)

# Parameters
n_samples = 500
n_features = 10

# Simulate features and labels
X_np = np.random.randn(n_samples, n_features)
true_beta = np.random.randn(n_features)
logits = X_np @ true_beta
probs = 1 / (1 + np.exp(-logits))
y_np = np.random.binomial(1, probs)

# Convert to PyTorch
X = torch.tensor(X_np, dtype=torch.float32, device=device)
y = torch.tensor(y_np, dtype=torch.float32, device=device).view(-1, 1)

1.5.2. Accuracy Helper Function (Code Cell)¶

In [35]:
def compute_accuracy(beta, X, y):
    with torch.no_grad():
        preds = (sigmoid(X @ beta) >= 0.5).float()
        return (preds == y).float().mean().item()

1.5.3. Gradient Ascent (Code Cell)¶

In [36]:
def gradient_ascent(X, y, lr=0.1, max_iter=100):
    # Initialize beta (weights) as a zero vector with shape (features, 1)
    # requires_grad=True allows autograd to track operations on beta
    beta = torch.zeros((X.shape[1], 1), requires_grad=True, device=device)

    # Iterate for a fixed number of steps
    for _ in range(max_iter):
        # Compute the predicted probabilities using the sigmoid function
        mu = sigmoid(X @ beta)

        # Compute the negative log-likelihood loss
        # Add small epsilon (1e-8) to avoid log(0) instability
        loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))

        # Compute gradients via backpropagation
        loss.backward()

        # Update beta using gradient ascent
        with torch.no_grad():
            # Gradient ascent step: move in the direction of the negative gradient
            beta += lr * (-beta.grad)

            # Reset gradients to zero for the next iteration
            beta.grad.zero_()

    # Detach the tensor from the computation graph and return it
    return beta.detach()

1.5.4. Newton-Raphson (Code Cell)¶

In [37]:
def newton_raphson(X, y, max_iter=10):
    # Initialize beta (weights) as zeros; shape = (features, 1)
    beta = torch.zeros((X.shape[1], 1), device=device)

    # Loop for a fixed number of iterations
    for _ in range(max_iter):
        # Compute the predicted probabilities using the sigmoid function
        mu = sigmoid(X @ beta)  # shape: (n_samples, 1)

        # Compute the diagonal weight matrix W = diag(mu * (1 - mu))
        W = torch.diag((mu * (1 - mu)).flatten())  # shape: (n_samples, n_samples)

        # Compute the adjusted response z (working variable in IRLS/NR)
        # z = η + (y - μ) / W = η + W⁻¹ (y - μ)
        # η = X @ β
        z = X @ beta + torch.inverse(W + 1e-4 * torch.eye(W.shape[0])) @ (y - mu)

        # Update rule for Newton-Raphson:
        # β = (Xᵀ W X)⁻¹ Xᵀ W z
        # Add a small ridge term (1e-4 * I) for numerical stability
        beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z

    # Return final estimate of beta
    return beta

1.5.5. IRLS (Code Cell)¶

In [38]:
def irls(X, y, max_iter=10):
    # Initialize beta (weights) as zeros; shape = (features, 1)
    beta = torch.zeros((X.shape[1], 1), device=device)

    # Iteratively update beta using weighted least squares
    for _ in range(max_iter):
        # Compute the linear predictor η = X @ beta
        eta = X @ beta

        # Compute predicted probabilities using sigmoid(η)
        mu = sigmoid(eta)  # shape: (n_samples, 1)

        # Compute weights: W_diag = μ(1 - μ) (flattened for diag)
        W_diag = (mu * (1 - mu)).flatten()  # shape: (n_samples,)

        # Construct the diagonal weight matrix W
        W = torch.diag(W_diag)  # shape: (n_samples, n_samples)

        # Compute the working response z:
        # z = η + (y - μ) / [μ(1 - μ)]
        z = eta + (y - mu) / (W_diag.view(-1, 1) + 1e-8)  # add epsilon for numerical stability

        # Weighted least squares update:
        # β = (Xᵀ W X)⁻¹ Xᵀ W z
        beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z

    # Return the final estimate of beta
    return beta

1.5.6. Compare All Methods (Code Cell)¶

In [39]:
results = {}

start = time.time()
beta_gd = gradient_ascent(X, y)
results['Gradient Ascent'] = (time.time() - start, compute_accuracy(beta_gd, X, y))

start = time.time()
beta_nr = newton_raphson(X, y)
results['Newton-Raphson'] = (time.time() - start, compute_accuracy(beta_nr, X, y))

start = time.time()
beta_irls = irls(X, y)
results['IRLS'] = (time.time() - start, compute_accuracy(beta_irls, X, y))

# Scikit-learn
sk_model = LogisticRegression(fit_intercept=False, max_iter=1000)
start = time.time()
sk_model.fit(X_np, y_np)
sk_time = time.time() - start
sk_acc = accuracy_score(y_np, sk_model.predict(X_np))
results['scikit-learn'] = (sk_time, sk_acc)

# Print comparison
import pandas as pd
pd.DataFrame(results, index=['Time (s)', 'Accuracy']).T
Out[39]:
Time (s) Accuracy
Gradient Ascent 0.012920 0.834
Newton-Raphson 0.016865 0.836
IRLS 0.001456 0.836
scikit-learn 0.001023 0.836

1.5.7. Plot Gradient Ascent Convergence (Optional Code Cell)¶

In [40]:
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.nn.functional import sigmoid
from sklearn.linear_model import LogisticRegression

# --------- Gradient Ascent with loss tracking ---------
def gradient_ascent_trace(X, y, lr=0.1, max_iter=100):
    # Initialize beta with zeros and enable gradient tracking
    beta = torch.zeros((X.shape[1], 1), requires_grad=True, device=device)
    loss_trace = []  # Store loss values for plotting

    for _ in range(max_iter):
        # Predicted probabilities using sigmoid function
        mu = sigmoid(X @ beta)

        # Compute negative log-likelihood (add small epsilon for numerical stability)
        loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
        loss_trace.append(loss.item())  # Store current loss

        # Compute gradient
        loss.backward()

        # Gradient ascent step
        with torch.no_grad():
            beta += lr * (-beta.grad)  # Ascend the negative gradient
            beta.grad.zero_()          # Reset gradients for next iteration

    return beta.detach(), loss_trace  # Return final beta and loss history

# --------- Newton-Raphson with loss tracking ---------
def newton_raphson_trace(X, y, max_iter=10):
    beta = torch.zeros((X.shape[1], 1), device=device)
    loss_trace = []

    for _ in range(max_iter):
        mu = sigmoid(X @ beta)  # Predicted probabilities
        loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
        loss_trace.append(loss.item())

        # Construct weight matrix W = diag(mu * (1 - mu))
        W = torch.diag((mu * (1 - mu)).flatten())

        # Compute pseudo-response z
        z = X @ beta + torch.inverse(W + 1e-4 * torch.eye(W.shape[0])) @ (y - mu)

        # Newton-Raphson update: beta = (Xᵀ W X)⁻¹ Xᵀ W z
        beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z

    return beta, loss_trace

# --------- IRLS with loss tracking ---------
def irls_trace(X, y, max_iter=10):
    beta = torch.zeros((X.shape[1], 1), device=device)
    loss_trace = []

    for _ in range(max_iter):
        eta = X @ beta          # Linear predictor
        mu = sigmoid(eta)       # Predicted probabilities
        loss = -torch.mean(y * torch.log(mu + 1e-8) + (1 - y) * torch.log(1 - mu + 1e-8))
        loss_trace.append(loss.item())

        W_diag = (mu * (1 - mu)).flatten()  # Diagonal of weight matrix
        W = torch.diag(W_diag)              # Weight matrix W

        # Compute pseudo-response z
        z = eta + (y - mu) / (W_diag.view(-1, 1) + 1e-8)  # Add epsilon to prevent div-by-zero

        # IRLS update: beta = (Xᵀ W X)⁻¹ Xᵀ W z
        beta = torch.inverse(X.T @ W @ X + 1e-4 * torch.eye(X.shape[1])) @ X.T @ W @ z

    return beta, loss_trace

# --------- scikit-learn model as benchmark ---------
# Fit logistic regression using sklearn for reference
sk_model = LogisticRegression(fit_intercept=False, max_iter=1000)
sk_model.fit(X_np, y_np)

# Compute loss using sklearn's fitted coefficients
logits_sk = X_np @ sk_model.coef_.flatten()
mu_sk = 1 / (1 + np.exp(-logits_sk))
loss_sklearn = -np.mean(y_np * np.log(mu_sk + 1e-8) + (1 - y_np) * np.log(1 - mu_sk + 1e-8))

# --------- Run all methods and collect loss traces ---------
_, loss_gd = gradient_ascent_trace(X, y, lr=0.1, max_iter=100)
_, loss_nr = newton_raphson_trace(X, y, max_iter=10)
_, loss_irls = irls_trace(X, y, max_iter=10)

# For sklearn, use a flat horizontal line for loss comparison
loss_sklearn_line = [loss_sklearn] * max(len(loss_gd), len(loss_nr), len(loss_irls))

# --------- Plot convergence of all methods ---------
plt.figure(figsize=(10, 6))

# Gradient Ascent: smooth line
plt.plot(loss_gd, label="Gradient Ascent", color="blue", linewidth=2)

# Newton-Raphson: dots with orange line
plt.plot(np.arange(len(loss_nr)), loss_nr, 'o-', label="Newton-Raphson", color="orange", markersize=6)

# IRLS: green dashed line with squares
plt.plot(np.arange(len(loss_irls)), loss_irls, 's--', label="IRLS", color="green", markersize=6)

# scikit-learn: black dashed horizontal line
plt.axhline(loss_sklearn, color="black", linestyle="--", linewidth=2, label="scikit-learn")

# Decorate the plot
plt.xlabel("Iteration")
plt.ylabel("Negative Log-Likelihood")
plt.title("Convergence of Logistic Regression Methods")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image