My talk is based on SciPy 2022 talk¶
- Title: Counting on Poisson Regression with scikit-learn
- Speaker: Thomas J. Fan
- The original talk is on Github: thomasjpfan/scipy-2022-poisson
- Video
- Slide
Contents¶
- Short summary of Scipy 2022 talk
- Generalized Linear Models
- Poisson Regression, Poisson loss and Deviance
- Scikit-learn tool for Poisson Regression
- Case Study: "London Bike Sharing" data set
- Data preprocessing
- Evaluation
- PoissonRegressor
- RandomForestRegressor
- HistGradientBoostingRegressor
- ZeroInflatedRegressor(scikit-lego)
- Short review: Poisson Regression example of Scikit-learn and Murphy's Prob ML
Generalized Linear Models¶
Notation¶
- $\mathbf x \in \mathbf R^d$: independent(explanatory) variable
- $t \in \mathbf R$: dependent variable or target(random variable)
- By convention, $\mathbf x$ is $(d+1)$-dim vector consisting of $d$ independent variables concatenated to the number one.
- $\mathbf w\in \mathbf R^{d+1}$: weight vector with $ \mathbf w=(w_0,..., w_d)^T$
- $y(\mathbf x, \mathbf w)$: output of model or predicted value for $\mathbf x$
Generalized Linear Models¶
- In linear regression model, the target variable $p(t|\mathbf x)$ is assumed to be followed Normal distribution and $\mathbb E[t|\mathbf x]$ is modeled by $\mathbf w ^T \mathbf x$.
- GLM generalizes linear regression by allowing
- the linear model to be related the target variable via a link function
- the magnitude of the variance of each measurement to be a function of its predicted value
- In GLM, dependent (random) variable $p(t|\mathbf x)$ is assumed to be generated from a particular distribution in an exponential family.
$$ \mathbb E[t|\mathbf x] = \mu = g^{-1}(\mathbf w ^T \mathbf x) $$
- where $g$ is the link function and $g^{-1}$ is called the mean function.
Remark of GLM¶
- Not to be confused with linear basis function model, i.e. $\mathbb E[t|\mathbf x] = \mathbf w ^T \Phi(\mathbf x)$ for some basis vector function $\Phi(\cdot)$.
- GLM is not linear w.r.t $\mathbf w$.
- Generally, GLM does not have a closed-form solution.
- $y(\mathbf x, \mathbf w) = g^{-1}(\mathbf w ^T \mathbf x)$.
- Exponential family(e.g. Normal, Binomial, Poisson, Gamma)
$$ p(\mathbf x | \mathbf \eta) = v(\mathbf x)h(\mathbf \eta) \exp \{ \mathbf \eta ^T \mathbf u(\mathbf x) \}, \quad \mathbf \eta: \text{natural parameter} $$
- where $\mathbf u(\mathbf x)$ is a (vector) function of $\mathbf x$.
GLM components¶
The GLM consists of three elements:
- Particular probability distribution for modeling $p(t|\mathbf x)$
- Linear predictor $\mathbf w ^T \mathbf x$
- Link function $g$ s.t.
$$ \mathbb E[t|\mathbf x] = \mu = g^{-1}(\mathbf w ^T \mathbf x) $$
Linear Regression¶
- Target $t$ $=$ $\mathcal N$$(t|\mu,$$\sigma^2)$, $-$$\infty<$ $t$ $<\infty$
$$ p(t|\mathbf x, \mathbf w, \sigma) = \mathcal N(t|\mu, \sigma^2), \quad \sigma >0 $$
- Estimate $\mu$ as Linear predictor $\mathbf w^T \mathbf x$.
$$\begin{align*} p(t_n|\mathbf x_n, \mathbf w, \sigma) & = \mathcal N(t_n|\mathbf w^T \mathbf x_n, \sigma^2)\\ &=\frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{1}{2\sigma^2}(t_n - \mathbf w^T \mathbf x_n)^2 \right) \end{align*}$$
- Log-likelihood
$$ \log p(t_n|\mathbf x_n, \mathbf w, \sigma) = -\frac{1}{2\sigma^2}(t_n - \mathbf w^T \mathbf x_n)^2 - \frac{1}{2}\log (2\pi \sigma^2) $$
Logistic Regression¶
- Target $t$ $=$ $\text{Ber}$$(t|\mu)$, $t=0$ or $1$
$$ \text{Ber}(t|\mu) = \mu ^t (1-\mu)^{1-t}, \quad 0<\mu<1 $$
- Estimate $\mu$ as $\sigma(\mathbf w^T \mathbf x)$ where $\sigma(\cdot )$ is a sigmoid function.
$$ p(t_n |\mathbf x_n, \mathbf w) = \text{Ber}(t_n|\sigma(\mathbf w^T \mathbf x_n)) = \sigma(\mathbf w^T \mathbf x_n)^{t_n} (1- \sigma(\mathbf w^T \mathbf x_n))^{1-t_n} $$
- Log-likelihood
$$ \log p(t_n |\mathbf x_n, \mathbf w) = t_n \log \sigma(\mathbf w^T \mathbf x_n) + (1-t_n)\log(1- \sigma(\mathbf w^T \mathbf x_n)) $$
Poisson Regression!¶
Agriculture¶
Risk modeling¶
Modeling Count data or Frequency distribution¶
- Differential expression analysis for RNAseq using Poisson mixed models
- Nucleic Acids Res. 2017 Jun 20; 45(11): e106.
- A Two-Stage Poisson Model for Testing RNA-Seq Data
- De Gruyter. May 16, 2011
- Poisson Regression Analysis of the Mortality among a Cohort of World War II Nuclear Industry Workers
- Radiation Research Society. Vol. 123, No. 2 (Aug., 1990). 138-152
- Negative-Binomial and quasi-poisson regressions between COVID-19, mobility and environment in São Paulo, Brazil
- Elsevier Environmental Research. 2022 Mar; 204: 112369.
Poisson Regression¶
- Regression technique modeling count or frequency data
- The target variable $t$ follows a Poisson distribution and $\log \mathbb E[t|\mathbf x]$ can be modeled by a linear combination of unknown parameter $\mathbf w$, i.e. $\log \mathbb E[t|\mathbf x, \mathbf w] = \mathbf w^T \mathbf x$ or $\mathbb E[t|\mathbf x, \mathbf w] = \exp(\mathbf w^T \mathbf x)$.
- $t = $ $\text{Poi}$$(t|\mu)$, $t \in {0,1,...}$
$$ \text{Poi}(t|\mu) = e^{-\mu}\frac{\mu^t}{t!},\, \mu>0 $$
- Estimate $\mu$ as $\exp(\mathbf w^T \mathbf x)$.
$$ p(t_n |\mathbf x_n, \mathbf w) = \text{Poi}(t_n|\underbrace{\exp(\mathbf w^T \mathbf x_n)}_{=:y_n})=e^{-y_n}\frac{y_n^{t_n}}{t_n!} $$
Poisson distribution¶
- The probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
- Positive real number $\mu$ (the expected value of $t$)
$$ p(t|\mu) = \text{Poi}(t|\mu)=e^{-\mu} \frac{\mu ^t}{t!}, \quad t=0,1,2,... $$
- $\mathbb E[t|\mu] = \text{Var} [t|\mu ] = \mu$
Poisson Regression¶
- $t=$ $\text{Poi}$$(t|\mu),$ $t \in$$\{0,1,...\}$
$$ \text{Poi}(t|\mu) = e^{-\mu}\frac{\mu^t}{t!},\, \mu>0 $$
- Estimate $\mu$ as $\exp(\mathbf w^T \mathbf x)$.
$$ p(t_n |\mathbf x_n, \mathbf w) = \text{Poi}(t_n|\underbrace{\exp(\mathbf w^T \mathbf x_n)}_{=:y_n})=e^{-y_n}\frac{y_n^{t_n}}{t_n!} $$
Poisson Regression¶
- Observation $\mathcal D = \{ (\mathbf x_n, t_n)_{n=1,2,...,N}\}$
- By the method of maximum likelihood, we can find the weight vector $\mathbf w$ making the probability as large as possible. Log-likelihood is written as
$$ \sum_{n=1}^N [ t_n \log(y_n) - y_n - \log(t_n !)] = \sum_{n=1}^N [ t_n \mathbf w^T \mathbf x_n - e^{\mathbf w^T \mathbf x_n} - \log(t_n !)] $$
- I.e. maximize the log-likelihood
$$ L(\mathbf w):= \sum_{n=1}^N [ t_n \log(y_n) - y_n] =\sum_{n=1}^N [ t_n \mathbf w^T \mathbf x_n - e^{\mathbf w^T \mathbf x_n} ] $$
- $-L(\mathbf w)$ is a convex function. But the equation $ \nabla_{\mathbf w}L(\mathbf w)=0$ has no closed-form solution.
How to quantify performance¶
MSE, MAE
Poisson deviance $$ D(\mathbf t, \mathbf y):= \sum_n d(t_n, y_n) = \sum_n 2 \left [\log p(t_n | y_n^*) - \log p(t_n | y_n) \right ] $$
where $\mathbf t = (t_1,t_2,...,t_N)^T$, $\mathbf y = (y_1,y_2,...,y_N)^T$ and $y_n$ is the predicted value for $\mathbf x_n$ and $y_n^*$ is the optimal parameter estimated by fitting the model just to the true output $t_n$.
In the case of Poisson regression, we have $y_n = \exp(\mathbf w ^T \mathbf x_n)$ and $y_n^* = t_n$. Hence
$$\begin{align*} D(\mathbf t, \mathbf y) &= 2\sum_n \left [ (t_n \log(t_n)-t_n-\log(t_n!)) - (t_n \log(y_n)-y_n-\log(t_n!)) \right ]\\ & = 2\sum_n \left [ t_n \log \frac{t_n}{y_n} + y_n - t_n \right] \end{align*}$$
Remark of Poisson deviance¶
- $\mathbf t = \mathbf y$ $\Leftrightarrow$ $D(\mathbf t, \mathbf y) = 0$
- Maximize log-likelihood $\Leftrightarrow$ Minimize Poisson deviance
Deviance¶
- Metric that elicits predicted expectation values of regression targets
Problem | Target Distribution | Target domain | Unit deviance $d(t,y)$ |
---|---|---|---|
Regression | Normal | $$t\in (-\infty, \infty)$$ | $(t-y)^2$ |
Binary classification | Bernoulli | $t\in\{0,1\}$ | $$2 \left [ t\log\frac{t}{y}+(1-t)\log\frac{1-t}{1-y} \right ]$$ |
Regression with count data | Poisson | $$t\in[0, \infty)$$ | $$2\left [ t\log\frac{t}{y}-t+y \right ] $$ |
Poisson Regression with Scikit-learn¶
PoissonRegressor¶
from sklearn.linear_model import PoissonRegressor
reg = PoissonRegressor()
RandomForestRegressor¶
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(criterion="poisson")
HistGradientBoostingRegressor¶
from sklearn.ensemble import HistGradientBoostingRegressor
reg = HistGradientBoostingRegressor(loss="poisson")
Minimization problem¶
scipy.optimize.minimize
is used withL-BFGS-B
$$ \text{min}_{\mathbf w} \frac{1}{2N}\sum_{n=1}^N d(t_n,y_n)+\frac{\alpha}{2} \| \mathbf w \|_2^2 $$
- where $\alpha$ is the $L2$ regularization penalty and $d(t,y)=2\left [ t\log\frac{t}{y}-t+y \right ]$ is the Poisson deviance.
- Cholesky based Newton solver: PR #23314
- Newton-LSMR: PR #23507 (New in version 1.2.)
Regularization by Default!¶
PoissonRegressor(alpha=1.0)
RandomForestRegressor¶
class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)
- criterion{“squared_error”, “absolute_error”, “friedman_mse”, “poisson”}, default=”squared_error”
- criterion="poisson"(New in version 1.0)
Mathematical formulation of Random forest¶
- $\mathcal D = \{ (\mathbf x_n, t_n)_{n=1,2,...,N}$, $\mathbf x_n \in \mathbf R^d$, $t_n \in \mathbf R \}$
- Decision tree recursively partitions the input(feature) space s.t. the samples with the same labels or similar target values are grouped together.
- $Q_m$: the data at the node $m$ with $n_m$ samples
- $\theta=(j, s_m)$ with $j \in \{1,...d\}$ and threshold $s_m \in \mathbf R$
- $\theta$ splits $Q_m$ into $Q_m^{left}(\theta)$ and $Q_m^{right}(\theta)$ as
$$ Q_m^{left}(\theta):=\{ (\mathbf x, t) | x_j \leq s_m\},\, \text{where } \mathbf x = (x_1,x_2,...,x_d)^T $$
$$ Q_m^{right}(\theta):= Q_m \setminus Q_m^{left}(\theta) $$
$$ G(Q_m, \theta):= \frac{n_m^{left}}{n_m}H(Q_m^{left}(\theta)) + \frac{n_m^{right}}{n_m}H(Q_m^{right}(\theta)) $$
- where $H$ is called an impurity function.
- Select parameters $\theta$ that minimize impurity $G$
$$ \theta ^* = \text{argmin}_{\theta} G(Q_m, \theta) $$
Regression criteria of RandomForestRegressor¶
criterion="squared_error"
$$ \bar{t}_m:= \frac{1}{n_m} \sum_{t\in Q_m} t $$
$$ H(Q_m):= \frac{1}{n_m}\sum_{t\in Q_m} (t- \bar{t}_m)^2 $$
criterion="poisson"
$$ H(Q_m):=\frac{1}{n_m}\sum_{t\in Q_m}\left ( t \log\frac{t}{\bar{t}_m} - t + \bar{t}_m \right ) % = \frac{1}{2n_m} \sum_{t\in Q_m} d(t, \bar{t}_m) $$
HistGradientBoostingRegressor¶
class sklearn.ensemble.HistGradientBoostingRegressor(loss='squared_error', *, quantile=None, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, categorical_features=None, monotonic_cst=None, interaction_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None)
Histogram-based Gradient Boosting Regression Tree(New in version 0.21)
loss{‘squared_error’, ‘absolute_error’, ‘gamma’, ‘poisson’, ‘quantile’}, default=’squared_error’
London Bike Sharing Dataset 🚲¶
cnt | t1 | hum | wind_speed | weather_code | is_holiday | is_weekend | season | hour |
---|---|---|---|---|---|---|---|---|
182 | 3 | 93 | 6 | 3 | 0 | 1 | 3 | 0 |
138 | 3 | 93 | 5 | 1 | 0 | 1 | 3 | 1 |
134 | 2.5 | 96.5 | 0 | 1 | 0 | 1 | 3 | 2 |
72 | 2 | 100 | 0 | 1 | 0 | 1 | 3 | 3 |
47 | 2 | 93 | 6.5 | 1 | 0 | 1 | 3 | 4 |
- Historical data for bike sharing in London 'Powered by TfL Open Data'(From 1/1/2015 to 31/12/2016)
- About data
Metadata¶
cnt
- the count of a new bike sharest1
- real temperature in Chum
- humidity in percentagewind_speed
- wind speed in km/hweather_code
- category of the weatheris_holiday
- boolean field - 1 holiday / 0 non holidayis_weekend
- boolean field - 1 if the day is weekendseason
- category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.weathe_code
category description: 1 = Clear ; mostly clear but have some values with haze/fog/patches of fog/ fog in vicinity 2 = scattered clouds / few clouds 3 = Broken clouds 4 = Cloudy 7 = Rain/ light Rain shower/ Light rain 10 = rain with thunderstorm 26 = snowfall 94 = Freezing Fog
Data preprocessing¶
SplineTransformer¶
from sklearn.preprocessing import SplineTransformer
SplineTransformer(n_knots=13, extrapolation="periodic")
- Way to add nonlinear terms instead of pure polynomials of features is to generate spline basis functions for each feature with the SplineTransformer.
- Splines are piecewise polynomials, parametrized by their polynomial degree and the positions of the knots. The SplineTransformer implements a B-spline basis.
- Generate univariate B-spline bases for features.
- Generate a new feature matrix consisting of
n_splines=n_knots + degree - 1
(n_knots - 1
forextrapolation="periodic"
) spline basis functions (B-splines) of polynomial order="degree" for each feature. - Read more
Spline Transformer¶
Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.17 | 0.67 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.02 | 0.48 | 0.48 | 0.02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
$$\vdots$$ | $$\vdots$$ | $\vdots$ | $\vdots$ | $\vdots$ | $$\vdots$$ | $$\vdots$$ | $$\vdots$$ | $$\vdots$$ | $$\vdots$$ | $$\vdots$$ | $\vdots$ | $\vdots$ |
22 | 0.67 | 0.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
23 | 0.48 | 0.48 | 0.02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.02 |
ColumnTransformer¶
- Applies transformers to columns of an array or pandas DataFrame.
categorical = ["is_holiday", "weather_code", "is_weekend", "season"]
preprocessor = ColumnTransformer([
("cyclic_hour", SplineTransformer(n_knots=13, extrapolation="periodic"), ["hour"]),
("categorical", OneHotEncoder(handle_unknown="ignore"), categorical),
], remainder=MinMaxScaler())
preprocessor
ColumnTransformer(remainder=MinMaxScaler(), transformers=[('cyclic_hour', SplineTransformer(extrapolation='periodic', n_knots=13), ['hour']), ('categorical', OneHotEncoder(handle_unknown='ignore'), ['is_holiday', 'weather_code', 'is_weekend', 'season'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder=MinMaxScaler(), transformers=[('cyclic_hour', SplineTransformer(extrapolation='periodic', n_knots=13), ['hour']), ('categorical', OneHotEncoder(handle_unknown='ignore'), ['is_holiday', 'weather_code', 'is_weekend', 'season'])])
['hour']
SplineTransformer(extrapolation='periodic', n_knots=13)
['is_holiday', 'weather_code', 'is_weekend', 'season']
OneHotEncoder(handle_unknown='ignore')
MinMaxScaler()
Evaluation: Cross validation¶
TimeSeriesSplit¶
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=50, max_train_size=10_000)
- Time Series cross-validator
- Training set size:
i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)
in thei
th split - Test set size:
n_samples//(n_splits + 1)
by default, wheren_samples
is the number of samples. - Read more
TimeSeriesSplit¶
- $N$ can be uniquely decomsed as
$$ N = dq +r $$
- $d=$
n_splits+1
(divisor) - $q=N//d$ (quotient)
- $r=N\% d$ (remainder)
Cross validation¶
sklearn.model_selection.cross_validate(estimator, X, y=None, *, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', return_train_score=False, return_estimator=False, return_indices=False, error_score=nan)
- Evaluate metric(s) by cross-validation and also record fit/score times.
- Read more
Scikit-learn tools for Poisson Regression¶
Ridge¶
from sklearn.linear_model import Ridge
ridge = make_pipeline(preprocessor, Ridge())
$$ \text{min}_{\mathbf w} \frac{1}{2N}\sum_{n=1}^N d(t_n,y_n)+\frac{\alpha}{2} \| \mathbf w \|_2^2 $$
- $d(t,y)=(t-y)^2$, $y_n = \exp(\mathbf w ^T \mathbf x_n)$ and the $L2$ regularization penalty $\alpha$.
- Test result
RMSE | Poisson Deviance | |
---|---|---|
Ridge | 594.25 | 272.48 |
PoissonRegressor¶
from sklearn.linear_model import PoissonRegressor
poisson = make_pipeline(preprocessor, PoissonRegressor(max_iter=300))
$$ \text{min}_{\mathbf w} \frac{1}{2N}\sum_{n=1}^N d(t_n,y_n)+\frac{\alpha}{2} \| \mathbf w \|_2^2 $$
where $d(t,y)=2\left [ t\log\frac{t}{y}-t+y \right ]$, $y_n = \exp(\mathbf w ^T \mathbf x_n)$ and the $L2$ regularization penalty $\alpha$.
Test result
RMSE | Poisson Deviance | |
---|---|---|
PoissonRegressor | 527.773 | 193.017 |
Ridge | 594.25 | 272.48 |
PoissonRegressor vs Ridge¶
- Results: errors
PoissonRegressor vs Ridge¶
- Distributions
Calibration¶
PoissonRegressor vs Ridge¶
- Calibration
RandomForestRegressor with Poisson¶
from sklearn.ensemble import RandomForestRegressor
rf_poisson = RandomForestRegressor(criterion="poisson", random_state=0, n_samples_split=40)
rf = RandomForestRegressor(random_state=0, min_samples_leaf=40)
- Test result
RMSE | Poisson Deviance | |
---|---|---|
RF (Poisson) | 306.235 | 67.788 |
RF | 401.012 | 114.321 |
PoissonRegressor | 527.773 | 193.017 |
Ridge | 594.25 | 272.48 |
Random Forest¶
- Results: errors
Random Forest¶
- Distributions
Random Forest¶
- Calibration
HistGradientBoosting with Poisson¶
from sklearn.ensemble import HistGradientBoostingRegressor
hist_poisson = HistGradientBoostingRegressor(random_state=0, loss="poisson", max_depth=4)
hist = HistGradientBoostingRegressor(random_state=0, max_depth=4)
- Test result
RMSE | Poisson Deviance | |
---|---|---|
HGRF (Poisson) | 292.367 | 60.666 |
HGRF | 309.892 | 74.369 |
RF (Poisson) | 306.235 | 67.788 |
RF | 401.012 | 114.321 |
PoissonRegressor | 527.773 | 193.017 |
Ridge | 594.25 | 272.48 |
Histogram-based Gradient Boosting Regression¶
- Results: errors
Histogram-based Gradient Boosting Regression¶
- Distributions
Histogram-based Gradient Boosting Regression¶
- Calibration
Prediction for a test set¶
- The last Time Series split set
- # of train data: 10_000
- # of test data: 341
Zero-Inflated Poisson Regression¶
ZeroInflatedRegressor¶
poisson_zero = ZeroInflatedRegressor(
classifier=HistGradientBoostingClassifier(random_state=0),
regressor=HistGradientBoostingRegressor(loss="poisson"))
- Test result
RMSE | Poisson Deviance | |
---|---|---|
ZHGRF (Poisson) | 283.730 | 54.449 |
HGRF (Poisson) | 292.367 | 60.666 |
HGRF | 309.892 | 74.369 |
RF (Poisson) | 306.235 | 67.788 |
RF | 401.012 | 114.321 |
PoissonRegressor | 527.773 | 193.017 |
Ridge | 594.25 | 272.48 |
ZeroInflatedRegressor¶
- Result: errors
ZeroInflatedRegressor¶
- Distributions