+
+
+
+
+
+
+
+
+
+
+ Banaue rice terraces. Photo:
+Jon
+Rawlinson
+
+
+
+
In a previous
+article, we showed the use of partial pooling, or
+hierarchical/multilevel models, for level coding high-cardinality
+categorical variables in vtreat. In
+this article, we will discuss a little more about the how and why of
+partial pooling in R.
+
We will use the lme4 package to fit the hierarchical
+models. The acronym “lme” stands for “linear mixed-effects” models:
+models that combine so-called “fixed effects” and “random effects” in a
+single (generalized) linear model. The lme4 documentation
+uses the random/fixed effects terminology, but we are going to follow
+Gelman and Hill, and avoid the use of the terms “fixed” and “random”
+effects.
+
+The varying coefficients [corresponding to the levels of a
+categorical variable] in a multilevel model are sometimes called
+random effects, a term that refers to the randomness in the
+probability model for the group-level coefficients….
+
+
+The term fixed effects is used in contrast to random effects
+– but not in a consistent way! … Because of the conflicting definitions
+and advice, we will avoid the terms “fixed” and “random” entirely, and
+focus on the description of the model itself…
+
+
– Gelman and Hill 2007, Chapter 11.4
+
We will also restrict ourselves to the case that vtreat
+considers: partially pooled estimates of conditional group expectations,
+with no other predictors considered.
+
+
The Data
+
Let’s assume that the data is generated from a mixture of \(M\) populations; each population is
+normally distributed with (unknown) means \(\mu_{gp}\), all with the same (unknown)
+standard deviation \(\sigma_w\):
+
\[
+y_{gp} = N(\mu_{gp}, {\sigma_{w}}^2)
+\]
+
The population means themselves are normally distributed, with
+unknown mean \(\mu_0\) and unknown
+standard deviation \(\sigma_b\):
+
\[
+\mu_{gp} = N(\mu_0, {\sigma_{b}}^2)
+\]
+
(The subscripts w and b stand for “within-group”
+and “between-group” standard deviations, respectively.)
+
We can generate a synthetic data set according to these assumptions,
+with distributions similar to the distributions observed in the radon
+data set that we used in our earlier post: 85 groups, sampled unevenly.
+We’ll use \(\mu_0 = 0, \sigma_w = 0.7,
+\sigma_b = 0.5\). Here, we take a peek at our data,
+df.
+
head(df)
+
## gp y
+## 1 gp75 1.1622536
+## 2 gp26 -1.0026492
+## 3 gp26 -0.4317629
+## 4 gp43 0.3547021
+## 5 gp19 -0.5028478
+## 6 gp41 0.1239806
+

+
As the graph shows, some groups were heavily sampled, but most groups
+have only a handful of samples in the data set. Since this is synthetic
+data, we know the true population means (shown in red in the graph
+below), and we can compare them to the observed means \(\bar{y}_i\) of each group \(i\) (shown in black, with standard errors.
+The gray points are the actual observations). We’ve sorted the groups by
+the number of observations.
+

+
For groups with many observations, the observed group mean is near
+the true mean. For groups with few observations, the estimates are
+uncertain, and the observed group mean can be far from the true
+population mean.
+
Can we get better estimates of the conditional mean for groups with
+only a few observations?
+
+
+
Partial Pooling
+
If the data is generated by the process described above, and if we
+knew \(\sigma_w\) and \(\sigma_b\), then a good estimate \(\hat{y}_i\) for the mean of group \(i\) is the weighted average of the grand
+mean over all the data, \(\bar{y}\),
+and the observed mean of all the observations in group \(i\), \(\bar{y}_i\).
+
\[
+\large
+\hat{y_i} \approx \frac{\frac{n_i} {\sigma_w^2} \cdot \bar{y}_i +
+\frac{1}{\sigma_b^2} \cdot \bar{y}}
+{\frac{n_i} {\sigma_w^2} + \frac{1}{\sigma_b^2}}
+\]
+
where \(n_i\) is the number of
+observations for group \(i\). In other
+words, for groups where you have a lot of observations, use an estimate
+close to the observed group mean. For groups where you have only a few
+observations, fall back to an estimate close to the grand mean.
+
Gelman and Hill call the grand mean the complete-pooling
+estimate, because the data from all the groups is pooled to create the
+estimate (which is the same for all \(i\)). The “raw” observed means are the
+no-pooling estimate, because no pooling occurs; only
+observations from group \(i\)
+contribute to \(\hat{y_i}\). The
+weighted sum of the complete-pooling and the no-pooling estimate is
+hence the partial-pooling estimate.
+
Of course, in practice we don’t know \(\sigma_w\) and \(\sigma_b\). The lmer function
+essentially solves for the restricted maximum likelihood (REML)
+estimates of the appropriate parameters in order to estimate \(\hat{y_i}\). You can express multilevel
+models in lme4 using the notation | gp in
+formulas to designate that gp is the grouping variable that
+you want conditional estimates for. The model that we are interested in
+is the simplest: outcome as a function of the grouping variable, with no
+other predictors.
+
poolmod = lmer(y ~ (1 | gp), data=df)
+
See section 2.2 of this
+lmer vignette for more discussion on writing formulas
+for models with additional predictors. Printing poolmod
+displays the REML estimates of the grand mean (The intercept), \(\sigma_b\) (the standard deviation of \(gp\)) and \(\sigma_w\) (the residual).
+
poolmod
+
## Linear mixed model fit by REML ['lmerMod']
+## Formula: y ~ (1 | gp)
+## Data: df
+## REML criterion at convergence: 2282.939
+## Random effects:
+## Groups Name Std.Dev.
+## gp (Intercept) 0.5348
+## Residual 0.7063
+## Number of obs: 1002, groups: gp, 85
+## Fixed Effects:
+## (Intercept)
+## -0.02761
+
To pull these values out explicitly:
+
# the estimated grand mean
+(grandmean_est= fixef(poolmod))
+
## (Intercept)
+## -0.02760728
+
# get the estimated between-group standard deviation
+(sigma_b = as.data.frame(VarCorr(poolmod)) %>%
+ filter(grp=="gp") %>%
+ pull(sdcor))
+
## [1] 0.5348401
+
# get the estimated within-group standard deviation
+(sigma_w = as.data.frame(VarCorr(poolmod)) %>%
+ filter(grp=="Residual") %>%
+ pull(sdcor))
+
## [1] 0.7063342
+
predict(poolmod) will return the partial pooling
+estimates of the group means. Below, we compare the partial pooling
+estimates to the raw group mean expectations. The gray lines represent
+the true group means, the dark blue horizontal line is the observed
+grand mean, and the black dots are the estimates. We have again sorted
+the groups by number of observations, and laid them out (with a slight
+jitter) on a log10 scale.
+

+
For groups with only a few observations, the partial pooling
+“shrinks” the estimates towards the grand mean, which often results
+in a better estimate of the true conditional population means. We can
+see the relationship between shrinkage (the raw estimate minus the
+partial pooling estimate) and the groups, ordered by sample size.
+

+
For this data set, the partial pooling estimates are on average
+closer to the true means than the raw estimates; we can see this by
+comparing the root mean squared errors of the two estimates.
+
+
+
+|
+estimate_type
+ |
+
+rmse
+ |
+
+
+
+
+|
+raw
+ |
+
+0.3261321
+ |
+
+
+|
+partial pooling
+ |
+
+0.2484646
+ |
+
+
+
+
+
The Discrete Case
+
For discrete (binary) outcomes or classification, use the function
+glmer() to fit multilevel logistic regression models.
+Suppose we want to predict \(\mbox{P}(y > 0
+\,|\, gp)\), the conditional probability that the outcome \(y\) is positive, as a function of \(gp\).
+
df$ispos = df$y > 0
+
+# fit a logistic regression model
+mod_glm = glm(ispos ~ gp, data=df, family=binomial)
+
Again, the conditional probability estimates will be highly uncertain
+for groups with only a few observations. We can fit a multilevel model
+with glmer and compare the distributions of the resulting
+predictions in link space.
+
mod_glmer = glmer(ispos ~ (1|gp), data=df, family=binomial)
+

+
Note that the distribution of predictions for the standard logistic
+regression model is trimodal, and that for some groups, the logistic
+regression model predicts probabilities very close to 0 or to 1. In most
+cases, these predictions will correspond to groups with few
+observations, and are unlikely to be good estimates of the true
+conditional probability. The partial pooling model avoids making
+unjustified predictions near 0 or 1, instead “shrinking” the
+estimates to the estimated global probability that \(y > 0\), which in this case is about
+0.49.
+
We can see how the number of observations corresponds to the
+shrinkage (the difference between the logistic regression and the
+partial pooling estimates) in the graph below (this time in probability
+space). Points in orange correspond to groups where the logistic
+regression estimated probabilities of 0 or 1 (the two outer lobes of the
+response distribution). Multimodal densities are often symptoms of model
+flaws such as omitted variables or un-modeled mixtures, so it is
+exciting to see the partially pooled estimator avoid the “wings” seen in
+the simpler logistic regression estimator.
+

+
+
+
+
Partial Pooling Degrades Gracefully
+
When there is enough data for each population to get a good estimate
+of the population means – for example, when the distribution of groups
+is fairly uniform, or at least not too skewed – the partial pooling
+estimates will converge to the the raw (no-pooling) estimates. When the
+variation between population means is very low, the partial pooling
+estimates will converge to the complete pooling estimate (the grand
+mean).
+
When there are only a few levels (Gelman and Hill suggest less than
+about five), there will generally not be enough information to make a
+good estimate of \(\sigma_b\), so the
+partial pooled estimates likely won’t be much better than the raw
+estimates.
+
So partial pooling will be of the most potential value when the
+number of groups is large, and there are many rare levels. With respect
+to vtreat, this is exactly the situation when level coding
+is most useful!
+
Multilevel modeling assumes the data was generated from the mixture
+process above: each population is normally distributed, with the same
+standard deviation, and the population means are also normally
+distributed. Obviously, this may not be the case, but as Gelman and Hill
+argue, the additional inductive bias can be useful for those populations
+where you have little information.
+
Thanks to Geoffrey
+Simmons, Principal Data Scientist at Echo Global Logistics, for
+suggesting partial pooling based level coding for vtreat,
+introducing us to the references, and reviewing our articles.
+
+
+
References
+
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression
+and Multilevel/Hierarchical Models. Cambridge University Press,
+2007.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/PartialPooling_R/PartialPooling_R.md b/PartialPooling_R/PartialPooling_R.md
new file mode 100644
index 0000000..63601d1
--- /dev/null
+++ b/PartialPooling_R/PartialPooling_R.md
@@ -0,0 +1,231 @@
+---
+title: 'Partial Pooling for Lower Variance Variable Encoding'
+author: "Nina Zumel"
+output:
+ html_document:
+ keep_md: true
+---
+
+