Concrete Dropout

## At a Glance - **Definition:** A dropout variant that learns the per-layer dropout probability $p$ by using the concrete relaxation of the typical Bernoulli dropout mask to make $p$ differentiable. Dropout can then be kept on at inference to perform MC inference for uncertainty estimation. - **Formula:** $\mathcal{L}(\theta) = -\tfrac{1}{M}\sum_{i \in S}\log p(y_i \mid f^{\omega}(x_i))$ $+ \tfrac{1}{N}\,\mathrm{KL}(q_\theta(\omega)\,\|\,p(\omega))$ - **Range / output:** Returns a trained scalar $p \in (0, 1)$ per layer. At inference, repeated forward passes through the Concrete-masked network produces a sample ensemble whose entropy is an uncertainty estimate. - **Also known as:** Learnable dropout; gradient-tuned Bayesian dropout. The relaxation used is the Concrete distribution, also called Gumbel-Softmax in the generative-modelling literature.[^2][^3] ## Why It Exists The obvious way to set a dropout probability is a grid-search: just try $p \in \{0.1, 0.2, \ldots, 0.5\}$, and pick the value that maximises validation-set performance. But that assumes that each layer should have the same dropout percentage, and that it should be fixed throughout training. Concrete Dropout fixes both problems by treating $p$ as a learnable parameter optimised during training. The only problem is that the standard Bernoulli mask used in dropout is discrete, it has no gradient with respect to $p$. So the paper replaces it with the Concrete distribution, a continuous Sigmoid like relaxation that concentrates near 0 and 1 but is differentiable everywhere, enabling the pathwise derivative estimator to flow gradients through $p$.[^1] ## Demo! <div class="ml-widget" data-algo="concrete-dropout"></div> **Tips:** - Notice that dropout zeros the contribution of whole input neurons (columns), not individual weights. - Note the real suppression rate is equal to p on average, but can be off slightly based on the specific noise drawn. Try hitting re sample to see fresh noise. - Drag $t$ toward zero to recover regular dropout, with an undefined gradient. - Note that logit($p$)'s gradient from the task loss is zero when $p=0$, peaks at $p=0.5$, and decreases back to zero at $p=1$. This is due to the log parameterisation of p and the gradient's path through the Gumbel-Softmax. ## Formalization To understand where $p$ comes from and why it can be learned, we need dropout's variational interpretation.[^1] Treating the network's stochastic weight matrices $\omega = \{W_l\}_{l=1}^L$ as latent variables, dropout inference approximates the posterior $p(\omega \mid \mathcal{D})$ with a variational distribution $q_\theta(\omega)$ parameterised by mean weight matrices $M_l$ and dropout probabilities $p_l$. Minimising the KL divergence between the approximate and true posterior gives the ELBO objective: $ \begin{align*} \mathcal{L}_{\mathrm{MC}}(\theta) = -\frac{1}{M}\sum_{i \in S}\log p(y_i \mid f^{\omega}(x_i)) \\ + \frac{1}{N}\,\mathrm{KL}(q_\theta(\omega)\,\|\,p(\omega)) \end{align*} $ The KL term decomposes per layer and, using the discretised Gaussian prior from Gal (2016),[^4] approximates to: $\mathrm{KL}(q_M(W)\,\|\,p(W)) \propto \frac{l^2(1-p)}{2}\|M\|^2 - K\,H(p)$ where the Bernoulli entropy is: $H(p) := -p\log p - (1-p)\log(1-p)$ The $(1-p)$ factor here is written in terms of the mean weights $M$. The implementation does not regularise $M$ directly; it regularises the layer kernel $W = (1-p)M$, which turns the $(1-p)$ factor into a $1/(1-p)$ — see *Why the code divides by $(1-p)$* below. To obtain $\tfrac{\partial \mathcal{L}}{\partial p}$, the paper replaces the discrete Bernoulli mask with the Concrete distribution.[^2][^3] For the binary case this reduces to a sigmoid-transformed uniform noise: $ \begin{align*} \tilde{z} = \sigma\!\left(\frac{1}{t}\!\left[\log p - \log(1-p) + \log u - \log(1-u)\right]\right), \\ u \sim \mathrm{Unif}(0,1) \end{align*} $ This $\tilde{z}$ sits in $(0, 1)$, concentrates near 0 and 1 for small temperature $t$, and is differentiable with respect to $p$ everywhere. This lets standard backpropagation update $p$ alongside the weights. **Variables:** - **$\theta = \{M_l, p_l\}_{l=1}^L$** `[scalar p per layer]`: variational parameters — the mean weight matrices and the dropout probabilities being learned. - **$f^{\omega}(x_i)$** `[output_dim]`: network output on input $x_i$ under one sampled weight realisation $\omega$ (one Concrete-masked forward pass). - **$N$** `[scalar]`: total dataset size. Scales the KL term so that $p \to 0$ as $N \to \infty$ — more data reduces epistemic uncertainty. - **$M$** `[scalar]`: mini-batch size used for the Monte Carlo estimate of the likelihood term. - **$l$** `[scalar]`: prior length-scale. Sets the weight-regularisation strength: `weight_regularizer` $= l^2/(\tau N)$. - **$K$** `[scalar]`: number of input units dropped (input dimensionality of the layer). Scales the entropy term, so deeper and wider layers are pushed toward higher $p$. - **$H(p)$** `[scalar]`: entropy of a $\mathrm{Bernoulli}(p)$ random variable; maximised at $p = 0.5$, which is why the KL term pulls $p$ toward 0.5 when data is scarce. - **$t$** `[scalar]`: Concrete temperature. Lower $t$ makes $\tilde{z}$ more binary (closer to standard dropout); $t$ can itself be treated as a variational parameter. - **$\tilde{z}$** `[input_dim]`: the Concrete-relaxed mask, valued in $(0, 1)$. The pre-activation is multiplied by $\tilde{z}$ then rescaled by $1/(1-p)$ to preserve expected magnitude. ## Building the Intuition (Deep Dive) ### The tug-of-war inside $\mathcal{L}(p)$ The widget shows what the Concrete mask looks like; this section explains where the optimal $p$ comes from. The KL term in the ELBO has two additive forces pulling $p$ in opposite directions. The **weight regularisation term** $\tfrac{l^2(1-p)}{2}\|M\|^2$ is minimised when $p = 1$: mask everything, drive the effective weight magnitude to zero. It pulls the minimum rightward. (This direction holds in the mean-weight coordinates $M$; in the kernel coordinates the code uses it looks reversed — see *Why the code divides by $(1-p)$* below.) The **entropy term** $-KH(p)$ is minimised (most negative) at $p = 0.5$, where entropy is highest. It also pulls rightward from small $p$ and leftward from large $p$ — it centres. The **likelihood term** $\tfrac{1}{N} \cdot (\text{first term of } \mathcal{L}_{\mathrm{MC}})$ penalises any masking that degrades predictions; it pulls leftward toward $p = 0$. Together these produce a convex bowl whose minimum moves left as $N$ grows (data overwhelms the KL) and right as $K$ grows (the entropy term, scaled by $K$, strengthens).[^1] ### Why the code divides by $(1-p)$ Eq. (3) has $(1-p)$ in the *numerator*, but the implementation divides by it. This is not a discrepancy — it is the same term written in two parametrisations. Eq. (3) is expressed in terms of $M$, the mean weight in the parametrisation where the realised weight is $M\cdot\mathrm{diag}(\mathrm{Bernoulli})$ with no rescaling. The code does not store $M$; it stores the layer kernel $W$ used with *inverted* dropout, where activations are scaled by $1/(1-p)$. For the realised weight to have mean equal to the stored kernel, the two relate by $W = (1-p)M$, so $\|M\|^2 = \|W\|^2/(1-p)^2$. Substituting: $\frac{l^2(1-p)}{2}\|M\|^2 = \frac{l^2(1-p)}{2}\cdot\frac{\|W\|^2}{(1-p)^2} = \frac{l^2}{2(1-p)}\|W\|^2$ The $1/(1-p)^2$ introduced by the reparametrisation cancels the explicit $(1-p)$ and leaves the denominator form the code uses. A corollary worth keeping: the *direction* the weight term appears to push $p$ flips between the two coordinate systems — toward $1$ in $M$, toward $0$ in $W$ — which is why the tug-of-war above is narrated in $M$-coordinates. Same optimum, different decomposition. This exact question was raised and answered on the reference repository (yaringal/ConcreteDropout, issue #1). ### Why the Bernoulli gradient is blocked — and how Concrete fixes it Standard gradient descent on $p$ fails because sampling from $\mathrm{Bernoulli}(p)$ is a non-differentiable step function. The score-function (REINFORCE) estimator does produce an unbiased gradient, but its variance is so high in practice that optimisation stalls.[^1] The Concrete distribution sidesteps this by relaxing the hard 0/1 boundary: instead of $z \sim \mathrm{Bernoulli}(p)$, sample $\tilde{z}$ from the sigmoid-of-logistic-noise reparameterisation above. Because $\tilde{z}$ is a deterministic, differentiable function of the noise $u$ and the parameter $p$, the pathwise derivative $\partial \tilde{z}/\partial p$ exists everywhere and can be backpropagated directly.[^2][^3] At temperature $t \to 0$, $\tilde{z}$ converges in distribution to the Bernoulli. So a small fixed temperature (the paper uses $t = 0.1$) gives the best of both worlds: near-binary behaviour at inference and differentiable gradients at training.[^1] ### The input-layer pattern Across all experiments in the paper, the input layer's dropout probability converges to nearly zero while deeper layers retain moderate $p$.[^1] The reason is that the entropy term $-KH(p)$ scales with $K$, the layer's input width. The input layer is typically narrow relative to hidden layers, so its entropy term is weakest (the likelihood pressure dominates and drives $p \to 0$). Deeper, wider layers have stronger entropy terms and settle at higher $p$. This gives a post-hoc justification for the practitioner rule of using small dropout early and larger dropout deep. ### Epistemic vs. aleatoric uncertainty Concrete Dropout is a tool for modelling **epistemic uncertainty** (uncertainty that shrinks as data accumulates). The dropout probability $p$ directly controls the variance of the function ensemble: larger $p$ produces more varied forward passes and wider predictive intervals. At $N \to \infty$ the optimised $p \to 0$ and the epistemic uncertainty collapses to zero, leaving only **aleatoric uncertainty** (irreducible noise)[^1]. ## Failure Modes & Gotchas - **$p$ stuck near 0.5 with small data:** If the dataset is small relative to model size, the entropy term dominates and the optimiser parks $p$ near 0.5. This is mathematically correct, the model genuinely has high epistemic uncertainty, but it can feel like the network is refusing to train. The fix is not to lower $p$ by hand; it is to collect more data or reduce model capacity. - **Gradient instability near $p = 0$ or $p = 1$:** The Concrete reparameterisation involves $\log p$ and $\log(1-p)$, both of which blow up at the boundaries. Practically, $p$ should be parameterised as a sigmoid of an unconstrained logit (as in the Keras implementation in the paper and the PyTorch example below 👇), so it can never exactly reach 0 or 1. Clamp the logit range or add a small $\varepsilon$ if numerical issues appear. - **Miscalibrated `weight_regularizer` / `dropout_regularizer`:** The two regulariser hyperparameters must satisfy `weight_regularizer` $= l^2/(\tau N)$ and `dropout_regularizer` $= 2/(\tau N)$, where $\tau$ is the model precision (inverse observation noise) and $N$ is the dataset size. Getting $N$ wrong (for example using a mini-batch size instead of the full dataset size) gives the wrong $p$ at convergence and breaks uncertainty calibration. This is the most common implementation error.[^1] - **Setting $\tau$ depends on the task:** For regression with a Gaussian likelihood, $\tau$ is the inverse observation noise — a genuine model quantity, so either learn it jointly (e.g. predict a per-point log-variance for heteroscedastic noise) or tune it on validation. For classification the categorical likelihood has no noise scale, so $\tau$ loses its meaning: fix $\tau = 1$ and tune only the prior length-scale $l$. One catch that goes with this — the factor of 2 in `dropout_regularizer` $= 2/(\tau N)$ comes from the Euclidean-loss derivation and must be **dropped for cross-entropy**, giving `dropout_regularizer` $= 1/(\tau N)$ while `weight_regularizer` $= l^2/(\tau N)$ keeps its form. The ratio `weight_regularizer / dropout_regularizer` $= l^2/2$ (regression) is a useful sanity check.[^1] - **Applying to non-Dense layers without adjusting the regulariser:** The paper's regulariser derivation assumes the loss is summed over data points and the KL scales by $1/N$. For pixel-wise losses (e.g. semantic segmentation), the effective $N$ is $N_{\text{images}} \times H \times W$. The paper's computer-vision experiments use `dropout_regularizer` $= 0.01 \times N \times H \times W$ for exactly this reason. Forgetting the spatial scaling drives $p$ to near zero for all layers.[^1] - **Temperature too high:** At large $t$ the Concrete samples are uniform in $(0, 1)$ rather than near-binary, so the network experiences smooth attenuation rather than dropout-like masking. Uncertainty estimates become poorly calibrated. Use $t \leq 0.1$ in practice; the paper uses $t = 0.1$ throughout.[^1] - **Comparing to fixed-$p$ dropout without MC sampling:** Concrete Dropout's uncertainty estimates require MC sampling at inference (multiple stochastic forward passes). Evaluating with a single deterministic forward pass (standard dropout inference mode) produces neither calibrated uncertainty nor the accuracy gains the paper reports. ## Implementation Details & Code - **Framework Equivalents:** PyTorch: wrap any `nn.Module` layer with the `ConcreteDropout` wrapper shown below; add `model.get_kl_loss()` to your training loss. Keras: the paper's Appendix C provides a ~20-line `Wrapper` subclass.[^1] No built-in equivalent exists in `torch.nn` or `torch.nn.functional`. - **Complexity:** Training adds one scalar parameter $p_\text{logit}$ per wrapped layer (negligible). The KL computation is $O(d)$ in the layer's weight count. Inference cost for $T$ MC samples is $T \times$ single-forward-pass cost — typically $T = 10$–$50$. - **Practical Heuristics:** Initialise $p$ around 0.1–0.2 (the paper uses uniform $\log\text{it}$ in $[-2, 0]$, corresponding to roughly 0.12–0.5). Set `weight_regularizer` $= l^2/(\tau N)$ and `dropout_regularizer` $= 2/(\tau N)$; the paper finds $l = 10^{-2}$ works well across UCI benchmarks. (See *Failure Modes & Gotchas* for how to set $\tau$ on classification vs. regression and the cross-entropy factor-of-2 caveat.) For pixel-wise losses multiply `dropout_regularizer` by $H \times W$. Temperature $t = 0.1$ is the recommended default. Concrete Dropout is tolerant to the initial value of $p$: the paper shows convergence to similar optima from any initialisation in $[0.05, 0.5]$.[^1] ```python import torch import torch.nn as nn import numpy as np class ConcreteDropout(nn.Module): """ Concrete Dropout wrapper (Gal, Hron & Kendall 2017). Wraps any nn.Module layer; learns the dropout probability p by gradient descent via the Concrete relaxation. Usage: layer = ConcreteDropout(nn.Linear(in_dim, out_dim), input_dim=in_dim) # In training loop: out = layer(x) kl = layer.get_kl_loss() loss = nll_loss(out, y) + kl """ def __init__( self, layer: nn.Module, input_dim: int, weight_regularizer: float = 1e-6, # = l² / (τN) dropout_regularizer: float = 1e-5, # = 2 / (τN) init_p: float = 0.1, temperature: float = 0.1, ): super().__init__() self.layer = layer self.input_dim = input_dim self.weight_regularizer = weight_regularizer self.dropout_regularizer = dropout_regularizer self.temperature = temperature # p is parameterised as sigmoid(p_logit) so it never hits 0 or 1. self.p_logit = nn.Parameter( torch.tensor(float(np.log(init_p / (1.0 - init_p)))) ) @property def p(self) -> torch.Tensor: return torch.sigmoid(self.p_logit) def get_kl_loss(self) -> torch.Tensor: """KL regularisation term — add this to your NLL loss at each step.""" p = self.p.to(torch.float32) eps = 1e-7 # Weight regularisation. Eq.(3) is (l²(1−p)/2)‖M‖² in mean-weight coords; # the kernel is W = (1−p)M under inverted dropout, so ‖M‖² = ‖W‖²/(1−p)² # and the term becomes weight_reg · ‖W‖² / (1−p). # See the note "Why the code divides by (1−p)". w_sum = sum( param.pow(2).sum() for name, param in self.layer.named_parameters() if "weight" in name ) weight_reg = self.weight_regularizer * w_sum / (1.0 - p + eps) # Entropy regularisation: −(2/τN) · K · H(p). # Minimising the loss maximises H(p), which pushes p toward 0.5. H = -(p * torch.log(p + eps) + (1 - p) * torch.log(1 - p + eps)) dropout_reg = -self.dropout_regularizer * self.input_dim * H return weight_reg + dropout_reg def forward(self, x: torch.Tensor) -> torch.Tensor: p = self.p eps = 1e-8 # Concrete relaxation: sigmoid((log p − log(1−p) + log u − log(1−u)) / t) u = torch.rand_like(x).clamp(eps, 1.0 - eps) z = torch.sigmoid( (self.p_logit + torch.log(u) - torch.log(1.0 - u)) / self.temperature ) # Scale to preserve expected activation magnitude. return self.layer(x * (1.0 - z) / (1.0 - p + eps)) ``` ## Related Concepts - [[MC Dropout]] - [[Dropout]] - [[Bayesian Neural Networks]] - [[Epistemic vs. Aleatoric Uncertainty]] - [[Variational Inference]] - [[Gumbel-Softmax]] --- ## References [^1]: Gal, Y., Hron, J. and Kendall, A. (2017) 'Concrete Dropout', *Advances in Neural Information Processing Systems*, 30. Available at: https://arxiv.org/abs/1705.07832 [^2]: Maddison, C.J., Mnih, A. and Teh, Y.W. (2017) 'The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables', *International Conference on Learning Representations*. Available at: https://arxiv.org/abs/1611.00712 [^3]: Jang, E., Gu, S. and Poole, B. (2017) 'Categorical Reparameterization with Gumbel-Softmax', *International Conference on Learning Representations*. Available at: https://arxiv.org/abs/1611.01144 [^4]: Gal, Y. (2016) 'Uncertainty in Deep Learning', PhD thesis, University of Cambridge. Available at: http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf