Lecture 2: Monte Carlo Integration
Rectangle Integration
Consider the rectangle rule for integration that you may be familiar with from standard calculus which estimates the area under a curve by dividing the area into many, evenly sized rectangles. We evaluate the function at, say, the left end of each rectangle, multiply by the width of the rectangle to compute its area, then sum up the area of all the rectangles. In the limit where the number of rectangles approaches infinity, we should have the exact value of the integral.
We can express this operation mathematically as: $$ \int_a^b f(x) \mathrm{d}x = \lim_{N \rightarrow \infty} \frac{b-a}{N} \sum_{i=0}^{N-1} f \left( a + i \left(\frac{b-a}{N} \right) \right) $$
Implementing this in code is also fairly straightforward:
import numpy as np
def f(x):
return 1 - np.sin(x)
def RectangleIntegrate(f: callable, a: float, b: float, N: int) -> float:
x = np.linspace(a, b, N)
y = f(x)
dx = (b - a) / N
return np.sum(y) * dx
RectangleIntegrate(f, 0, 2*np.pi, 1000)
This method can work fairly well as long as you have enough samples:
However, there are a few major draw backs with this method of integration. For one, if we do not have enough samples, we can have aliasing issues, as can be seen in this (albeit contrived) example:
And, even worse, the number of samples needed to get an accurate estimate scales with the power of the number of dimensions. So if you wanted an estimate for a function $f \in \mathbb{R}^5$, you would need $N^5$ samples compared to the $N$ samples necessary for a function $g \in \mathbb{R}^1$! So, is there another way that we can numerically evaluate an integral with fewer aliasing issues and better scaling with the number of dimensions?
Monte Carlo Integration
The core principle behind Monte Carlo integration is to choose random samples over the domain of the integral. And, since we’re dealing with random values, we need to take a step back and review some probability and statistics first.
Notation
- $\mathrm{X}$: a stochastic (i.e., random) variable
- $x$: an observation of $\mathrm{X}$ or a normal variable
- $p(x)$: the probability distribution function (PDF) which represents the probability of observing a sample $x$
- The integral of $p(x)$ over the domain must be 1, i.e., $\int_D p(x) dx = 1$
- The probability of observing any sample in the domain must be at least 0, i.e., $p(x) \geq 0$
- $\hat{I}$: our estimator (denoted as a capital letter with a hat)
Expected Value
The expected value of a random variable is simply the mean/average value you would expect after observing the random variable many times. It can be expressed as: \begin{align} \mathbb{E}[\mathrm{X}]_{\mathrm{X}\sim p} = \int x \ p(x) \ \mathrm{d}x \end{align}
which is saying that the expected value for a random variable $\mathrm{X}$ which is distributed with PDF $p$, is equal to the integral of $x \ p(x)$ (over the domain of $p$).
Properties of the Expected Value
- Linearity: \begin{align} \mathbb{E}[a \mathrm{X} + b \mathrm{X}] = a \mathbb{E}[\mathrm{X}] + b \mathbb{E}[\mathrm{X}] \end{align}
- Law of the unconscious statistician (LOTUS) – yes, that’s what its actually called! \begin{align} \mathbb{E}[f(\mathrm{X})]_{\mathrm{X}\sim p} = \int f(x) \ p(x) \ \mathrm{d}x \end{align}
- Law of large numbers (LLN) \begin{align} \mathbb{E}[\mathrm{X}] = \lim_{N \rightarrow \infty} \left[ \frac{1}{N} \sum_{i=1}^N x_i \right] \end{align}
The Monte Carlo Integrator
If we look back at the definition of expected value, you may be able to see how we can take advantage of the equation to compute our desired integral of $f(x)$ by combining it with LOTUS. All we have to do is compute the expected value of $f(\mathrm{X})/p(\mathrm{X})$! \begin{align} \mathbb{E}\left[ \frac{f(\mathrm{X})}{p(\mathrm{X})} \right]_{\mathrm{X}\sim p} = \int \frac{f(x)}{p(x)} \ p(x) \ \mathrm{d}x = \int f(x) \ \mathrm{d}x \end{align}
And, we know via the law of large numbers, that we can express this integral as the following sum which is the expected value of the estimator of the integral $\hat{I}$: \begin{align} \hat{I} &= \frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{p(x_i)} \\ \mathbb{E}[\hat{I}] = \mathbb{E}\left[ \frac{f(\mathrm{X})}{p(\mathrm{X})} \right]_{\mathrm{X}\sim p} &= \lim_{N \rightarrow \infty} \left[ \frac{1}{N} \sum_{i=1}^N \frac{f(x_i)}{p(x_i)} \right] \end{align}
This means that we can now approximate the integral by taking many random samples of $f(\mathrm{X})/p(\mathrm{X})$. However, there are a few caveats to consider. For one, we can’t let $p(x) = 0$ for any part of the domain where $f(x)$ is not $0$. The issue here is not that we will get a divide by zero error since we simply will never generate a sample where $p(x) = 0$. Rather, if the PDF is zero at any point where $f(x)$ is not zero, then we will never be able to evaluate $f(x)$ at that input, leading us to have a bias in our estimate, i.e., a systematic error in our estimate.
This integrator is also simple to represent in code:
import numpy as np
def f(x):
return 1 + np.sin(x)
def uniformPDF(x, a, b):
return 1 / (b-a)
def MonteCarloIntegrate(f: callable, pdf: callable, a: float, b: float, N: int) -> float:
X = np.random.rand(N) * (b - a) + a
p = pdf(X, a, b)
Y = f(X)
samples = Y / p
estimate = np.sum(samples) / N
return estimate
MonteCarloIntegrate(f, uniformPDF, 0, 2*np.pi, 1000)
Errors for Estimators
- Bias, as we see in the rectangle integrator, is a systematic error in the integral such that $\mathbb{E}[\hat{I}] \neq I$ where $I$ is our desired integral.
- Variance, as we see in the Monte Carlo integrator, is the average difference between each sample and the average of all samples, squared. More concretely, \begin{align} \mathrm{Var}(\mathrm{X}) = \mathbb{E}[(\mathrm{X} - \mathbb{E}[\mathrm{X}])^2] = \mathbb{E}[\mathrm{X}^2] - \mathbb{E}[\mathrm{X}]^2 \end{align}
Note that the bias in an estimate can often be less than the variance, leading to a result that is still closer to the ground truth! That is to say, there are times where a biased result can be preferrable to a result with higher variance.
The rectangle integrator is biased but has zero variance – consecutive runs of the rectangle integrator always produce the same result. Consequently, averaging multiple runs of the integrator will not lead to any change in the estimate. However, for the Monte Carlo estimator, each run is unbiased but necessarily has some variance, unless, say, you’re using a seed on your random number generator! This in turn means the average of multiple runs can only improve the estimate, i.e., reduce the variance. More on that below.
Variance and Importance Sampling
The variance of a linear combination of uncorrelated random samples $\mathrm{X}$ is \begin{align} \mathrm{Var}\left( \sum^N a \mathrm{X} \right) = a^2 \sum^N \mathrm{Var}(\mathrm{X}) \end{align}
So, the variance of our estimator $\hat{I}$ is \begin{align} \mathrm{Var}(\hat{I}) &= \mathrm{Var} \left( \frac{1}{N} \sum_{i=1}^N \frac{f(\mathrm{X})}{p(\mathrm{X})} \right)\\ &= \frac{1}{N^2} \sum_{i=1}^N \mathrm{Var}\left( \frac{f(\mathrm{X})}{p(\mathrm{X})} \right)\\ &= \frac{1}{N^2} N \mathrm{Var}\left( \frac{f(\mathrm{X})}{p(\mathrm{X})} \right)\\ &= \frac{1}{N} \mathrm{Var}\left( \frac{f(\mathrm{X})}{p(\mathrm{X})} \right) \end{align}
This tells us that the variance of our estimator is proportional to the inverse of the number of samples!
Reducing Variance with Importance Sampling
The idea behind importance sampling is to pick a distribution $p(x)$ that sends more samples to regions of the function $f(x)$ that contribute more to the integral. In practice, this entails making the distribution function approximate the function we are trying to integrate, relative to some scaling factor. If $p(x)$ is close to $f(x)$ (relative to some scaling factor), then, the variance of $f(\mathrm{X})/p(\mathrm{X})$ will be closer to some constant value $c$. And, as $\mathrm{X} \rightarrow c$, $\mathrm{Var}(\mathrm{X}) \rightarrow 0$. In fact, if we set $p(x) = s f(x)$ where $s = 1 / \int f(x) \mathrm{d}x$, then $$ \mathrm{Var}\left( \frac{f(x)}{s p(x)} \right) = \mathrm{Var}\left(\frac{1}{s}\right) = 0 $$
Though, you may already be able to tell, this requires that we already know the integral we are trying to evaluate! We also need to be careful here. If we choose a $p(x)$ that is very different than our $f(x)$, we can end up massively increasing the variance! This is because, if our PDF $p(x)$ is very different from $f(x)$, we will send most of our samples to regions of $f(x)$ that make little contribution, so the few samples we send to the regions of $f(x)$ that make a big contribution will cause “spikes” whenever they are evaluated. This is because, when we evaluate such points, we are dividing an already relatively large value of $f(x)$ with a small probability of sampling that $x$. In rendering, this often manifests as “fireflies,” i.e., single pixels that appear a lot brighter because one pixel sample was weighted very hightly. Note that even with a bad PDF, the integral will eventually converge, it will just be that the variance will decrease more slowly.