Parzen Window Density Estimation - Kernel and Bandwidth
Published:
Imagine that you have some data x1, x2, x3, ..., xn
originating from an unknown continuous distribution f
. You’d like to estimate f
.
There are several approaches of achieving the task. One of them is by estimating f
with the empirical distribution. However, if the data is continuous, the empirical distribution might result in an uniform distribution since each data would most likely appear once, therefore has the same probability to occur.
Another approach is by bucketizing the data into certain intervals. However, this approach will yield some number of buckets (discrete distribution) rather than a continuous distribution.
The next approach is by leveraging what’s called as Parzen Window Density Estimation. It’s a nonparametric method for estimating the probability density function given the data. In addition, it’s just another name for Kernel Density Estimation (KDE).
Basically, KDE estimates the true distribution by a mixture of kernels that are centered at xi
and have bandwidth (scale) equals to h
. Here’s the formula of KDE.
f_hat(x) = (1/(n.h)) * SUM(i=1 to n) K((x - xi) / h)
K: kernel
n: number of samples
h: bandwidth
Kernel can be described as a probability density function that is used to estimate density of data located close to a datapoint xi
. It could be any function though as long as the required properties are fulfilled. Those properties are the sum of probability (area under the PDF) is one, the function is symetric (K(-u) = K(u)) and the function is a non-negative one. Several kernel examples are gaussian, uniform, triangular, Epanechnikov, biweight, triweight, and so forth. If you browse them, you may notice that most of the kernel functions have a support of -1 to 1 (when the function is centered at 0).
So, why would we need to have kernel and bandwidth in the first place?
Since the datapoint xi
is just a sample coming from a continuous distribution f
, we might presume that there might be some unknown (nonzero) density around xi
. In other words, there might be some datapoints from the population that are very close to xi
. These close datapoints are represented with a density function called kernel.
However, how close the datapoints are from xi
(because a term “close” is relative)? This is determined by the bandwidth (h
). The following paragraph might describe this a bit clearer.
If you look at the formula of f_hat(x)
, the term K((x - xi) / h)
denotes that the kernel function is centered at xi
and it calculates the value of PDF for point x
and scaled by h
. As I’ve mentioned before on the previous paragraphs, most of the kernel functions (centered at 0) have a range (support) from -1 to 1. If the kernel function is centered at 0, then the minimum and maximum data in the kernel function are -1 and 1 respectively. Now, if the kernel function is centered at xi
, then the minimum and maximum data are xi - 1
and xi + 1
respectively. The same concept applies when a bandwidth h
is implemented. The calculation goes as follows:
- min value:
(x - xi) / h = -1
which yieldsx_min = xi - h
- max value:
(x - xi) / h = 1
which yieldsx_max = xi + h
The above can be summarized as that the bandwidth h
makes the range of the kernel function to [xi - h, xi + h]
.
Using f_hat(x)
, we might estimate the probability density of any point x
by these following steps:
- Determine the bandwidth
h
that will be used - For each sample data
xi
, calculate the probability density forx
using kernel function centered atxi
. Specifically, calculateK((x - xi) / h)
. Let’s denote the result asPD(x, i)
- Sum up
PD(x, i)
fori = 1 to n
(n
is the number of sample data)