Part 1: Kernel Density Estimation using Density Matrices

Density estimation methods can be used to solve a variety of statistical and machine learning challenges. They can be used to tackle a variety of problems, including anomaly detection, generative models, semi-supervised learning, compression, and text-to-speech. A popular technique to find density estimates for new samples in a non parametric set up is Kernel Density Estimation (KDE), a method which suffers from costly evaluations especially for large data sets and higher dimensions.

In this post we will discuss on the math behind an efficient way to calculate density estimates using density matrices (a concept from quantum mechanics).

Introduction

In many applications we have a finite set of data and we would like to know what probability distribution generated the data. From statistical inference this problem has played a central role in research and has inspired many methods which rely on the use of the density function such as non parametric regression when non linear patterns are observed. Also in machine learning many approaches to anomaly detection make use of the probability density function.

The parametric approach to density estimation given some data \(\mathbf{x}_1,\dots,\mathbf{x}_N\) assumes that each \(\mathbf{x}_i\) is sampled independently from a random vector \(\mathbf{X}\sim f(\mathbf{x};\mathbf{\theta})\) and the theory is developed around building an estimator \(\hat{\mathbf{\theta}}\) with good statistical properties such us unbiasdness, consistency, efficiency and sufficiency. The probability density of a new sample \(\mathbf{x}\) is given by: \[ \hat{f}(\mathbf{x}) = f(\mathbf{x};\hat{\mathbf{\theta}}) \]

Another approach to get density estimations in new samples in a non-parametric fashion is KDE and it can be understood as a weighted sum of density contributions that are centered at each data point. Formally, given a univariate random sample \(X_1,...,X_N\) from an unknown distribution with density \(f\), the KDE estimator of the density at a query point \(x\in\mathbb{R}\) is given by

\[ \hat{f}(x) = \frac{1}{Nh}\sum_{i=1}^N K\left(\frac{x-X_i}{h}\right). \]

where \(h\) is called the bandwidth and \(K:\mathbb{R}\rightarrow \mathbb{R}_{\geq 0}\) is a positive definite function and its called the kernel function.

Notice that a naive direct evaluation of KDE at \(m\) query points for \(N\) samples requires \(O(mN)\) kernel evaluations and \(O(mN)\) additions and multiplications. Also if we restrict to \(N\) query points then we get a computational complexity of \(O(N^2)\), making it a very expensive trade-off, especially for large data sets and higher dimensions (see more here).

According to Siminelakis, one technique to resolve the problem of scalability of the naïve evaluation of KDE in the literature is on discovering fast approximate evaluation of the kernel, and there are two primary lines of effort: space partitioning methods and Monte Carlo random sampling.

The formalism of density operators and density matrices was developed by Von Neumann as a foundation of quantum statistical mechanics. From the pointof view of machine learning, density matrices have an interesting feature: the fact that they combine linear algebra and probability, two of the pillars of machine learning, in a very particular but powerful way.

The central idea of this post is to use density matrices to represent probability distributions tackling the important drawback of scalability and create a competitive strategy to compute densities on new samples.

Now let’s define and discuss important mathematical background for the method.

Multivariate version of Kernel Density Estimation

The multivariate kernel density estimator at a query point \(\mathbf{x}\in\mathbb{R}^d\) for a given random sample \(\mathbf{X}_1,\dots,\mathbf{X}_N\) drawn from an unknown density \(f\) is given by

\[\hat{f}_\gamma(\mathbf{x})=\frac{1}{N(\pi / \gamma)^{\frac{d}{2}}} \sum_{i=1}^{N} \exp \left({-\gamma\left\|\mathbf{x}-\mathbf{X}_{i}\right\|^{2}}\right)\]

where we define \(\gamma = \frac{1}{2\sigma}\) and assume a Gaussian kernel.

Random Fourier Features

The use of Random Fourier Features (RFF) was explicitly introduced by Rahimi and Recht in 2007. In this methodology authors approximate a shift invariant kernel by an inner product in an explicit Hilbert space. Formally, given a shift-invariant kernel \(k: \mathbb{R}^d\times\mathbb{R}^d \rightarrow \mathbb{R}\) they build a map \(\phi_{\text{rff}} : \mathbb{R}^d \rightarrow \mathbb{R}^D\) such that

\[ k(\mathbf{x},\mathbf{y}) \approx \phi_{\text{rff}}(\mathbf{x})^T \phi_{\text{rff}}(\mathbf{y}) \]

for all \(\{\mathbf{x},\mathbf{y}\}\subseteq \mathbb{R}^d\). The main theoretical result that supports the last equation comes from an instance of Bochner’s theorem which shows that a shift invariant positive-definite kernel \(k\) is the Fourier transform of a probability measure \(p\). Specifically, to approximate the integral in the Fourier transform we sample \(\mathbf{\omega}_1,\dots,\mathbf{\omega}_D\) from \(p\) and \(b_1,\dots,b_D\) from a uniform distribution in the interval \((0,2\pi)\) to define the map \(\phi_{\text{rff}}:\mathbb{R}^d\rightarrow\mathbb{R}^D\) as

\[ \phi_{\text{rff}}(\mathbf{x})=\left( \frac{1}{\sqrt{D}} \sqrt{2} \cos \left(\boldsymbol{\omega}_1^{\top} \mathbf{x}+b_1\right), \dots, \frac{1}{\sqrt{D}} \sqrt{2} \cos \left(\boldsymbol{\omega}_D^{\top} \mathbf{x}+b_D\right) \right)^T \]

Rahimi and Recht also showed that the expected value of \(\phi_{\text{rff}}(\mathbf{x})\phi_{\text{rff}}^T(\mathbf{y})\) uniformly converges to \(k(\mathbf{x}, \mathbf{y})\). In general, we can reduce the complexity of kernel methods using this methodology because kernel evaluations are faster.

Density Matrices

The state of a quantum system is represented by a vector \(\varphi\in\mathcal{H}\) where \(\mathcal{H}\) is the Hilbert space of the possible states of the system and typically \(H=\mathbb{C}^d\).

Let’s considerer the example of the spin of an electron. The state vector is given by

\[ \varphi = (\alpha,\beta)^T, \quad \lVert \alpha \rVert^2+\lVert \beta \rVert^2=1 \]

In general, the quantum state vector is a combination of basis states and this is called superposition. In the example the basis states are

\[ \uparrow := (1,0)^T, \quad \downarrow :=(0,1)^T \]

with \(\lVert \alpha \rVert^2\) and \(\lVert \beta \rVert^2\) being the probabilities of obtaining the corresponding basis state. A density matrix is a representation of the state of a quantum system that can represent quantum uncertainty (superposition) and classical uncertainty. We can express the state of the quantum system in one of two ways, depending on whether or not there is classical uncertainty:

i) Pure : \(\rho = \psi \psi^{*}=\begin{pmatrix} \lVert \alpha \rVert^{2} & \alpha \beta^{*} \\ \beta^{*} \alpha & \lVert \beta \rVert^{2} \end{pmatrix}\)

ii) Mixed : \(\rho=\sum_{i=1}^{N} p_{i} \varphi_{i} \varphi_{i}^{*};\quad p_i\geq 0;\quad \sum_{i=1}^{N} p_{i}=1\)

Method: KDE using Density Matrices

Input. An observation of a \(d\)-dimensional random sample \(\mathbf{x}_1,\dots,\mathbf{x}_N\), number of random Fourier features \(D\in\mathbb{N}\) and spread parameter \(\gamma\in\mathbb{R}_{> 0}\).
Generate an observation \(\mathbf{\omega}_1,\dots,\mathbf{\omega}_N\) of a random sample of \(\mathbf{\omega}\sim N(\mathbf{0},\mathbf{I}_D)\) and an observation \(b_1,\dots,b_N\) of a random sample of \(b\sim\text{Uniform}(0,2\pi)\) for building the map \(\phi_{\text{rff}}\) from the random Fourier features method to approximate a Gaussian kernel with parameters \(\gamma\) and \(D\).
Apply \(\phi_{\text{rff}}\) to each element \(\mathbf{x}_i\):

\[\mathbf{z}_i = \phi_{\text{rff}}(\mathbf{x}_i)\]

Calculate mixed state density matrix: \[ \rho = \frac{1}{N}\sum_{i=1}^N \mathbf{z}_i \mathbf{z}_i^T \]
The density estimation of a query point \(\mathbf{x}\) is calculated using Born’s rule

\[\hat{f}_{\rho}(\mathbf{x})=\frac{\operatorname{Tr}\left(\rho \phi_{\mathrm{rff}}(\mathbf{x}) \phi_{\mathrm{rff}}(\mathbf{x})^{T}\right)}{\mathcal{Z}}=\frac{\phi_{\mathrm{rff}}(\mathbf{x})^{T} \rho \phi_{\mathrm{rff}}(\mathbf{x})}{\mathcal{Z}}\]

where the normalizing constant is given by:

\[ \mathcal{Z}=\left(\frac{\pi}{2 \gamma}\right)^{\frac{d}{2}} \]

We call \(\hat f_\rho(\mathbf{x})\) the DMKDE estimation at \(\mathbf{x}\).

What is next?

See implementation in Part 2.