nLab kernel method

Contents

Contents

Idea

In the context of machine learning, a kernel method is effectively the analysis of a data set DD through a choice of a distance/metric-function d:D×Dd \;\colon\; D \times D \to \mathbb{R} on it. The corresponding integral kernels exp(λd(,))\exp(- \lambda \cdot d(-,-)) turn out to contain useful information, if chosen correctly.

As a special case, if the data set DD is a “choice of rankings”, hence the set of permutations, of nn \in \mathbb{N} elements, distance functions come from the graph distance of Cayley graphs of the symmetric group, such as the Kendall distance (or Cayley distance) with associated kernels such as the Mallows kernel. With these choices, the kernel method overlaps with geometric group theory.

Details

A kernel on a set XX is a function K:X×XK: X \times X \to \mathbb{R} (although complex-valued kernels are also used) which is symmetric and positive definite, in the sense that for any N1N \geq 1 and any x 1,...,x NXx_1,..., x_N \in X, the matrix K ij=K(x i,x j)K_{i j} = K(x_i, x_j) is positive definite, i.e., i,jc ic jK ij0\sum_{i, j} c_i c_j K_{i j} \geq 0 for all c 1,...,c Nc_1, ..., c_N \in \mathbb{R}.

This may be reformulated as a mapping ϕ:XH\phi : X \to H, where HH is a reproducing kernel Hilbert space, a function space in which pointwise evaluation is a continuous linear functional.

The ‘kernel’ and ‘feature map to Hilbert space’ stories relate to each other as follows:

K(x,.)=ϕ(x).K(x, .) = \phi (x).

The ‘reproducing’ aspect of HH means that

f(.),K(x,.)=f(x),\langle f(.), K(x, .) \rangle = f(x),

and this is continuous. So we have

K(x,y)=ϕ(x),ϕ(y).K(x, y) = \langle \phi (x), \phi(y) \rangle.

HH is the span of the set of functions {K(x,.)|xX}\{K(x, .)| x \in X\}, and, under certain conditions, when we find the fHf \in H for which a functional is minimised, it takes the form

f(x)= i=1 mc iK(x i,x).f(x) = \sum_{i = 1}^m c_i K(x_i, x).

Many linear algebra techniques in HH just involve the inner product, and so can be conducted as a form of nonlinear algebra back in XX, using the ‘kernel trick’.

Kernels characterise the similarity of the input set. But how do they get chosen? Often a Gaussian radial-basis function (RBF) kernel is selected:

K(x,y)=e xy 2/σ 2.K(x, y) = e^{-\|x - y\|^2 / \sigma^2}.

Note that there’s a Bayesian version of kernel methods which uses Gaussian processes.

Example

Consider binary classification where we look to form a classifier which accurately finds the labels in {±1}\{\pm 1\} of a sample from a distribution over X×{±1}X \times \{\pm 1\} on the basis of their XX values, given a set of labelled training data. The support vector machine approach to this task looks to find the hyperplane in the associated Hilbert space, HH, which best separates the images of data points ϕ(x i)\phi (x_i), so that those with the same label (y i)(y_i) fall in the same half space. The control for overfitting comes from finding such a hyperplane that classifies the training data accurately with the largest perpendicular distance to the nearest of them (the support vectors).

References

General

Survey and introduction:

See also:

On positive definiteness via Fourier transform/Bochner's theorem:

Proof that the Mallows kernel and Kendall kernel are positive definite:

Discussion of weighted variants:

On Hamming distance kernels via topological quantum computation:

  • Alessandra Di Pierro, Riccardo Mengoni, Rajagopal Nagarajan, David Windridge: Hamming Distance Kernelisation via Topological Quantum Computation, in: Theory and Practice of Natural Computing. TPNC 2017, Lecture Notes in Computer Science 10687, Springer (2017) 269-280 [doi:10.1007/978-3-319-71069-3_21]

See also

In this paper I propose a generative model of supervised learning that unifies two approaches to supervised learning, using a concept of a correct loss function. Addressing two measurability problems, which have been ignored in statistical learning theory, I propose to use convergence in outer probability to characterize the consistency of a learning algorithm. Building upon these results, I extend a result due to Cucker-Smale, which addresses the learnability of a regression model, to the setting of a conditional probability estimation problem. Additionally, I present a variant of Vapnik-Stefanyuk’s regularization method for solving stochastic ill-posed problems, and using it to prove the generalizability of overparameterized supervised learning models.

In topological machine learning

On kernel methods applicable to persistence diagrams/barcodes for making topological data analysis amenable to “topologicalmachine learning:

  • Jan Reininghaus, Stefan Huber, Ulrich Bauer, Roland Kwitt, A Stable Multi-Scale Kernel for Topological Machine Learning, NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing System, 2 (2015 3070–3078 [[arXiv:1412.6821]]

  • Roland Kwitt, Stefan Huber, Marc Niethammer, Weili Lin, Ulrich Bauer, Statistical Topological Data Analysis – A Kernel Perspective, in: Advances in Neural Information Processing Systems (NIPS 2015) [[ISBN:9781510825024, doi:10.5555/2969442.2969582]]

  • Bastian Rieck, Filip Sadlo, Heike Leitte, Topological Machine Learning with Persistence Indicator Functions, In: Topological Methods in Data Analysis and Visualization V TopoInVis (2017) 87-101 Mathematics and Visualization. Springer, [[arXiv:1907.13496, doi:10.1007/978-3-030-43036-8_6]]

  • Raphael Reinauer, Matteo Caorsi, Nicolas Berkouk, Persformer: A Transformer Architecture for Topological Machine Learning [[arXiv:2112.15210]]

On kernel methods in topological data analysis via quantum computation:

Last revised on July 17, 2024 at 14:05:59. See the history of this page for a list of all contributions to it.