RG inspired Machine Learning for lattice field theory

Machine learning has been a fast growing field of research in several areas dealing with large datasets. We report recent attempts to use Renormalization Group (RG) ideas in the context of machine learning. We examine coarse graining procedures for perceptron models designed to identify the digits of the MNIST data. We discuss the correspondence between principal components analysis (PCA) and RG flows across the transition for worm configurations of the 2D Ising model. Preliminary results regarding the logarithmic divergence of the leading PCA eigenvalue were presented at the conference and have been improved after. More generally, we discuss the relationship between PCA and observables in Monte Carlo simulations and the possibility of reduction of the number of learning parameters in supervised learning based on RG inspired hierarchical ansatzes.


Introduction
Machine learning has been a fast growing field of research in several areas dealing with large datasets and should be useful in the context of Lattice Field Theory [1]. In these proceedings, we briefly introduce the concept of machine learning. We then report attempts to use Renormalization Group (RG) ideas for the identification of handwritten digits [2]. We review the multiple layer perceptron [3] as a simple method to identify digits with a high rate of success (typically 98 percent). We discuss the Principal Components Analysis (PCA) as a method to identify relevant features. We consider the effects of PCA projections and coarse graining procedures on the success rate of the perceptron. The identification of the MNIST digits is not really a problem where some critical behavior can be reached. In contrast, the two-dimensional (2D) Ising model near the critical temperature T c offers the chance to sample the high temperature contours ("worms" [4]) at various temperature near T c . At the conference, we gave preliminary evidence that the leading PCA eigenvalue of the worm pictures has a logarithmic singularity related in a precise way to the singularity of the specific heat. We also discussed work in progress relating the coarse graining of the worm images to approximate procedure in the Tensor Renormalization Group (TRG) treatment of the Ising model [5][6][7]. After the conference, much progress has been made in regard to this question. We briefly summarize the content of an upcoming preprint about this question [8].

What is machine learning?
If you open the web page of a Machine Learning (ML) course, you are likely to find the following definition: "machine learning is the science of getting computers to act without being explicitly programmed" (see e.g. Andrew Ng's course at Stanford [1]). You are also likely to find the statement that in the past decade, machine learning has been crucial for self-driving cars, speech recognition, effective web search, and the understanding of the human genome. We hope it will also be the case for Lattice Field Theory. From a pragmatic point of view, ML amounts to constructing functions that provide features (outputs) using data (inputs). These functions involve "trainable parameters" which can be determined using a "learning set" in the case of supervised learning. The input-output relation can be written in the generic form y(v, W) with v the inputs, y the outputs and W the trainable parameters. This is illustrated in Fig. 1 as a schematic representation of the so-called perceptron [3] where the outputs functions have the form y l = σ( j W l j v j ) with v j the pixels, W l j the tunable parameters and σ(x) the sigmoid function defined below. This simple parametrization allows you to recognize correctly 91 percent of the digits of the testing set of the MNIST data (see section 3).

The MNIST data
A classic problem in ML is the identification of handwritten digits. There exists a standard learning set called the MNIST data [2]. It consists of 60,000 digits where the correct answer is known for each. Each image is a square with 28 × 28 grayscale pixels. There is a UV cutoff (pixels are uniform), an IR cutoff (the linear size is 28 lattice spacings) and a typical size (the width of lines is typically 4 or 5). Unless you are attempting to get a success rate better than 98 percent, you may use a black and white approximation and consider the images as Ising configurations: if the pixel has value larger than some gray cutoff (0.5 on Fig. 3), the pixel is black, and white otherwise. Another simplification is to use a blocking, namely replacing groups of four pixels forming a 2 by 2 square by their average grayscale. The blocking process can only be repeated 5 times, after that, we obtain a uniform grayscale that makes the identification of the digit difficult.
We now consider a simple model, called the perceptron, which generates 10 output variables, one associated with each digit, using functions of the pixels (the visible variables) with one intermediate set of variables called the hidden variables. The visible variables v i are the 28 × 28 = 784 pixel's grayscale values between 0 and 1. We decided to take 196=784/4 hidden variables h k . Later (see hierarchical approximations in section 5), we will "attach" them rigidly to 2 × 2 blocks of pixels. The hidden variables are defined by a linear mapping followed by an activation function σ.
We choose the sigmoid function σ(x) = 1/(1 + exp(−x)), a popular choice of activation function which is 0 at large negative input, 1 at large positive input. We have σ(x) = σ(x) − σ(x) 2 which allows simple algebraic manipulations for the gradients. The output variables are defined in a similar way as functions of the hidden variables with l = 0, 1, . . . 9 which we want to associate with the MNIST characters by having target values y l 1 for l =digit while y l 0 for the 9 others. The trainable parameters W (1) k j and W (2) k j are determined by gradient search. Given the MNIST learning set {v (n) i } with n = 1, 2, . . . N 60, 000 with the corresponding target vectors {t (n) l }, we minimize the loss function: The weights matrices W (1) and W (2) are initialized with random numbers following a normal distribution. They are then optimized using a gradient method with gradients: After one "learning cycle" (going through the entire MNIST training data), we get a performance of about 0.95 (number of correct identifications/number of attempts on an independent testing set of 10,000 digits), after 10 learning cycles, the performance saturates near 0.98 (with 196 hidden variables and learning parameters, which control the gradient changes, 0.1 and 0.5. It is straightforward to introduce more hidden layers: ). With two hidden layers, the performance improves (but only very slightly). On the other hand, if we remove the hidden layer, the performance goes down to about 0.91 as mentioned above. It is instructive to look at the outputs for the 2 percent of cases where the algorithm fails to identify the correct digits. There are often cases where humans would have hesitations. In the following, we will focus more on getting comparable performance with less learning parameters rather than attempting to reduce the number of failures. It has been suggested [9,10] that the hidden variables can be related to the RG "block variables" and that an hierarchical organization inspired by physics modeling could drastically reduce the number of learning parameters. This may be called 'cheap" learning [11]. The notion of criticality or fixed point has not been identified precisely on the ML side. It is not clear that the technical meaning of "relevant", as used in a precise way in the RG context to describe unstable directions of the RG transformation linearized near a fixed point, can be used generically in ML context. The MNIST data does not seem to have "critical" features and we will reconsider this question for the more tunable Ising model near the critical temperature (see section 6).

Principal Component Analysis (PCA)
The PCA method has been used successfully for more than a century. It consists in identifying directions with largest variance (most relevant directions). It may allow a drastic reduction of the information necessary to calculate observables. We call v (n) i the grayscale value of the i-th pixel in the n-th MNIST sample. We first definev i as the average grayscale value of the i-th pixel over the learning set. This average is shown in Fig. 3. We can now define the covariance matrix: We can project the original data onto a small dimensional subspace corresponding to the largest eigenvalues of C i j . The first nine eigenvectors are displayed in Fig. 4. The projections in subspaces of dimension 10, 20, ... 80 are shown in Fig. 5.

RG inspired approximations
We have reconstructed PCA projected training and learning sets either as images (we keep 784 pixels after the projection, as shown in Fig. 5, and proceed with the projected images following the usual procedure), or as abstract vectors (the coordinates of the image in the truncated eigenvector base without using the eigenvectors, a much smaller dataset). The success rate of these PCA Projections are shown on Fig. 5.
We have used the one hidden layer perceptron with blocked images. First we replaced squares of 4 four pixels by a single pixel carrying the average value of the four blocked pixels. Using the 14 × 14 blocked pictures with 49 hidden variables, the success rate goes down slightly (97 percent). Repeating once, we obtain 7 × 7 images. With 25 hidden variables, we get a success rate of 92 percent.
As mentioned before, we can also replace the grayscale pixels by black and white pixels. This barely affects the performance (97.6 percent) but diminishes the configuration space from 256 784 to 2 784 and allows a Restricted Boltzmann Machine treatment.    We have considered the hierarchical approximation where each hidden variable is only connected to a single 2 × 2 block of visible variables (pixels): with α = 1, 2, 3, 4 are the position in the 2 × 2 block and l = 1, ...,196 the labeling of the blocks. Even though the number of parameters that we need to determine with the gradient method is significantly smaller (by a factor 196), the performance remains 0.92. A generalization with 4x4 blocks leads to a 0.90 performance with 1/4 as many weights. This simplified version can be used as a starting point for a full gradient search (pretraining), but the hierarchical structure (sparcity of W i j ) is robust and remains visible during the training. This pretraining breaks the huge permutation symmetry of the hidden variables.

Transition to the 2D Ising model
The MNIST data has a typical size built in the images, namely the width of the lines and UV details can be erased without drastic effects until that size is reached. One can think that the various digits are separate "phases", but there is nothing like a critical point connected to all the phases. It might be possible to think of the averagev i as a fixed point. However, there are no images close to it. In order to get images that can be understood as close to a critical point, we will consider the images of worm configurations for the 2D Ising model at different β = 1/T , some close to the critical value. The graphs in this section have been made by Sam Foreman. The worm algorithm [4] allows us to sample the high temperature contributions, whose statistics are governed by the number of active bonds in a given configuration. An example of an equilibrium configuration is shown in Fig. 7. We implemented a 'coarse-graining' procedure where the lattice is divided into blocks of 2 × 2 squares, essentially reducing the size of each linear dimension by two. Each 2×2 square is then 'blocked' into a single site, where the new external bonds in a given direction are determined by the number of active bonds exiting a given square. If a given block has one external bond in a given direction, the blocked site retains this bond in the blocked configuration, otherwise it is ignored. This is illustrated on the right side of Fig. 7.
The worm algorithm allows statistically exact calculations of the specific heat. In order to observe finite size effects, we performed this analysis for lattice sizes L = 4, 8, 16, 32. Using arguments that will be presented elsewhere [8], we conjectured that near criticality, the largest PCA eigenvalue λ max is proportional to the specific heat per unit of volume, with a proportionality constant 2 3 ln(1 + 2 ≈ 0.52. The good agreement is illustrated in Fig. 6. Figure 7. Example of legal high temperature contribution also called worm (left). All the paths close due to periodic boundary conditions. Example of "worm blocking" applied to the same configuration (right). To cross check the accuracy of the worm for some observables, the TRG was used. We calculated N b for a blocked and unblocked lattice using TRG, and calculated (N b − N b ) 2 for the unblocked case, and good agreement was found between the worm and the TRG. The TRG also provides a clear picture of what bond numbers are associated with which states as opposed to editing pictures by setting pixels to chosen values, such as what happens in the blocked case for a block of four sites. This correspondence makes the picture of RG for the configurations sturdier.
Results for the leading PCA eigenvalue and specific heat for blocked configurations were found qualitatively similar to the unblocked results. The blocking procedure is approximate. Systematically improvable approximations can be constructed with the TRG method [5][6][7]. Improvement of the blocking method used here were developed after the conference and will be presented elsewhere [8].