660 views
Regularized Convolutional Wasserstein Barycenters == > [name="Nicolas Bonneel"] [TOC] Optimal transport provides a way to interpolate between several probability distributions. This approach has advantages and disadvantages that we will explore. It can be used to efficiently interpolate between voxelized shapes, with the advantage that it can interpolate between shapes of arbitrary topologies (e.g., interpolating between an annulus and a disk does not pose any issue), and without the need to specify correspondances. It would also be tempting to interpolate between images. This is in general not advised (perhaps except in specific contexts) as the notion of "conservation of mass" between images is not very relevant, and more importantly, because many approaches work much better for this task (e.g., algorithms that match feature points, or deep learning techniques). **Example:** <center> <div style='width: 80%; display: inline-block; padding : 5px; vertical-align : top'> <center> <img width = 50% alt='pixel advection' src='https://i.imgur.com/5VcEFYh.gif'></img><br/><br/> Fig.1. The benefit of using optimal transport to interpolate shapes is that it does not require feature points to be matched, and is agnostic to changes in topology (here, interpolating a holed triangle and two disks) </center> <br/> </div> </center> The idea of this project is thus to compute interpolation of rasterized shapes, in 2d (though it trivially extends to 3d voxels). The objective of this TP is to implement an *entropy regularized* optimal transport approach to solve this problem. ## Warm-up: Downloading the code The skeleton of the code is available on [this Github Project](https://github.com/dcoeurjo/transport-optimal-gdrigrv-11-2020). This code contains all you need to load and save grayscale images and apply a gaussian blur to an image. You can download the archive from the Github interface (*"download zip"* menu), or *clone* the git repository. The project contains some `c++` and `python` codes. To compile the C++ tools, you would need [cmake](http://cmake.org) (>=3.14) to generate the project. For instance on linux/macOS: ``` bash git clone https://github.com/dcoeurjo/transport-optimal-gdrigrv-11-2020 cd transport-optimal-gdrigrv-11-2020 cd c++ mkdir build ; cd build cmake .. -DCMAKE_BUILD_TYPE=Release make ``` On Windows, use cmake to generate a Visual Studio project. We also provide some `python` code to load/display images and also apply some gaussian blur. The practical work described below can be done either in `c++` or `python`. In the next section, we briefly introduce entropy regularized optimal transport (you may skip this section if you have followed the lecture). ## Recap:  Entropy Regularized Optimal Transport and Barycenters ### Entropy Regularized Optimal Transport A discrete version of the Kantorovich formulation of optimal transport between two histograms $f$ and $g$ discretized on $n$ bins reads: $$W(f,g) = \min_{M\in \mathcal{B}(f,g)} \sum_{i=0..n,\\ j=0..n}c_{i,j} m_{ij}$$ where $M = [m_{ij}]_{ij}$ is a matrix called the *transport plan*, which entries model the amount of mass traveling from location $x_i$ to location $x_j$ at a cost $c_{ij}$. We can similarly denote a cost matrix $C = [c_{ij}]_{ij}$. Here, $\mathcal{B}(f,g)$ denotes the set of matrices with non-negative elements whose columns sum to $f$ and rows sum to $g$ : $\mathcal{B}(f,g) = \{ [b_{ij}]_{ij} ~ | ~b_{ij}\geq 0, \sum_i b_{ij} = g_j, \sum_j b_{ij} = f_i \}$. This minimization defines a linear program, and can be rewritten using a dot product between these two matrices, for simpler notation: $$W(f,g) =\min_{M\in \mathcal{B}(f,g)} ~\langle C, M\rangle$$ The idea behind entropy-regularized optimal transport[^b2] is to bring a regularization term that makes this objective function smooth. This term is given by the entropy of $M$ (note that other choices are possible[^b4]) : $$W_\gamma(f,g) =\min_{M\in \mathcal{B}(f,g)} ~\langle C, M\rangle - \gamma\, h(M)$$ where $h(M) = -\sum_{ij} M_{ij} (\log M_{ij}-1)$ is the entropy of the transport plan. This minimization can be rewritten as a projection: $$W_\gamma(f,g) = \gamma \min_{M\in \mathcal{B}(f,g)} ~KL(M, \xi)$$ where $\xi = \exp(-C/\gamma)$ (using the classical elementwise exponential) and $KL(M, \xi) = \sum_{ij} M_{ij} (\log \frac{M_{ij}}{\xi_{ij}}-1)$. Intuitively, this corresponds to finding the best projection in the sense of KL-divergence of a certain matrix $\xi$ onto the set of constraints $\mathcal{B}(f,g)$. This set of constraint being defined by the intersection of two affine constraints (a constraint on the sum of rows, and on the sum of columns), an idea to find this projection is to start with $M^0=\xi$ and alternatively project on these two constraints. Projecting on these constraints is easy: it merely amounts to normalizing the rows and columns of $M^{i-1}$ (we denote $M^{i-1}$ the estimate of $M$ at iteration number $i-1$) so that they sum to $f$ or $g$. We will call the row and column scaling factors $u^i$ and $v^i$ which are simply vectors representing by how much the rows or columns should be multiplied to get $f$ or $g$. They may also be interpreted as diagonal matrices that are left or right multiplied by $M^{i-1}$ since $M^i = \text{diag}(u^i)\,\,\xi\,\,\text{diag}(v^i)$. We obtain the following algorithm, roughly using matlab notations: ```bash= function [M, W] = Sinkhorn(f, g, C) xi = exp(-C/gamma) v = ones(n, 1) for i=1..K u = f ./ (xi * v) v = g ./ (xi * u) end return M = diag(u)*xi*diag(v), W = gamma*sum(f*log(u) + g*log(v))` end ``` Here, we have taken advantage of the fact that we do not need to fully store the iterates $\{M^i\}_i$, but we may only build $M^i$ from the scaling factors $u^i$ and $v^i$, and the matrix $M^{i-1}$... which itself needs not be stored. Ultimately, after $K$ iterations, $M^K$ can be recovered from $u^1$,...,$u^n$, $v^1$,...,$v^n$ and $\xi$, and the variables $u$ and $v$ respectively denote $\Pi_i u^i$ and $\Pi_i v^i$. An important special case is when $c(x,y)=\|x-y\|^2$ and the domain we are working on is a regular grid. The matrix-vector multiplication $\xi u$ (and similarly $\xi v$) amounts to computing $\sum_i \exp(-c(x_i, x_j)/\gamma) \, u_i$ which, in this case, amounts to performing a convolution with a Gaussian kernel[^b1] of standard deviation $\sigma = \sqrt{\gamma/2}$. This formulation has huge computational benefits, as Gaussian kernels are separable, and thus, can be performed in a much more efficient way. ### Wasserstein Barycenters Similarly to a Euclidean barycenter, a Wasserstein barycenter between histograms $\{h_i\}_{i=1..N}$ with weights $\{\lambda_i\}_{i=1..N}$ is an histogram $\tilde{b}$ minimizing: $$ \tilde{b} = \text{argmin} \sum_i \lambda_i W(b, h_i)$$ While exact Wasserstein barycenters can be very costly and difficult to obtain, an entropy regularized version can be obtained by replacing $W$ by $W_\gamma$ in the formula above. This time, by considering $N$ transport plans between the barycenter $\tilde{b}$ and each of the $N$ histograms $\{h_i\}_{i=1..N}$, the Sinkhorn algorithm can be rewritten to produce regularized Wasserstein barycenters as follows: ```bash= function b = RegularizedBarycenter(h[N], lambdas[N], C) xi = exp(-C/gamma) v = ones(N, n) for i=1..K for j=1..N u[j] = h[j] ./ (xi * v[j]) b = product_over_j((xi * u[j]).^lambda[j]) for j=1..N v[j] = b ./ (xi * u[j]) end return b end ``` Like for the classical Sinkhorn algorithm, all $xi * u[j]$ operations can be implemented as convolutions when dealing with grids and a quadratic cost. ## Implementing regularized Wasserstein barycenters, the "TP" The goal of this TP is to implement the `RegularizedBarycenter` function and experiment with parameters. As you shall see, the algorithm is quite sensitive to the parameter $\gamma$ : too low values yield numerical issues (slower convergence, computations blowing up) while too large values over-blur the resulting barycenter. Using K = 100 iterations, and varying $\gamma = 2\sigma^2$: <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=50, K=100' src='https://i.imgur.com/eEQDeFV.png'></img> <center> γ = 50 </center> </div> <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=18, K=100' src='https://i.imgur.com/mPmG1Kr.png'></img> <center> γ = 18<br/> white images resulting from numerical blowup </center> </div> <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=200, K=100' src='https://i.imgur.com/3fg92DE.png'></img> <center> γ = 200<br/> results in overly blurred images </center> </div> <!-- Vincent : j'ai réécrit en html sinon ça part en live. Pas de mathjax ? ![gamma=50, K=100](https://i.imgur.com/eEQDeFV.png =230x) ![gamma=18, K=100](https://i.imgur.com/mPmG1Kr.png =230x) ![gamma=200, K=100](https://i.imgur.com/3fg92DE.png =230x) $~~~~~~~~~~~~~~~~~~~~~~\gamma = 50~~~~~~~~~~~~~~~~~~$ $\gamma = 18$ (white images resulting $~~~~~\gamma = 200$ (results in overly $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$from numerical blow up) $~~~~~~~~~~~~~~~~~~~~~$blurred images) --> Using K = 5 iterations: <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=50, K=5' src='https://i.imgur.com/bGCyoiz.png'></img> <center> γ = 50 </center> </div> <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=18, K=5' src='https://i.imgur.com/7Om39DK.png'></img> <center> γ = 18<br/> not enough iterations to blow up </center> </div> <div style='width: 32%; display: inline-block; padding : 5px; vertical-align : top'> <img alt='gamma=200, K=5' src='https://i.imgur.com/TBp3QzK.png'></img> <center> γ = 200<br/> almost no difference with K=100 since convergence is faster </center> </div> <!-- Vincent : j'ai réécrit en html sinon ça part en live. Pas de mathjax? ![gamma=50, K=100](https://i.imgur.com/bGCyoiz.png =230x) ![gamma=18, K=100](https://i.imgur.com/7Om39DK.png =230x) ![gamma=200, K=100](https://i.imgur.com/TBp3QzK.png =230x) $~~~~~~~~~~~~~~~~~~~~~~\gamma = 50~~~~~~~~~~~~~~~~~~$ $\gamma = 18$ (not enough iterations $~~\gamma = 200$ (almost no difference $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$to blow up) $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$since convergence is faster) --> ### The core We will implement the algorithm for Wasserstein barycenters described above. While it seems straightforwards at first sight, there are in fact a few things to consider: * You should make sure the images sum to one before interpolating ! To do that, simply divide the image pixel values by their sum. However, doing so will result in overly dark images, and the brightness value will be lost. You may store these scaling factor, and linearly interpolate them (alongside the optimal transport interpolation of the normalized images). Before saving the images to disk, simply multiply each pixel value by this linearly interpolated scaling factor. * At line 7 of the algorithm, the function `product_over_j` allows to simply compute a geometric mean of the images $\xi * u[j]$ each weighted by $\lambda[j]$. A geometric mean of values that can be extremely small or large may not be robust. It is advised to replace $\Pi_j (\xi * u[j])^{\lambda[j]}$ by an arithmetic mean involving logarithms: $\exp(\sum_j {\lambda[j]}. \log (\xi * u[j]))$ * Before saving, you may need to clamp pixel values between 0 and 1. * While the algorithm shows 3 convolutions per iteration, in fact, 2 of them are the same and the previous computation can thus be reused. Also, the loops over $N$ (the number of images we are taking the barycenter of) are independants and can thus be parallelized. * It is tempting, for efficiency reasons, to truncate the Gaussian kernel, e.g., at $3\sigma$. It is important to realize that doing so prevents mass from being transported further than $3\sigma$ away, as this amounts to setting $c(x,y)=\infty$ in this region. This could make the optimal transport problem infeasible. Instead, for stability, one may extend the Gaussian to the entire image but use a very small minimum value $\epsilon$. You will realize that reducing $\gamma$ too much results in numerical blow up. There are solutions to this issue, notably by performing computations in the log-domain[^b3]. Instead of storing $u$ and $v$ values, one estimate $a=\log(u)$ and $b=\log(v)$. The convolutions $\xi*u$ and $\xi*v$ can now be performed via a log-sum-exp evaluation, which can be made stable by realizing that $a$ and $b$ are the dual variables of the Kantorovich linear program, and hence, $a(x)+b(y) \leq c(x,y)$. :::info **Tip**: for all practical applications, you are strongly advised to use existing fast libraries that optimize these computations, such as [GeomLoss](https://www.kernel-operations.io/geomloss/) in Python. ::: # References [^b1]: Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, Leonidas Guibas. 2015. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (SIGGRAPH 2015). [^b2]: Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (pp. 2292-2300). [^b3]: Lenaic Chizat, Gabriel Peyré, Bernhard Schmmitzer, François-Xavier Vialard. 2018. Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation, vol. 87, no 314, p. 2563-2609. [^b4]: Montacer Essid, Justin Solomon. 2018. Quadratically regularized optimal transport on graphs. SIAM Journal on Scientific Computing, vol. 40, no 4, p. A1961-A1986.