Robust Frank-Wolfe Algorithm for Minimum Enclosing Bregman Balls

This post gathers theoretical convergence & complexity results on the Frank-Wolfe algorithm extended to Bregman divergences (instead of euclidian distance) and made robust to outliers. It is the result of a collaboration with Amaury Sabran, supervised by Pr Frank Nielsen (LIX).

We propose a robust algorithm to approximate the slack minimum enclosing ball problem for a broad class of distance functions called Bregman Divergences.
The slack minimum enclosing ball problem is a convex relaxation of the minimum enclosing ball problem that allows outliers. Bregman divergences include not only the traditional (squared) Euclidean distance but also various divergence measures based on entropic functions. We apply this algorithm to the case of Euclidean MEB, MEB in a Reproducing Kernel Hilbert Space and MEB on the probability simplex with KL-divergences.

Introduction

The minimum enclosing ball of a set of points is the smallest ball that contains all these points. Such balls are used in clustering [1] and classification [2] problems.

The Standard Minimum Enclosing Ball Problem

For the standard minimum enclosing ball problem, we consider an euclidean vector space (E,<,>)(E, <\cdot,\cdot>) and points (z1,,zn)(z_1, \cdots, z_n) in EE. The aim is to find the ball B(c,R)B(c, R) with minimum radius that contains all the points. This problem can be described as the following convex optimization problem :

argmincER2\operatorname{argmin}_{c \in E} R^2

s.t.czi2R2,i=1..ns.t. \quad \Vert c-z_i \Vert^2 \leq R^2,\quad i=1..n
The radius of the enclosing ball is R=maxicziR = \max_i \Vert c-z_i \Vert.

The Slack-Bregman MEB Problem

We generalize this euclidian problem to the case where the distance is derived from a Bregman divergence. Bregman divergences include the standard Euclidean squared distance, the squared Mahalanobis distance and entropy-based distances such as the Kullback–Leibler divergence.

In addition, we generalize the hard-ball problem to slack-balls, i.e enclosing balls allowing outliers. To do so, we add slack variables in the optimization problem that penalize the presence of outliers. We control the amount of outliers with a slack parameter CC. For C1C\geq 1, we retrieve the hard enclosing ball problem.

The slack Bregman MEB problem (CC-SMEBB problem) is the following :
minR,ξ,θcR+C1inξi\min_{R,\xi,\theta_c} R + C\sum_{1\leq i \leq n}{\xi_i}

s.tAF,F(θi:f(θc))R+ξi,1in,s.t \quad A_{F,F^*}(\theta_i : \nabla f(\theta_c)) \leq R + \xi_i ,1 \leq i \leq n,

ξ0\xi \geq 0

where AF,FA_{F,F^*} is the canonical Bregman divergence defined in a following section.

Definition. A CC-Slack Minimum Enclosing Bregman Ball of PP, or CC-SMEBB of PP, is an optimal solution (θ,R)(\theta^*, R^*) to this optimization problem.

The Frank-Wolfe algorithm

The first algorithm to iteratively solve a quadratic program on a simplex has been described by Frank and Wolfe in 1956 [3]. The algorithm computes the gradient of the function to minimize, and then makes a small step in the direction in which the gradient is the smallest. The convergence rate of this algorithm is Θ(1/k)\Theta(1/k), where kk is the number of steps. It means that the gap between the algorithm value and the optimal value at step kk is smaller than C/kC/k for some constant CC.
This algorithm could be directly applied to the Minimum Enclosing Ball problem. At each step of the algorithm, the gradient is smallest in the direction of the farthest point from the approximate ball center. This means that the center should make a small step toward the furthest point in the dataset. The objective function of this minization problem is not continuously differentiable; however, geometrical methods in [4] showed a similar convergence rate to the general Frank-Wolfe algorithm.

There exists many variants of the Frank-Wolfe algorithm, notably with the ideas of "line search" instead of fixed-size steps, away-steps, and partial computation of the gradient to speed up the algorithm. In this paper, we focused on the original definition of the Frank-Wolfe algorithm, but of course many variants could be adapted to the Bregman divergence-generalization.

Core-sets

Given a set of points PRdP \subset \mathbb{R}^d, and ϵ>0\epsilon>0, an ϵ\epsilon-core-set SS is a subset of PP with the property that the smallest ball containing SS is within ϵ\epsilon of the smallest ball containing PP. That is if the smallest ball that contains SS is expanded by 1+ϵ1+\epsilon, then the expanded ball contains PP.
Core-sets are related to the minimum enclosing ball problem because it is easier to compute an enclosing ball on a smaller subset of PP.

In [4:1], the algorithm to solve the minimum enclosing ball problem is to compute the center as a convex combination of the points in the dataset, and to iteratively add new points to the center representation. It has been shown that after 1ϵ2\frac{1}{\epsilon^2} iterations, the center is a convex combination of the points θ1,,θm\theta_1, \cdots, \theta_m so that those points are an ϵ\epsilon-core-set of PP. In [5], it is proven that the optimal size of an ϵ\epsilon-core-set of any set of point PP is 1ϵ\lceil \frac{1}{\epsilon}\rceil and that the bound is tight. This result is remarkable because it does not depend on the number of points in the dataset nor on the dimension of the data.

Therefore, theoretically, in the MEB problem, an ϵ\epsilon-approximation of the MEB center could be expressed as a convex combination of at most 1/ϵ1/\epsilon points. When the dimension d is fixed and finite, the MEB center can always be expressed as a convex combination of at most d+1d+1 points.

Our approach

Frank-Wolfe Algorithm for the Slack Bregman MEB problem

Main Result

Let EE be a vector space, <,><\boldsymbol{\cdot}, \boldsymbol{\cdot}> a scalar product over EE and nn a non-negative integer.
Let P={θ1,θ2,,θn}P = \{\theta_1, \theta_2, \cdots, \theta_n\} a finite subset of EE.
Let FF be a convex, continuously differentiable function over TT.
For a function ff defined over a convex domain DD, we define the curvature constant :
Cf=supx,sD,γ[0,1],y=x+γ(sx)2γ2(f(x)f(y)<yx,f(x)>)C_f = \sup_{x, s \in D, \gamma \in [0,1], \atop y = x + \gamma (s-x)} \quad \frac{2}{\gamma^2} ( f(x) - f(y) - <y-x, \nabla f(x)>)

Let DFD_F be the Bregman divergence associated to FF and AF,FA_{F,F^*} its canonical divergence as described in the following section.

We note :
Δn={αRn:i1..n,0αi1,iαi=1}\Delta_n = \{\mathbf{\alpha} \in \mathbb{R}^n : \forall i \in 1..n, 0\leq \alpha_i \leq 1, \sum_i \alpha_i = 1\}

CΔn={αRn:i1..n,0αiC,iαi=1}C\Delta_n = \{\mathbf{\alpha} \in \mathbb{R}^n : \forall i \in 1..n, 0\leq \alpha_i \leq C, \sum_i \alpha_i = 1\}

and (ei)1in(e_i)_{1\leq i\leq n} the canonical base of Δn\Delta_n.

We define the Slack Bregman Frank-Wolfe algorithm, that output θ,R\theta,R, the center and radius of the approximate CC-SMEBB.

  • Let α(0)CΔn\alpha^{(0)} \in C\Delta_n

  • Let m=1Cm = \lceil\frac{1}{C}\rceil

  • For k=0...Kk = 0...K do:

    • θ(k):=αi(k)θi\theta^{(k)}:=\sum{\alpha_i^{(k)} \theta_i}

    • η(k):=F(θ(k))\eta^{(k)} := \nabla F (\theta^{(k)})

    • D(k):=(AF,F(θi:η(k)))1inD^{(k)} :=(A_{F,F^*}(\theta_i : \eta^{(k)}))_{1\leq i \leq n}

    • Let i1,,imi_1, \cdots , i_m be the indices of the mm largest entries of D (unsorted) with imi_m the index of the mm-th largest entry.

    • s(k):=k=1m1Ceik+(1(m1)C)eims^{(k)} :={\sum_{k=1}^{m-1} C e_{i_k}}+ (1-(m-1)\cdot C)e_{i_m}

    • R(k):=AF,F(θim:η(k))R^{(k)} := A_{F,F^*}(\theta_{i_m} : \eta^{(k)})

    • gap(k):=<s(k)α(k),D(k)>gap^{(k)} := <s^{(k)}-\alpha^{(k)},D^{(k)}>

    • α(k+1)=(1γk)α(k)+γks(k) where γk=2/(k+2)\alpha^{(k+1)}=(1-\gamma_k)\alpha^{(k)} + \gamma_k s^{(k)} \text{ where }\gamma_k = 2/(k+2)

  • Let j:=argminkgap(k)j := \operatorname{argmin}_k gap^{(k)}

  • Let θ,R:=θ(j),R(j)\theta,R := \theta^{(j)}, R^{(j)}

Theorem [Couairon, Sabran]. Let (θ,R)(\theta^*, R^*) be a CC-SMEBB of PP. Let (θ,R)(\theta, R) be the center and radius of the Bregman ball computed by the following algorithm during KK steps.

Then,
DF(θ,θ)4CFK+2\quad D_F(\theta^*,\theta) \leq \frac{4 C_F}{K+2}

This algorithm lets us approximate the center of a Slack Bregman Minimum Enclosing ball with a convergence rate of Θ(1/k)\Theta(1/k).

Background Tools and results

In this post, we will not prove the above theorem, but we will present the tools we used from Bregman geometry and convex optimization.

Bregman Divergence

Let χ\chi be a compact convex subset of Rd\mathbb{R^d}. Let DF(θ1:θ2)D_F (\theta_1 : \theta_2) denote the Bregman divergence for a smooth strictly-convex real-valued generator F:χRF : \chi \mapsto \mathbb{R}.
DF(θ1:θ2)=F(θ1)F(θ2)<θ1θ2,F(θ2)>D_F (\theta_1 : \theta_2) = F(\theta_1)-F(\theta_2)-<\theta_1-\theta_2, \nabla F(\theta_2)>.

Informally speaking, the Bregman divergence DFD_F is the tail of the Taylor expansion. DF(:θ2)D_F (\cdot : \theta_2) can be interpreted as the distance between FF and its linear approximation around θ2\theta_2 as shown below. Since F is strictly convex, DF(θ1:θ2)0D_F(\theta_1 :\theta_2) \geq 0 and DF(θ1:θ2)=0D_F(\theta_1 :\theta_2) = 0 iff θ1=θ2\theta_1 = \theta_2.

Visualizing the Bregman divergence

Consider the dual convex conjugate of FF, F(η)=supθ<θ,η>F(θ)F^*(\eta) = \sup_{\theta}{<\theta, \eta>- F(\theta)}. We have DF(θ1:θ2)=DF(η2:η1)D_F(\theta_1:\theta_2)= D_{F^*}(\eta_2 :\eta_1) with η=F(θ)\eta = \nabla F (\theta). (and θ=F(η)\theta = \nabla F^*(\eta) so that FF=\nabla F \circ \nabla F^* = id). Define the canonical divergence AF,F(θ1:η2)=F(θ1)+F(η2)<θ1,η2>A_{F,F^*}(\theta_1 : \eta_2) = F(\theta_1) + F^*(\eta_2) -<\theta_1, \eta_2>.

We have DF(θ1:θ2)=DF(η2:η1)=AF,F(θ1:η2)D_F(\theta_1 :\theta_2) = D_{F^*}(\eta_2 : \eta_1) = A_{F,F^*}(\theta_1 : \eta_2)

AF,FA_{F,F^*} is convex in its first and second arguments.

We define a right-centered Bregman ball ballF(ηc,r)={θχAF,F(θ,ηc)r}ball_F(\eta_c,r) = \{\theta \in \chi | A_{F,F^*}(\theta,\eta_c) \leq r\}.

Given a finite set of points P={θ1,...θn}χP=\{\theta_1,...\theta_n\} \subset \chi, the right-centered Minimum Enclosing Bregman Ball (rMEBB) is defined as the unique minimizer of minηcmax1inAF,F(θi,ηc)\min_{\eta_c}\max_{1\leq i \leq n}{A_{F,F^*}(\theta_i,\eta_c)}

Wolfe Duality

Consider a general optimization problem:
minxf(x)\min_x f(x)

s.t.gi(x)0,i=1..ms.t. \quad g_i(x) \leq 0, i=1..m

The Lagrangian Dual problem is
maxuinfxf(x)+jmujgj(x)\max_{u} \quad \inf_x \quad f(x) + \sum_j^m u_j g_j(x)

s.t.uj0,j=1..ms.t. \quad u_j \geq 0, j=1..m

Provided that the functions ff, g1g_1, ... gmg_m are convex and differentiable and that a solution exists, the infimum occurs when the gradient is equal to 0. The Wolfe Dual problem uses the KKT conditions as a constraint. The problem is therefore
maxx,uf(x)+j=1mujgj(x)\max_{x, u} f(x) + \sum_{j=1}^m u_j g_j(x)

s.t.f(x)+j=1mujgj(x)=0s.t. \quad \nabla f(x) + \sum_{j=1}^m u_j \nabla g_j(x) = 0

uj0,j=1..mu_j \geq 0, j=1..m

Let us write the Wolfe dual to the SMEBB optimization problem.
By identification, we have
f(η,R,ξ)=R+C1inξif(\eta, R, \xi) = R + C\sum_{1\leq i \leq n} \xi_i

gi(θi,ηc)=AF,F(θi:ηc)Rξi,i=1..ng_i(\theta_i, \eta_c) = A_{F,F^*}(\theta_i : \eta_c) - R - \xi_i, \quad i=1..n

gi(ξ)=ξi,i=(n+1)...2ng_i(\xi) = - \xi_i, \quad i=(n+1)...2n

f,g1,...,g2nf, g_1, ..., g_{2n} are convex and differentiable with respect to the variables RR, ηc\eta_c, ξ\xi. Therefore the Wolfe dual is

maxη,R,ξ,α,μR+1inαi(AF,F(θi:ηc)Rξi)+1in(Cμi)ξi\max_{\eta, R, \xi, \alpha, \mu} R + \sum_{1\leq i \leq n} \alpha_i (A_{F,F^*}(\theta_i : \eta_c) - R - \xi_i) + \sum_{1\leq i \leq n} (C-\mu_i) \xi_i

s.t.αi0,μi0,αi+μi=Ci=1..ns.t. \alpha_i \geq 0, \mu_i \geq 0 , \alpha_i + \mu_i = C \quad i = 1..n

η=F(1inαiθi),1inαi=1\eta = \nabla F (\sum_{1\leq i \leq n} \alpha_i \theta_i), \quad \sum_{1\leq i \leq n} \alpha_i= 1

From now on, we define the following maps:
θ(α)=1inαiθiandη(α)=F(θ(α))\theta(\alpha) = \sum_{1\leq i \leq n} \alpha_i \theta_i \quad \text{and} \quad \eta(\alpha) = \nabla F (\theta(\alpha))

The Wolfe dual of the SMEBB problem can be simplified as
maxαCΔn1inαiAF,F(θi:η(α))\max_{\alpha \in C\Delta_n} \sum_{1\leq i \leq n} \alpha_i A_{F,F^*}(\theta_i : \eta(\alpha))

We define the Wolfe dual function over CΔnC\Delta_n:
f(α)=1inαiAF,F(θi:η(α))f(\alpha) = \sum_{1\leq i \leq n} \alpha_i A_{F,F^*}(\theta_i : \eta(\alpha))

We can also write
f(α)=1inαiF(θi)F(θ(α))f(\alpha) = \sum_{1\leq i \leq n}{\alpha_i F(\theta_i)}-F(\theta(\alpha))

this dual function is concave over CΔnC\Delta_n.

Jaggi's theorem

The Slack Bregman Frank-Wolfe algorithm is the Frank-Wolfe optimization algorithm applied with the Wolfe dual problem of the SMEBB problem. The convergence theorem is from Jaggi (2008) [6].

Theorem [Jaggi, 2008]. Let ff be a convex, continuously differentiable function, and DD a compact convex subset of any vector space.;
Let x=minxDf(x)x^* = \min_{x \in D} f(x) and CfC_f be the curvature constant of ff;
Let gg be the duality gap:
g(x)=maxsD<xs,f(x)>g(x) = \max_{s \in D} \quad <x-s, \nabla f(x)>

Consider a sequence (x(k))(x^{(k)}) computed by the Frank-Wolfe algorithm :

  • Let x(0)Dx^{(0)} \in D
  • For k=0...Kk = 0...K do:
    • Compute s:=argminsD<s,f(x(k))>s:= \operatorname{argmin}_{s \in D}{<s, \nabla f(x^{(k)})>}
    • Update x(k+1)=(1γk)x(k)+γsx^{(k+1)} = (1-\gamma_k) x^{(k)} + \gamma s where γk=2/(k+2)\gamma_k = 2/(k+2)

Then,
min1kKg(x(k))4CfK+2\min_{1\leq k\leq K} g(x^{(k)}) \leq \frac{4 C_f}{K+2}

The duality gap is an upper bound of the difference between the values of the primal and dual functions.

Proof of the theorem

We only provide a sketch of the proof: First, we show that our algorithm is the Frank-Wolfe algorithm applied to f-f where ff is the dual function of the CC-SMEBB problem. We then analyze to duality gap and show that gap(k)DF(θ,θ(k))gap^{(k)} \geq D_F(\theta^*, \theta^{(k)}).

Therefore,
DF(θ,θ)gap(j)4CFK+2D_F(\theta^*,\theta) \leq gap^{(j)} \leq \frac{4 C_F}{K+2}

Applications

The Frank-Wolfe algorithm lets us approximate the optimal solution to the SMEBB problem.

Definition. An (ϵ,C)(\epsilon, C) slack Bregman MEB of a set of points θ1,,θn\theta_1, \cdots, \theta_n is a couple (θ,R)(\theta, R) such that if θ\theta^* is an optimal solution to the slack Bregman Meb problem, then
DF(θ,θ)ϵD_F(\theta^*, \theta) \leq \epsilon

With this definition, we have the following lemma:
Lemma. The number of iterations needed to compute a (ϵ,C)(\epsilon, C) slack Bregman MEB of a set of points is Θ(CF/ϵ)\Theta(C_F/\epsilon).

Slack Minimum Enclosing Ball with the Euclidean Distance

In this section, P={θ1,,θn}P = \{\theta_1, \cdots, \theta_n\} are points in a vector space EE of finite dimension dd. We consider the euclidean scalar product on EE, which corresponds to the Bregman divergence DFD_F associated to the generator F(θ)=θTθ/2F(\theta) = \theta^T\theta/2.

The optimization problem is therefore the Minimum enclosing ball problem with outliers:

minc,R,ξR+C1inξi\min_{c, R, \xi} R + C\sum_{1\leq i \leq n}{\xi_i}

s.t.cθi2R+ξi,1ins.t. \quad \Vert c-\theta_i \Vert^2 \leq R + \xi_i ,\quad 1 \leq i \leq n ξ0\xi \geq 0

Theorem. The Slack Bregman Frank-Wolfe algorithm computes a (ϵ,C)(\epsilon, C)-SMEBB of Z\mathbf{Z} in time Θ(ndΔ/ϵ)\Theta(nd \Delta /\epsilon). where Δ=diam(Z)2\Delta = diam(\mathbf{Z})^2.

Proof. We first notice that for this Bregman divergence, CF=diam(Z)2=maxz,zzz2C_F = diam(\mathbf{Z})^2 = \max_{z,z'} \Vert z-z' \Vert^2.
With the previous lemma, we have shown that after Θ(CF/ϵ)\Theta(C_F/\epsilon) iterations, the algorithm returns a (ϵ,C)(\epsilon, C)-SMEBB of PP. Therefore one only need to show that each iteration can be computed in time Θ(nd)\Theta(nd).
With F(z)=zTz/2F(z) = z^Tz/2, we have F=Id\nabla F = Id and
AF,F(θa:ηb)=θaθb22A_{F,F^*}(\theta_a: \eta_b) = \frac{\Vert \theta_a-\theta_b \Vert^2}{2}

For each iteration, we have θ(k)=1inαi(k)θi\theta^{(k)} = \sum_{1 \leq i \leq n} \alpha_i^{(k)} \theta_i computed in time Θ(nd)\Theta(nd), θ(k)=η(k)\theta^{(k)} = \eta^{(k)} and Di(k)=θiθ(k)22D^{(k)}_i = \frac{ \Vert \theta_i-\theta^{(k)} \Vert ^2}{2}. Therefore computing each of the nn coordinates of D(k)D^{(k)} takes time Θ(d)\Theta(d).
Computing the mm-th largest entries of D takes time Θ(n)\Theta(n) with the introselect algorithm and the remaining operations are all computed in time Θ(nd)\Theta(nd), hence the execution time of the algorithm is Θ(nd/ϵ)\Theta(nd/\epsilon).

Our definition of a (ϵ,C)(\epsilon, C)-SMEBB is not the same that the standard euclidian MEB approximation. Usually, a ball B(c,R)B(c, R) is an (1+ϵ)(1+\epsilon)-approximation of the minimum enclosing ball B(c,R)B(c^*, R^*) if B(c,R)B(c,(1+ϵ)R)B(c^*,R^*) \subset B(c,(1+\epsilon)R).A sufficient condition for B(c,R)B(c, R) to be an (1+ϵ)(1+\epsilon) approximation of the MEB is cc<ϵR\Vert c-c^* \Vert<\epsilon R^*. With this definition, the number of iteration needed is Θ(1/ϵ2)\Theta(1/\epsilon^2) and therefore the total execution time is Θ(nd/ϵ2)\Theta(nd/\epsilon^2).

Kernelized Slack MEB

In real life, the data is not linearly separable, and euclidian enclosing balls do not reflect the geometry of a dataset. That is why it is interesting to map the data into a feature space through a kernel and to consider enclosing balls in this feature space.

In this section, Z={z1,,zn}\mathbb{Z} = \{z_1, \cdots, z_n\} are points in a vector space EE of finite dimension dd. We consider a mapping ϕ\phi of Z\mathbb{Z} into an unknown feature space FF such as
(z,z)Z2,<ϕ(z),ϕ(z)>=k(z,z)\forall (z,z') \in \mathbb{Z}^2, <\phi(z), \phi(z')> = k(z, z')
where k is a definite positive kernel, which means that
cRn,1i,jncicjk(zi,zj)0\forall c \in \mathbb{R}^n, \sum_{1 \leq i,j \leq n} c_i c_j k(z_i,z_j) \geq 0
For instance, we could use the gaussian kernel:
k(z,z)=exp(qzz2)k(z, z') = exp(-q\Vert z-z' \Vert^2).

We define P={θ1,,θn}={ϕ(z1),,ϕ(zn)}P = \{\theta_1, \cdots, \theta_n\} = \{\phi(z_1), \cdots, \phi(z_n)\}. The representation of points in EE is implicit, which means that the center cc of the SMEBB is described as a convex combination of {θ1,,θn}\{\theta_1, \cdots, \theta_n\}: c=αCΔnαiθic = \sum_{\alpha \in C\Delta_n} \alpha_i \theta_i.

Let GG be the Graham matrix of P, such that Gi,j=<θi,θj>=k(zi,zj)G_{i,j} = <\theta_i, \theta_j> = k(z_i, z_j).

This scalar product corresponds to the Bregman divergence
DFD_F associated to the generator
F(1inαizi)=αTGα/2=121i,jnαiαjGi,jforαΔnF(\sum_{1 \leq i \leq n} \alpha_i z_i) = \alpha^TG\alpha/2 = \frac{1}{2} \sum_{1\leq i,j \leq n} \alpha_i \alpha_j G_{i,j} \quad \text{for} \quad \alpha \in \Delta_n

Theorem. The Slack Bregman Frank-Wolfe algorithm computes a (ϵ,C)(\epsilon,C)-SMEBB of Z\mathbb{Z} in time Θ(nΔ/Cϵ)\Theta(n \Delta /C\epsilon). where Δ=diam(ϕ(Z))2\Delta = diam(\phi(\mathbf{Z}))^2.

Proof. Since CF=ΔC_F = \Delta, the algorithm returns a (ϵ,C)(\epsilon, C)-SMEBB of PP in Θ(Δ/ϵ)\Theta(\Delta/\epsilon). We only need to show that each iteration can be computed in time Θ(n/C)\Theta(n/C).
The only difficult point is to show that in the algorithm, is is possible to compute D(k)D^{(k)} is time Θ(n/C)\Theta(n/C). We have
Di(k)=(eiα(k))TG(eiα(k))D^{(k)}_i = (e_i - \alpha^{(k)})^TG(e_i - \alpha^{(k)})

Since adding a constant term to D(k)D^{(k)} does not change the computation of s(k)s^{(k)} and gap(k)gap^{(k)}, we can consider
D(k)=diag(G)2Gα(k)D^{(k)} = diag(G) - 2 G\alpha^{(k)}

D(k+1)=D(k)+2G(α(k+1)α(k))D^{(k+1)} = D^{(k)} + 2 G(\alpha^{(k+1)} - \alpha^{(k)})
The update rule of α(k)\alpha^{(k)} shows that the vector α(k+1)α(k)\alpha^{(k+1)} - \alpha^{(k)} has at most 1/C1/C non-empty entries. Therefore the computation of G(α(k+1)α(k))G(\alpha^{(k+1)} - \alpha^{(k)}) takes time Θ(n/C)\Theta(n/C), hence the total running time of Θ(nΔ/Cϵ)\Theta(n\Delta/C\epsilon).

Kullback-Leibler divergence

When working on a dataset that consist in points in the probability simplex, it is more natural to work with the Kullback-Leibler divergence than with the euclidian distance.

Let P=(p1,...pd+1)P=(p_1,...p_{d+1}) denote a point of the d-dimensional probability simplex.
For 1id1\leq i \leq d, let
θ(P)=(log(pipd+1))1idη(P)=(pi)i=(eθi1+k=1deθk)1id\begin{aligned} \theta(P) &= (log(\frac{p_i}{p_{d+1}}))_{1 \leq i \leq d} \\ \eta(P) &= (p_i)_i = (\frac{e^{\theta_i}}{1+\sum_{k=1 }^d{e^{\theta_k}}})_{1 \leq i \leq d} \\ \end{aligned} and F(θ)=log(1+j=1deθj)F(η)=i=1dηilog(ηi)+(1kηk)log(1kηk)\begin{aligned} F(\theta) &= log(1+ \sum_{j=1}^d e^{\theta_j})\\ F^*(\eta) &= \sum_{i=1}^d{\eta_i log(\eta_i)}+ (1-\sum_k \eta_k) log (1-\sum_k \eta_k) \end{aligned}

The Kullback-Leibler divergence between two points P and Q of the d-dimensional probability simplex is KL(P:Q)=DF(θ(Q),θ(P))=AF,F(θ(Q),η(P))KL(P:Q) = D_F(\theta(Q),\theta(P)) = A_{F,F^*}(\theta(Q),\eta(P))

Hard Bregman ball

Bregman balls obtained with the FW Bregman MEB algorithm

Center convergence rate and duality gap for FW Bregman MEB algorithm

Conclusion

We have considered the Frank-Wolfe algorithm for solving the Minimum Enclosing ball Problem and generalized it to Bregman divergences. We also considered an optimization problem with slack variables, which allow some outliers. We showed that the Frank-Wolfe algorithm can be run within this new framework, and demonstrated a convergence rate of Θ(1/k)\Theta(1/k) where kk is the number of iterations, which is the same as what is proved in the Standard Euclidian MEB problem.

We showed three applications of this result: the standard euclidian slack MEB problem, the kernelized MEB problem where the enclosing ball is computed in a feature space, and MEB on the probability simplex with the Kullback-Leibler divergence.

Further studies may focus on the Nesterov Framework, which is a framework for non-smooth convex optimization. A convergence rate of Θ(nd/ϵ)\Theta(nd/\sqrt{\epsilon}) is shown in [7].

The MEB problem with outliers on Z={z1,,zn}\mathbb{Z} = \{z_1, \cdots, z_n\} can be written as
argmincEc2+maxuCΔn(<Ac,u>+<b,u>)\operatorname{argmin}_{c \in E} \quad \Vert c \Vert^2 + \max_{u \in C\Delta_n} (<Ac, u> + <b, u>)
with b=(b1,,bn)b = (b_1, \cdots, b_n) such as bi=zi2b_i = \Vert z_i \Vert^2, and A=2(z1,,zn)TA = -2 (z_1, \cdots, z_n)^T.

Therefore, one can apply the same method as in the paper for the Slack MEB problem, with Q2=CΔnQ_2 = C\Delta_n instead of Q2=ΔnQ_2 = \Delta_n. The algorithm developped in this paper could also perhaps be generalized to Bregman divergences.

In another direction, when the approximated center is close enough to the real center, the variable D(k)D^{(k)} in the algorithm stays roughly the same, so the computation of the largest entries of D(k)D^{(k)} might be easier. It would be a good idea to study which structure should be maintained on the entries of D(k)D^{(k)} so that the time complexity of this step decreases over time.

Appendix

Lemma. Let D=CΔnD = C\Delta_n with 1/nC11/n \leq C \leq 1, m=1Cm=\lceil\frac{1}{C}\rceil and vDv \in D.

The solution to the optimization problem argmaxsD<s,v>\operatorname{argmax}_{s \in D}{<s, v>} is
s=1km1Ceik+(1(m1)C)eims = \sum_{1 \leq k \leq m-1} Ce_{i_k} + (1-(m-1)C)e_{i_m}
where i1,,imi_1, \cdots, i_m are the indices of the mm-th largest entries of vv, imi_m being the mm-th largest entry, and is computed in time Θ(n)\Theta(n).

Proof. Let (I,{im},J)(I, \{i_m\}, J) be a partition of [[1;n]][\![1;n]\!] such that J=1/C\Vert J \Vert = \lfloor 1/C \rfloor and
iI,jJ,vivimvj\forall i \in I, \forall j \in J, \quad v_i \leq v_{i_m} \leq v_j.
This partition can be computed in time Θ(n)\Theta(n), and JJ is the set of indices of the mm-th largest entries of vv.

Let λ=1(m1)C\lambda = 1 - (m-1)C, s=λeim+jJCeis = \lambda e_{i_m} + \sum_{j \in J} C e_i, and αD\alpha \in D.

iI,0=siαi\forall i \in I, 0 = s_i \leq \alpha_i and jJ,0αjsj\forall j \in J, 0 \leq \alpha_j \leq s_j, therefore :
<sα,v>=iI(siαi)vi+(simαim)vim+jJ(sjαj)vjiI(siαi)vim+(simαim)vim+jJ(sjαj)vimi[[1;n]](siαi)vim(i[[1;n]]sii[[1;n]]αi)vim=0\begin{aligned} <s - \alpha, v> &= \sum_{i \in I} (s_i - \alpha_i) v_i + (s_{i_m} - \alpha_{i_m}) v_{i_m} + \sum_{j \in J} (s_j-\alpha_j) v_j \\& \geq \sum_{i \in I} (s_i - \alpha_i) v_{i_m} + (s_{i_m} - \alpha_{i_m}) v_{i_m} + \sum_{j \in J} (s_j-\alpha_j) v_{i_m} \\& \geq \sum_{i \in [\![1;n]\!]} (s_i - \alpha_i) v_{i_m} \\&\geq (\sum_{i \in [\![1;n]\!]} s_i - \sum_{i \in [\![1;n]\!]} \alpha_i) v_{i_m} = 0 \end{aligned}

Therefore,
αD,<s,v><α,v>s=argmaxsD<s,v>\forall \alpha \in D, <s, v> \geq <\alpha, v> \implies s = \operatorname{argmax}_{s \in D}{<s, v>}


  1. Ben-Hur, Asa and Horn, David and Siegelmann, Hava T and Vapnik, Vladimir, Support vector clustering, Journal of machine learning research, 2(Dec), 125-137, 2001 ↩︎

  2. Tax, David MJ and Duin, Robert PW, Support vector data description, Machine Learning, 54(1):45-66, 2004 ↩︎

  3. Frank, Marguerite and Wolfe, Philip, An algorithm for quadratic programming, Naval Res. Logist. Quart., 1956 ↩︎

  4. Badoiu, Mihai and Clarkson, Kenneth L, Smaller core-sets for balls, In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 801-802, Society for Industrial and Applied Mathematics, 2003. ↩︎ ↩︎

  5. Badoiu, Mihai and Clarkson, Kenneth L, Optimal core-sets for balls, Computational Geometry, 40(1): 14-22, 2008 ↩︎

  6. Martin Jaggi, Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, 2013, Available http://proceedings.mlr.press/v28/jaggi13.pdf. ↩︎

  7. Ankan Saha and S. V. N. Vishwanathan, Efficient Approximation Algorithms for Minimum Enclosing Convex Shapes, CoRR, abs/0909.1062, 2009 ↩︎