mjtsai1974's Dev Blog Welcome to mjt's AI world

Lagrange Multiplier

Why Lagrange Multiplier?

Suppose you are given a function f(x1, x2), and would like to find out its extreme, subject to a constraint function g(x1, x2) = 0;
where g(x1, x2) = ax1 + bx2 + c = d, with a, b, c, d to be any scalar, d might be 0.

The possible solution:
[1]Figure out from the constraint function g(x1, x2) = 0 and express x2 in terms of x1, like x2 = h(x1)
[2]Back to f(x1, x2) and replace x2 with h(x1), f(x1, x2) = f(x1, h(x1))
[3]Take partial derivative on x1 to zero, ∂f(x1, h(x1)) ∕ ∂x1 = 0, the x1 value thus obtained would then lead us to the extreme of f(x1, x2)

Here comes the problem that not all objective parameters could be expressed in terms of other objective parameters in the constraint function.

The Regularized Formula for Lagrange Multiplier

We thus step further into the lagrange multiplier, usually, we will see:

ℒ(x1, x2, λ) = f(x1, x2) + λ(x1, x2) ... (1), where λ is the lagrange multiplier, and ℒ(x1, x2, λ) is the maximum likelihood function for us to come out with the λ that can optimize the extreme value of ℒ

To get the most optimal solution, ∂ℒ ∕ ∂x1 = 0, ∂ℒ ∕ ∂x2 = 0, ∂ℒ ∕ ∂λ = 0 must be mandatory!

Before proceed any further, we’d like to realize why (1) is the regularized formula for Lagrange Multiplier.

[1]Expand f by means of Taylor Series

➀take f(x1, x2,…, xn) to be a continuous and differentiable function in Rn,
➁take x = [x1, x2,…, xn]t ∈ Rn,
➂then, begin from limx→af(x) = f(a), where x, a ∈ Rn,
express limx→af(x) in terms of Taylor Series:
f(a) = f(x) + f′(x)(x − a) + (1 ∕ (2!))f″(x)(x − a)2 + (1 ∕ (6!))f′″(x)(x − a)3 + …
f(a) ≈ f(x) + f′(x)(x − a); ignore the second derivative term,
then, f′(x) = ∂f(x) ∕ ∂x = [∂f(x) ∕ ∂x1 ∂f(x) ∕ ∂x2 … ∂f(x) ∕ ∂xn]t = ∇f,
f(a) ≈ f(x) + (∇f)t(x − a)n×1

[2]Next, we discuss single constraint condition

➀suppose we are asking for min/max f(x), subject to g(x) = c, cmight be 0, x ∈ Rn, where g(x) = c is a continuous, differentiable hypersurface on Rn
➁express limx→ag(x) = g(a) in terms of Taylor Series, then:
g(a) ≈ g(x) + g′(x)(x − a) = g(x) + (∇g)t ⋅(x − a) n×1
➂if a is also the point on the hypersurface, then, g(x) = g(a) = 0, we can treat (x − a) → 0, since x is very closed to a and conclude that (∇g)t(x − a)n×1 = 0
➃that is to say (∇g)t is the normal vector orthogonal to the hypersurface at the point a, where we can have:
(∇g)t ≈ limx→a[(g(x) − g(a)) ∕ (x − a)]t
➄cautions must be made that level curve of function f might hit onto the hypersurface of function g at the point x, with an infinitesimal distance to the point a, say it ε, where x + ε = a
➅then, ∇f might not be full parallel to ∇g, we should have:
Δ = ∇f − Proj∇g(∇f) … ≠ 0
     = ∇f − (((∇f)t ⋅ ∇g) ∕ ||∇g||2) ⋅ ∇g
∵ε→0, x→a, we thus have Δ ≈ 0
∴Δ = ∇f − λ∇g ≈ 0, where λ = (((∇f)t ⋅ ∇g) ∕ ||∇g||2)
or, we can directly say that ∇f ∈ span{∇g}
➆if your AI algorithm works well, then ε→0, such that ∇f = λ∇g could hold, we illustrate this concept of projecting ∇f onto ∇g, see below pic(the picture is just by intuition).

Project ∇f onto ∇g

[3]Lagrangian representation in most application design

In most application design, Lagrange Multiplier likelihood function is expressed as:
ℒ(x, λ) = f(x) + λg(x), x ∈ Rn,
∂ℒ ∕ ∂x = ∇f + λ∇g … (a)
∂ℒ ∕ ∂λ = 0 … (b)
=>Resolve above two equations (a), (b) to get λ:
➀if λ = 0, then, ∇f = 0 means that ∇f(x*) = 0 and g(x*) = 0 just satisfy the constraint function, where
g(x) = a1x1d1 + a2x2d2 + … + anxndn + C = 0, x ∈ Rn
➁if λ ≠ 0, then, ∇f = -λ∇g, it implies that ∇f and ∇g are inverse parallel, where f(x) has a extreme(min/max) at this point x*(so does g(x))
➂be noted that g(x) != 0 would not impact, since ∂ℒ ∕ ∂λ = 0 for a solve and g(x) equates to any scalar would be subject to the question domain.

[4]Why ∇f and ∇g are inverse parallel?

∵the regularization by using the lagrange multiplier would further reduce the magnitude of ∇f, the final ∇f would be normal vector with small magnitude(length), thus the small error when we project sample data onto ∇f.
This design expects a minimum magnitude of the gradient after regularization at the tangent point where the level curve and the hypersurface of the constraint function.

Inverse Parallel

What to choose in between ℒ(x, λ) = f(x) + λg(x) v.s. ℒ(x, λ) = f(x) − λg(x)

It depends on exactly your design of regularization algorithm and how good you believe your regularization can work.
➀if you aims at reducing the prediction error(Δ),then, ℒ(x, λ) = f(x) − λg(x) might be the most appropriate
➁inverse parallel version of ℒ(x, λ) = f(x) + λg(x) would make the point on minimizing the magnitude of ∇f which is orthogonal to the hyperplane.
Both guarantee the projecting sample data onto ∇f would we have the minimal error(see below pic for intuition).
One thing interest is that by choosing ℒ(x, λ) = f(x) + λg(x), the λ is negative; nevertheless, ℒ(x, λ) = f(x) − λg(x) would come out with positive λ, concluded from the 2 combination, we should have the lagrgarian in a more regularized expression:
ℒ(x, λ) = f(x) + sign(λ)λg(x), where sign(λ) is negative.

Just Works

The Lagrange Multiplier and Multiple Constraints

➀succeeding to conclusion in above paragraph, suppose we are asking for min/max f(x), subject to gk(x) = 0, x ∈ Rn, k = 1 to m, and gk(x) is continuous and differentiable,
where ∇f = -∑k=1mλk∇gk
⇔ ∇f ∈ span{∇g1, ∇g2,…, ∇gm}
➁refine Lagrange Multiplier likelihood function, we have:
ℒ(x, {λk}) = f(x) + ∑kλkgk(x), for k = 1 to m
➂to find the extreme value(min/max), we need:
xℒ = ∇f + ∑k=1mλk∇gk = 0 … (c)
λkℒ = gk(x) = 0 for k = 1 to m … (d)