mjtsai1974's Dev Blog Welcome to mjt's AI world

Lagrange Multiplier

Why Lagrange Multiplier?

Suppose you are given a function f(x1, x2), and would like to find out its extreme, subject to a constraint function g(x1, x2) = 0;
where g(x1, x2) = ax1 + bx2 + c = d, with a, b, c, d to be any scalar, d might be 0.

The possible solution:
[1]Figure out from the constraint function g(x1, x2) = 0 and express x2 in terms of x1, like x2 = h(x1)
[2]Back to f(x1, x2) and replace x2 with h(x1), f(x1, x2) = f(x1, h(x1))
[3]Take partial derivative on x1 to zero, ∂f(x1, h(x1)) ∕ ∂x1 = 0, the x1 value thus obtained would then lead us to the extreme of f(x1, x2)

Here comes the problem that not all objective parameters could be expressed in terms of other objective parameters in the constraint function.

The Regularized Formula for Lagrange Multiplier

We thus step further into the lagrange multiplier, usually, we will see:

ℒ(x1, x2, λ) = f(x1, x2) + λ(x1, x2) ... (1), where λ is the lagrange multiplier, and ℒ(x1, x2, λ) is the maximum likelihood function for us to come out with the λ that can optimize the extreme value of ℒ

To get the most optimal solution, ∂ℒ ∕ ∂x1 = 0, ∂ℒ ∕ ∂x2 = 0, ∂ℒ ∕ ∂λ = 0 must be mandatory!

Before proceed any further, we’d like to realize why (1) is the regularized formula for Lagrange Multiplier.

[1]Expand f by means of Taylor Series

➀take f(x1, x2,…, xn) to be a continuous and differentiable function in Rn,
➁take x = [x1, x2,…, xn]t ∈ Rn,
➂then, begin from limx→af(x) = f(a), where x, a ∈ Rn,
express limx→af(x) in terms of Taylor Series:
f(a) = f(x) + f′(x)(x − a) + (1 ∕ (2!))f″(x)(x − a)2 + (1 ∕ (6!))f′″(x)(x − a)3 + …
f(a) ≈ f(x) + f′(x)(x − a); ignore the second derivative term,
then, f′(x) = ∂f(x) ∕ ∂x = [∂f(x) ∕ ∂x1 ∂f(x) ∕ ∂x2 … ∂f(x) ∕ ∂xn]t = ∇f,
f(a) ≈ f(x) + (∇f)t(x − a)n×1

[2]Next, we discuss single constraint condition

➀suppose we are asking for min/max f(x), subject to g(x) = c, cmight be 0, x ∈ Rn, where g(x) = c is a continuous, differentiable hypersurface on Rn
➁express limx→ag(x) = g(a) in terms of Taylor Series, then:
g(a) ≈ g(x) + g′(x)(x − a) = g(x) + (∇g)t ⋅(x − a) n×1
➂if a is also the point on the hypersurface, then, g(x) = g(a) = 0, we can treat (x − a) → 0, since x is very closed to a and conclude that (∇g)t(x − a)n×1 = 0
➃that is to say (∇g)t is the normal vector orthogonal to the hypersurface at the point a, where we can have:
(∇g)t ≈ limx→a[(g(x) − g(a)) ∕ (x − a)]t
➄cautions must be made that level curve of function f might hit onto the hypersurface of function g at the point x, with an infinitesimal distance to the point a, say it ε, where x + ε = a
➅then, ∇f might not be full parallel to ∇g, we should have:
Δ = ∇f − Proj∇g(∇f) … ≠ 0
     = ∇f − (((∇f)t ⋅ ∇g) ∕ ||∇g||2) ⋅ ∇g
∵ε→0, x→a, we thus have Δ ≈ 0
∴Δ = ∇f − λ∇g ≈ 0, where λ = (((∇f)t ⋅ ∇g) ∕ ||∇g||2)
or, we can directly say that ∇f ∈ span{∇g}
➆if your AI algorithm works well, then ε→0, such that ∇f = λ∇g could hold, we illustrate this concept of projecting ∇f onto ∇g, see below pic(the picture is just by intuition).

Project ∇f onto ∇g

[3]Lagrangian representation in most application design

In most application design, Lagrange Multiplier likelihood function is expressed as:
ℒ(x, λ) = f(x) + λg(x), x ∈ Rn,
∂ℒ ∕ ∂x = ∇f + λ∇g … (a)
∂ℒ ∕ ∂λ = 0 … (b)
=>Resolve above two equations (a), (b) to get λ:
➀if λ = 0, then, ∇f = 0 means that ∇f(x*) = 0 and g(x*) = 0 just satisfy the constraint function, where
g(x) = a1x1d1 + a2x2d2 + … + anxndn + C = 0, x ∈ Rn
➁if λ ≠ 0, then, ∇f = -λ∇g, it implies that ∇f and ∇g are inverse parallel, where f(x) has a extreme(min/max) at this point x*(so does g(x))
➂be noted that g(x) != 0 would not impact, since ∂ℒ ∕ ∂λ = 0 for a solve and g(x) equates to any scalar would be subject to the question domain.

[4]Why ∇f and ∇g are inverse parallel?

∵the regularization by using the lagrange multiplier would further reduce the magnitude of ∇f, the final ∇f would be normal vector with small magnitude(length), thus the small error when we project sample data onto ∇f.
This design expects a minimum magnitude of the gradient after regularization at the tangent point where the level curve and the hypersurface of the constraint function.

Inverse Parallel

What to choose in between ℒ(x, λ) = f(x) + λg(x) v.s. ℒ(x, λ) = f(x) − λg(x)

It depends on exactly your design of regularization algorithm and how good you believe your regularization can work.
➀if you aims at reducing the prediction error(Δ),then, ℒ(x, λ) = f(x) − λg(x) might be the most appropriate
➁inverse parallel version of ℒ(x, λ) = f(x) + λg(x) would make the point on minimizing the magnitude of ∇f which is orthogonal to the hyperplane.
Both guarantee the projecting sample data onto ∇f would we have the minimal error(see below pic for intuition).
One thing interest is that by choosing ℒ(x, λ) = f(x) + λg(x), the λ is negative; nevertheless, ℒ(x, λ) = f(x) − λg(x) would come out with positive λ, concluded from the 2 combination, we should have the lagrgarian in a more regularized expression:
ℒ(x, λ) = f(x) + sign(λ)λg(x), where sign(λ) is negative.

Just Works

The Lagrange Multiplier and Multiple Constraints

➀succeeding to conclusion in above paragraph, suppose we are asking for min/max f(x), subject to gk(x) = 0, x ∈ Rn, k = 1 to m, and gk(x) is continuous and differentiable,
where ∇f = -∑k=1mλk∇gk
⇔ ∇f ∈ span{∇g1, ∇g2,…, ∇gm}
➁refine Lagrange Multiplier likelihood function, we have:
ℒ(x, {λk}) = f(x) + ∑kλkgk(x), for k = 1 to m
➂to find the extreme value(min/max), we need:
xℒ = ∇f + ∑k=1mλk∇gk = 0 … (c)
λkℒ = gk(x) = 0 for k = 1 to m … (d)

Hoeffding Inequality versus Chebyshev's Inequality

Hoeffding Inequality

Suppose you are given an in−sample data set with certain property ν to train your learning algorithm with a hope to predict an out−sample data set with unknown and to be believed existed property μ. Does in−sample ν say anything about out−sample μ?? Where the in−sample and out−sample might consist of small balls in red, green, blue. Both samples are coming from the same random generator. We treat the distribution of colour in balls as the property.

[1]No!! Possibly not, in−sample might contains only red balls, while out−sample might contains only green balls.
[2]Yes!! Probably yes, maybe in−sample and out−sample might contains similar high proportion of blue balls, thus, in−sample ν might be closed to out−sample μ.

Hoeffding inequality guarantees that there exists such possibility that in−sample ν might be closed to out−sample μ, their difference is within quiet small ε, on conditions that the sample size is large:

P(|ν − μ| > ε) ≤ 2 × exp(−2 × ε² × N); where N is the in|out−sample size. Hoeffding inequality claims ν = μ is probably approximate correct.

[1]Valid for all N and ε.
[2]Doesn't depend on μ, no need to know μ.
[3]Large sample size N or looser gap ε, will we have higher probability for ν = μ.

Above paragraph is my summary from NTU H.T.Lin’s Machine Learning Foundation.

Chebyshev's Inequality

Chebyshev's inequality claims that P(|Y − E(Y)| > a) ≤ (1 ∕ a2) × Var(Y); where Y is any arbitrary random variable Y and any a > 0.

Proof:
Var(Y) = ∫−∞(y − μ)2׃Ydy
           ≥ ∫|y − μ| ≥ a(y − μ)2׃Ydy
           ≥ ∫|y − μ| ≥ aa2׃Ydy
           = a2×P(|Y − E(Y)| > a), where μ = E(Y).

Suppose we have
(1)random variables X1, X2,,,,Xn with expectation μ and variance δ2,
(2)avgXn = Σni=1Xi ∕ n, Var(avgXn) = δ2 ∕ n,
then, for any ε > 0, by using Chebyshev's inequality, we have
P(|avgXn − μ| > ε) = P(|avgXn − E(avgXn)| > ε)
            ≤ δ2 ∕ (n × ε2), as n→∞, it→0.
∴ limn→∞P(|avgXn − μ| > ε) = 0 … law of large number.

Hoeffding Inequality ≈ Chebyshev's Inequality

Now, we compare Hoeffding Inequality with Chebyshev's Inequality

[1]P(|ν − μ| > ε) ≤ 2 × exp(−2 × ε² × N)
           v.s.
[2]P(|avgXn − μ| > ε) ≤ δ2 ∕ (n × ε2)

This is equivalent to compare 2 × exp(−2 × ε² × N) v.s. δ2∕(n × ε2)
For [1], as N → ∞, we will have 2 × exp(−2 × ε² × N) → 0
For [2], as n → ∞, we will have δ2∕(n × ε2) → 0
∴ we just have Hoeffding Inequality ≈ Chebyshev's Inequality, where 2 in [1] and δ2 in [2] would just lead to probability in the same sign.

Introducing Lanyon

Lanyon is an unassuming Jekyll theme that places content first by tucking away navigation in a hidden drawer. It’s based on Poole, the Jekyll butler.

Built on Poole

Poole is the Jekyll Butler, serving as an upstanding and effective foundation for Jekyll themes by @mdo. Poole, and every theme built on it (like Lanyon here) includes the following:

  • Complete Jekyll setup included (layouts, config, 404, RSS feed, posts, and example page)
  • Mobile friendly design and development
  • Easily scalable text and component sizing with rem units in the CSS
  • Support for a wide gamut of HTML elements
  • Related posts (time-based, because Jekyll) below each post
  • Syntax highlighting, courtesy Pygments (the Python-based code snippet highlighter)

Lanyon features

In addition to the features of Poole, Lanyon adds the following:

  • Toggleable sliding sidebar (built with only CSS) via link in top corner
  • Sidebar includes support for textual modules and a dynamically generated navigation with active link support
  • Two orientations for content and sidebar, default (left sidebar) and reverse (right sidebar), available via <body> classes
  • Eight optional color schemes, available via <body> classes

Head to the readme to learn more.

Browser support

Lanyon is by preference a forward-thinking project. In addition to the latest versions of Chrome, Safari (mobile and desktop), and Firefox, it is only compatible with Internet Explorer 9 and above.

Download

Lanyon is developed on and hosted with GitHub. Head to the GitHub repository for downloads, bug reports, and features requests.

Thanks!

Example content

Howdy! This is an example blog post that shows several types of HTML content supported in this theme.

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.

Curabitur blandit tempus porttitor. Nullam quis risus eget urna mollis ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.

Etiam porta sem malesuada magna mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.

Inline HTML elements

HTML defines a long list of available inline tags, a complete list of which can be found on the Mozilla Developer Network.

  • To bold text, use <strong>.
  • To italicize text, use <em>.
  • Abbreviations, like HTML should use <abbr>, with an optional title attribute for the full phrase.
  • Citations, like — Mark otto, should use <cite>.
  • Deleted text should use <del> and inserted text should use <ins>.
  • Superscript text uses <sup> and subscript text uses <sub>.

Most of these elements are styled by browsers with few modifications on our part.

Heading

Vivamus sagittis lacus vel augue rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.

Code

Cum sociis natoque penatibus et magnis dis code element montes, nascetur ridiculus mus.

// Example can be run directly in your JavaScript console


// Create a function that takes two arguments and returns the sum of those arguments

var adder = new Function("a", "b", "return a + b");

// Call the function

adder(2, 6);
// > 8

Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.

Lists

Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.

  • Praesent commodo cursus magna, vel scelerisque nisl consectetur et.
  • Donec id elit non mi porta gravida at eget metus.
  • Nulla vitae elit libero, a pharetra augue.

Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.

  1. Vestibulum id ligula porta felis euismod semper.
  2. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
  3. Maecenas sed diam eget risus varius blandit sit amet non magna.

Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.

HyperText Markup Language (HTML)
The language used to describe and define the content of a Web page
Cascading Style Sheets (CSS)
Used to describe the appearance of Web content
JavaScript (JS)
The programming language used to build advanced Web sites and applications

Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Nullam quis risus eget urna mollis ornare vel eu leo.

Tables

Aenean lacinia bibendum nulla sed consectetur. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Name Upvotes Downvotes
Totals 21 23
Alice 10 11
Bob 4 3
Charlie 7 9

Nullam id dolor id nibh ultricies vehicula ut id elit. Sed posuere consectetur est at lobortis. Nullam quis risus eget urna mollis ornare vel eu leo.


Want to see something else added? Open an issue.

What's Jekyll?

Jekyll is a static site generator, an open-source tool for creating simple yet powerful websites of all shapes and sizes. From the project’s readme:

Jekyll is a simple, blog aware, static site generator. It takes a template directory […] and spits out a complete, static website suitable for serving with Apache or your favorite web server. This is also the engine behind GitHub Pages, which you can use to host your project’s page or blog right here from GitHub.

It’s an immensely useful tool and one we encourage you to use here with Lanyon.

Find out more by visiting the project on GitHub.