Lagrangian formulation of backpropagation


Lecture 2 presented two forms of optimization: Constrained optimization and unconstrained optimization.

Is there any math background that I need to understand this lagrangian formulation of backpropagation?


I haven’t gone through the lectures but it makes sense to me.
Umm I think all you need to know is for this portion is : summations,Series, Function representation here.

All the representations seem cryptic when we begin. It’s good to have a math symbol cheatsheet around. At times, they just condense it, but sometimes it’s really just to make it look cryptic :slight_smile:


I don’t understand what the term represents. The squared L2 norm is obvious, but the second term’s role is unclear to me. Isn’t the term inside the parentheses zero since they’re the same thing?


Hi There, contrary to what often stated, I think knowing some math would help a lot. In particular, this is variational calculus, very useful to know since all machine learning is based on it!

The point is that you are trying to minimise a function subject to some constraints. For example, you would like to study the motion of a particle that is constrained to move on a circular trajectory. In the case presented in the slides, you want x_i(l)= f_f[…], that just means a general function of the learning parameters.

The most common way to do this is to use a Lagrange multiplier. In this case there are -i- multipliers (B(l) in the slides) that you need to find. The multipliers enforce the constraint x_i(l)= f_l[…]. You turned an initially constrained problem into an unconstrained one, although now you need to minimise also with respect to B_i(l).

This is something I am sure you have seen before, although in a less abstract notation. For example, when defining the objective function in a standard regression problem, the first term corresponds to the Euclidean distance (between prediction and ground truth) and the second would corresponds to the regularisation term you need to avoid overfitting. Then the Lagrangian is the familiar (in vector form):

L = || x(W) - y ||^2_2 + lambda/2 (W^T W)

Here “lambda” is your Lagrange multiplier enforcing the constraint: W^T W =0, i.e. it prevents the learning parameters to become too big.
In the machine learning community, one usually fixes this value “by hand” or through a grid search to save computational time, but what you really should do is to find the value of lambda by optimization, i.e. you need to solve the system:

d L/dW =0 , dL/d lambda =0 .

Finally, a more advanced note is that you can understand Lagrange multipliers coming from representations of the Dirac delta function: Prod_i Delta(x_i - sum_l f_l […]).

I hope this answers your question :slight_smile: If you want to know more, I suggest having a look to a book on functional calculus.


Thanks a lot :slightly_smiling_face: . That helped.


Tim Vieira wrote a great article covering the connection between backprop and the method of Lagrange multipliers.


Actually you can find all the details in some serious books on Machine learning, like

C. Bishop, Pattern Recognition and machine learning

There is a detailed explanation in one of the appendices. Usually most books on variational methods only cover the original Lagrange multiplier formulation, that is for equalities e.g. y= sum_i f(x_i). But you can have inequality constraints such as y >= f(x_i), that requires some additional conditions on the form of the multiplier.

I found the article of Tim Viera misleading. Backprop is really the “chain rule”, at least from a calculus point of view. Numerically, you can implement things in a different way but usually solving a constraint optimization problem takes more resources than an unconstrained one. This is because you are restricting the target space to satisfy a series of demands. The mapping to the constrained problem in the article is an unnecessary complication and has really nothing to do with the backdrop algorithm.

Moreover, I would like to point out that a constrained problem can have a large number of degenerate minima, see e.g. spin glasses. As Neural Networks are a particular instance of a spin glass model, you can see these kind of problems emerging as well. Note however that constraints in NN emerges from the randomness of the model.

Constrained optimization problems appear often in physics or in operation research. For example, many combinatorial problems can be solved in terms of constrained optimization problems on networks, like finding the shortest path between two points etc…


Thank you for the detailed reply. Firstly, my background is not in mathematics or academia or CS or what-so-ever; I majored in multimedia system. I am a self-trained software engineer who started came at the issue of programming, Computer Science, and Machine Learning from a perspective of “I’ve got questions I want to ask and techniques I want to apply that I’m currently under prepared to answer.”

I have heard about Bishop, Pattern Recognition and machine learning book when I first tried to get into this field of work where I can actually work on developing and applying Machine Learning algorithms every day. The book didn’t stick with me. I am thankfully for Andrew Ng’s Machine Learning MOOC which finally got me into this field. I will give Bishop’s book another chance now that I am more ready.

I understand what you have explained. It’s a very interesting point of view. I am glad that I learned something today from your reply. Keep sharing!


Hi Cedric,

Sure I took Adrew Ng’s classes as well, as I find them a nice “hands on” start to get on the topic. The fact that he overlooks most of the math, does not mean that their content is not useful, even for people that have a good mathematical background.

I believe the purpose of this forum is to share knowledge and opinions about AI. Being a hot topic, you can find a lot of people talking about it, not always with the adeguate knowledge.

It is common in science to have a debate about people’s research/claims, and to not accept someone’s point of view just because he/she is famous. In this case I just pointed out that the blog post is mathematically incorrect. Mine was not a point of view, but rather an explanation of the meaning and various applications of this methods.

I am glad it has been useful.

Best wishes,