A Neural ODE ​1​ expresses its output as the solution to a dynamical system whose evolution function is a learnable neural network. In other words, a Neural ODE models the transformation from input to output as a learnable ODE.

Since our model is a learnable ODE, we use an ODE solver to evolve the input to an output in the forward pass and calculate a loss. For the backward pass, we would like to simply store the function evaluations of the ODE solver and then backprop through them to calculate the loss gradient. Unfortunately, the forward solve might have taken a huge number of steps depending on the system dynamics, possibly making our computation graph too big to hold in memory for backprop.

Luckily, a very well-known technique called the Adjoint Sensitivity method 2​ helps retrieve the gradient of the Neural ODE loss with respect to the network parameters by solving another ODE backwards in time, sidestepping the need to store intermediate function evaluations.

The adjoint method is used by the Neural ODE authors who also provide a very intuitive proof for it in the paper’s Appendix (B.1), alternate to the one presented in 2​.

Nevertheless, I couldn’t find a copy of 2​, so here I take a stab at deriving the adjoint method using what I think is used in ​2– the more traditional approach of Lagrange Multipliers.

For kicks.⚽

(My attempt relies heavily on my readings of ​3, 4​, and 5.)

Problem

First let’s set the stage by formulating our problem and breaking it into subproblems.

PM: Minimization Problem

Here’s our main minimization problem. Find argminθL(z(t1))(PM)\tag{PM} \text{Find } {\underset {\theta}{\operatorname {argmin} }} \enskip L(z(t_1)) subject to\text{subject to}

F(z(t)˙,z(t),θ,t)=z(t)˙f(z(t),θ,t)=0(1)\tag{1} F(\dot{z(t)}, z(t), \theta, t) = \dot{z(t)} – f(z(t), \theta, t) = 0 z(t0)=zt0(2)\tag{2} z(t_0) = z_{t_0} t0<t1t_0 < t_1

(PM)\text{(PM)}’s dramatis personae a.k.a variables are as follows.

  • ff is our neural network with parameters θ\theta
  • z(t0)z(t_0) is our input, z(t1)z(t_1) is the output, z(t)z(t) is the state reached from z(t0)z(t_0) at time t[t0,t1]t \in [t_0, t_1]
  • z(t)˙=dz(t)dt\dot{z(t)} = \frac{\mathrm{d}z(t)}{\mathrm{d}t} is the time derivative of our state z(t)z(t).
  • LL is our loss and its a function of the output z(t1)z(t_1) .

In the forward pass, we solve the initial value problem given by (1)(1)-(2)(2) to go from input to output, like so z(t1)=zt0+t0t1f(z(t),θ,t)dt(3)\tag{3} z(t_1) = z_{t_0} + \int_{t_0}^{t_1} f(z(t), \theta, t)dt

PG: Grad Sub-Problem

(PM)\text{(PM)} is a constrained non-linear optimization problem. To tackle it with gradient descent, we need the gradient of the loss L(z(t1))L(z(t_1)) with respect to the network params θ\theta. But we saw in the intro that simple backprop can be memory inefficient because of the potentially huge computation graph. So we want to Efficiently calculate dL(z(t1))dθ(PG)\tag{PG} \text{Efficiently calculate } \frac{\mathrm{d} L(z(t_1))}{ \mathrm{d} \theta}

But how can we make the grad calculation more memory-efficient? We would need to somehow do a backward pass without storing all the function activations from the forward pass. Perhaps we can leverage the ODE constraint given by (1)(1)?

PL: Lagrangian Sub-Sub-Problem

To incorporate our ODE constraint (1)\text{(1)} into (PM)\text{(PM)} we will use the technique of Lagrange Multipliers. In particular, let’s entwine our original our ODE F(z(t)˙,z(t),θ,t)=0F(\dot{z(t)}, z(t), \theta, t)=0 into our original objective L(z(t1))L(z(t_1)) to make a new objective like so ψ=L(z(t1))t0t1λ(t)F(z(t)˙,z(t),θ,t)dt(4)\tag{4} \psi = L(z(t_1)) - \int_{t_0}^{t_1} \lambda(t) F(\dot{z(t)}, z(t), \theta, t) dt

λ(t)\lambda(t) is called a Lagrange multiplier6​. We know that F=0F=0 so the additional term doesn’t change our gradient. That is dψdθ=dL(z(t1))dθ(5)\tag{5} \frac{\mathrm{d} \psi}{ \mathrm{d} \theta} = \frac{\mathrm{d} L(z(t_1))}{ \mathrm{d} \theta}

So what did we gain? This: we can try to choose λ(t)\lambda(t) so that we can hopefully eliminate hard to compute derivatives in dL(z(t1))dθ\frac{\mathrm{d} L(z(t_1))}{ \mathrm{d} \theta}, such as the computation graph Jacobian dz(t1)dθ\frac{\mathrm{d} z(t_1)}{\mathrm{d} \theta}. In fact, let’s state that as a problem. Choose λ(t) to avoid hard derivatives(PL)\tag{\text{PL}} \text{Choose } \lambda(t) \text{ to avoid } \\ \text{hard derivatives}

Solving (PL)\text{(PL)} could help to solve (PG)\text{(PG)}, which helps for (PM)\text{(PM)}.

Simplify terms

Note: Rampant blatant argument and transpose suppression in what follows.😱

Let’s begin by investigating the integral at the end of (4). t0t1λ(t)Fdt=t0t1λ(t)(z(t)˙f)dt=t0t1λ(t)z(t)˙dtt0t1λ(t)fdt(6) \begin{aligned} \int_{t_0}^{t_1} \lambda(t) F dt &= \int_{t_0}^{t_1} \lambda(t) (\dot{z(t)} \medspace – \medspace f)dt \\ &= \int_{t_0}^{t_1} \lambda(t) \dot{z(t)} dt \enskip – \int_{t_0}^{t_1} \lambda(t) f dt \tag{6} \end{aligned}

Using Integration by Parts for the first of the two terms, we see t0t1λ(t)z(t)˙dt=λ(t)z(t)t0t1t0t1z(t)λ(t)˙dt=λ(t1)z(t1)λ(t0)zt0t0t1z(t)λ(t)˙dt(7)\begin{aligned} \int_{t_0}^{t_1} \lambda(t) \dot{z(t)} dt &= \lambda(t)z(t) \big\vert_{t_0}^{t_1} – \int_{t_0}^{t_1} z(t) \dot{\lambda(t)} dt \\ &= \lambda(t_1)z(t_1) \medspace – \medspace \lambda(t_0)z_{t_0} \\ &- \int_{t_0}^{t_1} z(t) \dot{\lambda(t)} dt \end{aligned} \tag{7}

Using (7)(7) in (6)(6), we get t0t1λ(t)Fdt=λ(t1)z(t1)λ(t0)zt0t0t1(zλ˙+λf)dt\begin{aligned} \int_{t_0}^{t_1} \lambda(t) F dt &= \lambda(t_1)z(t_1) \medspace – \medspace \lambda(t_0)z_{t_0} \\ &- \int_{t_0}^{t_1} (z \dot{\lambda} + \lambda f) dt \end{aligned}

Bringing in the derivative with respect to θ\theta, we see ddθ[t0t1λFdt]=λ(t1)dz(t1)dθλ(t0)dzt0dθ0zt0 is inputt0t1(dzdθλ˙+λdfdθ)dt(8)\begin{aligned} \frac{\mathrm{d}}{\mathrm{d} \theta} \left[ \int_{t_0}^{t_1} \lambda F dt \right] &= \lambda(t_1)\frac{\mathrm{d} z(t_1)}{\mathrm{d} \theta} \medspace – \medspace \lambda(t_0)\overbrace{\cancel{\frac{\mathrm{d} z_{t_0} }{\mathrm{d} \theta}}^{\enskip 0}}^{z_{t_0} \text{ is input}} \\ &- \int_{t_0}^{t_1} (\frac{\mathrm{d} z}{\mathrm{d} \theta} \dot{\lambda} + \lambda \frac{\mathrm{d} f}{\mathrm{d} \theta} ) dt \end{aligned} \tag{8}

From the Chain rule we have dfdθ=fθ+fzdzdθ\frac{\mathrm{d} f}{\mathrm{d} \theta} = \frac{\partial f}{\partial \theta} + \frac{\partial f}{\partial z} \frac{\mathrm{d} z}{\mathrm{d} \theta}

So ddθ[t0t1λFdt]=λ(t1)dz(t1)dθt0t1(λ˙+λfz)dzdθdtt0t1λfθdt(9)\begin{aligned} \frac{\mathrm{d}}{\mathrm{d} \theta} \left[ \int_{t_0}^{t_1} \lambda F dt \right] &= \lambda(t_1)\frac{\mathrm{d} z(t_1) }{\mathrm{d} \theta} \\ &- \int_{t_0}^{t_1} (\dot{\lambda} + \lambda \frac{\partial f}{\partial z} ) \frac{\mathrm{d} z}{\mathrm{d} \theta}dt \\ &- \int_{t_0}^{t_1} \lambda \frac{\partial f}{\partial \theta} dt \end{aligned} \tag{9}

We can now go all the way back to (4)(4) and differentiate with respect to θ\theta. Invoking (5)(5) and the Chain Rule, we get dLdθ=Lz(t1)dz(t1)dθddθ[t0t1λFdt]\frac{\mathrm{d} L}{ \mathrm{d} \theta} = \frac{ \partial L}{ \partial z(t_1)}\frac{\mathrm{d} z(t_1)}{ \mathrm{d} \theta} \enskip – \enskip \frac{\mathrm{d}}{\mathrm{d} \theta} \left[ \int_{t_0}^{t_1} \lambda F dt \right]

Finally, let’s pull in (9)(9) and collect terms. dLdθ=[Lz(t1)λ(t1)]dz(t1)dθ+t0t1(λ(t)˙+λ(t)fz)dz(t)dθdt+t0t1λ(t)fθdt(10)\begin{aligned} \frac{\mathrm{d} L}{ \mathrm{d} \theta} &= \left[ \frac{ \partial L}{ \partial z(t_1)} – \lambda(t_1) \right] \frac{\mathrm{d} z(t_1) }{\mathrm{d} \theta} \\ &+ \int_{t_0}^{t_1} (\dot{\lambda(t)} + \lambda(t) \frac{\partial f}{\partial z} ) \frac{\mathrm{d} z(t)}{\mathrm{d} \theta}dt \\ &+ \int_{t_0}^{t_1} \lambda(t) \frac{\partial f}{\partial \theta} dt \end{aligned} \tag{10}

Solve PL, PG, PM with Good Lagrange Multiplier

We can now stop and think about (PL)\text{(PL)} – avoid hard derivatives. Let’s list the derivatives in (10)(10).

DerivativeDescriptionEase
Lz(t1)\frac{ \partial L}{ \partial z(t_1)}Gradient of loss with respect to output.Easy
dz(t1)dθ\frac{\mathrm{d} z(t_1) }{\mathrm{d} \theta}Jacobian of output with respect to params.Not easy
λ(t)˙\dot{\lambda(t)}Derivative of lambda, a vector, with respect to time.Easy
λ(t)fz\lambda(t) \frac{\partial f}{\partial z}vector-Jacobian Product. Can compute with reverse mode autodiff without explicitly constructing Jacobian fz\frac{\partial f}{\partial z}.Easy
dz(t)dθ\frac{\mathrm{d} z(t)}{\mathrm{d} \theta}Jacobian of arbitrary layer with respect to params.Not easy
λ(t)fθ\lambda(t) \frac{\partial f}{\partial \theta}vector-Jacobian Product. Can compute with reverse mode autodiff without explicitly constructing Jacobianfθ\frac{\partial f}{\partial \theta}.Easy

So we see that (10)(10) requires calculating dz(t)dθ\frac{\mathrm{d} z(t)}{\mathrm{d} \theta} at the very last point in time t1t_1 (output layer) and possibly any general point in time tt (arbitrary hidden layer). That is exactly what we’re trying to avoid!

Can we get rid of dz(t)dθ\frac{\mathrm{d} z(t)}{\mathrm{d} \theta} from (10)(10) by making a judicious choice of λ(t)\lambda(t)?🤔

What if we define λ(t)\lambda(t) to be such that it satisfies the following ODE that runs backwards in time, starting at t1t_1 and ending at t0t_0 λ(t)˙=λ(t)fzs.t.λ(t1)=Lz(t1)givingλ(t0)=λ(t1)t1t0λ(t)fzdt(11)\boxed{\begin{aligned} &\dot{\lambda(t)} = -\lambda(t)\frac{\partial f}{\partial z} \enskip \text{s.t.} \enskip \lambda(t_1) = \frac{ \partial L}{ \partial z(t_1)} \\ &\text{giving} \\ &\lambda(t_0) = \lambda(t_1) – \int_{t_1}^{t_0} \lambda(t) \frac{\partial f}{\partial z} dt \end{aligned}} \tag{11}

Like magic, the first two terms in (10)(10) – the ones containing dzdθ\frac{\mathrm{d} z}{\mathrm{d} \theta} – will simply zero out!!!😃🥳

(10)(10) then simplifies to dLdθ=t1t0λ(t)fθdt(12)\boxed{\frac{\mathrm{d} L}{ \mathrm{d} \theta} = -\int_{t_1}^{t_0} \lambda(t) \frac{\partial f}{\partial \theta} \tag{12} dt}

where I just flipped the integration limits​7​.

Notice dLdθ\frac{\mathrm{d} L}{ \mathrm{d} \theta} is precisely the loss gradient we’re after in (PG)\text{(PG)}!

But you’ll remember from (PG)\text{(PG)} that just getting the gradient is not enough – we want the gradient computation to be memory efficient. Well, notice that (12)(12) is a new ODE system that does not require us to preserve activations from the forward pass. So our new method trades off computation for memory – in fact the memory requirement for this gradient calculation is O(1)\text{O}(1) with respect to the number of layers!

Once we have the gradient, we can readily use it in gradient descent to (locally) solve (PM)\text{(PM)}!

Summary of Steps

In summary, here are the steps in one iteration of solving (PM)\text{(PM)} with gradient descent (for batch of one):

  1. Forward pass: Solve the ODE (1)(1)-(2)(2) from time t0t_0 to t1t_1 and get the output z(t1)z(t_1).
  2. Loss calculation: Calculate L(z(t1))L(z(t_1)).
  3. Backward pass: Solve ODEs (11)(11) and (12)(12) from reverse time t1t_1 to t0t_0 to get the gradient of the loss dL(z(t1))dθ\frac{\mathrm{d} L(z(t_1))}{ \mathrm{d} \theta}.
  4. Use the gradient to update the network parameters θ\theta.

What is the adjoint?

The λ(t)\lambda(t) we defined in (11)(11) is called the adjoint. I think one reason for the name is that adjoint refers to the transpose in linear algebra, and λ(t)\lambda(t) is a row vector or equivalently a transposed column vector. (11)(11) itself is the adjoint system.

While the paper derives these equations using an alternate proof that proceeds by defining adjoint and then using Chain Rule, we have derived the expressions for the adjoint and the adjoint equations here using the method of Lagrange multipliers. For kicks.

If we compare (11)(11), (12)(12) to the equations (46)(46), (51)(51) in the original Neural ODEs paper, we see that all the definitions of the adjoint, adjoint system and the gradient of the loss with respect to the parameters are in agreement. In fact λ(t)\lambda(t) corresponds to the quantity a(t)a(t) defined in the paper.. woot!!🎉​8

Comments ported from Wordpress on “Deriving the Adjoint Equation for Neural ODEs using Lagrange Multipliers”

Kacper said on 2020-02-05 09:47:43:

Hi!

Thank you for your post. I will jump derivation more in-depth (as I’m a newbie in these topics). What I lack at first glance, is an explanation for a dot notation above a letter (I derived from the table that it is a derivative - I know it’s standard notation in physics but rather uncommon in ML community). I think it would be worth to mention what z˙\dot{z} stands for in the beginning.

Best regards

Vaibhav Patel said on 2020-02-05 10:24:34:

Hey, thanks for the feedback! Really appreciate it!

I agree, I tried unsuccessfully to not be all over the place with my notations. Part of it has to do with not being adept at knowing how to fit equations on smaller devices.. I just started blogging and I thought it would all just.. flow.. or something.

For starters, I’ll definitely try to add a line above explaining that the dot notation signifies derivative with respect to time.

Peter Gall said on 2020-02-19 22:27:01:

Hey, thanks a lot for your writeup.

I am a little confused about how you would actually compute the vector-Jacobian product with autodiff without explicitly constructing the Jacobian.

Do you have mock code that shows this? In e.g. pytorch?

Vaibhav Patel said on 2020-02-21 10:14:55:

Hey, that’s a great question! (I’ve edited my previous reply here to be hopefully more lucid. I also wrote separate post to address this question and provided some code snippets there as well. Hope you find it useful!)

The vector-Jacobian product can be calculated using an autograd library like HIPS/autograd or pytorch.autograd. In such libraries the vjp is calculated using reverse-mode autodiff. The idea is that instead of instantiating full Jacobians, these libraries use precomputed simplified expressions of vjp’s for all functions one might encounter in deep learning. The computational cost of these simplified expressions is of the same order as their original functions owing to the sparsity of the Jacobians and the commonality in expressions of the functions and their derivatives. This is much cheaper than instantiating a full Jacobian matrix and then left multiplying it by a row vector.

Here are the exact lines (41-44) calling pytorch’s vjp in the torchdiffeq package by the authors.

1
2
3
4
vjp_t, *vjp_y_and_params = torch.autograd.grad(
    func_eval, (t,) + y + f_params,
    tuple(-adj_y_ for adj_y_ in adj_y), allow_unused=True, retain_graph=True
)

This snippet is requesting the 3 efficient vector-Jacobian products. The left row vector in each case is the adjoint vector adj_y. The Jacobians are the derivative of func_eval with respect to i) the time at timepoint t, ii) the state y at time point t, and iii) the neural net params f_params.

Actually, the above autodiff way of calculating the vjp is just the backpropagation of the gradient adj_y through f. The crucial point is the neural net f is not our entire computation graph from the ODE solve - it’s an evolution function that just takes the state y to the “next” state in time.

Cheers!

Pere Diaz Lozano said on 2020-10-14 11:30:21:

Good afternoon,

Would you use the same techique to derive the expression to compute the gradient of the loss with respect to the initial time t0?

I’ve been trying to do it but can not figure it out.

Thank you very much

Vaibhav Patel said on 2020-10-25 14:59:53:

Hi there, yes the gradient with respect to the initial time t0t_0 can also be determined using the same technique. To frame an initial value problem for getting dLdt0\frac{\mathrm{d} L}{\mathrm{d} t_0}, we need

  1. The initial value dLdtn\frac{\mathrm{d} L}{\mathrm{d} t_n}
  2. And ODE for ddt(dLdt)\frac{\mathrm{d}}{\mathrm{d}t}(\frac{\mathrm{d} L}{\mathrm{d} t})

For the initial value, we have dLdtn=dLdzdzdtn=λ(tn)f(z(tn),θ,tn)\frac{\mathrm{d} L}{\mathrm{d} t_n} = \frac{\mathrm{d} L}{\mathrm{d} z} \frac{\mathrm{d} z}{\mathrm{d} t_n} = \lambda(t_n) f(z(t_n), \theta, t_n)

For the ODE, we have ddt(dLdt)=ddt(d(λf)dt)=λ˙f+λdfdt=λfzf+λfzdzdt+λft=λfzf+λfzf+λft=λft\frac{\mathrm{d}}{\mathrm{d}t}(\frac{\mathrm{d} L}{\mathrm{d} t}) = \frac{\mathrm{d}}{\mathrm{d}t}(\frac{\mathrm{d} (\lambda f)}{\mathrm{d} t}) = \dot{\lambda} f + \lambda \frac{\mathrm{d} f}{\mathrm{d} t} = -\lambda \frac{\partial f}{\partial z} f + \lambda \frac{\partial f}{\partial z}\frac{\mathrm{d} z}{\mathrm{d} t} + \lambda \frac{\partial f}{\partial t} = -\lambda \frac{\partial f}{\partial z} f + \lambda \frac{\partial f}{\partial z} f + \lambda \frac{\partial f}{\partial t} = \lambda \frac{\partial f}{\partial t}

So dLdt0=dLdtn+t1t0λ(t)ftdt\frac{\mathrm{d} L}{\mathrm{d} t_0} = \frac{\mathrm{d} L}{\mathrm{d} t_n} + \int_{t_1}^{t_0} {\lambda(t) \frac{\partial f}{\partial t} dt}.

In the paper, it looks like they instead evaluate dLdt0=(dLdtnt1t0λ(t)ftdt)\frac{\mathrm{d} L}{\mathrm{d} t_0} = -( -\frac{\mathrm{d} L}{\mathrm{d} t_n} - \int_{t_1}^{t_0} {\lambda(t) \frac{\partial f}{\partial t} dt}) to be able to get the dynamics for λ(t)\lambda(t), dLdθ\frac{\mathrm{d} L}{\mathrm{d} \theta} and dLdt\frac{\mathrm{d} L}{\mathrm{d} t} by performing a single vector-Jacobian product between the adjoint λ(t)\lambda(t) and the Jacobian of f(z,θ,t)f(z, \theta, t).

Rishant said on 2022-05-20 23:05:57:

In the derivation, to obtain eq (8), λ(t) is considered independent of θ. Then we set a specific form of λ(t) in (11). Does it follow from (11) that indeed λ(t) as chosen is independent of θ?

Vaibhav Patel said on 2024-04-06 14:01:26:

Hi, apologies for the delayed reply. Great question. Loosely speaking, I comforted myself with the idea that the adjoint state λ(t)\lambda(t) was somewhat analogous to z(t)z(t), in that they both may have a dependence on θ\theta as “parameters” but not inputs.

For instance if z(t)˙=f(z(t),θ,t)=θz(t)\dot{z(t)} = f(z(t), \theta, t) = \theta z(t), then z(t)=zt0eθz(t)z(t) = z_{t_0} e^{\theta z(t)}, and so z(t)z(t) is more like zθ(t)z_{\theta}(t).

Analogously, from (11) we have λ(t)˙=λ(t)fz=θλ(t)\dot{\lambda(t)} = -\lambda(t) \frac{\partial f}{\partial z} = -\theta \lambda(t). So λ(t)=λ(t0)eθt\lambda(t) = \lambda(t_0) e^{-\theta t}. There is definitely a parameter dependence on θ\theta, but I take it to similarly mean that we should write λθ(t)\lambda_{\theta}(t).

Further given the independence of θ\theta from tt, I think (11) is definitely “safer” because dλdt=λt+λθdθdt=λt\frac{\mathrm{d} \lambda}{\mathrm{d} t} = \frac{\partial \lambda}{\partial t} + \frac{\partial \lambda}{\partial \theta}\frac{\mathrm{d} \theta}{\mathrm{d} t} = \frac{\partial \lambda}{\partial t}.

Anyway, I concede this isn’t very rigorous. Seemingly, https://arxiv.org/html/2402.15141v1 has a proof that clarifies the dependence of the adjoint state on the parameters.

Soumya said on 2022-11-01 20:12:28:

Hi,

Thank you very much for the detailed explanation. Could you please explain in the neural ode paper equation (45) why the aθ(tn)=0a_\theta(t_n) =0.

Vaibhav Patel said on 2024-04-06 16:28:17:

Hi there, sorry about the delay. It was admittedly a bit sneaky on my part - I actually swapped the limits of the integration and then stuck a minus sign in front. The form was influenced by the Neural ODEs paper where the solve the ODEs for the adjoint and dLdθ\frac{\mathrm{d} L}{\mathrm{d} \theta} (which (11) and (12) should correspond to) in the same ODE pass from t1t_1 to t0t_0.

Jack said on 2023-08-13 04:37:54:

Thanks for your writeup and I wonder if there’s a typo: should the upper bound and the lower bound in your integral be t_0 and t_n?

Vaibhav Patel said on 2024-03-30 20:37:23:

Hi there, apologies for the very late reply. Great question. My interpretation is that since aθ(tn)=dLdθ(tn)a_\theta(t_n) = \frac{\mathrm{d} L}{\mathrm{d} \theta(t_n)}, by setting aθ(tn)=0a_\theta(t_n) = 0 and flowing the adjoint system backward, we’re expressing our desire to find the params that will land us in a local minimum of LL when the system is run in forward time. Basically, doing stochastic gradient descent to a local minimum of the loss for the current batch of training samples.

References


  1. Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK. Neural ordinary differential equations. In: Advances in Neural Information Processing Systems. ; 2018:6571–6583. ↩︎

  2. Pontryagin LS, Mishchenko E, Boltyanskii V, Gamkrelidze R. The mathematical theory of optimal processes. 1962. ↩︎ ↩︎ ↩︎ ↩︎

  3. Bradley AM. PDE-Constrained Optimization and the Adjoint Method. Technical Report. Stanford University. https://cs. stanford. edu/ ambrad …; 2013. ↩︎

  4. Cao Y, Li S, Petzold L, Serban R. Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution. SIAM Journal on Scientific Computing. 2003;24:1076-1089. doi:10.1137/S1064827501380630 ↩︎

  5. Biegler LT. Optimization of Differential-Algebraic Equation Systems. Chemical Engineering Department Carnegie Mellon University Pittsburgh, http://dynopt cheme cmu edu. 2000. ↩︎

  6. Actually I believe in optimal control theory it is more appropriately called a costate variable. ↩︎

  7. A reason is that this way, all the ODEs associated with our gradient calculation run backwards. This consistency can help make the calls to our ODE solver more concise. ↩︎

  8. I think to say that 𝜆 and a are the same, we need 𝜆 to be uniformly Lipschitz continuous, after which we can invoke Picard’s uniqueness theorem. ↩︎