Synthetic Gradients - doesn't seem beneficialWhy mini batch size is better than one single “batch” with...

What is a good reason for every spaceship to carry a weapon on board?

What does an unprocessed RAW file look like?

Custom shape shows unwanted extra line

What makes papers publishable in top-tier journals?

Methods for writing a code review

A fantasy book with seven white haired women on the cover

Why didn't Tom Riddle take the presence of Fawkes and the Sorting Hat as more of a threat?

How can the probability of a fumble decrease linearly with more dice?

Is there a way to not have to poll the UART of an AVR?

Categorical Unification of Jordan Holder Theorems

Why do neural networks need so many training examples to perform?

Eww, those bytes are gross

Renting a 2CV in France

Why zero tolerance on nudity in space?

Microtypography protrusion with Polish quotation marks

Why didn't the 2019 Oscars have a host?

Does the US government have any planning in place to ensure there's no shortages of food, fuel, steel and other commodities?

What species should be used for storage of human minds?

Why is the "Domain users" group missing from this PowerShell AD query?

How would an AI self awareness kill switch work?

How much mayhem could I cause as a fish?

Why is 'diphthong' pronounced the way it is?

Why is a temp table a more efficient solution to the Halloween Problem than an eager spool?

Can the "Friends" spell be used without making the target hostile?

Synthetic Gradients - doesn't seem beneficial

Why mini batch size is better than one single “batch” with all training data?sliding window leads to overfitting in LSTM?Synthetic Gradients good number of Layers & neuronsGradients for bias terms in backpropagationVanishing Gradient in a shallow networkCustom c++ LSTM slows down at 0.36 cost is usual?sliding window leads to overfitting in LSTM?GradientChecking, can I blame float precision?Synthetic Gradients good number of Layers & neuronsSimplifying gradients of weights (RNN)Understanding Timestamps and Batchsize of Keras LSTM considering Hiddenstates and TBPTTPolicy Gradients - gradient Log probabilities favor less likely actions?LSTM - divide gradients by number of timesteps IMMEDIATELY or in the end?

I can see two motives to use Synthetic Gradients in RNN:

To speed up training, by imediately correcting each layer with predicted gradient

To be able to learn longer sequences

I see problems with both of them. Please note, I really like Synthetic Gradients and would like to implement them. But I need to understand where my trail of thought is incorrect.

I will now show why Point 1 and Point 2 don't seem to be beneficial, and I need you to correct me, if they are actually beneficial:

Synthetic Gradients tell us we can rely on another "mini-helper-network" (called DNI) to advise our current layer about what gradients will arrive from above, even during fwd prop.

However, such gradients will only come several operations later. Same amount of Backprop will have to be done as without DNI, except that now we also need to train our DNI.

Adding this Asyncronisity shouldn't not make layers train faster than during the traditional "locked" full fwdprop -> full back prop sequence, because same number of computations must be done by the device. It's just that the computations will be slid in time

This makes me think Point 1) will not work. Simply adding SG between each layer
shouldn't improve training speed.

Ok, how about adding SG only on the last layer to predict "gradient from future" and only if it's the final timestep during forward prop.

This way, even though our LSTM has to stop predicting and must backpropagate, it can still predict the future-gradient it would have received (with the help of DNI sitting on the last timestep).

Consider several training sessions (session A, session B):

fwdprop timestep_1A ---> fwdprop timestep_2A ---> fwdprop timestep_3A
----> stop and bkprop!

fwdprop timestep_1B ---> fwdprop timestep_2B ---> fwdprop timestep_3B
----> stop and bkprop!

We've just forced our network to "parse" 6 timesteps in two halves: 3 timesteps, then 3 remaining timesteps again.

Notice, we have our DNI sitting at the very end of "Session A" and predicting "what gradient I would get flowing from the beginning of Session B (from future)".
Because of that, timestep_3A will be equipped with gradient "that would have come from timestep_1B", so indeed, corrections done during A will be more reliable.

But, hey! These predicted "synthetic gradients" will be very small (negligible) anyway - after all, that's why we start a new backprop session B. Weren't they too small, we would just parse all 6 timesteps in a single long bkprop "session A".

Therefore I think Point 2) shouldn't give benefit either. Adding SG on the last
timestep of fwdprop allows to effectively train longer sequences, but
vanishing gradients didn't go anywhere.

Ok. Maybe we can get the benefit of training "session A", "session B" etc on separate machines? But then how is this different to simply training with the usual minibatches in parallel? Keep in mind, was mentioned in point 2: things are worsened by sessionA predicting gradients which are vanishing anyway.

Question:
Please help me understand the benefit of Synthetic Gradient, because the 2 points above don't seem to be beneficial

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

$begingroup$
Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?
$endgroup$
– D.W.
May 24 '18 at 1:44

$begingroup$
Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D
$endgroup$
– Kari
May 24 '18 at 5:47

$begingroup$
I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.
$endgroup$
– D.W.
May 24 '18 at 7:00

$begingroup$
What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario
$endgroup$
– Kari
May 24 '18 at 11:36

add a comment |

I can see two motives to use Synthetic Gradients in RNN:

To speed up training, by imediately correcting each layer with predicted gradient

To be able to learn longer sequences

I see problems with both of them. Please note, I really like Synthetic Gradients and would like to implement them. But I need to understand where my trail of thought is incorrect.

I will now show why Point 1 and Point 2 don't seem to be beneficial, and I need you to correct me, if they are actually beneficial:

Synthetic Gradients tell us we can rely on another "mini-helper-network" (called DNI) to advise our current layer about what gradients will arrive from above, even during fwd prop.

However, such gradients will only come several operations later. Same amount of Backprop will have to be done as without DNI, except that now we also need to train our DNI.

This makes me think Point 1) will not work. Simply adding SG between each layer
shouldn't improve training speed.

Ok, how about adding SG only on the last layer to predict "gradient from future" and only if it's the final timestep during forward prop.

This way, even though our LSTM has to stop predicting and must backpropagate, it can still predict the future-gradient it would have received (with the help of DNI sitting on the last timestep).

Consider several training sessions (session A, session B):

fwdprop timestep_1A ---> fwdprop timestep_2A ---> fwdprop timestep_3A
----> stop and bkprop!

fwdprop timestep_1B ---> fwdprop timestep_2B ---> fwdprop timestep_3B
----> stop and bkprop!

We've just forced our network to "parse" 6 timesteps in two halves: 3 timesteps, then 3 remaining timesteps again.

Therefore I think Point 2) shouldn't give benefit either. Adding SG on the last
timestep of fwdprop allows to effectively train longer sequences, but
vanishing gradients didn't go anywhere.

Question:
Please help me understand the benefit of Synthetic Gradient, because the 2 points above don't seem to be beneficial

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

$begingroup$
Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?
$endgroup$
– D.W.
May 24 '18 at 1:44

$begingroup$
Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D
$endgroup$
– Kari
May 24 '18 at 5:47

$begingroup$
I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.
$endgroup$
– D.W.
May 24 '18 at 7:00

$begingroup$
What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario
$endgroup$
– Kari
May 24 '18 at 11:36

add a comment |

I can see two motives to use Synthetic Gradients in RNN:

To speed up training, by imediately correcting each layer with predicted gradient

To be able to learn longer sequences

I see problems with both of them. Please note, I really like Synthetic Gradients and would like to implement them. But I need to understand where my trail of thought is incorrect.

I will now show why Point 1 and Point 2 don't seem to be beneficial, and I need you to correct me, if they are actually beneficial:

Synthetic Gradients tell us we can rely on another "mini-helper-network" (called DNI) to advise our current layer about what gradients will arrive from above, even during fwd prop.

However, such gradients will only come several operations later. Same amount of Backprop will have to be done as without DNI, except that now we also need to train our DNI.

This makes me think Point 1) will not work. Simply adding SG between each layer
shouldn't improve training speed.

Ok, how about adding SG only on the last layer to predict "gradient from future" and only if it's the final timestep during forward prop.

This way, even though our LSTM has to stop predicting and must backpropagate, it can still predict the future-gradient it would have received (with the help of DNI sitting on the last timestep).

Consider several training sessions (session A, session B):

fwdprop timestep_1A ---> fwdprop timestep_2A ---> fwdprop timestep_3A
----> stop and bkprop!

fwdprop timestep_1B ---> fwdprop timestep_2B ---> fwdprop timestep_3B
----> stop and bkprop!

We've just forced our network to "parse" 6 timesteps in two halves: 3 timesteps, then 3 remaining timesteps again.

Therefore I think Point 2) shouldn't give benefit either. Adding SG on the last
timestep of fwdprop allows to effectively train longer sequences, but
vanishing gradients didn't go anywhere.

Question:
Please help me understand the benefit of Synthetic Gradient, because the 2 points above don't seem to be beneficial

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

I can see two motives to use Synthetic Gradients in RNN:

To speed up training, by imediately correcting each layer with predicted gradient

To be able to learn longer sequences

I see problems with both of them. Please note, I really like Synthetic Gradients and would like to implement them. But I need to understand where my trail of thought is incorrect.

I will now show why Point 1 and Point 2 don't seem to be beneficial, and I need you to correct me, if they are actually beneficial:

Synthetic Gradients tell us we can rely on another "mini-helper-network" (called DNI) to advise our current layer about what gradients will arrive from above, even during fwd prop.

However, such gradients will only come several operations later. Same amount of Backprop will have to be done as without DNI, except that now we also need to train our DNI.

This makes me think Point 1) will not work. Simply adding SG between each layer
shouldn't improve training speed.

Ok, how about adding SG only on the last layer to predict "gradient from future" and only if it's the final timestep during forward prop.

This way, even though our LSTM has to stop predicting and must backpropagate, it can still predict the future-gradient it would have received (with the help of DNI sitting on the last timestep).

Consider several training sessions (session A, session B):

fwdprop timestep_1A ---> fwdprop timestep_2A ---> fwdprop timestep_3A
----> stop and bkprop!

fwdprop timestep_1B ---> fwdprop timestep_2B ---> fwdprop timestep_3B
----> stop and bkprop!

We've just forced our network to "parse" 6 timesteps in two halves: 3 timesteps, then 3 remaining timesteps again.

Therefore I think Point 2) shouldn't give benefit either. Adding SG on the last
timestep of fwdprop allows to effectively train longer sequences, but
vanishing gradients didn't go anywhere.

Question:
Please help me understand the benefit of Synthetic Gradient, because the 2 points above don't seem to be beneficial

backpropagation

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

edited May 25 '18 at 20:06

asked May 23 '18 at 21:09

Kari

614321

asked May 23 '18 at 21:09

Kari

614321

asked May 23 '18 at 21:09

Kari

614321

$begingroup$
Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?
$endgroup$
– D.W.
May 24 '18 at 1:44

$begingroup$
Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D
$endgroup$
– Kari
May 24 '18 at 5:47

$begingroup$
I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.
$endgroup$
– D.W.
May 24 '18 at 7:00

$begingroup$
What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario
$endgroup$
– Kari
May 24 '18 at 11:36

add a comment |

$begingroup$
Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?
$endgroup$
– D.W.
May 24 '18 at 1:44

$begingroup$
Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D
$endgroup$
– Kari
May 24 '18 at 5:47

$begingroup$
I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.
$endgroup$
– D.W.
May 24 '18 at 7:00

$begingroup$
What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario
$endgroup$
– Kari
May 24 '18 at 11:36

Why do you think this won't speed up training? The only justification I see is the bare assertion that this "shouldn't improve training speed" but you don't provide your reasoning. Also it's not clear what you mean by "step 1)" as you haven't described any steps in teh question. In any case, the paper demonstrates that it does provide speedups. Data beats theory any day. Have you read the paper?

– D.W.
May 24 '18 at 1:44

Data beats theory any day, I agree, but the best counter example I can make is GPUs vs CPUs. Everywhere people keep telling GPU runs orders of magnitudes faster than CPU, and provide comparisons. However, a properly coded mulithreaded CPU is only 2-3 times slower than its same category GPU and is cheaper than the GPU. larsjuhljensen.wordpress.com/2011/01/28/… Once again, I am not going against Synthetic Gradients, - they seem awesome, it's just until I can get answer to my post, I won't be able to rest :D

– Kari
May 24 '18 at 5:47

I'm not sure that a 7-year old blog post about BLAST is terribly relevant here.

– D.W.
May 24 '18 at 7:00

What I am trying to say is "there are ways to make parallelism seem better than it might actually be", in any scenario

– Kari
May 24 '18 at 11:36

add a comment |

2 Answers
2

active

oldest

votes

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

 input

   |

   V

Layer_0 & DNI_0

Layer_1

Layer_2

Layer_3 & DNI_3

Layer_4

Layer_5 & DNI_5

Layer_6

Layer_7

   | 

   V

output

The DNI_0 is always trained with a synthetic gradient arriving from DNI_3 (flowing through Layer_2, and Layer_1 of course), siting several layers further.

Likewise, the DNI_3 is always trained with a synthetic gradient arriving from DNI_5

DNI_0 or DNI_3 will never get to see the true gradient, because true grad is only delivered to DNI_5, and not earlier.

For anyone still struggling to understand them, read this awesome blogpost, part 3

Earlier layers will have to be content with synthetic gradients, because they or their DNI will never witness the "true gradient".

Regarding training in parallel with minibatches instead of Parallelizing via Synthetic Grads:

Longer sequences are more precise than minibatches, however minibatches add a regulization effect. But, given some technique to prevent gradient from exploding or vanishing, training longer sequences can provide a lot better insight into context of the problem. That's because network infers output after considering a longer sequence of input, so the outcome is more rational.

For the comparison of benefits granted by SG refer to the diagrams page 6 of the Paper, mainly being able to solve longer sequences, which I feel is most beneficial (we can already parallelize via Minibatches anyway, and thus SG shouldn't speed up the process when performed on the same machine - even if we indeed only propagate up to the next DNI).

However, the more DNI modules we have, the noisier the signal should
be. So it might be worth-while to train the layers and DNI all by the legacy backprop, and only after some epochs have elapsed we start using the DNI-bootstrapping discussed above.

That way, the earliest DNI will acquire at least some sense of what to expect at the start of the training. That's because the following DNIs are themselves unsure of what the true gradient actually looks like, when the training begins, so initially, they will be advising "garbage" gradient to anyone sitting earlier than them.

Don't forget that authors also experimented with predicting the actual inputs for every layer.

And finally, one of the largest benefits: Once DNI's are sufficiently well-trained, it's possible to correct the network by the predicted gradient immediately after a forward prop has occurred. There is no need to keep running expensive backpropagation (with chain rules etc), because the trained DNI already has a good idea of what the gradient will be. We can begin trusting that DNI more and more.

Usually, we have:

fwdProp(), bkProp(), fwdProp(), bkProp()...

With synthetic gradients we can have:

fwdProp(), bkProp(), bkProp(), bkProp(), fwdProp()...

Minibatches gives us speedup (via parallelization) and also give us the regularization.
Synthetic Gradients allow us to infer better by working with longer sequences + several consecutive bkprops ocasionally. All together this is a very powerful system.

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

add a comment |

Synthetic gradients make training faster, not by reducing the number of epochs needed or by speeding up the convergence of gradient descent, but rather by making each epoch faster to compute. The synthetic gradient is faster to compute than the real gradient (computing the synthetic gradient is faster than the backpropagation), so each iteration of gradient descent can be computed more rapidly.

answered May 24 '18 at 7:01

D.W.

2,103628

$begingroup$
From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?
$endgroup$
– Kari
May 24 '18 at 11:14

$begingroup$
So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later
$endgroup$
– Kari
May 24 '18 at 11:35

$begingroup$
@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.
$endgroup$
– D.W.
May 24 '18 at 13:11

$begingroup$
Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).
$endgroup$
– Kari
May 24 '18 at 13:22

$begingroup$
It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/
$endgroup$
– Kari
May 24 '18 at 13:56

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f32072%2fsynthetic-gradients-doesnt-seem-beneficial%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

 input

   |

   V

Layer_0 & DNI_0

Layer_1

Layer_2

Layer_3 & DNI_3

Layer_4

Layer_5 & DNI_5

Layer_6

Layer_7

   | 

   V

output

The DNI_0 is always trained with a synthetic gradient arriving from DNI_3 (flowing through Layer_2, and Layer_1 of course), siting several layers further.

Likewise, the DNI_3 is always trained with a synthetic gradient arriving from DNI_5

DNI_0 or DNI_3 will never get to see the true gradient, because true grad is only delivered to DNI_5, and not earlier.

For anyone still struggling to understand them, read this awesome blogpost, part 3

Earlier layers will have to be content with synthetic gradients, because they or their DNI will never witness the "true gradient".

Regarding training in parallel with minibatches instead of Parallelizing via Synthetic Grads:

Don't forget that authors also experimented with predicting the actual inputs for every layer.

Usually, we have:

fwdProp(), bkProp(), fwdProp(), bkProp()...

With synthetic gradients we can have:

fwdProp(), bkProp(), bkProp(), bkProp(), fwdProp()...

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

add a comment |

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

 input

   |

   V

Layer_0 & DNI_0

Layer_1

Layer_2

Layer_3 & DNI_3

Layer_4

Layer_5 & DNI_5

Layer_6

Layer_7

   | 

   V

output

The DNI_0 is always trained with a synthetic gradient arriving from DNI_3 (flowing through Layer_2, and Layer_1 of course), siting several layers further.

Likewise, the DNI_3 is always trained with a synthetic gradient arriving from DNI_5

DNI_0 or DNI_3 will never get to see the true gradient, because true grad is only delivered to DNI_5, and not earlier.

For anyone still struggling to understand them, read this awesome blogpost, part 3

Earlier layers will have to be content with synthetic gradients, because they or their DNI will never witness the "true gradient".

Regarding training in parallel with minibatches instead of Parallelizing via Synthetic Grads:

Don't forget that authors also experimented with predicting the actual inputs for every layer.

Usually, we have:

fwdProp(), bkProp(), fwdProp(), bkProp()...

With synthetic gradients we can have:

fwdProp(), bkProp(), bkProp(), bkProp(), fwdProp()...

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

add a comment |

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

 input

   |

   V

Layer_0 & DNI_0

Layer_1

Layer_2

Layer_3 & DNI_3

Layer_4

Layer_5 & DNI_5

Layer_6

Layer_7

   | 

   V

output

The DNI_0 is always trained with a synthetic gradient arriving from DNI_3 (flowing through Layer_2, and Layer_1 of course), siting several layers further.

Likewise, the DNI_3 is always trained with a synthetic gradient arriving from DNI_5

DNI_0 or DNI_3 will never get to see the true gradient, because true grad is only delivered to DNI_5, and not earlier.

For anyone still struggling to understand them, read this awesome blogpost, part 3

Earlier layers will have to be content with synthetic gradients, because they or their DNI will never witness the "true gradient".

Regarding training in parallel with minibatches instead of Parallelizing via Synthetic Grads:

Don't forget that authors also experimented with predicting the actual inputs for every layer.

Usually, we have:

fwdProp(), bkProp(), fwdProp(), bkProp()...

With synthetic gradients we can have:

fwdProp(), bkProp(), bkProp(), bkProp(), fwdProp()...

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

It's important to understand how to update any DNI module. To clear things up, consider an example of network with several layers and 3 DNI modules:

 input

   |

   V

Layer_0 & DNI_0

Layer_1

Layer_2

Layer_3 & DNI_3

Layer_4

Layer_5 & DNI_5

Layer_6

Layer_7

   | 

   V

output

The DNI_0 is always trained with a synthetic gradient arriving from DNI_3 (flowing through Layer_2, and Layer_1 of course), siting several layers further.

Likewise, the DNI_3 is always trained with a synthetic gradient arriving from DNI_5

DNI_0 or DNI_3 will never get to see the true gradient, because true grad is only delivered to DNI_5, and not earlier.

For anyone still struggling to understand them, read this awesome blogpost, part 3

Earlier layers will have to be content with synthetic gradients, because they or their DNI will never witness the "true gradient".

Regarding training in parallel with minibatches instead of Parallelizing via Synthetic Grads:

Don't forget that authors also experimented with predicting the actual inputs for every layer.

Usually, we have:

fwdProp(), bkProp(), fwdProp(), bkProp()...

With synthetic gradients we can have:

fwdProp(), bkProp(), bkProp(), bkProp(), fwdProp()...

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

edited 7 mins ago

answered May 31 '18 at 1:01

Kari

614321

answered May 31 '18 at 1:01

Kari

614321

answered May 31 '18 at 1:01

Kari

614321

add a comment |

answered May 24 '18 at 7:01

D.W.

2,103628

$begingroup$
From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?
$endgroup$
– Kari
May 24 '18 at 11:14

$begingroup$
So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later
$endgroup$
– Kari
May 24 '18 at 11:35

$begingroup$
@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.
$endgroup$
– D.W.
May 24 '18 at 13:11

$begingroup$
Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).
$endgroup$
– Kari
May 24 '18 at 13:22

$begingroup$
It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/
$endgroup$
– Kari
May 24 '18 at 13:56

add a comment |

answered May 24 '18 at 7:01

D.W.

2,103628

$begingroup$
From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?
$endgroup$
– Kari
May 24 '18 at 11:14

$begingroup$
So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later
$endgroup$
– Kari
May 24 '18 at 11:35

$begingroup$
@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.
$endgroup$
– D.W.
May 24 '18 at 13:11

$begingroup$
Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).
$endgroup$
– Kari
May 24 '18 at 13:22

$begingroup$
It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/
$endgroup$
– Kari
May 24 '18 at 13:56

add a comment |

answered May 24 '18 at 7:01

D.W.

2,103628

answered May 24 '18 at 7:01

D.W.

2,103628

answered May 24 '18 at 7:01

D.W.

2,103628

answered May 24 '18 at 7:01

D.W.

2,103628

answered May 24 '18 at 7:01

D.W.

2,103628

$begingroup$
From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?
$endgroup$
– Kari
May 24 '18 at 11:14

$begingroup$
So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later
$endgroup$
– Kari
May 24 '18 at 11:35

$begingroup$
@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.
$endgroup$
– D.W.
May 24 '18 at 13:11

$begingroup$
Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).
$endgroup$
– Kari
May 24 '18 at 13:22

$begingroup$
It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/
$endgroup$
– Kari
May 24 '18 at 13:56

add a comment |

$begingroup$
From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?
$endgroup$
– Kari
May 24 '18 at 11:14

$begingroup$
So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later
$endgroup$
– Kari
May 24 '18 at 11:35

$begingroup$
@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.
$endgroup$
– D.W.
May 24 '18 at 13:11

$begingroup$
Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).
$endgroup$
– Kari
May 24 '18 at 13:22

$begingroup$
It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/
$endgroup$
– Kari
May 24 '18 at 13:56

From my understanding, time-wise the gradients shouldn't reach the DNI faster, it's just that they are now slid in time, and computed asynchronously while the forward prop is occurring. The DNI will still have to get the true gradient to train itself. So the Synthetic Gradients should require the same number of computations to be done in parallel as when with standard BPTT. Is this correct?

– Kari
May 24 '18 at 11:14

So there wouldn't be any speedup in simply introducing the SG between the layers. Yes, we do get the immediate predicted gradient from the DNI, but for each such prediction we will eventually have to pay the price by asynchronous full back propagation towards that DNI, a bit later

– Kari
May 24 '18 at 11:35

@Kari, no, that doesn't sound right to me. If you need the same number of iterations, but now each iteration takes the same 50% less time on the GPU, then the resulting computation will be done earlier. Even if you need 10% more iterations/epochs (because the gradients are delayed or the synthetic gradients don't perfectly match the actual gradients), that's still a win: the speedup from being able to compute the synthetic gradient faster than the real gradient outweighs other effects. You seem confident that this can't help, but the data in the paper shows it does help.

– D.W.
May 24 '18 at 13:11

Hm, well for example, we have 4 layers sitting after our DNI; In normal backprop, we would have 4 "forward" exchanges between the layers, and then 4 "backwards exchanges", and while this occurs the system is locked. With DNI we can straight away correct our weights, but will need to get true gradients later on.But now, the system is not locked, allowing in the meantime for more forward passes to slide by. But we still owe the true gradient from before, to our DNI ...To obtain and deliver this gradient back to DNI it will take 100% of the time (same 4 steps forward, same 4 steps backward).

– Kari
May 24 '18 at 13:22

It's just that our DNI says "fine, give them when possible, later on", but we still got to pay the full price, so I don't see the performance increase. I agree, the papers show great results, but how come? We already can train minibatches in parallel anyway :/

– Kari
May 24 '18 at 13:56

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggthjy