
DNN Gates
In the following we heavily borrow from this text. The basic building block of vectorized gradients is the Jacobian Matrix. In the introductory section we dealt with functions . Suppose that we have a more complicated function that maps a vector of length to a vector of length : Then its Jacobian is: The Jacobian matrix will be useful for us because we can apply the chain rule to a vector-valued function just by multiplying Jacobians. As a little illustration of this, suppose we have a function taking a scalar to a vector of size 2 and a function taking a vector of size two to a vector of size two. Now let’s compose them to get . Using the regular chain rule, we can compute the derivative of as the Jacobian And we see this is the same as multiplying the two Jacobians: This is also another instructive summary that help us understand how to calculate the local gradients involved and the gate templates (identities) summarized below that are routinely found in neural network backpropagation calculations. Assume that with . Tables of Gates and Gradients used in the backpropagation of deep neural networks| Gate | Solution |
|---|---|
| |
| |
| |
element-wise | |
, | |
, | |
, , |
Backpropagation through non-linear units
As shown here, you need to be watchful of the effects of the various non-linear gates on the gradient flow. For sigmoid gate, if you are sloppy with the weight initialization or data preprocessing these non-linearities can “saturate” and entirely stop learning — your training loss will be flat and refuse to go down. If your weight matrix W is initialized too large, the output of the matrix multiply could have a very large range (e.g. numbers between -400 and 400), which will make all outputs in the vector z almost binary: either 1 or 0. But if that is the case, , which is local gradient of the sigmoid non-linearity, will in both cases become zero (“vanish”), making the gradient for both x and W be zero. The rest of the backward pass will come out all zero from this point on due to multiplication in the chain rule.


