Mishka --- Understanding Recurrent Identity Networks

Mishka -- Understanding Recurrent Identity Networks -- January 26, 2018

Overviewing a remarkable recent Swiss paper which finds a simple solution to vanishing gradients problem in recurrent networks:

https://arxiv.org/abs/1801.06105

It is a very simple schema, and it is one of those cases when the question "how comes this was not known for decades?" arises. (Other cases when this question arises include AlphaZero (both Go and Chess) and our own self-modifying neural nets based on vector flows.)

I don't think this is a particularly well written paper - what the authors say is that if one writes the recurrent part H_next = ... + V*H_previous as H_next = ... + (U+I)*H_previous, where U and V are square matrices and I is the identity matrix, then it "encourages the network to stay close to the identity transformation", and then things work nicely, with the added remarkable benefit of making possible to use ReLU activation functions in the recurrent setting without things blowing up. But they don't do a good job explaining why this rewriting encourages the network to stay close to the identity transformation.

I think the answer is regularization, especially explicit regularization on weights like L_2, but possibly also implicit regularization which might be present in some optimization methods. If a regularization encouraging small weights is applied to elements of U, rather than elements of V, then this indeed would encourage the network to stay close to the identity! (When one scales this kind of network to large data set, one probably needs to make sure that regularization (which is often associated with priors) would not become vanishingly small compared to the influence of the data set, otherwise this approach might stop working.)

(Other than leaving the reader with the sense of mystery for why it all works, the paper is quite interesting and remarkable, both in its results, and in documenting how the authors discovered it. I certainly don't mean to diminish the value of their discovery here.)

Mishka --- January 26, 2018

Back to Mishka's home page