2014, before batch normalization was invented, training NN was hard.
For example, VGG was trained for 11 layers first, and then randomly added more layers inside, so that it could converge.
Another example: Google net used early output








similar to mini batch






675 Mass Ave -> central square ???












soft attention -> weighted combination of all img location
hard attention -> forcing the model select only one location to look at -> more tricky -> not differentiable -> talk later in RL lecture










网友评论