But how small should weights be? What’s the recommended upper limit? What probability distribution to use for generating random numbers? Furthermore, while using sigmoid activation functions, if weights are initialized to very large numbers, then the sigmoid will saturate(tail regions), resulting into dead neurons. ) Weight InitializationĪlways initialize the weights with small random numbers to break the symmetry between different units. As mentioned on Quora - “You just keep on adding layers, until the test error doesn’t improve anymore”. Selecting the optimal number of layers is relatively straight forward. By increasing the number of hidden units, model will have the required flexibility to filter out the most appropriate information out of these pre-trained representations. Since, pre-trained representation might contain a lot of irrelevant information in these representations(for the specific supervised task). On the other hand, while keeping smaller numbers of hidden units(than the optimal number), there are higher chances of underfitting the model.Īlso, while employing unsupervised pre-trained representations(describe in later sections), the optimal number of hidden units are generally kept even larger. Since, any regularization method will take care of superfluous units, at least to some extent. Keeping a larger number of hidden units than the optimal number, is generally a safe bet. depending on the specific task, which have shown to ameliorate some of these issues. You can further explore other alternatives like ReLU, SoftSign, etc. I have found that using tanh as activations generally works better than sigmoid. Hence, using tanh as activation function will result into faster convergence. sigmoids are not zero-centered.Ī better alternative is a tanh function - mathematically, tanh is just a rescaled and shifted sigmoid, tanh(x) = 2*sigmoid(x) - 1.Īlthough tanh can still suffer from the vanishing gradient problem, but the good news is - tanh is zero-centered. Saturation of sigmoids at tails(further causing vanishing gradient problem). But, a sigmoid function is inherently cursed by these two drawbacks - 1. For years, sigmoid activation functions have been the preferable choice. Activations introduces the much desired non-linearity into the model. One of the vital components of any Neural Net are activation functions. Data Augmentation - create new examples(in case of images - rescale, add noise, etc.).Remove any training sample with corrupted data(short texts, highly distorted images, spurious output labels, features with lots of null values, etc.).Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is better).A few measures one can take to get better training data: So, whether you are working with Computer Vision, Natural Language Processing, Statistical Modelling, etc. And why not, any DNN would(presumably) still give good results, right? But, it’s not completely old school to say that - “given the right type of data, a fairly simple model will provide better and faster results than a complex DNN”(although, this might have exceptions). For more in-depth understanding, I highly recommend you to go through the above mentioned research papers and references provided at the end.Ī lot of ML practitioners are habitual of throwing raw training data in any Deep Neural Net(DNN). All the points suggested here, should be taken more of a summarization of the best practices for training DNNs. Most of these practices, are validated by the research in academia and industry and are presented with mathematical and experimental proofs in research papers like Efficient BackProp(Yann LeCun et al.) and Practical Recommendations for Deep Architectures(Yoshua Bengio).Īs you’ll notice, I haven’t mentioned any mathematical proofs in this post. In this post, I will be covering a few of these most commonly used practices, ranging from importance of quality training data, choice of hyperparameters to more general tips for faster prototyping of DNNs. There are certain practices in Deep Learning that are highly recommended, in order to efficiently train Deep Neural Networks. How to train your Deep Neural Network Jan 5, 2017