Yeah, the order seems to be swapped for some reason in the pytorch-transformers library. I’m 99% sure it’s a mistake and switching it should be fine. But, it also works even if you don’t switch it.

Gradient overflow warnings are normal when using dynamic loss scaling. As long as you aren’t getting NaN losses or unreasonable loss values, it’s expected behaviour.

AI researcher, avid reader, fantasy and Sci-Fi geek, and fan of the Oxford comma. www.linkedin.com/in/t-rajapakse/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store