Yeah, the order seems to be swapped for some reason in the pytorch-transformers library. I’m 99% sure it’s a mistake and switching it should be fine. But, it also works even if you don’t switch it.
Gradient overflow warnings are normal when using dynamic loss scaling. As long as you aren’t getting NaN losses or unreasonable loss values, it’s expected behaviour.