There was a bug in the ELECTRA pretraining code that prevented the discriminator from being trained (as discussed here).
I think that would explain the differences you point out. I’ll see if I can rerun this experiment and update article accordingly.
Let me know if I can help with anything!