The best model might vary depending on the situation, but as of right now, XLNet seems to be holding the lead in most benchmarks.
All these models are based on the Transformer architecture. The vanilla Transformer is simply two stacks of encoders and decoders. The bottom-most (first) encoder performs the embedding task, i.e converting the sentence into a vector. The vector is then passed through each encoder and the output of the final encoder is sent to each of the decoders. The outputs from each decoder are also passed to the decoder on top of it in the stack. Bottom decoder sends its output to the next bottom and so forth.
BERT is based on this Transformer architecture but it only has the encoder stacks (albeit with more layers and more attention heads).
These two posts explain the concepts beautifully. It’s much easier to understand when presented visually, and these guides do it brilliantly.