The embedding portion of the task is done at the tokenization stage. The basic idea is that there is a vocabulary of words known to pre-trained BERT, and all the input text is tokenized (split into smaller and smaller word pieces) until all input is represented by tokens that are in the vocabulary.
The library we are using comes with classes representing the various types of BERT models. In our case, we are using `BertForSequenceClassification`. This class has a BERT model with a classification (linear) layer added at the end. You can see it in the output you get when you load the model onto the GPU/CPU.
We are not adding an explicit softmax layer to the model itself. The criterion `CrossEntropyLoss()` combines a softmax function and a loss function (see here). So, we simply pass the outputs of the model to
CrossEntropyLoss(). For making predictions, we can simply take the
np.argmax() of the model outputs. If you want the probabilities for each class, you can send the output through a softmax function.
Does this clear things up?