The embedding portion of the task is done at the tokenization stage. The basic idea is that there is a vocabulary of words known to pre-trained BERT, and all the input text is tokenized (split into smaller and smaller word pieces) until all input is represented by tokens that are in the vocabulary.

The library we are using comes with classes representing the various types of BERT models. In our case, we are using `BertForSequenceClassification`. This class has a BERT model with a classification (linear) layer added at the end. You can see it in the output you get when you load the model onto the GPU/CPU.

We are not adding an explicit softmax layer to the model itself. The criterion `CrossEntropyLoss()` combines a softmax function and a loss function (see here). So, we simply pass the outputs of the model to CrossEntropyLoss(). For making predictions, we can simply take the np.argmax() of the model outputs. If you want the probabilities for each class, you can send the output through a softmax function.

Does this clear things up?

AI researcher, avid reader, fantasy and Sci-Fi geek, and fan of the Oxford comma.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store