There are two concepts in further training a pre-trained Transformer model on a specific task.
- Pre-training (the terminology is a little confusing)
Pre-training is the training procedure performed on a Transformer prior to adapting it to a particular task. In the case of BERT, this consists of training the model on a very, very large dataset using two pre-training objectives. Namely, masked word prediction and next sentence prediction.
- Fine-tuning (on a particular task)
A suitable linear layer is added on top of the Transformer model and the entire model is trained on the required task (classification, NER, QA etc.)
For the vast majority of cases, you will only ever need to do the 2nd part. That is, you will take a pre-trained model and fine-tune it on your task with your own data.