You got errors, because of the evaluation metrics. For binary classification, you can calculate f1 score, precision, recall, etc. However, for the multi-class problem, you need to use micro-f score or macro-f score instead. In this new colab notebook (https://colab.research.google.com/drive/14RJo9TnXM300BIlucXCrZE4BV3pFkKfe), I generated a three class dataset, and just removed all the metrics not suitable for the multiple-class model. Now only the accuracy is kept in the result list. Instead of using bert-text package, I merely hid all the codes in the first chunk. You can examine them in detail. Could you please help me to implement micro-f score into it? Thanks!