The token set (vocabulary) is usually generated by using Byte Pair Encoding on a corpus that you think represents your training set well.
BPE starts with a set of tokens consisting of single character tokens. Then the most frequent pairs of tokens are merged into single tokens and added to vocabulary. All occurrences of those pairs in the corpus are replaced with the new merged tokens. This process is repeated until the vocabulary is as large as you want it to be.
BPE starts with a set of tokens consisting of single character tokens. Then the most frequent pairs of tokens are merged into single tokens and added to vocabulary. All occurrences of those pairs in the corpus are replaced with the new merged tokens. This process is repeated until the vocabulary is as large as you want it to be.
https://en.m.wikipedia.org/wiki/Byte_pair_encoding