Gotchas in Tokenizer Behavior Every Developer Should Know

Community Article Published April 18, 2025

This is an evolving blog post about things you need to know when you're dev with tokenizers, which I've learned along the way, hoping it helps you avoid the mistakes I've made.

BOS token

1. Not all tokenizers have a BOS token

Example, Qwen/Qwen2.5-0.5B does not have a bos_token.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.bos_token is not None
False

while microsoft/Phi-3-mini-128k-instruct does.

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token is not None
True
>>> tokenizer.bos_token
'<s>'

2. A tokenizer can have a BOS token but not use it

Example, both microsoft/Phi-3-mini-128k-instruct and CohereLabs/aya-expanse-8b have a bos_token, but only CohereLabs/aya-expanse-8b uses it.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<s>', 1)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[25685, 338, 2253, 1135, 22769]
>>> tokenizer.bos_token_id in input_ids
False

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[32010, 1724, 338, 2253, 1135, 22769, 29973, 32007, 32001, 25685, 29889, 32007, 32000]
>>> tokenizer.bos_token_id in input_ids
False
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> tokenizer.bos_token, tokenizer.bos_token_id
('<BOS_TOKEN>', 5)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[5, 82653, 1801, 5329, 2924, 82092]
>>> tokenizer.bos_token_id in input_ids
True

>>> input_ids = tokenizer.apply_chat_template([{"role": "user", "content": "What is better than ugly?"}, {"role": "assistant", "content": "Beautiful."}])
>>> input_ids
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
>>> tokenizer.bos_token_id in input_ids
True

EOS token

3. Tokenizing doesn't add the EOS token

When you tokenize a string, it doesn't automatically add the EOS token.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
>>> tokenizer.eos_token, tokenizer.eos_token_id
('<|endoftext|>', 151643)
>>> input_ids = tokenizer("Beautiful is better than ugly")["input_ids"]
>>> input_ids
[46518, 374, 2664, 1091, 27261]
>>> input_ids[-1] == tokenizer.eos_token_id
False

4. Applying the chat template could add the EOS token, but sometimes it doesn't, and sometimes it does but not at the end

Applying a chat template might add an EOS token—but not always, and not always at the end. Behavior varies across models:

  • Some templates append the EOS at the end, like meta-llama/Llama-3.2-1B-Instruct:

    >>> from transformers import AutoTokenizer
    >>> messages = [
    ...     {"role": "user", "content": "What is better than ugly?"},
    ...     {"role": "assistant", "content": "Beautiful."},
    ... ]
    >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|eot_id|>', 128009)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 972, 5186, 220, 2366, 20, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 2731, 1109, 28360, 30, 128009, 128006, 78191, 128007, 271, 47618, 13, 128009]
    >>> input_ids[-1] == tokenizer.eos_token_id
    True
    
  • Some don’t add an EOS at all, like databricks/dbrx-instruct:

    >>> tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|endoftext|>', 100257)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549, 555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177, 304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860, 3196, 389, 2038, 2561, 709, 311, 430, 1486, 627, 57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828, 43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847, 311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675, 7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058, 320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311, 1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920, 4390, 7, 2675, 656, 539, 617, 1972, 7394, 828, 2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473, 67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13, 1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11, 477, 3754, 9908, 323, 656, 539, 82791, 713, 3649, 315, 701, 4967, 828, 29275, 2028, 374, 701, 1887, 10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905, 433, 11, 1120, 6013, 311, 279, 1217, 13, 1442, 499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009, 13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430, 3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386, 72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781, 38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691, 1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198, 100278, 882, 198, 3923, 374, 2731, 1109, 28360, 30, 100279, 198, 100278, 78191, 198, 47618, 13, 100279]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    
  • Some add the EOS, but not in the very end like Qwen/Qwen2.5-0.5B-Instruct:

    >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
    >>> tokenizer.eos_token, tokenizer.eos_token_id
    ('<|im_end|>', 151645)
    >>> input_ids = tokenizer.apply_chat_template(messages)
    >>> input_ids
    [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 3838, 374, 2664, 1091, 27261, 30, 151645, 198, 151644, 77091, 198, 46518, 13, 151645, 198]
    >>> input_ids[-1] == tokenizer.eos_token_id
    False
    >>> input_ids[-2] == tokenizer.eos_token_id
    True
    

PAD token

5. When pad_token equals eos_token

It's common to set the pad_token to the same value as the eos_token, but this requires extra care when masking or preparing labels.

For example:

labels = input_ids.clone()
labels[input_ids == tokenizer.pad_token_id] = -100  # ⚠️ Not safe if PAD == EOS

If pad_token_id == eos_token_id, this will also mask out actual eos_tokens, which are typically meaningful and shouldn't be ignored. When using the same ID for both, make sure your masking logic doesn't unintentionally remove valid eos_tokens.

Chat template

6. Applying the chat template is not a homomorphism with respect to concatenation

In other words, you can't apply the template separately to the prompt and the completion, then concatenate them—it won't produce the correct result.

This implies you should never apply the chat template to a standalone completion:

completion = tokenizer.apply_chat_template(completion)  # ❌ No

And for the prompt, you should either use continue_final_message=True or add_generation_prompt=True.

prompt = tokenizer.apply_chat_template(prompt, continue_final_message=True)  # ✅ OK
prompt = tokenizer.apply_chat_template(prompt, add_generation_prompt=True)   # ✅ OK
prompt = tokenizer.apply_chat_template(prompt)                               # ❌ NO

7. Chat template and tokenization don't compose due to special tokens

In other words, you can't apply the chat template and tokenize in sequence due to the special tokens (especially the BOS token).

If you try this:

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)  # ❌ NO

It won’t give the expected result because both apply_chat_template and tokenizer add special tokens. Instead, disable special token addition during tokenization:

>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text, add_special_tokens=False)  # ✅ OK

Example with CohereLabs/aya-expanse-8b:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("CohereLabs/aya-expanse-8b")
>>> messages = [
...     {"role": "user", "content": "What is better than ugly?"},
...     {"role": "assistant", "content": "Beautiful."},
... ]
>>> text = tokenizer.apply_chat_template(messages, tokenize=False)
>>> tokenizer(text)["input_ids"]  # ❌ No
[5, 5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]
tokenizer(text, add_special_tokens=False)["input_ids"]  # ✅ OK
[5, 255000, 255006, 11214, 1801, 5329, 2924, 82092, 38, 255001, 255000, 255007, 82653, 21, 255001]

8. Adding a chat template isn’t enough — you also need to update the EOS token

When fine-tuning a base model and adding a chat template, that template typically includes a special end-of-turn token — for example, <|im_end|> in the case of Qwen/Qwen2.5-0.5B-Instruct. This token is used to indicate the end of a message in a chat format.

Here’s a simplified version of a chat template using Jinja syntax:

{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}

It’s crucial to update the model’s EOS token to match the end-of-turn token used in the template. If they don’t match, it can lead to issues like infinite generation.

tokenizer.chat_template = """\
{%- for message in messages %}
    {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
{%- endfor %}
"""
tokenizer.eos_token = "<|im_end|>"  # ⚠️ Critical step

Community

Very cool!

Really cool article, all these gotchas are fun and good to know.

Nice information

Thanks for sharing, probably worth having a script to check:

import warnings
from transformers import AutoTokenizer

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def check_tokenizer_gotchas(model_id):
    print(f"\n{'='*60}")
    print(f"Analyzing Tokenizer for: {model_id}")
    print(f"{'='*60}\n")

    try:
        # Load tokenizer (trust_remote_code=True is often needed for newer/custom models)
        tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return

    # Standard test input
    test_text = "Beautiful is better than ugly"
    
    # Standard test messages for Chat Templates
    messages = [
        {"role": "user", "content": "What is better than ugly?"},
        {"role": "assistant", "content": "Beautiful."}
    ]

    # --- GOTCHA 1 & 2: BOS Token Existence and Usage ---
    print(f"--- 1 & 2. BOS Token Analysis ---")
    if tokenizer.bos_token is None:
        print(f"⚠️  Gotcha #1: This tokenizer has NO BOS token defined.")
    else:
        print(f"✅  BOS token exists: '{tokenizer.bos_token}' (ID: {tokenizer.bos_token_id})")
        
        # Check usage in standard encoding
        encoded = tokenizer(test_text)["input_ids"]
        if tokenizer.bos_token_id in encoded:
             print(f"✅  BOS token IS automatically added during standard tokenization.")
        else:
            print(f"⚠️  Gotcha #2: BOS exists but is NOT added automatically.")

    # --- GOTCHA 3: EOS Token in Standard Tokenization ---
    print(f"--- 3. Standard EOS Token Analysis ---")
    encoded = tokenizer(test_text)["input_ids"]
    if tokenizer.eos_token_id and encoded[-1] == tokenizer.eos_token_id:
        print(f"ℹ️  EOS token WAS added automatically (Uncommon behavior).")
    else:
        print(f"⚠️  Gotcha #3: Tokenization did NOT add the EOS token automatically.")

    # --- GOTCHA 4: EOS in Chat Templates ---
    print(f"--- 4. Chat Template EOS Analysis ---")
    if tokenizer.chat_template:
        # Generate IDs without adding the generation prompt yet
        chat_encoded = tokenizer.apply_chat_template(messages, add_generation_prompt=False)
        
        if tokenizer.eos_token_id is None:
             print("❌  No EOS token defined in tokenizer.")
        
        elif len(chat_encoded) > 0:
            last_id = chat_encoded[-1]
            # Check if the very last token is EOS
            if last_id == tokenizer.eos_token_id:
                print(f"✅  Chat template correctly appends EOS ({tokenizer.eos_token}) at the very end.")
            
            # Check if EOS is second to last (common issue)
            elif len(chat_encoded) > 1 and chat_encoded[-2] == tokenizer.eos_token_id:
                # Decode the actual last token to show the user
                trailing_token = tokenizer.decode([last_id])
                # Escape newlines for visibility in print output
                trailing_repr = repr(trailing_token) 
                
                print(f"⚠️  Gotcha #4: EOS is present but NOT at the end.")
                print(f"    The actual last token is ID {last_id} ({trailing_repr}).")
                print(f"    (This is likely a trailing newline from the Jinja template).")
            
            else:
                print(f"⚠️  Gotcha #4: Chat template does NOT append the EOS token.")
    else:
        print("ℹ️  No chat template defined for this tokenizer.")

    # --- GOTCHA 5: PAD == EOS ---
    print(f"--- 5. Pad Token Collision Check ---")
    if tokenizer.pad_token_id is not None and tokenizer.eos_token_id is not None:
        if tokenizer.pad_token_id == tokenizer.eos_token_id:
            print(f"⚠️  Gotcha #5: PAD token ID equals EOS token ID ({tokenizer.pad_token_id}).")
            print(f"    Warning: Masking logic `input_ids == pad_token_id` will unintentionally mask EOS tokens.")
        else:
            print(f"✅  PAD ({tokenizer.pad_token_id}) and EOS ({tokenizer.eos_token_id}) are distinct.")
    else:
        print("ℹ️  PAD or EOS token not defined for this tokenizer.")

    # --- GOTCHA 6 & 7: Composition and Double Special Tokens ---
    print(f"--- 6 & 7. Chat Template Composition ---")
    if tokenizer.chat_template:
        # Step 1: Apply template directly to IDs (Correct way)
        direct_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)
        
        # Step 2: Apply template to string, THEN tokenize (Incorrect way often used)
        str_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        composed_ids = tokenizer(str_template)["input_ids"]

        if direct_ids != composed_ids:
            print(f"⚠️  Gotcha #7: Tokenizing the output of `apply_chat_template` ADDS extra special tokens.")
            print(f"    Direct ID length: {len(direct_ids)} vs Re-tokenized length: {len(composed_ids)}")
        else:
            print(f"✅  Tokenization of chat template string matches direct ID generation.")
    else:
      print("ℹ️  No chat template defined for this tokenizer.")

# Run for all models mentioned in the text
models = [
    "Qwen/Qwen2.5-0.5B",
    "microsoft/Phi-3-mini-128k-instruct",
    "CohereLabs/aya-expanse-8b",
    "meta-llama/Llama-3.2-1B-Instruct",
    "databricks/dbrx-instruct",
    "Qwen/Qwen2.5-0.5B-Instruct"
]

for model in models:
    check_tokenizer_gotchas(model)

Sign up or log in to comment