Stefan Schweter's picture

In a Training Loop 🔄

Stefan Schweter PRO

stefan-it

·

https://schweter.bayern

AI & ML interests

Flair Library 💕, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models, Bavarian NLP 🥨

Recent Activity

liked a dataset about 6 hours ago

minilingua-ai/mcqa-minilingua-sft

liked a model about 6 hours ago

minilingua-ai/MiniLingua-1b

reacted to martinsu's post with 🔥 4 days ago

I wasted days on a GPU node on a bug that shouldn't exist So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on. Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure. Well... long story short—TGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster. I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight. Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal — like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird. This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors — just silent degradation. TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know. If model was trained on fast tokenizer, its fine, but how do You know? The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast . The result of this fight with tokenizers is https://huggingface.co/martinsu/tildeopen-30b-mu-instruct It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancy—just a proper instruction fine-tune where I didn't mess up the tokenizer this time. Full article: https://github.com/martins-u/tokenmagedon

View all activity

Organizations

liked a dataset about 6 hours ago

minilingua-ai/mcqa-minilingua-sft

Viewer • Updated Jul 27 • 17.2k • 9 • 1

liked a model about 6 hours ago

minilingua-ai/MiniLingua-1b

Updated Jul 27 • 5 • 1

liked 2 models 7 days ago

Cognitive-Lab/NetraEmbed

Visual Document Retrieval • 4B • Updated 7 days ago • 622 • 22

Cognitive-Lab/ColNetraEmbed

Visual Document Retrieval • Updated 7 days ago • 383 • 1

liked a model 17 days ago

QwQZh/gated_attention

Updated May 10 • 18

liked 2 datasets 26 days ago

uv-scripts/sam3

Updated 26 days ago • 1.23k • 9

nvidia/Nemotron-PII

Viewer • Updated about 4 hours ago • 200k • 1.93k • 44

liked a dataset 29 days ago

HuiHuang/NER-CoT

Viewer • Updated 29 days ago • 45.8k • 47 • 1

liked a model about 1 month ago

ZurichNLP/DETECT

Updated Oct 22 • 2

liked 2 datasets about 1 month ago

HuggingFaceFW/finepdfs-edu

Viewer • Updated Nov 11 • 49.5M • 12.5k • 56

PleIAs/SYNTH

Viewer • Updated Nov 11 • 68M • 50k • 199

liked a model about 1 month ago

moonshotai/Kimi-K2-Thinking

Text Generation • Updated Nov 8 • 398k • • 1.54k

liked 3 datasets about 1 month ago

Eurolingua/HPLT3-198-500k

Preview • Updated Nov 10 • 229 • 1

HPLT/HPLT3.0

Updated Nov 14 • 184 • 9

JZSG/gallica_heritage

Viewer • Updated Nov 2 • 1.55M • 320 • 1

liked a Space about 2 months ago

UGC Video Generator

Generate influencer-style UGC videos from a form

liked a model about 2 months ago

ByteDance/Ouro-1.4B

Text Generation • Updated about 1 month ago • 13.6k • 57

liked a dataset about 2 months ago

nvidia/Nemotron-Safety-Guard-Dataset-v3

Viewer • Updated Oct 22 • 387k • 1.36k • 15

liked 2 models about 2 months ago

openai/gpt-oss-safeguard-120b

Text Generation • 120B • Updated Oct 29 • 29.8k • 72

openai/gpt-oss-safeguard-20b

Text Generation • 22B • Updated Oct 29 • 44.2k • • 161