Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
In a Training Loop 🔄
1778
303
143
Stefan Schweter
PRO
stefan-it
Follow
urchade's profile picture
Topnoch's profile picture
elenanereiss's profile picture
3601 followers
·
366 following
https://schweter.bayern
stefan-it
stefan-it
AI & ML interests
Flair Library 💕, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models, Bavarian NLP 🥨
Recent Activity
liked
a dataset
about 6 hours ago
minilingua-ai/mcqa-minilingua-sft
liked
a model
about 6 hours ago
minilingua-ai/MiniLingua-1b
reacted
to
martinsu
's
post
with 🔥
4 days ago
I wasted days on a GPU node on a bug that shouldn't exist So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on. Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure. Well... long story short—TGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster. I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight. Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal — like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird. This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors — just silent degradation. TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know. If model was trained on fast tokenizer, its fine, but how do You know? The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast . The result of this fight with tokenizers is https://huggingface.co/martinsu/tildeopen-30b-mu-instruct It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancy—just a proper instruction fine-tune where I didn't mess up the tokenizer this time. Full article: https://github.com/martins-u/tokenmagedon
View all activity
Organizations
stefan-it
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
a dataset
about 6 hours ago
minilingua-ai/mcqa-minilingua-sft
Viewer
•
Updated
Jul 27
•
17.2k
•
9
•
1
liked
a model
about 6 hours ago
minilingua-ai/MiniLingua-1b
Updated
Jul 27
•
5
•
1
liked
2 models
7 days ago
Cognitive-Lab/NetraEmbed
Visual Document Retrieval
•
4B
•
Updated
7 days ago
•
622
•
22
Cognitive-Lab/ColNetraEmbed
Visual Document Retrieval
•
Updated
7 days ago
•
383
•
1
liked
a model
17 days ago
QwQZh/gated_attention
Updated
May 10
•
18
liked
2 datasets
26 days ago
uv-scripts/sam3
Updated
26 days ago
•
1.23k
•
9
nvidia/Nemotron-PII
Viewer
•
Updated
about 4 hours ago
•
200k
•
1.93k
•
44
liked
a dataset
29 days ago
HuiHuang/NER-CoT
Viewer
•
Updated
29 days ago
•
45.8k
•
47
•
1
liked
a model
about 1 month ago
ZurichNLP/DETECT
Updated
Oct 22
•
2
liked
2 datasets
about 1 month ago
HuggingFaceFW/finepdfs-edu
Viewer
•
Updated
Nov 11
•
49.5M
•
12.5k
•
56
PleIAs/SYNTH
Viewer
•
Updated
Nov 11
•
68M
•
50k
•
199
liked
a model
about 1 month ago
moonshotai/Kimi-K2-Thinking
Text Generation
•
Updated
Nov 8
•
398k
•
•
1.54k
liked
3 datasets
about 1 month ago
Eurolingua/HPLT3-198-500k
Preview
•
Updated
Nov 10
•
229
•
1
HPLT/HPLT3.0
Updated
Nov 14
•
184
•
9
JZSG/gallica_heritage
Viewer
•
Updated
Nov 2
•
1.55M
•
320
•
1
liked
a Space
about 2 months ago
Running
1
UGC Video Generator
🎥
1
Generate influencer-style UGC videos from a form
liked
a model
about 2 months ago
ByteDance/Ouro-1.4B
Text Generation
•
Updated
about 1 month ago
•
13.6k
•
57
liked
a dataset
about 2 months ago
nvidia/Nemotron-Safety-Guard-Dataset-v3
Viewer
•
Updated
Oct 22
•
387k
•
1.36k
•
15
liked
2 models
about 2 months ago
openai/gpt-oss-safeguard-120b
Text Generation
•
120B
•
Updated
Oct 29
•
29.8k
•
72
openai/gpt-oss-safeguard-20b
Text Generation
•
22B
•
Updated
Oct 29
•
44.2k
•
•
161
Load more