Growing Transformers β Model UNICODE 1_9 (247.6M)
This repository contains growing-transformers-model-unicode-1-9-247m, a constructively grown (layer-wise) model from the paper:
It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study
Code:
https://github.com/AVBochkov/PGT
What this model is (in one paragraph)
This is a 9-layer decoder-only Transformer (GPT-like) trained with a constructive, layer-wise growth procedure: layers are added progressively, and previously trained layers are frozen at each stage. The key feature is its fully frozen βvisual UNICODEβ embedding substrate: token embeddings are deterministic and non-semantic by design (precomputed from Unicode glyph renderings), and not trained. The paper studies how semantic structure and reasoning capabilities can still emerge in deeper Transformer layers even when the input embedding layer is fixed and non-semantic.
Relationship to the 16-bit constructive model (main comparison)
This model is intended to be compared to:
- Bochkov/growing-transformers-model-16-bit-1-9-181m (constructive growth + frozen 16-bit binary embedding)
Same Transformer stack (controlled study)
Both models share the same Transformer stack architecture in the controlled study:
- 9 layers, decoder-only
d_model = 1024n_head = 32- context length used in training: 1024
vocab_size = 65,536(BVV tokenizer family)
They differ only in the embedding substrate (visual UNICODE vs 16-bit binary).
Important: parameter count difference (why 247.6M vs 181.6M)
The total parameter count is larger here (β247.6M) than in the 16-bit ablation (β181.6M) because the frozen embedding matrix is much larger:
- This UNICODE model uses a full-size frozen embedding at
d_model = 1024, i.e. an embedding table of sizevocab_size Γ d_model = 65,536 Γ 1,024 β 67.1Mparameters (frozen). - The 16-bit ablation uses an extremely small embedding (
n_embed = 16) and a deterministic expansion tod_model, so the embedding-related parameter count is dramatically smaller.
So: same Transformer stack, but different embedding-table size β different total parameters.
Frozen embedding definition (visual UNICODE)
vocab_size = 65,536- Each token ID maps to a deterministic visual Unicode-based vector (precomputed), frozen during training.
- The embedding layer is intentionally non-semantic (semantic organization emerges in deeper layers).
Model architecture (controlled study)
- Type: decoder-only Transformer (GPT-like)
- Layers: 9
- Hidden size: d_model = 1024
- Heads: n_head = 32
- RoPE: rotary position embeddings
- Activation: GELU
- Token embedding: frozen visual UNICODE embedding at
d_model=1024
Training method (constructive growth)
Training was performed in staged growth (as in the controlled study):
- Train layers 1β3, then freeze them
- Add layers 4β6, train only new layers, then freeze them
- Add layers 7β9, train only new layers
This isolates the effect of layer stacking on top of a stable frozen substrate (the embedding + already-trained lower layers).
Tokenizer
This model uses the BVV tokenizer family.
Canonical tokenizer repo:
- https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)
Note: for exact reproducibility, it is recommended to load the tokenizer from this model repo, because this repo also contains the embedding artifacts that match the tokenizer IDs used here.
Intended use
Research / analysis of:
- constructive (layer-wise) growth training for Transformers
- emergent semantics with frozen / non-semantic embedding substrates
- comparisons against monolithic baselines and minimal-embedding ablations
Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.
How to use (Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unicode-1-9-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unicode-1-9-247m", trust_remote_code=True).to('cuda')
inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=50,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean.
#The poem was published in 1993 by the French journalist and poet Jean Baptiste de La Mo
inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=10,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai </s><
π§βπ¬ Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
@article{
bochkov2025emergent,
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
author={Andrey Bochkov},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Odh8IynO1o},
note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
- Downloads last month
- 18