Growing Transformers β€” Model UNICODE 1_9 (247.6M)

This repository contains growing-transformers-model-unicode-1-9-247m, a constructively grown (layer-wise) model from the paper:

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer (GPT-like) trained with a constructive, layer-wise growth procedure: layers are added progressively, and previously trained layers are frozen at each stage. The key feature is its fully frozen β€œvisual UNICODE” embedding substrate: token embeddings are deterministic and non-semantic by design (precomputed from Unicode glyph renderings), and not trained. The paper studies how semantic structure and reasoning capabilities can still emerge in deeper Transformer layers even when the input embedding layer is fixed and non-semantic.


Relationship to the 16-bit constructive model (main comparison)

This model is intended to be compared to:

  • Bochkov/growing-transformers-model-16-bit-1-9-181m (constructive growth + frozen 16-bit binary embedding)

Same Transformer stack (controlled study)

Both models share the same Transformer stack architecture in the controlled study:

  • 9 layers, decoder-only
  • d_model = 1024
  • n_head = 32
  • context length used in training: 1024
  • vocab_size = 65,536 (BVV tokenizer family)

They differ only in the embedding substrate (visual UNICODE vs 16-bit binary).

Important: parameter count difference (why 247.6M vs 181.6M)

The total parameter count is larger here (β‰ˆ247.6M) than in the 16-bit ablation (β‰ˆ181.6M) because the frozen embedding matrix is much larger:

  • This UNICODE model uses a full-size frozen embedding at d_model = 1024, i.e. an embedding table of size
    vocab_size Γ— d_model = 65,536 Γ— 1,024 β‰ˆ 67.1M parameters (frozen).
  • The 16-bit ablation uses an extremely small embedding (n_embed = 16) and a deterministic expansion to d_model, so the embedding-related parameter count is dramatically smaller.

So: same Transformer stack, but different embedding-table size β†’ different total parameters.


Frozen embedding definition (visual UNICODE)

  • vocab_size = 65,536
  • Each token ID maps to a deterministic visual Unicode-based vector (precomputed), frozen during training.
  • The embedding layer is intentionally non-semantic (semantic organization emerges in deeper layers).

Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • RoPE: rotary position embeddings
  • Activation: GELU
  • Token embedding: frozen visual UNICODE embedding at d_model=1024

Training method (constructive growth)

Training was performed in staged growth (as in the controlled study):

  1. Train layers 1–3, then freeze them
  2. Add layers 4–6, train only new layers, then freeze them
  3. Add layers 7–9, train only new layers

This isolates the effect of layer stacking on top of a stable frozen substrate (the embedding + already-trained lower layers).


Tokenizer

This model uses the BVV tokenizer family.

Canonical tokenizer repo:

Note: for exact reproducibility, it is recommended to load the tokenizer from this model repo, because this repo also contains the embedding artifacts that match the tokenizer IDs used here.


Intended use

Research / analysis of:

  • constructive (layer-wise) growth training for Transformers
  • emergent semantics with frozen / non-semantic embedding substrates
  • comparisons against monolithic baselines and minimal-embedding ablations

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-unicode-1-9-247m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-unicode-1-9-247m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. 
#The poem was published in 1993 by the French journalist and poet Jean Baptiste de La Mo

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai     </s><

πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/growing-transformers-model-unicode-1-9-247m

Papers for Bochkov/growing-transformers-model-unicode-1-9-247m