Inference-Optimization - a ahmed-ali Collection

ahmed-ali 's Collections

Inference-Optimization

Inference-Optimization

updated 4 days ago

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 6
Weighted Grouped Query Attention in Transformers

Paper • 2407.10855 • Published Jul 15, 2024
Fast Transformer Decoding: One Write-Head is All You Need

Paper • 1911.02150 • Published Nov 6, 2019 • 9
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Paper • 2406.09297 • Published Jun 13, 2024 • 6
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21, 2024 • 33
Effectively Compress KV Heads for LLM

Paper • 2406.07056 • Published Jun 11, 2024 • 1
The Nature of Mathematical Modeling and Probabilistic Optimization Engineering in Generative AI

Paper • 2410.18441 • Published Oct 24, 2024 • 7
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11 • 57