Datasets - a igormolybog Collection

igormolybog 's Collections

Domain spec fine-tuning

Inference speed

llama + WebWork

Solver training

Hetero training

Datasets

updated May 8, 2024

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

Paper • 2311.06783 • Published Nov 12, 2023 • 28
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Paper • 2311.07574 • Published Nov 13, 2023 • 16
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

Paper • 2401.04575 • Published Jan 9, 2024 • 18
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Paper • 2402.00159 • Published Jan 31, 2024 • 65
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Paper • 2402.06619 • Published Feb 9, 2024 • 57
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Paper • 2402.07625 • Published Feb 12, 2024 • 16
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Paper • 2402.10176 • Published Feb 15, 2024 • 38
StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29, 2024 • 152
WildChat: 1M ChatGPT Interaction Logs in the Wild

Paper • 2405.01470 • Published May 2, 2024 • 64
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment

Paper • 2405.01481 • Published May 2, 2024 • 30