Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models

Caia Costello; Simon Guo; Anna Goldie; Azalia Mirhoseini

doi:10.48550/arXiv.2504.18116

Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models

Caia Costello Stanford

Simon Guo Stanford

Anna Goldie Stanford University

Azalia Mirhoseini Stanford

International Conference on LLM-Aided Design, 2025

SSI-FM @ICLR 2025

DOI: 10.48550/arXiv.2504.18116

Think, Prune, Train (TPT) is a scalable framework that enables smaller language models to achieve performance rivaling larger ones through iterative self-improvement on their own reasoning traces, with experimental results showing models like Gemma-2B and LLaMA-70B-Instruct surpassing GPT-4o on reasoning tasks.

Abstract

Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.

Scaling Intelligence Lab

Think, Prune, Train, Improve: Scaling Reasoning Without Scaling Models

Abstract

Materials

Bibtex