DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

2402.03300

Published 4/30/2024 by Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu and 1 other

cs.CL cs.AI cs.LG

💬

Abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Get summaries of the top AI research delivered straight to your inbox:

Overview

The paper introduces DeepSeekMath 7B, a large language model trained on a vast amount of math-related data to improve its mathematical reasoning capabilities.
DeepSeekMath 7B achieves impressive performance on the competition-level MATH benchmark, approaching the level of state-of-the-art models like Gemini-Ultra and GPT-4.
The paper attributes the model's mathematical reasoning abilities to two key factors: leveraging publicly available web data and introducing a novel optimization technique called Group Relative Policy Optimization (GRPO).

Plain English Explanation

The paper presents a new large language model called DeepSeekMath 7B that is specifically designed to excel at mathematical reasoning. Mathematical reasoning is a significant challenge for language models due to the complex and structured nature of mathematics.

To address this challenge, the researchers behind DeepSeekMath 7B took two key steps. First, they gathered a massive amount of math-related data from the web, including 120B math-related tokens from Common Crawl. This allowed the model to learn a deep understanding of mathematical concepts and problem-solving strategies.

Second, the researchers introduced a new optimization technique called Group Relative Policy Optimization (GRPO), which is a variant of the well-known Proximal Policy Optimization (PPO) algorithm. GRPO helps the model develop stronger mathematical reasoning abilities while also improving its memory usage, making it more efficient.

The results are impressive: DeepSeekMath 7B achieves a score of 51.7% on the challenging MATH benchmark, approaching the performance of cutting-edge models like Gemini-Ultra and GPT-4. When the model's self-consistency is taken into account, the score rises to 60.9%, further demonstrating its mathematical prowess.

This research represents a significant step forward in the field of large language models for mathematical reasoning, and it has the potential to impact various domains that rely on advanced mathematical skills, such as scientific research, engineering, and education.

Technical Explanation

The paper introduces DeepSeekMath 7B, a large language model that has been pre-trained on a massive amount of math-related data from Common Crawl, totaling 120 billion tokens. This data, combined with natural language and code data, is used to continue the pre-training of the DeepSeek-Coder-Base-v1.5 7B model.

The key innovation in this work is the use of a novel optimization technique called Group Relative Policy Optimization (GRPO), which is a variant of the Proximal Policy Optimization (PPO) algorithm. GRPO is designed to enhance the model's mathematical reasoning abilities while also improving its memory usage, making it more efficient.

The researchers evaluate the performance of DeepSeekMath 7B on the competition-level MATH benchmark, and the model achieves an impressive score of 51.7% without relying on external toolkits or voting techniques. This performance level approaches that of state-of-the-art models like Gemini-Ultra and GPT-4.

Furthermore, the researchers demonstrate that leveraging the self-consistency of the model's outputs over 64 samples can further improve the performance, reaching a score of 60.9% on the MATH benchmark.

The paper attributes the strong mathematical reasoning capabilities of DeepSeekMath 7B to two key factors: the extensive math-related data used for pre-training and the introduction of the GRPO optimization technique.

Critical Analysis

The paper presents a compelling approach to improving the mathematical reasoning capabilities of large language models, and the results achieved by DeepSeekMath 7B are impressive. However, there are a few potential limitations and areas for further research that could be considered.

First, the paper does not provide a detailed analysis of the types of mathematical problems or concepts that DeepSeekMath 7B excels or struggles with. A more granular evaluation of the model's strengths and weaknesses could help identify areas for future improvements.

Additionally, the paper does not address the potential generalization of the GRPO technique to other types of reasoning tasks beyond mathematics. It would be interesting to explore the broader applicability of this optimization method and its impact on other domains.

Furthermore, the paper does not discuss the computational and resource requirements of training DeepSeekMath 7B, which could be a critical factor in the model's real-world deployability and scalability. Insights into the trade-offs between performance and efficiency would be valuable for the research community.

Despite these potential areas for further exploration, the overall approach and the results presented in the paper represent a significant step forward in the field of large language models for mathematical reasoning. The research has the potential to inspire future work and contribute to the development of more capable and accessible mathematical AI systems.

Conclusion

The paper introduces DeepSeekMath 7B, a large language model that has been specifically designed and trained to excel at mathematical reasoning. By leveraging a vast amount of math-related web data and introducing a novel optimization technique called Group Relative Policy Optimization (GRPO), the researchers have achieved impressive results on the challenging MATH benchmark.

DeepSeekMath 7B's performance, which approaches that of state-of-the-art models like Gemini-Ultra and GPT-4, demonstrates the significant potential of this approach and its broader implications for fields that rely on advanced mathematical skills. The research represents an important step forward in the ongoing efforts to develop large language models that can effectively tackle complex mathematical problems and reasoning tasks.

As the field of large language models for mathematical reasoning continues to evolve, the insights and techniques presented in this paper are likely to inspire further advancements and contribute to the development of even more capable and versatile mathematical AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

5/6/2024

cs.CL cs.AI

💬

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

5/9/2024

cs.CL cs.AI

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, Wenpeng Yin

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

4/8/2024

cs.CL

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

Debrup Das, Debopriyo Banerjee, Somak Aditya, Ashish Kulkarni

Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.

4/4/2024

cs.CL