Excellence in Policy Research

APSDI

Publications

Explore our collection of research papers, reports, white papers, and guides

Research Paper Protected

Promoting Digital Transformation & Innovation In Africa

Dr Bright Aregs Nov 2025

Research Paper Access Restricted

Promoting Digital Transformation & Innovation In Africa

Authors: Dr Bright Aregs

Published: November 10, 2025

135 views

2 downloads

Abstract

There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the rea- soning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quan- titative reasoning. However, because all known publicly released web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and L A T EX content and removing boilerplate from HTML documents, as well as our methods for qual- ity filtering and deduplication.

View Full Publication Download PDF

Publication Details

Category:

Keywords:

Math, Research, DataSet

ISBN:

978-0-061-96436-7

DOI:

10.3352/jeehp.2013.10.3

Excellence in Policy Research

APSDI

Publications

Promoting Digital Transformation & Innovation In Africa

Abstract

Publication Details

Contact Us

About APSDI

Programs &
Initiatives

Research & Publications

Excellence in Policy Research

APSDI

Publications

Promoting Digital Transformation & Innovation In Africa

Abstract

Publication Details

Contact Us

About APSDI

Programs & Initiatives

Research & Publications

Programs &
Initiatives