What Happened
Developer Michael Hla trained a language model from scratch exclusively on pre-1900 scientific and literary text, then tested whether the model could independently derive 20th-century physics concepts. The model is small by modern standards, which limits deep reasoning, but when prompted with descriptions of landmark historical experiments, it produced outputs stating that 'light is made up of definite quantities of energy' and suggested an equivalence between gravity and acceleration — core ideas behind quantum mechanics and general relativity. The dataset, model weights, and training code are all publicly released on GitHub under the project name gpt1900. A live demo of an early instruction-tuned checkpoint is available at gpt1900.com.
Why It Matters
This project is a practical stress test of a fundamental question: does an LLM reason or pattern-match? The result is nuanced. The model shows emergent intuition from corpus statistics alone, without any physics fine-tuning. For indie developers and SMEs building domain-specific LLMs, this demonstrates that narrow, high-quality datasets can produce surprisingly capable niche models even at small parameter counts. It also validates the approach of training from scratch on curated vertical data rather than always fine-tuning a large base model.
- Full dataset and model weights are open-source — usable for research or as a baseline
- Training from scratch on a narrow corpus is feasible for solo developers with limited GPU budget
- The project frames physics rediscovery as an open benchmark problem, inviting community contributions
Asia-Pacific Angle
Chinese and Southeast Asian developers building vertical LLMs — for example, models trained on classical Chinese literature, Traditional Chinese Medicine texts, or pre-modern legal documents — can directly apply this methodology. The gpt1900 codebase provides a reproducible pipeline for corpus curation, tokenizer training, and instruction tuning on historical or domain-specific text. Teams at Chinese AI startups experimenting with smaller, cheaper models for specialized knowledge domains (a common constraint given export controls on high-end GPUs) will find the small-model-from-scratch approach particularly relevant. The open dataset also serves as a clean English-language benchmark corpus for multilingual transfer experiments.
Action Item This Week
Clone the gpt1900 GitHub repository, review the dataset construction scripts, and identify one narrow text corpus from your own domain — legal filings, technical manuals, or historical records — that you could substitute to train a comparable domain-specific model using the same pipeline.