Building neGPT: A Small Language Model on Northeast India Historical Data
I have been wanting to understand GPTs not just at a surface level, but from the inside out — how they are built, how they are trained, and what really happens between raw text and generated output.
After studying from Andrej Karpathy’s material, I decided that the best way to truly learn the intricacies of GPT-style models was to build one around a domain that genuinely interests me. That is how neGPT started.
For now, neGPT is a small language model project built on a historical data corpus focused on Northeast India. The idea is to work with domain-specific text and use this project as a practical way to understand the full pipeline behind language models — from data collection and preprocessing to training, evaluation, inference, and eventually deployment.
At this stage, this post is an introduction to the project rather than the full technical write-up.
In the detailed write-up that I will publish later, I plan to cover the entire process step by step, including:
- how I came up with the project idea
- why I chose Northeast India historical data as the corpus
- how I collected and prepared the dataset
- preprocessing and tokenization
- model design and training setup
- experiments, challenges, and observations
- inference and output generation
- deployment and future improvements
The goal of this project is not just to build a small domain-specific language model, but also to deeply learn how GPT-like systems work in practice by implementing and experimenting with them myself.
I will keep updating this as the project progresses, and later publish a complete end-to-end account of the journey from concept to deployment.
Thank you for reading. If you found this blog useful or enjoyed reading it, you can buy me a coffee here: Buy me a coffee