AB Dispatches

Building neGPT: A Small Language Model on Northeast India Historical Data

I have been wanting to understand GPTs not just at a surface level, but from the inside out — how they are built, how they are trained, and what really happens between raw text and generated output.

After studying from Andrej Karpathy’s material, I decided that the best way to truly learn the intricacies of GPT-style models was to build one around a domain that genuinely interests me. That is how neGPT started.

For now, neGPT is a small language model project built on a historical data corpus focused on Northeast India. The idea is to work with domain-specific text and use this project as a practical way to understand the full pipeline behind language models — from data collection and preprocessing to training, evaluation, inference, and eventually deployment.

At this stage, this post is an introduction to the project rather than the full technical write-up.

In the detailed write-up that I will publish later, I plan to cover the entire process step by step, including:

The goal of this project is not just to build a small domain-specific language model, but also to deeply learn how GPT-like systems work in practice by implementing and experimenting with them myself.

I will keep updating this as the project progresses, and later publish a complete end-to-end account of the journey from concept to deployment.

Thank you for reading. If you found this blog useful or enjoyed reading it, you can buy me a coffee here: Buy me a coffee

#gpt #llm #nlp #northeast india #projects #tech