Building neGPT: A Small Language Model on Northeast India Historical Data

14 Mar, 2026

I have been wanting to understand GPTs not just at a surface level, but from the inside out — how they are built, how they are trained, and what really happens between raw text and generated output.

After studying from Andrej Karpathy’s material, I decided that the best way to truly learn the intricacies of GPT-style models was to build one around a domain that genuinely interests me. That is how neGPT started.

For now, neGPT is a small language model project built on a historical data corpus focused on Northeast India. The idea is to work with domain-specific text and use this project as a practical way to understand the full pipeline behind language models — from data collection and preprocessing to training, evaluation, inference, and eventually deployment.

At this stage, this post is an introduction to the project rather than the full technical write-up.

In the detailed write-up that I will publish later, I plan to cover the entire process step by step, including:

how I came up with the project idea
why I chose Northeast India historical data as the corpus
how I collected and prepared the dataset
preprocessing and tokenization
model design and training setup
experiments, challenges, and observations
inference and output generation
deployment and future improvements

The goal of this project is not just to build a small domain-specific language model, but also to deeply learn how GPT-like systems work in practice by implementing and experimenting with them myself.

I will keep updating this as the project progresses, and later publish a complete end-to-end account of the journey from concept to deployment.

Thank you for reading. If you found this blog useful or enjoyed reading it, you can buy me a coffee here: Buy me a coffee

#gpt #llm #nlp #northeast india #projects #tech