June 1, 2023
I'm making my own GPT...
Language models have transformed the way we interact with text, enabling applications like chatbots, text generation, and more.
Language models have transformed the way we interact with text, enabling applications like chatbots, text generation, and more. While pre-trained models like GPT have gained popularity, creating your own language model trained on custom data can be an exciting and rewarding endeavor. So I decided to make my own GPT and focus it on pulp magazine fiction from the last century.
- So in the beginning I needed to define my project through the data I was going to gather and I wanted to focus on the use of pulp magazine fiction, pulling data from: https://archive.org/details/pulpmagazinearchive. This involved a substantial amount of web scrapping which was a high learning curve, but fun to do. In the end I created the most clean version I could of about a century worth of pulp magazine text. I'm currently worried as there is a lot of bumpf text as these magazines were OCR'ed and they bring in everything including text from adverts and the like.
- The next stage I'm going to start with is the data preprocessing. This involves cleaning the data by removing unnecessary symbols, punctuation, or irrelevant content. Tokenization will be a crucial step where you split your text into smaller units like words or subwords. Subwords is a pretty large subset, so my tokens will be in the form of characters which takes the total from roughly 50,000 tokens down to about 70. I'm just getting a feel for the process so I'm not going to use a library such as NLTK or spaCy for tokenization, but that could be a ripe area for upgrading this project.
- Having preprocessed I'm going to work through training this language model. I'm hoping to use TensorFlow where I'll initializing my model with the appropriate hyperparameters, such as the number of layers, hidden units, and attention heads (I'm still researching what is required here). Then, I'll feed my preprocessed data into the model and train it using techniques like unsupervised learning or fine-tuning. I'll use a 90/10 split of training data to test data.
Hopefully at the end of it all I'll have created the ability to generate an infinite amount of pulp magazine text. Wish me luck, perhaps next month's article will be a pulp fiction story written by an AI model I've trained myself.