Generative Pre-trained Transformer


Since the NLP models developed by OpenAI have entered, the Natural Language Processing field has seen unprecedented development. The Generative Pre-trained Transformer (GPT) can solve NLP problems such as question-answering, reading comprehension, machine translation and text summarization. Unlike previous techniques which used supervised learning to solve these problems the GPT models whereas solves them as an unsupervised learning problem. On top of that, these models perform with almost the same levels of accuracy and in the majority of the cases even more as compared to the nourished supervised learning models.

In the bid to develop a stronger and sharper language learning model, OpenAI came up with GPT-2. An even larger dataset was used along with having a larger corpus of parameters. Another upliftment lied in the case that the input for the GPT model was not task-specific, and thereby the output that the model generated did not have a task as an input parameter. Came with GPT-2 the ability to perform task-based processing for the same input data. The task-specific processing laid the foundations for what is called Zero-shot task transfer. Basically, the model is not fed with any examples and is specified only with a task. The model is required to perceive the sentiment of the task and shower the answers. In the case of GPT-1, the input was re-organized several times for fine-tuning. For example, if we want the model to do a translation task on the input provided we would pass in the sentence, put a colon and write French (: French). The model was assumed to understand that it is a translation task and provide the french parallel.

The GPT-2 model was developed using a mammoth 40GB of internet text data from over 8 million web pages, the sheer size of the data-set was enough to consider the model as a direct upgrade to GPT-1. The model architecture was another area where GPT-2 had direct advancements. It had 1.5 Billion parameters which were 10 times more than GPT-1, for word embedding the model had 48 layers and 1600 dimensional vectors. The model was presented with a varying number of parameters: 117 million, 345M, 762M and 1.5GB. It was observed that with an increase in the number of parameters, the model’s capability to synthesize the data from a dataset increases. The GPT-2 model consisted of 50,257 tokens to mark a significant scaling of the vocabulary. Moreover, 512 tokens worth the size of batch and an even larger 1024 tokens for the context window were used.

The GPT model was majorly based on the original transformer model. Hence in order to understand that we need to take an in-depth look in what the transformer model looks like. The model follows the encoder-decoder structure like any other neural sequence transduction models. Where the encoder maps an input sequence of symbol representations into a sequence of continuous representations., the decoder generates the resulting output sequence of symbol representation. Also since the model is auto-regressive at each stage, thereby generating the text it takes the previously generated symbols as additional input.

The transformer model architecture

As being stated before, the GPT model largely follows the original transformer model. The model was trained with masked self-attention heads, 786-dimensional states & 12 attention heads. 3072-dimensional inner states were used for the position wised feed-forward networks. Adam Optimization scheme was used with a maximum learning rate of 2.5e-4. The model was trained for 100 epochs on a minibatch size of 64 randomly sampled sequences of 512 tokens. Bytyepair encoding (BPE) vocabulary with 40,000 mergers was used with a dropout rate of 0.1 for regularization of residual, embedding and attention dropouts. Gaussian Error Learning rate (GELU) was used for the activation function.  To clean the raw text from the data-set, ftfy library was used, and after removing punctuation and blank spaces spaCy library was used for tokenization.

For fine-tuning, the hyper parameter prototype that was used for unsupervised pre-training was reused. The dropout rate of 0.1 was added to the classifier.  Also, a learning rate of 6.25e-5 with a batch size of 32 was adopted for most of the tasks. As per the authors training of 3 epochs was apt for most of the scenarios.

The GPT-2 model follows the above GPT model which further is largely based on the transformer model. There were however few modifications. The layer normalization was displaced to the input of each sub-block and after the final self-attention block, an additional layer normalization was added. The weights of the residual layers were scaled by a factor of 1/ √ N where N is the number of residual layers. Also, the vocabulary was expanded to 50,257.  The context size was also increased to 1024 tokens from 512 tokens as was used in GPT-1.  A larger batch size of 512 was also used.

Since the model was trained using a zero-shot setting in the picture, but despite that, it performed better than models which were trained on genre-specific datasets when tested on the same datasets. The GPT-2 model was tested on a diverse range of datasets and the problems involved reading comprehension, text summarization, text translation and question answering. On the ‘children’s book test’ dataset for identification of  1.) Common nouns and 2.) named entities, the model increased the accuracy by 8% and 7% respectively from the previous state of the art models. On the LAMBADA dataset which asks for predicting the last word of a sentence, the GPT-2 model reduced the perplexity massively from 99% to 8.6%.

An example demonstrating the performance of the model is stated below. The figure represents the answers that were given by GPT-2 on the natural questions sorted by their probability.

GPT 2 Question answer

The model training on a heavyweight dataset and having considerable amount of parameters, only aided in strengthening the capability of the model to understand language tasks. GPT-2 was, therefore, able to overtake several prominent models that too in a zero-shot setting. Also, the decline in the perplexity of the model with an increase in parameters never stagnated, that is it kept on decreasing even further as we increased the number of parameters. This conveyed that the size of the model was never an issue and models with a better understanding of language could be a possibility in the future.

The state of the art model like GPT-2 finds its use in several current problems that the world is grappling with. Some of them are listed below :

  1. Better Speech Recognition Systems.
  2. Advanced Writing assistants.
  3. Better language translation agents.
GPT-2 Facts

But with great power lies even greater responsibility. In an era where fake news, hateful trolling and propaganda are the new crime areas. Some experts speculate that if used disproportionately and abusively the technology can only add more fuel to the fire. In other words, AI will not only be able to automate the generation of fake news but even impersonate anybody or an organization to unprecedented degrees of accuracy.

Spread the updates
January 6, 2021


I wanted to put you one little bit of note to finally thank you very much once again with your awesome things you’ve documented on this page. This is really extremely open-handed with you to make unhampered exactly what a lot of folks could possibly have advertised as an electronic book in making some bucks for their own end, principally now that you might well have tried it if you considered necessary. These guidelines additionally acted to become a easy way to understand that other individuals have the identical desire really like my very own to realize a whole lot more with respect to this condition. I think there are some more enjoyable instances ahead for individuals that examine your blog post.

Woah! I’m really loving the template/theme of this site. It’s simple, yet effective. A lot of times it’s tough to get that “perfect balance” between superb usability and visual appeal. I must say that you’ve done a amazing job with this. In addition, the blog loads extremely quick for me on Opera. Superb Blog!

I must thank you for the efforts you have put in penning this site. I’m hoping to view the same high-grade blog posts from you later on as well. In truth, your creative writing abilities has encouraged me to get my own, personal website now 😉

I want looking at and I believe this website got some genuinely useful stuff on it! .

This is a topic which is near to my heart… Best wishes! Exactly where are your contact details though?

Hello there! I could have sworn I’ve been to this website before but after browsing through some of the post I realized it’s new to me. Anyhow, I’m definitely glad I found it and I’ll be bookmarking and checking back frequently!

Excellent article. I’m facing some of these issues as well..

Such great website

Amazing blog thanks for sharing today on this blog

Great post! We are linking to this great content on our site. Keep up the good writing.

Very good article. I am facing a few of these issues as well..

My partner and I absolutely love your blog and find most of your post’s to be exactly I’m looking for. Would you offer guest writers to write content in your case? I wouldn’t mind writing a post or elaborating on a few of the subjects you write in relation to here. Again, awesome web site!

Hi there! I simplpy woyld llike tto ggive yoou a bbig tumbs up foor thhe excelpent
info you have gott here oon this post. I ill
bbe coming back tto our sitye for more soon.

You reaply make it sdem so eawy with your presentawtion butt I find this mater tto
be rally something that I thnk I would never understand.
It seems tooo complex and verry brad forr me.

I amm lookng fforward forr your nsxt post, I’ll
tryy too geet the hang off it!

Excwllent article. I ill bee goiing through a few of these issues aas well..

Hey There. I founnd your weblog thhe uswge oof msn.
Thiis iss a very smatly written article. I wikl
make sute tto bookkark itt and retutn too learn eztra of your usecul info.
Thaqnk you forr the post. I wilkl cerrtainly comeback.

Leave a reply