Generative Pre-trained Transformer

Since the NLP models developed by OpenAI have entered, the Natural Language Processing field has seen unprecedented development. The Generative Pre-trained Transformer (GPT) can solve NLP problems such as question-answering, reading comprehension, machine translation and text summarization. Unlike previous techniques which used supervised learning to solve these problems the GPT models whereas solves them as an unsupervised learning problem. On top of that, these models perform with almost the same levels of accuracy and in the majority of the cases even more as compared to the nourished supervised learning models.

In the bid to develop a stronger and sharper language learning model, OpenAI came up with GPT-2. An even larger dataset was used along with having a larger corpus of parameters. Another upliftment lied in the case that the input for the GPT model was not task-specific, and thereby the output that the model generated did not have a task as an input parameter. Came with GPT-2 the ability to perform task-based processing for the same input data. The task-specific processing laid the foundations for what is called Zero-shot task transfer. Basically, the model is not fed with any examples and is specified only with a task. The model is required to perceive the sentiment of the task and shower the answers. In the case of GPT-1, the input was re-organized several times for fine-tuning. For example, if we want the model to do a translation task on the input provided we would pass in the sentence, put a colon and write French (: French). The model was assumed to understand that it is a translation task and provide the french parallel.

The GPT-2 model was developed using a mammoth 40GB of internet text data from over 8 million web pages, the sheer size of the data-set was enough to consider the model as a direct upgrade to GPT-1. The model architecture was another area where GPT-2 had direct advancements. It had 1.5 Billion parameters which were 10 times more than GPT-1, for word embedding the model had 48 layers and 1600 dimensional vectors. The model was presented with a varying number of parameters: 117 million, 345M, 762M and 1.5GB. It was observed that with an increase in the number of parameters, the model’s capability to synthesize the data from a dataset increases. The GPT-2 model consisted of 50,257 tokens to mark a significant scaling of the vocabulary. Moreover, 512 tokens worth the size of batch and an even larger 1024 tokens for the context window were used.

The GPT model was majorly based on the original transformer model. Hence in order to understand that we need to take an in-depth look in what the transformer model looks like. The model follows the encoder-decoder structure like any other neural sequence transduction models. Where the encoder maps an input sequence of symbol representations into a sequence of continuous representations., the decoder generates the resulting output sequence of symbol representation. Also since the model is auto-regressive at each stage, thereby generating the text it takes the previously generated symbols as additional input.

As being stated before, the GPT model largely follows the original transformer model. The model was trained with masked self-attention heads, 786-dimensional states & 12 attention heads. 3072-dimensional inner states were used for the position wised feed-forward networks. Adam Optimization scheme was used with a maximum learning rate of 2.5e-4. The model was trained for 100 epochs on a minibatch size of 64 randomly sampled sequences of 512 tokens. Bytyepair encoding (BPE) vocabulary with 40,000 mergers was used with a dropout rate of 0.1 for regularization of residual, embedding and attention dropouts. Gaussian Error Learning rate (GELU) was used for the activation function.  To clean the raw text from the data-set, ftfy library was used, and after removing punctuation and blank spaces spaCy library was used for tokenization.

For fine-tuning, the hyper parameter prototype that was used for unsupervised pre-training was reused. The dropout rate of 0.1 was added to the classifier.  Also, a learning rate of 6.25e-5 with a batch size of 32 was adopted for most of the tasks. As per the authors training of 3 epochs was apt for most of the scenarios.

The GPT-2 model follows the above GPT model which further is largely based on the transformer model. There were however few modifications. The layer normalization was displaced to the input of each sub-block and after the final self-attention block, an additional layer normalization was added. The weights of the residual layers were scaled by a factor of 1/ √ N where N is the number of residual layers. Also, the vocabulary was expanded to 50,257.  The context size was also increased to 1024 tokens from 512 tokens as was used in GPT-1.  A larger batch size of 512 was also used.

Since the model was trained using a zero-shot setting in the picture, but despite that, it performed better than models which were trained on genre-specific datasets when tested on the same datasets. The GPT-2 model was tested on a diverse range of datasets and the problems involved reading comprehension, text summarization, text translation and question answering. On the ‘children’s book test’ dataset for identification of  1.) Common nouns and 2.) named entities, the model increased the accuracy by 8% and 7% respectively from the previous state of the art models. On the LAMBADA dataset which asks for predicting the last word of a sentence, the GPT-2 model reduced the perplexity massively from 99% to 8.6%.

An example demonstrating the performance of the model is stated below. The figure represents the answers that were given by GPT-2 on the natural questions sorted by their probability.

The model training on a heavyweight dataset and having considerable amount of parameters, only aided in strengthening the capability of the model to understand language tasks. GPT-2 was, therefore, able to overtake several prominent models that too in a zero-shot setting. Also, the decline in the perplexity of the model with an increase in parameters never stagnated, that is it kept on decreasing even further as we increased the number of parameters. This conveyed that the size of the model was never an issue and models with a better understanding of language could be a possibility in the future.

The state of the art model like GPT-2 finds its use in several current problems that the world is grappling with. Some of them are listed below :

  1. Better Speech Recognition Systems.
  2. Advanced Writing assistants.
  3. Better language translation agents.

But with great power lies even greater responsibility. In an era where fake news, hateful trolling and propaganda are the new crime areas. Some experts speculate that if used disproportionately and abusively the technology can only add more fuel to the fire. In other words, AI will not only be able to automate the generation of fake news but even impersonate anybody or an organization to unprecedented degrees of accuracy.

Leave a reply