Improve Language Understanding through Generative Pre-training – Alec Radford
Recap of Transformer – Model for Translation
Autoregressive Transformer Decoder in GPT – Model for Language Generation
Loss Objective for training and Maximum Likelihood
Weight Initialization
GeLU Activation – Gaussian Error Linear Unit
What is Learned Positional Encoding
BPE- Byte pair encoding
Local Attention – Window Based Attention
Memory compressed Attention
Supervised Fine Tuning Loss Objective
Traversal Type Input Transformation for fine tuning
Code for GPT from OpenAI :
https://github.com/openai/finetune-transformer-lm/blob/master/train.py