All blog posts.
Transformer Part II: The implementation & experiments
In this episode, we will go into some details of the causal transfomer implementation. Some toy experiment results are shown to analyze transformer, in an attempt to understand what drives the performance. We will not go through every single line of implementation. The code used for illustration can be found: https://github.com/yuansen23aa/GPT-learning/blob/main/basic_gpt.ipynb, which largely follows Anrej Karpathy’s nanoGPT implementation with some modifications. So Let’s dig in. We use those terms interchangeably: block size = sequence length, causal attention = masked attention, causal transformation = decoder-only transformer. ...