This is a miniaturized implementation of the Llama 3.1 model, which is then trainined on the TinyStories dataset.
This implementation includes Grouped Query Attention, Rotary Positional Embeddings (RoPE), and the AdamW optimizer (see src/
).
Outputs of the model for classification on the SST and CFIMDB dataset are included in outputs/
.
This code was developed as part of the 11-711 Advanced NLP class at Carnegie Mellon University. Parts of the codebase were created by the course staff. This code is based on llama2.c by Andrej Karpathy. Parts of the code are also from the transformers
library ([Apache License 2.0]).