Llama 3

Llama 3 is a powerful open-source language model from Meta AI, available in 8B and 70B parameter sizes. The models come in both base and instruction-tuned versions designed for dialogue applications. Key features include an expanded 128K token vocabulary for improved multilingual performance, CUDA graph acceleration for up to 4x faster inference, and compatibility with 4-bit quantization to run on consumer GPUs.

Llama 2 70BSourceHFGPTQ
Llama 2 70B ChatSourceGPTQ

Hardware Requirements

Llama 3 8B: This model can run on GPUs with at least 16GB of VRAM, such as the NVIDIA GeForce RTX 3090 or RTX 4090. It can also be quantized to 4-bit precision to reduce the memory footprint to around 7GB, making it compatible with GPUs that have less memory capacity such as 8GB.

Llama 3 70B: This larger model requires more powerful hardware with at least one GPU that has 32GB or more of VRAM, such as the NVIDIA A100 or upcoming H100 GPUs. For optimal performance, multiple high-end GPUs or tensor cores are recommended to leverage parallelization.

Installing Llama 3 Model

You can download all of the model files from the Hugging Face repos. Once you have your desired model, it can be deployed using the Hugging Face Inference Endpoints or deployed locally on a compatible LLM manager such as LM Studio.

Llama 3 Demo

There is also a demo version of the ‘Llama 3 70B instruct’ model available on HuggingChat – just make sure it is selected as the current model – https://huggingface.co/chat/