lhl 9 hours ago

For inference, if you have a supported card (or probably architecture if you are on Linux and can use HSA_OVERRIDE_GFX_VERSION), then you can probably run anything with (upstream) PyTorch and transformers. Also, compiling llama.cpp is has been pretty trouble-free for me for at least a year.

(If you are on Windows, there is usually a win-hip binary of llama.cpp in the project's releases or if things totally refuse to work, you can use the Vulkan build as a (less performant) fallback).

Having more options can't hurt, but ROCm 5.4.2 is almost 2 years old, and things have come a long way since then, so I'm curious about this being published freshly today, in October 2024.

BTW, I recently went through and updated my compatibility doc (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has changed just in the past few months (upstream bitsandbytes, upstream xformers, and Triton-based Flash Attention): https://llm-tracker.info/howto/AMD-GPUs

tcdent 3 hours ago

The rise of generated slop ml libraries is staggering.

This library is 50% print statements. And where it does branch, it doesn't even need to.

Defines two environment variables and sets two flags on torch.

rglullis 18 minutes ago

So, this is all I needed to add to NixOS workstation:

     hardware.graphics.enable = true;

     services.ollama = {
     enable = true;
     acceleration = "rocm";
     environmentVariables = {
       ROC_ENABLE_PRE_VEGA = "1";
       HSA_OVERRIDE_GFX_VERSION = "11.0.0";
     };
   };
slavik81 3 hours ago

On Ubuntu 24.04 (and Debian Unstable¹), the OS-provided packages should be able to get llama.cpp running on ROCm on just about any discrete AMD GPU from Vega onwards²³⁴. No docker or HSA_OVERRIDE_GFX_VERSION required. The performance might not be ideal in every case⁵, but I've tested a wide variety of cards:

    # install dependencies
    sudo apt -y update
    sudo apt -y upgrade
    sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential

    # ensure you have permissions by adding yourself to the video and render groups
    sudo usermod -aG video,render $USER
    # log out and then log back in to apply the group changes
    # you can run `rocminfo` and look for your GPU in the output to check everything is working thus far

    # download a model, build llama.cpp, and run it
    wget https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    git checkout b3267
    HIPCXX=clang-17 cmake -H. -Bbuild -DGGML_HIPBLAS=ON -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" -DCMAKE_BUILD_TYPE=Release
    make -j16 -C build
    build/bin/llama-cli -ngl 32 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf --prompt "Once upon a time"
I'd suggest RDNA 3, MI200 and MI300 users should probably use the AMD-provided ROCm packages for improved performance. Users that need PyTorch should also use the AMD-provided ROCm packages, as PyTorch has some dependencies that are not available from the system packages. Still, you can't beat the ease of installation or the compatibility with older hardware provided by the OS packages.

¹ https://lists.debian.org/debian-ai/2024/07/msg00002.html ² Not including MI300 because that released too close to the Ubuntu 24.04 launch. ³ Pre-Vega architectures might work, but have known bugs for some applications. ⁴ Vega and RDNA 2 APUs might work with Linux 6.10+ installed. I'm in the process of testing that. ⁵ The version of rocBLAS that comes with Ubuntu 24.04 is a bit old and therefore lacks some optimizations for RDNA 3. It's also missing some MI200 optimizations.

  • mindcrime 42 minutes ago

    I was able to install (AMD provided) ROCm and Ollama on Ubuntu 22.04.5 with an RX 7900 XTX with no real problems to speak of, and I can execute LLMs using Ollama on ROCm just fine. Take that FWIW.

  • ekianjo 10 minutes ago

    are there AMD cards with more than 24GB VRAM on the market right now at consumer friendly prices?

a2128 18 hours ago

It seems to use an old, 2 year old version of ROCm (5.4.2) which I'm doubtful would support my RX 7900 XTX. I personally found it easiest to just use the latest `rocm/pytorch` image and run what I need from there

  • slavik81 4 hours ago

    The RX 7900 XTX (gfx1100) was first enabled in the math libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I believe the performance improved significantly in later releases, as well.

tomxor 9 hours ago

I almost tried to install AMD rocm a while ago after discovering the simplicity of llamafile.

  sudo apt install rocm

  Summary:
    Upgrading: 0, Installing: 203, Removing: 0, Not Upgrading: 0
    Download size: 2,369 MB / 2,371 MB
    Space needed: 35.7 GB / 822 GB available
I don't understand how 36 GB can be justified for what amounts to a GPU driver.
  • atq2119 7 hours ago

    So no doubt modern software is ridiculously bloated, but ROCm isn't just a GPU driver. It includes all sorts of tools and libraries as well.

    By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.

    • tomxor 6 hours ago

      I suspected that, but any binaries being that large just seems wrong, I mean the whole thing is 35 time larger than my entire OS install.

      Do you know what is included in ROCm that could be so big? Does it include training datasets or something?

      • lhl 3 hours ago

        Here's the big files in my /opt/rocm/lib which is most of it:

          4.8G hipblaslt
          1.6G libdevice_conv_operations.a
          2.0G libdevice_gemm_operations.a
          1.4G libMIOpen.so.1.0.60200
          1.1G librocblas.so.4.2.60200
          1.6G librocsolver.so.0.2.60200
          1.4G librocsparse.so.1.0.60200
          1.5G llvm
          3.5G rocblas
          2.0G rocfft
        
        The biggest one just to pick on one is hipblaslt is "a library that provides general matrix-matrix operations. It has a flexible API that extends functionalities beyond a traditional BLAS library, such as adding flexibility to matrix data layouts, input types, compute types, and algorithmic implementations and heuristics." https://github.com/ROCm/hipBLASLt

        There are mostly GPU kernels that by themselves aren't so big, but for every single operation x every single supported graphics architecture, eg:

          304K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.co
          24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx942.dat
          240K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.co
          20K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx942.dat
          344K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.co
          24K TensileLibrary_SS_SS_UA_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_gfx942.dat
      • skirmish 5 hours ago

        My understanding is that ROCm contains all included kernels for each supported architecture, so it would have (made up):

          -- matrix multiply 2048x2048 for Navi 31,
          -- same for Navi 32,
          -- same for Navi 33,
          -- same for Navi 21,
          -- same for Navi 22,
          -- same for Navi 23,
          -- same for Navi 24, etc.
          -- matrix multiply 4096x4096 for Navi 31,
          -- ...
        • slavik81 2 hours ago

          Correct. Although, you wouldn't find Navi 22, 23 or 24 in the list because those particular architectures are not supported. Instead, you'd see Vega 10, Vega 20, Arcturus, Aldebaran, Aqua Vanjaram and sometimes Polaris.

          We're working on a few different strategies to reduce the binary size. It will get worse before it gets better, but I think you can expect significant improvements in the future. There are lots of ways to slim the libraries down.

  • steeve 5 hours ago

    You can look us up at https://github.com/zml/zml, we fix that.

    • andyferris 4 hours ago

      Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?

  • burnte 5 hours ago

    CPU drivers are complete OSes that run on the GPUs now.

  • greenavocado 8 hours ago

    It's not just you; AMD manages to completely shit-up the Linux kernel with their drivers: https://www.phoronix.com/news/AMD-5-Million-Lines

    • striking 8 hours ago

      > Of course, much of that is auto-generated header files... A large portion of it with AMD continuing to introduce new auto-generated header files with each new generation/version of a given block. These verbose header files has been AMD's alternative to creating exhaustive public documentation on their GPUs that they were once known for.

    • anthk 8 hours ago

      OpenBSD, too.

kn100 28 minutes ago

Sad that RDNA2 cards aren't supported. Not even that old!

ekianjo 11 minutes ago

Does it work with GGUF files?

freeqaz 9 hours ago

What's the best bang-for-your-buck AMD GPU these days? I just bought 2 used 3090s for $750ish refurb'd on eBay. Curious what others are using for running LLMs locally.

  • 3abiton 8 hours ago

    Personal experience: It's not even worth it. AMD (i)GPU breaks with every pytorch, ROCm, xformers, or ollama updates. You'll sleep more compfortably at night.

    • fazkan 8 hours ago

      Thats our observation, which is why we wrote the scripts ourselves that way we can control the dependencies at least.

    • sangnoir 4 hours ago

      When dealing with ROCM, it's critical that once you have a working configuration, you freeze everything in place (except your application). Docker is one way to achieve this if your host machine is subject to kernel or package updates

  • elorant 7 hours ago

    Probably the 7900xtx. $1k for 24GB of RAM.

    • freeqaz 4 hours ago

      That's about the same price as a 3090 and it's also 24GB. Are they faster at inference?

leonheld 9 hours ago

People use "Docker-based" all the time but what they mean is that they ship $SOFTWARE in a Docker image.

"Docker-based" reads, to me, as if you were doing Inference on AMD cards with Docker somehow, which doesn't make sense.

  • a_vanderbilt 9 hours ago

    You can do inference from a Docker container, just as you'd do it with NVidia. OpenAI runs a K8s cluster doing this. I have personally only worked with NVidia, but the docs are present for AMD too.

    Like anything AI and AMD, you need the right card(s) and rocm version along with sheer dumb luck to get it working. AMD has Docker images with rocm support, so you could merge your app in with that as the base layer. Just pass through the GPU to the container and you should get it working.

    It might just be the software in a Docker image, but it removes a variable I would otherwise have to worry about during deployment. It literally is inference on AMD with Docker, if that's what you meant.

  • mikepurvis 9 hours ago

    Docker became part of the standard toolkit for ML because deploying Python that links to underlying system libraries is a gong show unless you ship that layer too.

    • tannhaeuser 5 hours ago

      Even Docker doesn't guarantee reproducible results due to sensitivity towards host GPU drivers, and ML frontends/integrations bringing their own "helpful" newby-friendly all-in-one dependency checks and updater services.

  • jeffhuys 9 hours ago

    Why doesn’t it make sense? You can talk to devices from a Docker container - you just have to attach it.

  • bongodongobob 4 hours ago

    Yeah, they're using docker to wrap up the software packages, which is what Docker is used for. I don't understand why that confuses you or what you think Docker is otherwise used for.

phkahler 8 hours ago

Does it work with an APU? I just put 64GB in my system and gonna drop in a 5700G. Will that be enough? SFF inference if so.

ashirviskas 9 hours ago

I'm all for having more open source projects, but I do not see how it can be useful in this ecosystem, especially for people with newer AMD GPUs (not supported in this project) which are already supported in most popular projects?

  • fazkan 8 hours ago

    Just something that, we found helpful, support for new architectures is just a package update. This is more of a cookie cutter

stefan_ 8 hours ago

This seems to be some AI generated wrapper around a wrapper of a wrapper.

> # Other AMD-specific optimizations can be added here

> # For example, you might want to set specific flags or use AMD-optimized libraries

What are we doing here, then?

  • fazkan 8 hours ago

    its just a big requirements file, and a dockerfile :) the rest are mostly helper scripts.

white_waluigi 9 hours ago

Isn't this just a wrapper for huggingface-transformers?

  • fazkan 8 hours ago

    yes, but handles all the dependencies for AMD architecture. So technically its just a requirements file :). Author of the repo above.

dhruvdh 9 hours ago

Why would you use this over vLLM?

  • fazkan 8 hours ago

    we have vllm in certin production instances, it is a pain for most non-nvidia related architectures. A bit of digging around and we realized that most of it is just a wrapper on top of pytorch function calls. If we can do away with batch processing with vllm supports, we can be good, this is what we did here.

    • dhruvdh 6 hours ago

      Batching is how you get ~350 tokens/sec on Qwen 14b on vLLM (7900XTX). By running 15 requests at once.

      Also, there is a Dockerfile.rocm at the root of vLLM's repo. How is it a pain?

      • fazkan 5 hours ago

        driver mismatch issues, we mostly use publicly available instances, so the drivers change as the instances change, according to their base image. Not saying it won't work, but it was more painful to figure out vllm, than to write a simple inference script and do it ourselves.

BaculumMeumEst 7 hours ago

how about they follow up 7900 XTX with a card that actually has some VRAM

  • skirmish 5 hours ago

    They prefer you pay $3,600 for AMD Radeon Pro W7900, 48GB VRAM.

    • anthonix1 an hour ago

      ... which also has a much lower power cap

khurdula 9 hours ago

Are we supposed to use AMD GPUs for this to work? Or Does it work on any GPU?

  • karamanolev 9 hours ago

    > This project provides a Docker-based inference engine for running Large Language Models (LLMs) on AMD GPUs.

    First sentence of the README in the repo. Was it somehow unclear?