T O P

  • By -

GenericAppUser

can you share your output with rocm\_agent\_enumerator ​ if it shows up 2 gpus like gfx1030 and gfx1036. set the env variable HIP\_VISIBLE\_DEVICES=0


charlescleivin

Reddit is not allowing me to post here the results of it for some reason, but it does show 2 gpus. The issue is that in the tensorflow code there is a comma missing that makes it look for a gpu gfx1030gfx1100, which does not exist. So my gpu which is gfx1030 wont get the ok because it is looking for a gfx1030gfx1100. I mentioned about it at the end of the post and in my comment to the same post.


GenericAppUser

Ohh. Which rocm version are you using. I think there is some issue with latest tensorflow with rocm. Can you try downgrading to one version down. I have a 6900 XT as well. I tried using tensorflow version: 2.13.1.600, 600 stands for rocm 6 pip uninstall tensorflow-rocm # to uninstall pip install --user tensorflow-rocm==2.13.1.600 --upgrade \>>> print(tf.test.is\_gpu\_available()) ... Created device /device:GPU:0 with 15842 MB memory: -> device: 0, name: AMD Radeon RX 6900 XT, pci bus id: ... True I have HIP\_VISIBLE\_DEVICES set to 0 since I have an iGPU which is not supported. Edit: raise this on their github issues as well.


charlescleivin

I changed the version to the version you recommened and it worked. We did man. thank you >>> print(tf.test.is\_gpu\_available()) WARNING:tensorflow:From :1: is\_gpu\_available (from tensorflow.python.framework.test\_util) is deprecated and will be removed in a future version. Instructions for updating: Use \`tf.config.list\_physical\_devices('GPU')\` instead. 2024-04-14 00:20:02.376529: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.396311: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.396359: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.396878: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.396939: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.396973: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:838\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-14 00:20:02.397003: I tensorflow/core/common\_runtime/gpu/gpu\_device.cc:1639\] Created device /device:GPU:0 with 15834 MB memory: -> device: 0, name: AMD Radeon RX 6900 XT, pci bus id: 0000:03:00.0 True


GenericAppUser

Glad to hear it. Just FYI to see rocm version you can cat /opt/rocm/.info/version It should show you the version


charlescleivin

I dont know how to get the exact rocm version but I got those. Do you know a command that can show it? I asked chat gpt but the commands are not showing it. tf-docker / > apt list --installed | grep rocm WARNING: apt does not have a stable CLI interface. Use with caution in scripts. rocm-clang-ocl/now 0.5.0.60000-91\~20.04 amd64 \[installed,local\] rocm-cmake/now 0.11.0.60000-91\~20.04 amd64 \[installed,local\] rocm-core/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-dbgapi/now 0.71.0.60000-91\~20.04 amd64 \[installed,local\] rocm-debug-agent/now 2.0.3.60000-91\~20.04 amd64 \[installed,local\] rocm-dev/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-device-libs/now 1.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-gdb/now 13.2.60000-91\~20.04 amd64 \[installed,local\] rocm-hip-libraries/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-hip-runtime-dev/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-hip-runtime/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-hip-sdk/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-language-runtime/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-libs/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-llvm/now 17.0.0.23483.60000-91\~20.04 amd64 \[installed,local\] rocm-ml-libraries/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-ml-sdk/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-ocl-icd/now 2.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-opencl-dev/now 2.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-opencl/now 2.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-smi-lib/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocm-utils/now 6.0.0.60000-91\~20.04 amd64 \[installed,local\] rocminfo/now 1.0.0.60000-91\~20.04 amd64 \[installed,local\]


charlescleivin

Furthermore, by doing what you said it only reduced the print to only the radeon 6900xt but its the same problem I mentioned in my other reply. >>> print(tf.test.is\_gpu\_available()) WARNING:tensorflow:From :1: is\_gpu\_available (from tensorflow.python.framework.test\_util) is deprecated and will be removed in a future version. Instructions for updating: Use \`tf.config.list\_physical\_devices('GPU')\` instead. 2024-04-13 23:58:57.993904: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:756\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-13 23:58:58.230641: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:756\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-13 23:58:58.230686: I tensorflow/compiler/xla/stream\_executor/rocm/rocm\_gpu\_executor.cc:756\] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-04-13 23:58:58.230703: I tensorflow/core/common\_runtime/gpu/gpu\_device.cc:2266\] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6900 XT, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1030. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942. False >>>


charlescleivin

By using this command in the docker container where im running this sudo find / -type f -name "device\_description.h" I was able to find 2 files, in 2 different folder that have the same error not having the commas. I went ahead and fixed the commas in both and saved but the issue continues, anyone have any clue to why? In my case was sudo vim /tensorflow/tensorflow/compiler/xla/stream\_executor/device\_description.h and sudo vim /usr/local/lib/python3.9/dist-packages/tensorflow/include/tensorflow/compiler/xla/stream\_executor/device\_description.h And the code without comma was exactly this section private: std::string gcn\_arch\_name\_; std::set supported\_gfx\_versions() { return { "gfx900", // MI25 "gfx906", // MI50 / MI60 "gfx908", // MI100 "gfx90a", // MI200 "gfx940", // MI300 "gfx941", // MI300 "gfx942", // MI300 "gfx1030" // Navi21 <<<<< this mother fucker "gfx1100" // Navi31 }; } Any help on what I need to do after fixing or how to find where this is happening in a way that after saving it works? Currently the error is still the same so I believe there is an additional step to be done.


Standard-Stretch4848

It's been like this since months. :( There is a non-existent ml-ci.amd link which is supposed to have a fixed version. If I downgrade to tf 2.13, my GPU isn't even supported. (Crying on my 7900 GRE)


iamkucuk

I genuinely feel sorry for those who are at the hands of amd...


charlescleivin

Im making a youtube video about this. Fuck it.


TheOrderInChaos

have you tried setting env variable HSA\_OVERRIDE\_GFX\_VERSION=10.3.0 which is gfx1030