This message was deleted Rancher Users #harvester

Join Slack

This message was deleted.

# harvester

adamant-kite-43734

07/27/2023, 9:06 AM

This message was deleted.

red-king-19196

07/28/2023, 7:36 AM

Is the target pcidevice enabled and added to the VM? Do you have any screenshots or support bundle files we can investigate further?

rhythmic-painter-76998

07/28/2023, 7:46 AM

yes, it is not pcidevice enable or not, both machines are enabled and we can run nvidia-smi on vm to see graph card info. it is just the cuda program are unable to run on a machine without native graph support. I have written a detail description on the bundle.

red-king-19196

07/28/2023, 7:57 AM

I saw the issue description. Does the “native graph support from CPU” mean the CPU has an integrated graphics processing unit? Sorry I’m not familiar with the technical term.

rhythmic-painter-76998

07/28/2023, 7:57 AM

yes

👌 1

rhythmic-painter-76998

07/28/2023, 7:58 AM

I forgot I can use integrated graphics terms

red-king-19196

07/28/2023, 8:32 AM

Could you show us the

nvidia-smi

output? And may I re-post your issue description here (it describes more details)?

prehistoric-balloon-31801

07/28/2023, 8:34 AM

Hi @rhythmic-painter-76998 So nvidia-smi in the VM returns the card, but CUDA program can’t use it? I have little experience to CUDA programming, do you see any error when calling function to get or detect cards?

rhythmic-painter-76998

07/28/2023, 8:35 AM

Copy code

Fri Jul 28 08:34:50 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:08:00.0 Off |                  N/A |
|  0%   50C    P8               19W / 420W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

rhythmic-painter-76998

07/28/2023, 8:37 AM

I am not seeing any errors from the program as I was running gpu-burn process https://github.com/wilicc/gpu-burn this code can only run on machine B (no integrated graphics + external graphics) when I installed ubuntu on top of it. but can NOT work when I create a vm from harvester (w/ pci devices enabled)

prehistoric-balloon-31801

07/28/2023, 8:40 AM

So running the command returns the card

Copy code

gpu_burn -l

rhythmic-painter-76998

07/28/2023, 8:40 AM

Copy code

docker run --rm --gpus all gpu_burn -l

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
<https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license>

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

rhythmic-painter-76998

07/28/2023, 8:41 AM

this comamnd is fine, but when running burning test, no proc info are shown in

nvidia-smi

prehistoric-balloon-31801

07/28/2023, 8:42 AM

It seems

gpu_burn -l

doesn’t return any card either

rhythmic-painter-76998

07/28/2023, 8:43 AM

let me do a cross check. I have one can work and one is not. 😞

rhythmic-painter-76998

07/28/2023, 8:47 AM

interesting, both shows the same result on -l option

rhythmic-painter-76998

07/28/2023, 8:48 AM

and running the burn test, the process hangs

Copy code

docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
<https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license>

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c3865714-8aa0-f435-8ef3-9cde617bcb7b)



^C^C^C

👀 1

rhythmic-painter-76998

07/28/2023, 8:52 AM

I wonder the integrated graph plays any role here

red-king-19196

07/28/2023, 11:29 AM

Was the problematic VM running when you generated the support bundle file? I only saw one VM running with PCIDevice attached on harvester01 🤔

rhythmic-painter-76998

07/28/2023, 3:49 PM

oh, let me upload another bundle. harvester01 and harvester02-5900x are two nodes attached together. and harvester02-5900x can generate provlematic VM.

supportbundle_007cc9b3-1d94-4f5c-9c28-db80782d499b_2023-07-28T15-47-08Z.zip

great-bear-19718

07/30/2023, 10:08 PM

i dont have this GPU to emulate this issue.. but i am wondering if it could be related to this? https://www.reddit.com/r/VFIO/comments/pbgsg4/solved_rtx_3090_gpu_passthrough_just_displays_a/

great-bear-19718

07/30/2023, 11:41 PM

are you able to please check this? https://forums.developer.nvidia.com/t/enabling-resizable-bar-on-rtx-30-series-gpus-in-linux/239950

rhythmic-painter-76998

07/31/2023, 12:51 AM

i also found this article, the device is not enabled.

great-bear-19718

07/31/2023, 12:52 AM

any chance we can get

nvidia-smi -q

output?

great-bear-19718

07/31/2023, 12:53 AM

and when you run it in ubuntu.. as you mentioned do you pass through to a vm

great-bear-19718

07/31/2023, 12:53 AM

or natively on the host?

rhythmic-painter-76998

07/31/2023, 12:55 AM

no, for this particular machien, i did two test: 1 running an ubuntu on natively and cuda is working 2. running as a harvester node and create a vm and pass gpu into it, cuda did not work.

great-bear-19718

07/31/2023, 12:55 AM

any chance of running it via kvm in ubuntu?

rhythmic-painter-76998

07/31/2023, 12:55 AM

both scenarios are running an ubuntu, same version

great-bear-19718

07/31/2023, 12:55 AM

one is running native ubuntu one is running in a vm

rhythmic-painter-76998

07/31/2023, 12:56 AM

great-bear-19718

07/31/2023, 12:56 AM

it may not be the same test then

rhythmic-painter-76998

07/31/2023, 12:57 AM

what do you mean? for scenario 1 is for testing whether the gpu is working properly.

great-bear-19718

07/31/2023, 12:58 AM

i am trying to check if its qemu doing something..

great-bear-19718

07/31/2023, 12:58 AM

so we need to check on host ubuntu.. where it works fine

great-bear-19718

07/31/2023, 12:58 AM

then on this host.. run kvm and passthrough gpu to ubuntu vm

great-bear-19718

07/31/2023, 12:58 AM

and compare difference

great-bear-19718

07/31/2023, 12:59 AM

also what version of ubuntu did you use?

rhythmic-painter-76998

07/31/2023, 12:59 AM

2004

rhythmic-painter-76998

07/31/2023, 1:01 AM

my hunches is that is two graphics are required if you want to pass one into vm

great-bear-19718

07/31/2023, 1:02 AM

i cant say..

great-bear-19718

07/31/2023, 1:02 AM

i am trying to arrange a 30-- series gpu in our lab to see for myself

great-bear-19718

07/31/2023, 1:02 AM

any chance you please create an issue for us to track

rhythmic-painter-76998

07/31/2023, 1:02 AM

sure

great-bear-19718

07/31/2023, 6:50 AM

i ran the same

gpu_burn

on a 3070 GPU, and can see the CPU usage pinned on a core..

great-bear-19718

07/31/2023, 6:50 AM

what happens when you run it natively on an ubuntu host.. what would be the load expectation?

rhythmic-painter-76998

07/31/2023, 7:33 AM

the load would be on gpu

rhythmic-painter-76998

07/31/2023, 7:50 AM

@great-bear-19718 you can follow these scripts: 1. install docker: https://github.com/FootprintAI/multikf/blob/main/hack/docker-install/install-docker-linux.sh 2. install cuda driver: https://github.com/FootprintAI/multikf/blob/main/hack/gpu/get-cudadriver.sh 3. install nvidia container runtime: https://github.com/FootprintAI/multikf/blob/main/hack/gpu/nvidia-container-runtime.sh

great-bear-19718

07/31/2023, 8:08 AM

sure

rhythmic-painter-76998

07/31/2023, 8:43 AM

ticket is here: https://github.com/harvester/harvester/issues/4344

rhythmic-painter-76998

07/31/2023, 8:49 AM

is it possible that the machine B is plugged with a monitor, so the graphic are unable to “release”?

rhythmic-painter-76998

08/17/2023, 8:37 AM

finally I got it working!! see here https://github.com/harvester/harvester/issues/4344#issuecomment-1681869350

🙌 2

great-bear-19718

08/17/2023, 11:43 PM

i tried checking it too.. i was noticing cuda was not picking up the gpu in the vm

great-bear-19718

08/17/2023, 11:43 PM

but this is a great find

57 Views

Open in Slack

Previous Next