A quick update on ALTRAD8UD-1L2T (making it work with GPU)
My other articles on the same topic:
- First: A first impression of ASRock Rack ALTRAD8UD-1L2T (Ampere Altra)
- Second (this one): A quick update on ALTRAD8UD-1L2T (making it work with GPU)
Intro
That would be a quick update on my first impression of the Ampere Altra board. So it will be without a picture, which I took somewhere in Switzerland, but with one of my cats.
While I wait for everything to arrive, I'm building a server (sourcing chassis and not paying half the price of the board would take some time…); I’m just plugging all random PCIe devices that I have in my apartment and trying to get them to work. So naturally, the first thing to try is the GPU.
I don’t have too many of them, to be honest. I have a few that are good as just a HDMI/DP or even DVI output, but if I want to have some chance to play some games on Altra, I have only a few:
- RX 550 — good’ol Polaris that has been supported by amdgpu for a long time; also, nothing fancy like Floating Point instructions in DCN, so it has relatively good compatibility.
- RX 5700 was a cheap used (likely pulled out of a mining rig) version of 5700 that is good enough to be plugged into a random system; it is in working condition, but I don’t know how long that would last. And this card is using DCN.
- Intel Arc 750 — I got it as it seems to be a decent GPU, mostly on par with RX 5700 in terms of performance, has a relatively good RayTracing, and Intel have a good credit of working under Linux (x86/x86–64).
RX 550
RX550 is boring, but it seems to work fine:
Sorry for the quality of the screenshots; I’m using a cheap USB-HDMI capture card to get them. And as I wasn’t planning to write anything soon, I’m getting those screenshots from all kinds of messages I’ve shown someone.
That runs the stock kernel, so I haven’t changed anything there.
RX 5700
Making it run was way more challenging. First, the 6.1 kernel doesn’t save FP registers on ARM (correct me if I misremember that), which is required for any cards with DCN 1.0 or higher to work. And what is more, Ampere Altra has a bug with PCIe that prevents some devices from working. That is called “Ampere Altra erratum #82288 PCIE_65”, and it is already integrated by some Linux Distros. There are discussions onthe community forum that have a discussion about that bug .
If you don’t apply that — amdgpu will fail to initialize the card, even with a 6.9-rc6 kernel (the latest I’ve tried) with a message that looks like this:
[ 2.257810] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
Sometimes, it boots, but you might have graphics glitches or lockups.
But after you apply the patch, it works:
Intel Arc 750
That one is the funniest. The reason is that the Intel i915 kernel driver doesn’t work on non-x86. They’ve started to upstream a new Xe kernel driver, which seems more focused on their recent GPUs. But if you try it on Altra, out of the box, you’ll get something like:
[ 57.555741] xe 0004:04:00.0: [drm] Using GuC firmware from i915/dg2_guc_70.bin version 70.20.0 [ 57.581832] xe 0004:04:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none [ 57.581848] Unable to handle kernel paging request at virtual address ffffffffc08003cc [ 57.589768] Mem abort info:
<…>
[ 57.860552] logic_inb+0xa0/0xe0
[ 57.863772] hsw_power_well_enable+0x198/0x288 [xe]
[ 57.868900] intel_power_well_enable+0x74/0x98 [xe]
[ 57.874019] intel_power_well_get+0x2c/0x40 [xe]
After a bit of digging, I have an ugly hack that makes it initialize and output picture (2D only, as there is a mesa part that might be way more tricky):
What did I effectively comment out? According to the description, intel_vga_reset_io_mem is a function that ensures compatibility with a module called vgacon (VGA Console) as it would lead to lock-up if you don’t touch some registers that that module also uses. There is a comment there that describes in details what it do. However, VGA Console doesn’t work on ARM (except one of the old platforms). So it should be safe to comment it out for a test.
And if do that, after a while (driver takes some time to initialize, especially with drm debug logging enabled):
I got a picture! Out of HDMI on Intel GPU running on ARM!
I still don’t have 3D, becuase of two reasons:
- I need newer version of Mesa than what is included with debian testing (I’ve upgraded to testing since last article).
- i915 gallium driver in mesa, that handle Xe as well, is marked as x86/x86–64 only. So potentially it would be very hard to make it work (especially because I’m not familiar with mesa codebase).
Correction: It’s been a while since I’ve used Intel GPU under Linux, so I had wrong impression that Xe and Intel Arc are still handled by i915. WHile in fact it is handled by newer iris driver. So please disregard what I’ve said here.
But a win is a win, even if it is a small one. Now I would need to make that fix above upstremable, and honestly I’m not sure what kind of patch would be accapted there. My current bet is that if I just gate it on CONFIG_VGA_CONSOLE should be fine, but we’ll see.
UPDATE from later same day
After spending a few hours recompiling libdrm (because Debian doesn’t enable intel-speicific library on ARM), mesa (because debian-testing’s mesa is old and Xe there is gated on x86/x86–64 architecture, while in 24.1.0-rc it should just works if you compile it for aarch64 with default list of drivers, except for RayTracing, but that is another story), I actually got nothing working anymore.
Attempting to start gdm immediately triggers an error that looked like that:
[ 687.296338] xe 0004:04:00.0: [drm:guc_exec_queue_timedout_job [xe]] Timedout signaled job: seqno=4294967169, guc_id=3, flags=0x1
guc ids might be different, and eventually, it locks up with something like that:
[ +0.000021] xe 0004:04:00.0: [drm:xe_devcoredump [xe]] Multiple hangs are occurring, but only the first snapshot was taken
[ +0.000492] xe 0004:04:00.0: [drm] Engine reset: guc_id=5
It was rendering a cursor, though.
I was about to give up, but then I decided to try to search for those errors. I found that those errors were also caught by Xe developers' automation. In their repository, some patches made the driver more stable overall. So I’ve decided to grab their branch (that I think will be drm-next for kernel 6.10) and see if it works.
After recompiling it (and reapplying my previous patch), it… didn’t work. At all. The driver failed with the same message: “Multiple hangs are occurring.”
After a bit of mangling with the settings in the BIOS (some knobs turn on or off workarounds) I’ve remembered, that I haven’t re-applied Ampere’s PCIe bugfix to that kernel.
After that, I started GDM and…
And then it hanged. But remembering that the driver is experimental, I’ve decided to try again. This time, I got further and have a nice screenshot confirming that OpenGL is running.
It has actually been sitting beside me for the whole time I’m writing that update and has still not crashed. It is not fast, but it is expected from an experimental driver.
Now I’ll probably need to upstream mesa build system changes and file a few more bug reports…
UPDATE: the only change that needs to be upstreamd is to enable intel-rt for aarch64, other changes are related to the way how debian builds mesa and have nothing to do with upstream mesa.
Instead of conclusion
Well, that is just a small note on how is the state of things. I really hope that Ampere would upstream their workaround for PCIe bug at some point and people won’t need to maintain their own kernel builds. It, though, surprise me that it was not done yet.
And if you read that article only to see a picture of the cat, here you go: