Tell HN: GpuOwl/PRPLL, GPU software used to find the largest prime number
Hi, I'm Mihai Preda the author of GpuOwl/PRPLL [1], an OpenCL software used by Luke Durant for his recent discovery of the largest prime number know, the 52nd Mersenne prime 2^136279841 - 1 [2].
Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.
[1] GpuOwl: https://github.com/preda/gpuowl [2] GIMPS: https://www.mersenne.org/
Hi, I've got few questions: 1). What profiling tools do you use for GPU code? 2). Where one would start, in terms of learning resources, about coding using inline GPU assembler? 3). Do you verify GPU assembler generated by a compiler from C/C++ code, in terms of effectiveness? If so, which tools do you use for that? 4). Is SIMD on GPUs a thing? 5). What are the primary factors being taken into account by you (cache sizes, microoptimizations, etc.) when you write code for a tool like gpuowl/prpll? Which factor is the most important? Thanks!
1. My profiling is rudimentary but effective. I measure per-kernel execution time with OpenCL events (which register with high accuracy start/end times w. practically no overhead), and also I continously measure per-iteration time by dividing wall-time for blocks of 20'000 iterations by that nb. These measuremens are consistent and sensitive.
2. I'm not aware of good learning resources. Explore existing such code, e.g. opencl miners tend to use asm. Read in amdgpu/ in LLVM. Disassemble code from OpenCL and read the ISA. Explore and experiment, but it's tedious. I would not recommend to jump into ISA initially. BTW AMD does have good GCN ISA docs available online, that is useful!
3. Yes I often read the compiled ISA, and over time I discover bugs and also better understand the ISA.
4. OpenCL is SIMD, and yes it matches the GPU HW.
5. most important is to reduce the number of registers used (#VGPRs), as that influences heavilly the occupancy of the kernel. Use fewer costly instructions such as FP64 mul/FMA. Sequential memory access, and in general reduce global memory access as it's very slow. Merge small kernels into one (keep the data in the kernel). Never spill VGPRs.
And another more general question: (6) gcc, clang, and nvcc have some OpenMP offloading capabilities which allow to compile code into binaries which can then run on GPUs. Is the code they produce through OpenMP anywhere close to what one gets directly with i.e. opencl?
I don't know, I haven't eplored OpenMP myself.. maybe some day.
First, congrats! Awesome work and appreciate you sharing more.
Second: I'm confused by something in your readme. It says:
> For Mersenne primes search, the PRP test is by far preferred over LL, such that LL is not used anymore for search.
But later notes that PRP is computationally nearly identical to LL. Was that sentence supposed to say TF and P-1 instead of PRP or am I misunderstanding something about the actual computational cost of PRP?
The PRP test has the same computational cost as an LL test. The reason why GIMPS now prefers to do PRP tests instead of LL tests is because an efficiently verifiable proof-of-work certificate was developed for PRP tests [1].
[1] https://doi.org/10.4230/LIPIcs.ITCS.2019.60
Ah, that's interesting and makes sense. Thank you!
Some topic ideas:
Wow, congrats!
Indeed, I’m curious why you’ve used OpenCL. And what was the hardware/general setup used for finding the prime?
What was your motivation behind building this software?
OpenCL works on both AMD and Nvidia GPUs with mostly the same source code. By supporting at-runtime compilation it allows a lot of code particularization/instantiation before compilation, which reduces the power (cost) of the generated code. In general OpenCL is close enough to the HW and the generated code is improving over time (LLVM).
Motivation: a long time ago I had an AMD GPU and no way to run an LL test on it, so I decided to write my own. And I was hooked by the power of the GPU and the quest for ever more efficient, faster implem.
First of all, thank you for your work and congratulations on your achievements, both in the search for Mersenne primes and software development.
I am contributing to GIMPS with 2 Radeon Pro VII cards. I'm wondering what will happen when ROCm stops supporting these GPUs.
Do you have any plans to keep them working with GPUOwl/Prpll when they are no longer supported by ROCm?