Tell HN: GpuOwl/PRPLL, GPU software used to find the largest prime number

20 points by mpreda 19 hours ago

Hi, I'm Mihai Preda the author of GpuOwl/PRPLL [1], an OpenCL software used by Luke Durant for his recent discovery of the largest prime number know, the 52nd Mersenne prime 2^136279841 - 1 [2].

Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.

[1] GpuOwl: https://github.com/preda/gpuowl [2] GIMPS: https://www.mersenne.org/

motorolnik 17 hours ago

Hi, I've got few questions: 1). What profiling tools do you use for GPU code? 2). Where one would start, in terms of learning resources, about coding using inline GPU assembler? 3). Do you verify GPU assembler generated by a compiler from C/C++ code, in terms of effectiveness? If so, which tools do you use for that? 4). Is SIMD on GPUs a thing? 5). What are the primary factors being taken into account by you (cache sizes, microoptimizations, etc.) when you write code for a tool like gpuowl/prpll? Which factor is the most important? Thanks!

mpreda 15 hours ago

1. My profiling is rudimentary but effective. I measure per-kernel execution time with OpenCL events (which register with high accuracy start/end times w. practically no overhead), and also I continously measure per-iteration time by dividing wall-time for blocks of 20'000 iterations by that nb. These measuremens are consistent and sensitive.
2. I'm not aware of good learning resources. Explore existing such code, e.g. opencl miners tend to use asm. Read in amdgpu/ in LLVM. Disassemble code from OpenCL and read the ISA. Explore and experiment, but it's tedious. I would not recommend to jump into ISA initially. BTW AMD does have good GCN ISA docs available online, that is useful!
3. Yes I often read the compiled ISA, and over time I discover bugs and also better understand the ISA.
4. OpenCL is SIMD, and yes it matches the GPU HW.
5. most important is to reduce the number of registers used (#VGPRs), as that influences heavilly the occupancy of the kernel. Use fewer costly instructions such as FP64 mul/FMA. Sequential memory access, and in general reduce global memory access as it's very slow. Merge small kernels into one (keep the data in the kernel). Never spill VGPRs.
motorolnik 17 hours ago

And another more general question: (6) gcc, clang, and nvcc have some OpenMP offloading capabilities which allow to compile code into binaries which can then run on GPUs. Is the code they produce through OpenMP anywhere close to what one gets directly with i.e. opencl?
- mpreda 15 hours ago
  
  I don't know, I haven't eplored OpenMP myself.. maybe some day.

dgacmu an hour ago

First, congrats! Awesome work and appreciate you sharing more.

Second: I'm confused by something in your readme. It says:

> For Mersenne primes search, the PRP test is by far preferred over LL, such that LL is not used anymore for search.

But later notes that PRP is computationally nearly identical to LL. Was that sentence supposed to say TF and P-1 instead of PRP or am I misunderstanding something about the actual computational cost of PRP?

cbright an hour ago

The PRP test has the same computational cost as an LL test. The reason why GIMPS now prefers to do PRP tests instead of LL tests is because an efficiently verifiable proof-of-work certificate was developed for PRP tests [1].
[1] https://doi.org/10.4230/LIPIcs.ITCS.2019.60
- dgacmu 30 minutes ago
  
  Ah, that's interesting and makes sense. Thank you!

mpreda 19 hours ago

Some topic ideas:

  - Why use OpenCL when implementing GPU software
  - Does it run on AMD or on Nvidia GPUs?
  - How does the primality test implemented in GpuOwl work?
  - How fast is it to test a Mersenne candidate?
  - Why use FFTs? how large are the FFTs?
  - What do you use for sin/cos?

iyn 17 hours ago

Wow, congrats!

Indeed, I’m curious why you’ve used OpenCL. And what was the hardware/general setup used for finding the prime?

What was your motivation behind building this software?

mpreda 15 hours ago

OpenCL works on both AMD and Nvidia GPUs with mostly the same source code. By supporting at-runtime compilation it allows a lot of code particularization/instantiation before compilation, which reduces the power (cost) of the generated code. In general OpenCL is close enough to the HW and the generated code is improving over time (LLVM).
Motivation: a long time ago I had an AMD GPU and no way to run an LL test on it, so I decided to write my own. And I was hooked by the power of the GPU and the quest for ever more efficient, faster implem.

rigmonger 6 hours ago

First of all, thank you for your work and congratulations on your achievements, both in the search for Mersenne primes and software development.

I am contributing to GIMPS with 2 Radeon Pro VII cards. I'm wondering what will happen when ROCm stops supporting these GPUs.

Do you have any plans to keep them working with GPUOwl/Prpll when they are no longer supported by ROCm?