First 36 TB of memory is extremely large. You have a serious server.
I looked up the Power10 memory bandwidth and found:
The IBM Power10 processor has a maximum memory bandwidth of 409 GB/sec per socket for 32 GB and 64 GB memory cards, and 375 GB/sec per socket for 128 GB and 256 GB memory cards. The Power10's memory bandwidth is 2.6 times higher than scalable x86 processors.
With with 3 TB of memory per Power10 socket with up to 15 CPUs.
I assume you are trying some sort of "burn in" test to check the memory is fully working - IBM will have already done that but I understand the need.
For 36TB, you have at least 9 sockets with (guessing at least) 120 to the full 240 CPUs.
You will have to use all of the CPU to test memory or you will be waiting days.
As nmem64 is single threaded, a starting point would be one nmem64 process per CPU and up to eight nmem64 processes per CPU. Due to SMT=8.
There is no getting around this. So you are into hundreds of copies of nmem64.
At 36 TB and say 100 processes you will need 36 x 1024 / 100 = 368 GB per process.
Testing this much memory is not a trivial task.
I have never has such a huge machine to "play" with - please let me know how you get on.
Ask if you get stuck, I will try to help out.
With 100 to 800 processes, I would guess you have to send the output to log files and then run checks on the log files to fine out if it is working as expected.
Even starting 800 processes, getting the kernel to allocate the space and then write to each memory page to force the allocation of real memory could take a few hours!
Good luck, cheers, Nigel