Tag Archives: GeForce 610M

Approximating the value of Pi by Monte Carlo method in Nspire and NVIDIA GPU

The value of pi can be approximated by Monte Carlo method. This can easily be done with any thing from a programmable calculator like the Nspire, or much more efficiently in parallel computing devices like GPU.

Writing code in the Nspire is straight forward. Nspire Basic provided random number functions like RandSeed(), Rand(), RandBin(), RandNorm() etc.

To implement the Monte Carlo method in GPU, random numbers are not generated in the CUDA kernel functions, but in the main program using CuRand host APIs. On the NVIDIA GPU, CuRand is an API for random number generation, with 9 different types of random number generators available. At first I picked one from the Mersenne Twister family (CURAND_RNG_PSEUDO_MT19937) as the generator but unfortunately this returned a segmentation fault. Tracing the call to curandCreateGenerator revealed the status value returned is 204 which means CURAND_STATUS_ARCH_MISMATCH. It turns out this particular generator is supported only on Host API and architecture sm_35 or above. Eventually settled with CURAND_RNG_PSEUDO_MTGP32. The performance of this RNG is closely tied to the thread and block count, and the most efficient use is to generate a multiple of 16384 samples (64 blocks × 256 threads).

On a side note, to use the CuRand APIs in Visual Studio, the CuRand library must be added to the project dependencies manually. Otherwise there will be error in the linking stage since the CuRand is dynamically linked.


On the Nspire, using 100,000 samples, it took an overclocked CX CAS 790 seconds to complete the simulation. The same Nspire program finishes in 8 seconds in the PC version running on a Core i5.

On the GPU side, CUDA on a GeForce 610M finished in a blink of an eye for the same number of iterations.

To get a better measurement on the performance, the number of iteration is increased to 10,000,000 and the result on performance is compared to a separately coded Monte Carlo program for the same purpose that run serially instead of parallel CUDA. The GPU version took 296 ms, while the plain vanilla C ran for 1014 ms.