Tag Archives: monte carlo

Monte Carlo methods in TensorFlow

The Markov Chain Monte Carlo (MCMC) is a sampling method to sample from a probability distribution when direct sampling is not feasible.

The implementation of Monte Carlo in the TensorFlow Probability package included sample to run the Hamiltonian MCMC, which is a variation with input from the Hamiltonian dynamics to avoid slow exploration of state space.

While running the samples, my Python reported errors on importing tensorflow_probability:tfp1

The problem is resolved by uninstalling and re-installing the tensorflow-estimator package:

pip uninstall tensorflow_estimator
pip install tensorflow_estimator

Finally the samples run fine with expected results.


The results from the sample run.


Implementing parallel GPU function in CUDA for R

There are existing R packages for CUDA. But if there is a need to customize your own parallel code on NVIDIA GPU to be called from R, it is possible to do so with the CUDA Toolkit. This post demonstrates a sample function to approximate the value of Pi using Monte Carlo method which is accelerated by GPU. The sample is built using Visual Studio 2010 but the Toolkit is supported on linux platforms as well. It is assumed that the Visual Studio is integrated with the CUDA Toolkit.

The first thing to do is to create a New Project using the Win32 Console Application template, and specify DLL with Empty project option.


And then, some standard project environment customization including:

CUDA Build Customization:

CUDA Runtime, select Shared/dynamic CUDA runtime library:

Project Dependencies setting. Since the CUDA code in this example utilize curand for Monte Carlo, the corresponding library must be included or else the linking will fail.

Finally the time to code. Only a cu file is needed which resembles the standard directives. It is important to include the extern declaration as below for R to call.

After a successful compile, the DLL will be created with the CUDA code. This DLL will be registered in R for calling.

Finally, start R and issue the dyn.load command to load the DLL into the running environment. Shown below is a “wrapper” R function to make calling the CUDA code easier. Notice at the heart of this wrapper is the .C function.

Last but not least, the CUDA Toolkit comes with a visual profiler which is capable to be launched for profiling the performance of the NVIDIA GPU. It can be launched from the GUI, or using a command line like the example below. It should be noted that the command line profiler must be started before R or it might not be able to profile properly.RCuda11

The GUI profiler is equipped with a nice interface to show performance statistics.RCuda10

Approximating the value of Pi by Monte Carlo method in Nspire and NVIDIA GPU

The value of pi can be approximated by Monte Carlo method. This can easily be done with any thing from a programmable calculator like the Nspire, or much more efficiently in parallel computing devices like GPU.

Writing code in the Nspire is straight forward. Nspire Basic provided random number functions like RandSeed(), Rand(), RandBin(), RandNorm() etc.

To implement the Monte Carlo method in GPU, random numbers are not generated in the CUDA kernel functions, but in the main program using CuRand host APIs. On the NVIDIA GPU, CuRand is an API for random number generation, with 9 different types of random number generators available. At first I picked one from the Mersenne Twister family (CURAND_RNG_PSEUDO_MT19937) as the generator but unfortunately this returned a segmentation fault. Tracing the call to curandCreateGenerator revealed the status value returned is 204 which means CURAND_STATUS_ARCH_MISMATCH. It turns out this particular generator is supported only on Host API and architecture sm_35 or above. Eventually settled with CURAND_RNG_PSEUDO_MTGP32. The performance of this RNG is closely tied to the thread and block count, and the most efficient use is to generate a multiple of 16384 samples (64 blocks × 256 threads).

On a side note, to use the CuRand APIs in Visual Studio, the CuRand library must be added to the project dependencies manually. Otherwise there will be error in the linking stage since the CuRand is dynamically linked.


On the Nspire, using 100,000 samples, it took an overclocked CX CAS 790 seconds to complete the simulation. The same Nspire program finishes in 8 seconds in the PC version running on a Core i5.

On the GPU side, CUDA on a GeForce 610M finished in a blink of an eye for the same number of iterations.

To get a better measurement on the performance, the number of iteration is increased to 10,000,000 and the result on performance is compared to a separately coded Monte Carlo program for the same purpose that run serially instead of parallel CUDA. The GPU version took 296 ms, while the plain vanilla C ran for 1014 ms.