Tag Archives: NVIDIA

Experiencing Deep Learning with Jupyter and Anaconda

Most of the time my work with deep learning is done in command line interface with Python and TensorFlow. The clean and efficient syntax of the Python language and package design of TensorFlow almost eliminated the need of a complex Integrated Development Environment (IDE). But after trying out the free Google Colab service that provide a web based interface in Jupyter, I am going to set up one on my desktop that sports an Nvidia RTX2060 GPU.

Installation is easy, but be sure to run Anaconda console as Administrator on Windows platform. For running TensorFlow with GPU:

conda create -n tensorflow_gpuenv tensorflow-gpu
conda activate tensorflow_gpuenv

Managing multiple packages is much easier with Anaconda as it separate configurations into environments that can be customized. On my development machine, I can simply create a TensorFlow environment with GPU and then install Jupyter to enjoy its graphical interface.

Finally to activate Jupyter:

jupyter notebook

jupyterconsole.PNG

To see how Anaconda with Jupyter is flexible on the same machine, a comparison of a simple image pattern recognition program runs under Jupyter with and without GPU support.

jupytergpu
jupytercpu

Profiling machine learning applications in TensorFlow

TensorFlow provided package timeline by using the import from tensorflow.python.client

from tensorflow.python.client import timeline

This is useful for performance profiling TensorFlow application with graphical visualization similar to the graphs generated from the CUDA Visual Profiler. With a little tweak in the machine learning code, TensorFlow applications can store and report performance metrics of the learning process.
tfprofile3

The design of the timeline package made it easy to add profiling by simply adding code below.

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata() 

It is also required to instruct the model to compile with the profiling options:

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
options=run_options,
run_metadata=run_metadata)

With the sample mnist digits classifier for TensorFlow, the output shown Keras history are saved and can be later retrieved to generate reports.
tfprofile2

Finally, using the Chrome tracing page ( chrome://tracing/ ), the performance metrics persisted on file system can be opened for verification.
tfprofile1

 

More test driving of Tensorflow with Nvidia RTX 2060

By following the TensorFlow guide, it is easy to see how TensorFlow harnesses the power of my new Nvidia RTX 2060.

The first one is image recognition. Similar to the technology used in a previous installment on neural network training with traffic images from CCTV captured, a sample data set of images with classification of fashion objects are learnt by using TensorFlow. In that previous installment, Amazon Web Service cloud with a K520 GPU instance is used for the model training. In this blog post, the training is actually taking place in the Nvidia RTX 2060.

tfimg1

Another classic sample is the regression analysis, from the Auto MPG data set. With a few line of code, TensorFlow clean up the data set to remove unsuitable values and convert categorical values to numeric ones for the model.

tfimg2

Binomial Tree methods for European options using GPU

Binomial methods are versatile in pricing options for it is suitable for American, European, and Asian options. With an European call option with maturity t, strike price k, spot price S, volatility σ, risk-free rate r:

binom1.1

For put option, the last effective term shall be in the max function shall be:
binomput2

Stock’s increment, decrements, and probability to move up are given by the below respectively:

binom2

One of the CUDA samples from Nvidia is to implement the binomial model on GPU.

binom3

 

Analysis of traffic CCTV camera image from data.gov.hk with GPU deep learning on Amazon cloud

In this installment, modern computing technology including Internet, artificial intelligence, deep learning, GPU, and cloud technology are utilized to solve practical problem.

Traffic jam is very common in metropolitan cities. Authorities responsible for road traffic management often install CCTV for monitoring. Data available at data.gov.hk site published on the Internet for big data analysis is one such example. The followings are two sample images from the same camera with one showing heavy traffic and the other one with very light traffic.

Modern pattern recognition technology is a piece of cake to determine these two distinct conditions from reading images above. The frontier in this field is no doubt utilizing deep learning network in GPU processor. Amazon AWS provided GPU equipped computing resources for such task.

The first step is to fire up an GPU instance on Amazon AWS. The following AMI is selected. Note that HVM is required for GPU VM.
awsgpu1

The next step is to choose a GPU instance. As the problem is simple enough only g2.2xlarge is used.
awsgpu2

After logging in to the terminal, set up CUDA 7, cuDNN, caffe, and DIGITS. The steps can be referred from their respective official documentations. A device query test below confirmed successful installation of CUDA. The whole process may took an hour to complete if installed from scratch. There may be pre-built images out there.
awsgpu4

Note that an account from NVIDIA Accelerated Computing Developer Program may be required to download some of these packages. A make test below confirmed complete setup of caffe.awsgpu5

Finally, after installing DIGITS, the login page is default to port 5000. At AWS console network connection rule can easily be setup to open this port. Alternatively, for more secure connections, tunneling can be used instead, as shown below running at 8500.awsgpu6

Now it is time to start training. A new image classification dataset is to be created. As stated above the source image set is obtained from data.gov.hk. At this site, traffic cameras installed over strategic road network point feed JPG image on the web for public access. The images are refreshed every 2 minutes. A simple shell script is prepared to fetch the image to build the data set. Below is the screen where DIGITS configures the classification training.
awsgpu8

Since our sample data set size is small, the training completed in no time.
awsgpu9

awsgpu10

Next, a model is defined. GoogLeNet is selected in this example.
awsgpu12awsgpu13

Model training in progress. The charts update in real time.
awsgpu17

awsgpu15awsgpuB

When the model is completed, some tests can be carried out. In this example, the model is trained to determine whether the camera image taken indicates traffic jam or not.

A traffic jam sample. Prediction: free=-64.23%, jam=89.32%
awsgpuC

The opposite. Prediction: free=116.62%, jam=-133.61%
awsgpuE

With Amazon cloud, the ability to deploy cutting edge AI technology in GPU is no longer limited to researchers or those rich in resources. General public can now benefit from these easy to access computing resources to explore limitless possibilities in the era of big data.

Approximating the value of Pi by Monte Carlo method in Nspire and NVIDIA GPU

montecarlopi8a
The value of pi can be approximated by Monte Carlo method. This can easily be done with any thing from a programmable calculator like the Nspire, or much more efficiently in parallel computing devices like GPU.

Writing code in the Nspire is straight forward. Nspire Basic provided random number functions like RandSeed(), Rand(), RandBin(), RandNorm() etc.

To implement the Monte Carlo method in GPU, random numbers are not generated in the CUDA kernel functions, but in the main program using CuRand host APIs. On the NVIDIA GPU, CuRand is an API for random number generation, with 9 different types of random number generators available. At first I picked one from the Mersenne Twister family (CURAND_RNG_PSEUDO_MT19937) as the generator but unfortunately this returned a segmentation fault. Tracing the call to curandCreateGenerator revealed the status value returned is 204 which means CURAND_STATUS_ARCH_MISMATCH. It turns out this particular generator is supported only on Host API and architecture sm_35 or above. Eventually settled with CURAND_RNG_PSEUDO_MTGP32. The performance of this RNG is closely tied to the thread and block count, and the most efficient use is to generate a multiple of 16384 samples (64 blocks × 256 threads).
montecarlopi4

On a side note, to use the CuRand APIs in Visual Studio, the CuRand library must be added to the project dependencies manually. Otherwise there will be error in the linking stage since the CuRand is dynamically linked.

montecarlopi2

On the Nspire, using 100,000 samples, it took an overclocked CX CAS 790 seconds to complete the simulation. The same Nspire program finishes in 8 seconds in the PC version running on a Core i5.
montecarlopi1

On the GPU side, CUDA on a GeForce 610M finished in a blink of an eye for the same number of iterations.
montecarlopi3

To get a better measurement on the performance, the number of iteration is increased to 10,000,000 and the result on performance is compared to a separately coded Monte Carlo program for the same purpose that run serially instead of parallel CUDA. The GPU version took 296 ms, while the plain vanilla C ran for 1014 ms.