Tag Archives: CUDA

Logistic Regression – from Nspire to R to Theano

Logistic regression is a very powerful tool for classification and prediction. It works very well with linearly separable problem. This installment will attempt to recap on its practical implementation, from traditional perspective by maximum likelihood, to more machine learning approach by neural network, as well as from handheld calculator to GPU cores.

The heart of the logistic regression model is the logistic function. It takes in any real value and return value in the range from 0 to 1. This is ideal for binary classifier system. The following is a graph of this function.
theanologistic1

TI Nspire

In the TI Nspire calculator, logistic regression is provided as a built-in function but is limited to single variable. For multi-valued problems, custom programming is required to apply optimization techniques to determine the coefficients of the regression model. One such application as shown below is the Nelder-Mead method in TI Nspire calculator.

Suppose in a data set from university admission records, there are four attributes (independent variables: SAT score, GPA, Interview score, Aptitude score) and one outcome (“Admission“) as the dependent variable.
theano-new1

Through the use of a Nelder-Mead program, the logistic function is first defined as l. It takes all regression coefficients (a1, a2, a3, a4, b), dependent variable (s), independent variables (x1, x2, x3, x4), and then simply return the logistic probability. Next, the function to optimize in the Nelder-Mead program is defined as nmfunc. This is the likelihood function on the logistic function. Since Nelder-Mead is a minimization algorithm the negative of this function is taken. On completion of the program run, the regression coefficients in the result matrix are available for prediction, as in the following case of a sample data with [GPA=1500, SAT=3, Interview=8, Aptitude=60].

theanologistic2(nspire1)

R

In R, as a sophisticated statistical package, the calculation is much simpler. Consider the sample case above, it is just a few lines of commands to invoke its built-in logistic model.

theano-new2

Theano

Apart from the traditional methods, modern advances in computing paradigms made possible neural network coupled with specialized hardware, for example GPU, for solving these problem in a manner much more efficiently, especially on huge volume of data. The Python library Theano is a complex library supporting and enriching these calculations through optimization and symbolic expression evaluation. It also features compiler capabilities for CUDA and integrates Computer Algebra System into Python.

One of the examples come with the Theano documentation depicted the application of logistic regression to showcase various Theano features. It first initializes a random set of data as the sample input and outcome using numpy.random. And then the regression model is created by defining expressions required for the logistic model, including the logistic function and likelihood function. Lastly by using the theano.function method, the symbolic expression graph coded for the regression model is finally compiled into callable objects for the training of neural network and subsequent prediction application.

theanologistic5(theano1)

A nice feature from Theano is the pretty printing of the expression model in a tree like text format. This is such a feel-like-home reminiscence of my days reading SQL query plans for tuning database queries.

theanologistic5(theano2).PNG

 

CUDA, Theano, and Antivirus

Most ubiquitous antivirus products monitor new process from executables in real time and will attempt to terminate their execution if deemed a potential threat. Some of these antivirus products simply do a signature match while some do more sophisticated heuristic or intelligent scanning.

However, there are times when antivirus might turn up a false positive. This is rare, but many software developers must have experienced the slow-down caused merely by the suspending and scanning of the new build from their favorite IDE. Recently my antivirus product let me know he has been very edgy on some Theano python programs with this scan alert.

theanoavast2

This rings a bell. I remember the same happened when working with CUDA on Visual Studio. And a test with some sample CUDA programs quickly confirmed my memory.

theanoavast3

The solution to get rid of the scan is quite simple. On most antivirus products there is an option to whitelist certain programs from being scanned. And on my Avast installation, simply adding the full file path of the nvcc output will do the trick. Note that doing so may pose certain security risks as this essentially neutralized the protection. So to compensate for the increased risk, the whitelist path should be set as precise as possible, such as including a wildcard filename as shown below.

theanoavast1

One last interesting point is, as one of the great benefits from Theano is a high level abstraction of the CUDA layer, it performs some compiling to GPU executable on the CUDA platform using nvcc. Comparing the profiling results obtained before and after antivirus whitelisting shown improvement not only in the overall speed but also in the compile time. With reference to the before-whitelisting profiling result

 Function profiling
==================
 Message: train
 Time in 10000 calls to Function.__call__: 1.122200e+01s
 Time in Function.fn.__call__: 1.089400e+01s (97.077%)
 Time in thunks: 1.069661e+01s (95.318%)
 Total compile time: 5.477000e+01s
 Number of Apply nodes: 17
 Theano Optimizer time: 3.589900e+01s
 Theano validate time: 4.000425e-03s
 Theano Linker time (includes C, CUDA code generation/compiling): 1.772000e+00s
 Import time 1.741000e+00s

and the whitelisted, optimized results:

Function profiling
==================
 Message: train
 Time in 10000 calls to Function.__call__: 9.727999e+00s
 Time in Function.fn.__call__: 9.469999e+00s (97.348%)
 Time in thunks: 9.293550e+00s (95.534%)
 Total compile time: 2.827000e+00s
 Number of Apply nodes: 17
 Theano Optimizer time: 1.935000e+00s
 Theano validate time: 1.999855e-03s
 Theano Linker time (includes C, CUDA code generation/compiling): 3.799987e-02s
 Import time 2.199984e-02s

If Avast only scan the executable before it starts executing, there should be no improvement at all in the compilation. It seems more likely, from analyzing the profiling breakdowns, Avast scans the GPU executable on its creation on the file system by Theano. Turning off Avast’s “File Shield” with no whitelisting triggered no scan alert, therefore confirmed the suspicion.

Experimenting with convergence time in neural network models

After setting up Keras and Theano and have some basic benchmark on the Nvidia GPU, the next thing to get a taste of neural network through these deep learning models are to compare these with one to solve the same problem (an XOR classification) that run on a modern calculator, the TI Nspire, using the Nelder-Mead algorithm for convergence of neural network weights.

A sample of SGD settings in Keras Theano with 30000 iterations converged in around 84 seconds. While the TI Nspire  completed with comparable results in 19 seconds. This is not a fair game of course, as there are lots of parameters that can be tuned in the model.

keras-nspire4

 

Exploring Theano with Keras

Theano needs no introduction in the field of deep learning. It is based on Python and supports CUDA. Keras is a libray that wraps the complexity of Theano to provide a high level abstraction for developing deep learning solutions.

Installing Theano and Keras are easy and there are tons of resources available online. However, my primary CUDA platform is on Windows so most standard guides that are based on Linux required some adaptations. Most notably are the proper setting of the PATH variable and the use of the Visual Studio command prompt.

The basic installation steps include setting up of CUDA, a scientific python environment, and then Theano and Keras. CuDNN is optional and required Compute Capability of greater than 3.0 which unfortunately my GPU is a bit old and does not meet this requirement.

keras1

Some programs on Windows platform encountered errors and found to be library related issues. Like this one that failed to compile on Spyder can be resolved using the Visual Studio Cross Tool Command Prompt.
keras4keras3a

The Nvidia profiler checking for the performance of the GPU, running the Keras example of the MNIST digits with MLP.keras2

Compiling CUDA in command line

For getting acquainted to more unix based CUDA development, the following command line and related environment are found to works in the current Visual Studio based platform.

cd /d "C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\1_Utilities\deviceQuery"

set path=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin;%path%

nvcc -arch=sm_21 -I..\..\common\inc deviceQuery.cpp -o deviceQuery

cudawin3

The command line environment on Microsoft Visual Studio platform that have CUDA properly setup (like the one for approximating the value of pi here) relied on native commands like msbuild as below.

msbuild BlackScholes_vs2010.sln /t:rebuild

cudawin
cudawin2cudawin4

Analysis of traffic CCTV camera image from data.gov.hk with GPU deep learning on Amazon cloud

In this installment, modern computing technology including Internet, artificial intelligence, deep learning, GPU, and cloud technology are utilized to solve practical problem.

Traffic jam is very common in metropolitan cities. Authorities responsible for road traffic management often install CCTV for monitoring. Data available at data.gov.hk site published on the Internet for big data analysis is one such example. The followings are two sample images from the same camera with one showing heavy traffic and the other one with very light traffic.

Modern pattern recognition technology is a piece of cake to determine these two distinct conditions from reading images above. The frontier in this field is no doubt utilizing deep learning network in GPU processor. Amazon AWS provided GPU equipped computing resources for such task.

The first step is to fire up an GPU instance on Amazon AWS. The following AMI is selected. Note that HVM is required for GPU VM.
awsgpu1

The next step is to choose a GPU instance. As the problem is simple enough only g2.2xlarge is used.
awsgpu2

After logging in to the terminal, set up CUDA 7, cuDNN, caffe, and DIGITS. The steps can be referred from their respective official documentations. A device query test below confirmed successful installation of CUDA. The whole process may took an hour to complete if installed from scratch. There may be pre-built images out there.
awsgpu4

Note that an account from NVIDIA Accelerated Computing Developer Program may be required to download some of these packages. A make test below confirmed complete setup of caffe.awsgpu5

Finally, after installing DIGITS, the login page is default to port 5000. At AWS console network connection rule can easily be setup to open this port. Alternatively, for more secure connections, tunneling can be used instead, as shown below running at 8500.awsgpu6

Now it is time to start training. A new image classification dataset is to be created. As stated above the source image set is obtained from data.gov.hk. At this site, traffic cameras installed over strategic road network point feed JPG image on the web for public access. The images are refreshed every 2 minutes. A simple shell script is prepared to fetch the image to build the data set. Below is the screen where DIGITS configures the classification training.
awsgpu8

Since our sample data set size is small, the training completed in no time.
awsgpu9

awsgpu10

Next, a model is defined. GoogLeNet is selected in this example.
awsgpu12awsgpu13

Model training in progress. The charts update in real time.
awsgpu17

awsgpu15awsgpuB

When the model is completed, some tests can be carried out. In this example, the model is trained to determine whether the camera image taken indicates traffic jam or not.

A traffic jam sample. Prediction: free=-64.23%, jam=89.32%
awsgpuC

The opposite. Prediction: free=116.62%, jam=-133.61%
awsgpuE

With Amazon cloud, the ability to deploy cutting edge AI technology in GPU is no longer limited to researchers or those rich in resources. General public can now benefit from these easy to access computing resources to explore limitless possibilities in the era of big data.

Implementing parallel GPU function in CUDA for R

There are existing R packages for CUDA. But if there is a need to customize your own parallel code on NVIDIA GPU to be called from R, it is possible to do so with the CUDA Toolkit. This post demonstrates a sample function to approximate the value of Pi using Monte Carlo method which is accelerated by GPU. The sample is built using Visual Studio 2010 but the Toolkit is supported on linux platforms as well. It is assumed that the Visual Studio is integrated with the CUDA Toolkit.

The first thing to do is to create a New Project using the Win32 Console Application template, and specify DLL with Empty project option.

RCuda1
RCuda2

And then, some standard project environment customization including:

CUDA Build Customization:
RCuda3

CUDA Runtime, select Shared/dynamic CUDA runtime library:
RCuda5

Project Dependencies setting. Since the CUDA code in this example utilize curand for Monte Carlo, the corresponding library must be included or else the linking will fail.
RCuda4

Finally the time to code. Only a cu file is needed which resembles the standard directives. It is important to include the extern declaration as below for R to call.
RCuda6

After a successful compile, the DLL will be created with the CUDA code. This DLL will be registered in R for calling.
RCuda7RCuda8

Finally, start R and issue the dyn.load command to load the DLL into the running environment. Shown below is a “wrapper” R function to make calling the CUDA code easier. Notice at the heart of this wrapper is the .C function.
RCuda9

Last but not least, the CUDA Toolkit comes with a visual profiler which is capable to be launched for profiling the performance of the NVIDIA GPU. It can be launched from the GUI, or using a command line like the example below. It should be noted that the command line profiler must be started before R or it might not be able to profile properly.RCuda11

The GUI profiler is equipped with a nice interface to show performance statistics.RCuda10