Category Archives: R

Logistic Regression – from Nspire to R to Theano

Logistic regression is a very powerful tool for classification and prediction. It works very well with linearly separable problem. This installment will attempt to recap on its practical implementation, from traditional perspective by maximum likelihood, to more machine learning approach by neural network, as well as from handheld calculator to GPU cores.

The heart of the logistic regression model is the logistic function. It takes in any real value and return value in the range from 0 to 1. This is ideal for binary classifier system. The following is a graph of this function.
theanologistic1

TI Nspire

In the TI Nspire calculator, logistic regression is provided as a built-in function but is limited to single variable. For multi-valued problems, custom programming is required to apply optimization techniques to determine the coefficients of the regression model. One such application as shown below is the Nelder-Mead method in TI Nspire calculator.

Suppose in a data set from university admission records, there are four attributes (independent variables: SAT score, GPA, Interview score, Aptitude score) and one outcome (“Admission“) as the dependent variable.
theano-new1

Through the use of a Nelder-Mead program, the logistic function is first defined as l. It takes all regression coefficients (a1, a2, a3, a4, b), dependent variable (s), independent variables (x1, x2, x3, x4), and then simply return the logistic probability. Next, the function to optimize in the Nelder-Mead program is defined as nmfunc. This is the likelihood function on the logistic function. Since Nelder-Mead is a minimization algorithm the negative of this function is taken. On completion of the program run, the regression coefficients in the result matrix are available for prediction, as in the following case of a sample data with [GPA=1500, SAT=3, Interview=8, Aptitude=60].

theanologistic2(nspire1)

R

In R, as a sophisticated statistical package, the calculation is much simpler. Consider the sample case above, it is just a few lines of commands to invoke its built-in logistic model.

theano-new2

Theano

Apart from the traditional methods, modern advances in computing paradigms made possible neural network coupled with specialized hardware, for example GPU, for solving these problem in a manner much more efficiently, especially on huge volume of data. The Python library Theano is a complex library supporting and enriching these calculations through optimization and symbolic expression evaluation. It also features compiler capabilities for CUDA and integrates Computer Algebra System into Python.

One of the examples come with the Theano documentation depicted the application of logistic regression to showcase various Theano features. It first initializes a random set of data as the sample input and outcome using numpy.random. And then the regression model is created by defining expressions required for the logistic model, including the logistic function and likelihood function. Lastly by using the theano.function method, the symbolic expression graph coded for the regression model is finally compiled into callable objects for the training of neural network and subsequent prediction application.

theanologistic5(theano1)

A nice feature from Theano is the pretty printing of the expression model in a tree like text format. This is such a feel-like-home reminiscence of my days reading SQL query plans for tuning database queries.

theanologistic5(theano2).PNG

 

GARCH model in R

A much more practical approach than calculating GARCH parameters on a calculator is to do it in R. Not only is there is available packages, retrieving financial data for experimenting is also a piece of cake as the facilities built-in offered convenient access to historical data.

To use GARCH in R the library must be installed first.

fgarch1

To test the library, data are imported using the tSeries package.

fgarch2

 

A plot of the log return.

fgarch3a

fgarch3

 

Before running the GARCH model, a QQ plot is reviewed.

fgarch4a
fgarch4

 

Finally, the GARCH model is created using the command below.

fgarch6

 

Density plot.

fgarch5a

fgarch5

 

With trace=off a clean model can be printed after running the model.

fgarch7

Graphical visualization of data distribution in TI-84 and R

For visualizing data distribution, the TI-84 Stat plot can provide some insights. Using the same data set as in the previous installment on Shapiro-Wilk test, TI-84 Stat plot is a quick and convenient tool.

shapiro84-graphplot2 shapiro84-graphplot1

In R, the command qqnorm() will show the following plot for the same data.

shapiro84-graphplot3

Shapiro-Wilk test for normality

The Shapiro-Wilk test is suitable for testing normality. The result as presented in p-value is easy to interpret. In the original form there is limit of 8 to 50 samples. A revised approximation method exists that can extend the number to 5000. The calculation steps are quite complex for hand held calculator like in the TI-84. In R there is a built-in function. Using the example in the original paper of Shapiro-Wilk test to calculate this statistics in R is simple:

shapiro

Stochastic Gradient Descent in R

Stochastic Gradient Descent (SGD) is an optimization method common used in machine learning, especially neural network. The name implied it is aimed at minimization of function.

In R, there is a SGD package for the purpose. As a warm up for the newly upgraded R and RStudio, it is taken as the target of a test drive.

R-sgd1

Running the documentation example.
R-sgd2

Running the included demo for logistic regression.R-sgd3

Upgrading R and RStudio

Although the old version served well, it is still nice to see these two brothers getting upgraded.

The RStudio is upgraded first.

wget https://download2.rstudio.org/rstudio-server-0.99.891-amd64.deb
sudo apt-get install gdebi-core
sudo gdebi rstudio-server-0.99.891-amd64.deb

For reason unknown to me the R came installed is an very old one (back to 2013). So a more recent R is installed manually. The following commands will stop RStudio for the upgrade (and the last one to restart).

sudo rstudio-server offline
sudo rstudio-server force-suspend-all
sudo rstudio-server online

However, for my installation that does not work. The thing that worked is log in to RStudio and then use “Session > Restart R”.

A fresh start of R / RStudio!
newR