Category Archives: Statistics

Logistic Regression – from Nspire to R to Theano

Logistic regression is a very powerful tool for classification and prediction. It works very well with linearly separable problem. This installment will attempt to recap on its practical implementation, from traditional perspective by maximum likelihood, to more machine learning approach by neural network, as well as from handheld calculator to GPU cores.

The heart of the logistic regression model is the logistic function. It takes in any real value and return value in the range from 0 to 1. This is ideal for binary classifier system. The following is a graph of this function.
theanologistic1

TI Nspire

In the TI Nspire calculator, logistic regression is provided as a built-in function but is limited to single variable. For multi-valued problems, custom programming is required to apply optimization techniques to determine the coefficients of the regression model. One such application as shown below is the Nelder-Mead method in TI Nspire calculator.

Suppose in a data set from university admission records, there are four attributes (independent variables: SAT score, GPA, Interview score, Aptitude score) and one outcome (“Admission“) as the dependent variable.
theano-new1

Through the use of a Nelder-Mead program, the logistic function is first defined as l. It takes all regression coefficients (a1, a2, a3, a4, b), dependent variable (s), independent variables (x1, x2, x3, x4), and then simply return the logistic probability. Next, the function to optimize in the Nelder-Mead program is defined as nmfunc. This is the likelihood function on the logistic function. Since Nelder-Mead is a minimization algorithm the negative of this function is taken. On completion of the program run, the regression coefficients in the result matrix are available for prediction, as in the following case of a sample data with [GPA=1500, SAT=3, Interview=8, Aptitude=60].

theanologistic2(nspire1)

R

In R, as a sophisticated statistical package, the calculation is much simpler. Consider the sample case above, it is just a few lines of commands to invoke its built-in logistic model.

theano-new2

Theano

Apart from the traditional methods, modern advances in computing paradigms made possible neural network coupled with specialized hardware, for example GPU, for solving these problem in a manner much more efficiently, especially on huge volume of data. The Python library Theano is a complex library supporting and enriching these calculations through optimization and symbolic expression evaluation. It also features compiler capabilities for CUDA and integrates Computer Algebra System into Python.

One of the examples come with the Theano documentation depicted the application of logistic regression to showcase various Theano features. It first initializes a random set of data as the sample input and outcome using numpy.random. And then the regression model is created by defining expressions required for the logistic model, including the logistic function and likelihood function. Lastly by using the theano.function method, the symbolic expression graph coded for the regression model is finally compiled into callable objects for the training of neural network and subsequent prediction application.

theanologistic5(theano1)

A nice feature from Theano is the pretty printing of the expression model in a tree like text format. This is such a feel-like-home reminiscence of my days reading SQL query plans for tuning database queries.

theanologistic5(theano2).PNG

 

Advertisements

Durbin-Watson statistic in TI-84

Unlike the more sophisticated TI-89 and Nspire, the Durbin-Watson statistic is not included in the TI-84. Yet, calculating it is fairly straight-forward using list functions.

This statistics of regression is given as

durbinwatson4

where e is the residual list of values. To obtain this list (using a previous multiple regression example), simply subtract the actual values from the regression formula (Y7 below):

durbinwatson1 durbinwatson2

Finally, run the formula below for answer.

durbinwatson3

Coefficient of determination for Multiple linear regression in TI-84 Plus

After determining the parameters of multiple linear regression in TI-84 (which do not have any direct built-in function support of this calculation), the coefficient of determination can also be easily calculated using the rich set of list functions supported by TI-84. Following the previous example, the dependent variable is in Sales list, the other two independent variables are Size and Dist lists.

The Yhat list is to be prepared first. This lists store the predicted values using the regression parameters determined in the previous installment.
mreg84rsq3amreg84rsq3

Next, the mean of Y and Yhat are calculated and stored to a handy list S.

mreg84rsq2

Furthermore, three lists SYY, SYhYh, SYYh are calculated respectively.

mreg84rsq4mreg84rsq3e

mreg84rsq6mreg84rsq5

mreg84rsq8mreg84rsq7

The result is obtained by the formula below.
mreg84rsq9

Graphical visualization of data distribution in TI-84 and R

For visualizing data distribution, the TI-84 Stat plot can provide some insights. Using the same data set as in the previous installment on Shapiro-Wilk test, TI-84 Stat plot is a quick and convenient tool.

shapiro84-graphplot2 shapiro84-graphplot1

In R, the command qqnorm() will show the following plot for the same data.

shapiro84-graphplot3

P value of Shapiro-Wilk test on TI-84

In previous installment, the Shapiro-Wilk test is performed step by step on the TI-84.

To calculate the p-value from this W statistic obtained, the following steps can also be done on the TI-84 using some standard statistics function. Note that the approximation below are for the case 4 ≤ n ≤ 11.

The mean and standard deviation are derived using the below equations.
shapiro84-pvalue1

The W statistic is assigned from the correlation calculation results, and another variable V is calculated for the transformed W.
shapiro84-pvalue2

Finally, the standardized Z statistic and p-value is calculated using the mean, standard deviation, and the transformed W value.
shapiro84-pvalue3 shapiro84-pvalue4

Performing Shapiro-Wilk test on TI-84

On TI-84, although the Shapiro-Wilk test is not available as built-in function, it is possible to calculate the W statistics by using the calculator’s rich set of list and statistics function. It is not trivial, but doable.

First set up a list of sample values. Using the example from the original Shapiro-Wilk paper, the samples are input into a list called W, while the N variable denotes the dimension and U stores inverse of its square root. The data sample in W must be sorted in ascending order.

shapiro84-2 shapiro84-1

Next is to generate the M list based on W. This list is to calculate the inverse normal distribution based on the index value and the dimension of W list. Doing this in TI-84 is easy with the help of the seq() function. The complete expression is:
seq(invNorm( (I-0.375) / (N+0.25), 0, 1), I, 1, N, 1)

shapiro84-3 shapiro84-4

Now that the List M is ready, it is time to derive the sum of square of it and we store it into variable m (not the list M).
shapiro84-5

Since the approximation algorithm is used instead of look up table as proposed in the original paper, the next step is to prepare List A. This list is arranged in such a way that the first two and last two elements have different calculation than all the others. The last two values (N-1, N) of List A are prepared in the Y5 and Y6 equations. These are approximating equations. The first two elements are the negation of these two in a particular order.
shapiro84-6 shapiro84-7

One more variable, ε, has to be calculated at this point before we generate List A.
(m - 2*M(N)² - 2*M(N-1)²) / (1 - 2*Y6² - 2*Y5²)
shapiro84-8

Next is to generate the List A. As explained, four elements namely the first, second, N-1, and N elements of this list is calculated unlike all other elements. The screen below is for all except that four special elements.
shapiro84-9

The remaining steps for the four elements of this List A are then carried out as below.
shapiro84-10

Finally, all three lists are ready for calculation.
shapiro84-11

To calculate the W statistics, use the Linear Regression function in the TI-84 as below for W and A. The correlation coefficient r² is the W statistics as shown in R. See the next installment for the Z statistic and P value calculations from this W statistic, also on the TI-84.

shapiro84-14 shapiro84-18