# Looking for insights from Fitbit data with R

With a month’s of Fitbit data, it’s about time to harvest for some insights from this technology packed wristband.

```
library(lubridate)
library(dplyr)
fitbitdata % mutate(dow = wday(Date))
fitbitdata\$dowlabel <- factor(fitbitdata\$dow,levels=1:7,labels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"),ordered=TRUE)
fitbitdata\$scyl <- as.factor(as.integer(fitbitdata\$distance)/max(fitbitdata\$distance))
c1 <- rainbow(7)
c2 <- rainbow(7, alpha=0.4)
c3 <- rainbow(7, v=0.8)
boxplot(fitbitdata\$steps~fitbitdata\$dowlabel, col=c2, medcol=c3, whiskcol=c1, staplecol=c3, boxcol=c3, outcol=c3, pch=23, cex=2, alpha=fitbitdata\$scyl)

```

Number of steps and distance traveled data per day is collected from the Fitbit’s phone app, converted into CSV format, and then uploaded to R for data analysis. With a few lines of R code to draw a box-plot for day of week analysis, this data set with a third part statistical package will fill the gap until the Fitbit App offers something more sophisticated .

# Logistic Regression – from Nspire to R to Theano

Logistic regression is a very powerful tool for classification and prediction. It works very well with linearly separable problem. This installment will attempt to recap on its practical implementation, from traditional perspective by maximum likelihood, to more machine learning approach by neural network, as well as from handheld calculator to GPU cores.

The heart of the logistic regression model is the logistic function. It takes in any real value and return value in the range from 0 to 1. This is ideal for binary classifier system. The following is a graph of this function.

## TI Nspire

In the TI Nspire calculator, logistic regression is provided as a built-in function but is limited to single variable. For multi-valued problems, custom programming is required to apply optimization techniques to determine the coefficients of the regression model. One such application as shown below is the Nelder-Mead method in TI Nspire calculator.

Suppose in a data set from university admission records, there are four attributes (independent variables: SAT score, GPA, Interview score, Aptitude score) and one outcome (“Admission“) as the dependent variable.

Through the use of a Nelder-Mead program, the logistic function is first defined as l. It takes all regression coefficients (a1, a2, a3, a4, b), dependent variable (s), independent variables (x1, x2, x3, x4), and then simply return the logistic probability. Next, the function to optimize in the Nelder-Mead program is defined as nmfunc. This is the likelihood function on the logistic function. Since Nelder-Mead is a minimization algorithm the negative of this function is taken. On completion of the program run, the regression coefficients in the result matrix are available for prediction, as in the following case of a sample data with [GPA=1500, SAT=3, Interview=8, Aptitude=60].

## R

In R, as a sophisticated statistical package, the calculation is much simpler. Consider the sample case above, it is just a few lines of commands to invoke its built-in logistic model.

## Theano

Apart from the traditional methods, modern advances in computing paradigms made possible neural network coupled with specialized hardware, for example GPU, for solving these problem in a manner much more efficiently, especially on huge volume of data. The Python library Theano is a complex library supporting and enriching these calculations through optimization and symbolic expression evaluation. It also features compiler capabilities for CUDA and integrates Computer Algebra System into Python.

One of the examples come with the Theano documentation depicted the application of logistic regression to showcase various Theano features. It first initializes a random set of data as the sample input and outcome using numpy.random. And then the regression model is created by defining expressions required for the logistic model, including the logistic function and likelihood function. Lastly by using the theano.function method, the symbolic expression graph coded for the regression model is finally compiled into callable objects for the training of neural network and subsequent prediction application.

A nice feature from Theano is the pretty printing of the expression model in a tree like text format. This is such a feel-like-home reminiscence of my days reading SQL query plans for tuning database queries.

# Stochastic Gradient Descent in R

Stochastic Gradient Descent (SGD) is an optimization method common used in machine learning, especially neural network. The name implied it is aimed at minimization of function.

In R, there is a SGD package for the purpose. As a warm up for the newly upgraded R and RStudio, it is taken as the target of a test drive.

Running the documentation example.

Running the included demo for logistic regression.

Although the old version served well, it is still nice to see these two brothers getting upgraded.

`wget https://download2.rstudio.org/rstudio-server-0.99.891-amd64.deb`
```sudo apt-get install gdebi-core
sudo gdebi rstudio-server-0.99.891-amd64.deb```

For reason unknown to me the R came installed is an very old one (back to 2013). So a more recent R is installed manually. The following commands will stop RStudio for the upgrade (and the last one to restart).

```sudo rstudio-server offline
sudo rstudio-server force-suspend-all
sudo rstudio-server online```

However, for my installation that does not work. The thing that worked is log in to RStudio and then use “Session > Restart R”.

A fresh start of R / RStudio!

# Implementing parallel GPU function in CUDA for R

There are existing R packages for CUDA. But if there is a need to customize your own parallel code on NVIDIA GPU to be called from R, it is possible to do so with the CUDA Toolkit. This post demonstrates a sample function to approximate the value of Pi using Monte Carlo method which is accelerated by GPU. The sample is built using Visual Studio 2010 but the Toolkit is supported on linux platforms as well. It is assumed that the Visual Studio is integrated with the CUDA Toolkit.

The first thing to do is to create a New Project using the Win32 Console Application template, and specify DLL with Empty project option.

And then, some standard project environment customization including:

CUDA Build Customization:

CUDA Runtime, select Shared/dynamic CUDA runtime library:

Project Dependencies setting. Since the CUDA code in this example utilize curand for Monte Carlo, the corresponding library must be included or else the linking will fail.

Finally the time to code. Only a cu file is needed which resembles the standard directives. It is important to include the extern declaration as below for R to call.

After a successful compile, the DLL will be created with the CUDA code. This DLL will be registered in R for calling.

Finally, start R and issue the dyn.load command to load the DLL into the running environment. Shown below is a “wrapper” R function to make calling the CUDA code easier. Notice at the heart of this wrapper is the .C function.

Last but not least, the CUDA Toolkit comes with a visual profiler which is capable to be launched for profiling the performance of the NVIDIA GPU. It can be launched from the GUI, or using a command line like the example below. It should be noted that the command line profiler must be started before R or it might not be able to profile properly.

The GUI profiler is equipped with a nice interface to show performance statistics.

# Data input for ANOVA in TI nspire and R

In TI nspire CX, the application Lists & Spreadsheet provided a convenient Excel list interface for data input.

The data can also be named by columns and recalled from the Calculator application. Statistical functions can then be applied. Using a sample from the classical TI-89 statistics guide book on determining the interaction between two factors using 2-way ANOVA, the same output is obtained from the TI nspire CX.

In R, data are usually imported from CSV file using read.csv() command. There are also other supported formats including SPSS and Excel. For more casual data entry that command line input is suffice, raw data are usually stored into list variable using c() command. Working with ANOVA for data entry in this way is not as straightforward because dimension is required for the analysis on data stored in the list variable.

To accomplish the ANOVA, factor data types are used in conjunction with list variable. The below is the same TI example completed in R. Firstly we define the list variable in a fashion of the order by club (c1 = driver, c2 = five iron) then brand (b1-, b2-, b3-, with the last digit as the sample number), i.e.
{c1,b1-1}; {c1,b1-2}; {c1,b1-3}; {c1,b1-4};
{c1,b2-1}; {c1,b2-2};…
{c2,b1-1}; {c2,b1-2};…

Two Factor variables are then created, one for club (with twelve 1’s followed by twelve 2’s), and another for brand (1 to 3 each repeating four times for each sample, and then completed by another identical sequence).

These two Factor variables essentially represent the position (or index in array’s term) of the nth data value in respect of the factor it belongs to, and can be better visualized in the following table.

Finally, the 2-way ANOVA can be performed using the following commands.

Interaction plot in R.

# R on BeagleBoard with LCD and Android UI

After the successful installation of R in BeagleBoard XM, it is natural to come up with a concept of a calculator running R. The ingredients are simple: a small form factor BeagleBoard, a LCD screen, a simple keypad. R can run on Android but as a keypad passionate, no touch screen is better than the feel of real key punch.

As a proof of concept and also a weekend project, the R on the BeagleBoard is “bridged” across an Apache web server (also on the BB), to be accessed by an Android app which act as input device for R commands. Result returned from R is then forwarded to an 16×2 LCD screen from a previous IoT prototype with TI MSP430 MCU and a TP Link portable router with OpenWRT.

A little delay, but works.

The Android app is done in MIT App Inventor. It is a very nice graphical, easy to use, completely web based application to build Android app. No typing of code is required.

# k-means clustering using TI Nspire

The k-means clustering is probably the simplest of clustering algorithm. Using the built in TI Basic the algorithm can easily be implemented. Source data can be edited in the default spreadsheet editor. A sample data set with two attributes length and weight on three pet types, dog, cats, and rabbits is used for testing. A 3-centroid cluster is selected.

The following is a scatter plot using the default TI Nspire data & statistics application.

The same plot after defining weight and length.

Running of the k-means program. The dist_to_cluster matrix contains the distance to each centroid for each row of data, and from column-wise the first to third indicates the distance to the corresponding centroid. The last column is the cluster identified that is governed by the minimum value of distance to each of the three clusters. The last command in the screen is to transpose the matrix to fetch the fourth column as row and paste back into the spreadsheet for comparison.

A comparison chart using R. The default plot by R looks better than the Nspire’s. The label is grouped automatically by the cluster found.

# RStudio

RStudio is a great addition after migrating from Amazon Linux to Ubuntu. Not only does it provide a nice modern web interface to R, linking to Dropbox is the icing on the cake.