Tag Archives: code performance

Profiling machine learning applications in TensorFlow

TensorFlow provided package timeline by using the import from tensorflow.python.client

from tensorflow.python.client import timeline

This is useful for performance profiling TensorFlow application with graphical visualization similar to the graphs generated from the CUDA Visual Profiler. With a little tweak in the machine learning code, TensorFlow applications can store and report performance metrics of the learning process.
tfprofile3

The design of the timeline package made it easy to add profiling by simply adding code below.

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata() 

It is also required to instruct the model to compile with the profiling options:

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
options=run_options,
run_metadata=run_metadata)

With the sample mnist digits classifier for TensorFlow, the output shown Keras history are saved and can be later retrieved to generate reports.
tfprofile2

Finally, using the Chrome tracing page ( chrome://tracing/ ), the performance metrics persisted on file system can be opened for verification.
tfprofile1

 

Advertisements

Statistical test on C code performance

While curious on how C code are optimized by compiler on mathematical functions, two implementations of the standard normal distribution are compared in terms of performance. The aim is to provide insight on how the generated machine code performs, without having to actually inspecting them. The function is an approximation function coded in standard C. It should be noted that both functions implement the same approximation, but the actual equation is a little bit different in terms of number of multiplication operator. The first implementation is with less multiplication operators:
reverse1
i.e.
reverse2

While the second one is a modified version with more multiplication operators, e.g. expanding k5 to k*k*k*k*k.

A scaffolding test rig is used to loop 10 million times, within it, the approximation function is called from 0 to 1 in steps of 0.1. Visual Studio is used for the code compilation, and 20 samples are collected from each of the two functions. For the analysis, the TI-84 Pocket SE is used to carry out 2-sample T test in the procedure below:

cndcodettest1cndcodettest2
cndcodettest3
cndcodettest4
cndcodettest5

By evaluating the p-value, It looks like the first version of more compact C code performed better and perhaps there is little the compiler can help here.