debugging

Install TensorFlow with GPU support on a RedHat (supercluster)

xuanchien

Sep 4, 2016 — 7 min read

I am working on a deep learning model for text summarization and I use TensorFlow as my main framework. It is a great framework and contains many built-in functions to ease the implementation. However, when I trained my model, it was too slow. It took about 7 seconds to train a batch containing 10 documents. Considering that my dataset contains about 150,000 documents, it would take around 29 hours to train all documents per epoch, which is way too slow. Generally it took more than 10 epochs for the model to converge, so I had to figure out how to resolve this issue. Beside optimizing the neural network, a preferred option is to train my model on GPU to take advantage of the fast matrix computation.

Unfortunately, the TensorFlow package installed using pip on the supercluster at my school did not support GPU. The reason is because it requires both CUDA 7.5 and CuDNN to be installed, which they do not support on the server. They have CUDA but its version is 7.0 and there are no CuDNN at all. Hence, according to TensorFlow tutorial, my best option was to build TensorFlow from source. Even though the guideline on TensorFlow website is simple, getting it to work took me a lot of efforts. But it was totally worth it and I am happy with the result. In this post, I want to walk you through the steps I did to successfully compile TensorFlow with GPU. It took me about 1 day to fix issues occurred during the compiling process and I hope it would take you less than 1 hour following my tutorial.

1. Environment Information

Server: RedHat (CentOS release 6.5 (Final))
Python: 2.7.11 :: Anaconda custom (64-bit)
Default gcc (at /usr/bin/gcc): gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
Default ld (at /usr/bin/ld): GNU ld version 2.20.51.0.2-5.36.el6 20100205

2. Compiling Bazel

First, we need bazel to compile TensorFlow. I could not use the Bazel installer because the version of glibc on server was not compatible with bazel installer. Mine was 2.12 while bazel asked for glibc 2.14. Of course we could not upgrade glibc since that I did not have root permission. So as you expected, we have to build it from source. (I found great help from this link: https://github.com/bazelbuild/bazel/issues/418#issuecomment-156147911)

Because of an issue related to tmp folder, we have to build bazel directly from within this folder.

cd /tmp
git clone https://github.com/bazelbuild/bazel.git
cd /tmp/bazel

Before compiling, I had to switch to another gcc. The default gcc (on /usr/bin/gcc) on server was 4.4.7 and I could not compile bazel with it. I instead activated gnu_4.9.2 module:

module load gnu_4.9.2
which gcc
/opt/rh/devtoolset-3/root/usr/bin/gcc

Next, replace all references to old gcc with the new gcc

setenv GCC `which gcc`
sed -i -e "s=/usr/bin/gcc=$GCC=g" tools/cpp/CROSSTOOL

Noted that my OS was RedHat, that's why I used setenv command, if you are using Linux then you should use export command (ex: export GCC=`which gcc` )

Then, prepare the bazelrc file on /tmp to pass options when compiling Bazel.

cd /tmp
cat >/tmp/bazelrc <<EOF
startup --batch
build --spawn_strategy=standalone --genrule_strategy=standalone
EOF

Let's build it

cd /tmp/bazel
BAZELRC=/tmp/bazelrc ./compile.sh

You should see the following output:

Build successful! Binary is here: /tmp/bazel/output/bazel

Copy bazel binary to your own directory, I'll refer the path to bazel binary is BAZEL_PATH

3. Compiling TensorFlow

Activate cuda 7.0 and Java 1.8 module first

module unload cuda/6.5
module load cuda/7.0
module load java/java8

Download CuDNN from Nvidia website (you might need to register for download), extract it into a folder, for example: ~/software/cudnn. Update $LD_LIBRARY_PATH environment variable (you should put it in ~/.bash_profile or ~/.cshrc):

setenv LD_LIBRARY_PATH ${HOME}/software/cudnn/lib64:${LD_LIBRARY_PATH}

Configure tensorflow environments for building

./configure

This script will ask for several pieces of information, it is important to enter them correctly. The answer on each system might be different. Empty answer mean using the default option.

Please specify the location of python. [Default is /home/s1510032/anaconda2/bin/python]:    	
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc should be used by nvcc as the host compiler. [Default is /opt/rh/devtoolset-3/root/usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]:
Please specify the location where CUDA  toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /opt/cuda/7.0
Please specify the Cudnn version you want to use. [Leave empty to use system default]:
Please specify the location where cuDNN  library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: ~/software/cuda
libcudnn.so resolves to libcudnn.4
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]:

Now come the hardest part. The next step is to build a pip package for tensorflow by using Bazel. On TensorFlow website, it's written simply in a single command like this:

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

But things are not as easy as like that since we are on a shared hosting, there is no root access and installed libraries can be different. We need to update several files before running the above command

3.1. Modify CROSSTOOL

By default, the script to build tensorflow assumes that our gcc is at /usr/bin/gcc and our libraries are at /usr/lib/gcc. This information is hard-coded in the file third_party/gpus/crosstools/CROSSTOOL.tpl. But remember that we have switched to using gnu_4.9.2, so we need to update the CROSSTOOL file with new gcc path. In addition, ld command is no longer /usr/bin/ld so we need to update it as well. Open CROSSTOOL file using your favorite editor (vi, nano,...) and do the following:

replace all references to /usr/bin/gcc with /opt/rh/devtoolset-3/root/usr/bin/gcc (using the command which gcc to find this path on your system)
replace all /usr/bin/ld with '/opt/rh/devtoolset-3/root/usr/bin/ld`
replace linker_flag: "-B/usr/bin" with linker_flag: "-B/opt/rh/devtoolset-3/root/usr/bin"

In addition, we need to change the paths to header files which will be used when compiling tensorflow. Paths to header files are specified with option cxx_builtin_include_directory in CROSSTOOL file. Here are old attributes:

cxx_builtin_include_directory: "/usr/lib/gcc/"
cxx_builtin_include_directory: "/usr/local/include"
cxx_builtin_include_directory: "/usr/include"

After updating:

cxx_builtin_include_directory: "/opt/rh/devtoolset-3/root/usr/lib/gcc/x86_64-redhat-linux/4.9.2/include"
cxx_builtin_include_directory: "/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2"
cxx_builtin_include_directory: "/opt/cuda/7.0/include"
cxx_builtin_include_directory: "/usr/include"

3.2 Modify `crosstool_wrapper_driver_is_not_gcc.tpl`

This file is located at third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl, open this file and replace the LLVM_HOST_COMPILER_PATH with new gcc path like this:

LLVM_HOST_COMPILER_PATH = ('/opt/rh/devtoolset-3/root/usr/bin/gcc')

You should also pay attention to the first line in this file, it is something like: #!/usr/bin/env python. By default, this file will be run by the default Python interpreter of the system. Since I was using Anaconda and my python was different, I had to change this as well. You should replace this line with your own python path, like this:

#!/home/s1510032/anaconda2/bin/python

If you do not do this, you might get an error about argparse module when building tensorflow. This module might not available in your default system Python interpreter so it is better to use your own interpreter, in which you can install any package you like.

3.3. Create `~/.bazelrc`

This file is different from the file we created to build Bazel, this file is for building TensorFlow. Create this file at your home directory with the content like this:

build --verbose_failures --linkopt=-Wl,-rpath,/opt/rh/devtoolset-3/root/usr/lib64 --linkopt=-Wl,-rpath,/u/drspeech/opt/jdks/jdk1.8.0_25/lib --linkopt=-lz --linkopt=-lrt --linkopt=-lm --genrule_strategy=standalone --spawn_strategy=standalone --linkopt=-Wl,-rpath,/opt/cuda/7.0/lib64

3.4 Build TensorFlow

Okay, it seems we are almost done. Try building tensorflow with GPU support:

bazel build -c opt --config=cuda --genrule_strategy=standalone --spawn_strategy=standalone //tensorflow/tools/pip_package:build_pip_package

It would take several minutes for it to compile, and then it might stop with an error like this:

Undefined reference to symbol 'ceil@@GLIBC_2.2.5' at build time

The fix is to modify the file bazel-tensorflow/external/protobuf/BUILD to add extra linker flags (-lrt and -lm).

LINK_OPTS = select({
    ":android": [],
    "//conditions:default": ["-lpthread","-lrt","-lm"],
})

(Some people said that we need to add extra linker flags in google/protobuf/BUILD but I have no idea where this file is.

After fixing, run the command again to build tensorflow:

bazel build -c opt --config=cuda --genrule_strategy=standalone --spawn_strategy=standalone //tensorflow/tools/pip_package:build_pip_package

You should get no error messages this time. After running this successfully, run the command to build Wheel (python distribution)

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

And install this new package with pip (make sure you already removed old tensorflow before installing)

pip install /tmp/tensorflow_pkg/tensorflow-0.10.0rc0-py2-none-any.whl

4. Testing

You can check whether your new tensorflow works as you expected or not by running the following script in python. If there are no errors, it means tensorflow has been installed correctly.

import tensorflow as tf
# Creates a graph.
with tf.device('/gpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

5. Additional notes

While building tensorflow, you might get this error:

ERROR: no such package '@local_config_cuda//crosstool': BUILD file not found on package path.

This seems to be an error with the Bazel process. You need to kill bazel process (ps aux | grep bazel) and re-run the above procedure (from ./configure step). It is a pain in the ass but I have no idea of a better way to fix this issue

Also, do your search on Issues page of tensorflow's github, I got many help and guidelines from people got similar error messages. For examples:

Do you have any other issues when building TensorFlow with GPU support? Does this post help you? Leave your comment and feedback. Good luck!

Install TensorFlow with GPU support on a RedHat (supercluster)

xuanchien

1. Environment Information

2. Compiling Bazel

3. Compiling TensorFlow

3.1. Modify CROSSTOOL

3.2 Modify `crosstool_wrapper_driver_is_not_gcc.tpl`

3.3. Create `~/.bazelrc`

3.4 Build TensorFlow

4. Testing

5. Additional notes

Read more

Fixing common issues when reindexing data on Solr

Some tips for doing OAuth Integration with Magento

Debug a slow Wordpress site

Restore data from InnoDB file (idb & frm) using TwinDB toolkit

1. Environment Information

2. Compiling Bazel

3. Compiling TensorFlow

3.1. Modify CROSSTOOL

3.2 Modify crosstool_wrapper_driver_is_not_gcc.tpl

3.3. Create ~/.bazelrc

3.4 Build TensorFlow

4. Testing

5. Additional notes

Read more

Fixing common issues when reindexing data on Solr

Some tips for doing OAuth Integration with Magento

Debug a slow Wordpress site

Restore data from InnoDB file (idb & frm) using TwinDB toolkit

3.2 Modify `crosstool_wrapper_driver_is_not_gcc.tpl`

3.3. Create `~/.bazelrc`