Distributed TensorFlow on Raspberry Pi’s Hadoop 3 Cluster
Raspberry Pi + Apache Hadoop 3.1.1 + TensorFlow 1.9 + LinkedIn TonY 0.1.0
Preparation
- Raspbian OS Upgrade
Make sure you have Raspbian Stretch. You can validate your OS version via checking your Python 3 version, if you have Python 3.5, you’re good, otherwise if you have Python 3.4, you are in Raspbian Jessie. Follow https://www.raspberrypi.org/documentation/raspbian/updating.md to upgrade to latest Stretch version of Raspbian, old Jessie version won’t work! Note that it takes more than half an hour for an upgrade.
2. For the sake of making everyone’s life easier, install pdsh to run shell commands in our Raspberry Pis.
sudo apt-get install pdsh
Add this to your ~/.bashrc file
export PDSH_RCMD_TYPE=ssh
In the future, use
pdsh -R ssh -w pi@192.168.0.[7,8,9,13,17] 'YOUR_SHELL_COMMAND'
to run shell command in all the nodes, works like MAGIC!
To make your life even easier, add an alias for that in .bashrc:
alias runshell="pdsh -R ssh -w pi@192.168.0.[7,8,9,13,17]
3. Install virtualenv in all nodes
runshell 'sudo apt-get install virtualenv'
4. Turn on swap
If read through my old Raspberry Pi posts, you might have turned off Swapping because of the use of Docker, now you need to turn it back on again with lots of swapping space because you only have 1GB physical memory, that is far from enough for TensorFlow training, to enable:
sudo vim /etc/dphys-swapfile# Update
CONF_SWAPSIZE=1000
Run
/etc/init.d/dphys-swapfile restart
Do the job!
1. Install Hadoop 3.1.1 in your Raspberry Pi cluster
Follow my old post to install Hadoop 3.1.1 to the Raspberry Pis (just replace the 3.0.0 version with 3.1.1, everything else works the same).
Make sure you see this before continuing:
pi@master:~ $ jps
18912 NodeManager
2947 NameNode
3091 DataNode
31577 Jps
3257 SecondaryNameNode
18799 ResourceManager
On your namenode page:


and resourcemanager page:


2. Install TensorFlow
Create Python virtual environment and install TensorFlow:
runshell 'virtualenv python3 ~/p3 && source ~/p3/bin/activate && pip3 install tensorflow'
We use Python 3 version of TensorFlow instead Python 2, cause Python 2’s TensorFlow doesn’t work on Raspbian OS for some reason.
3. Get a copy of TonY jar
The TonY project is Open Sourced, download a tony-cli jar from:link
4. Kick off your distributed TensorFlow job!
Now you have set up all the environment, time to kick off your job. Download this file to your src/ directory.
pi@master:~/tf $ ls
src tony-cli-0.1.0-all.jar tony.xml
pi@master:~/tf $ cat tony.xml
<configuration>
<property><name>tony.application.insecure-mode</name><value>true</value></property>
</configuration>
pi@master:~/tf $ ls src/distributed.py
pi@master:~/tf $ CLASSPATH=$(${HADOOP_HDFS_HOME}/bin/hadoop classpath --glob):/home/pi/tf/:/home/pi/tf/* java com.linkedin.tony.cli.ClusterSubmitter --src_dir src --executes src/distributed.py --python_binary_path /home/pi/p3/bin/python
6. Check your job in resource manager web page

Worker log:

Now it is running!
Good new is you can also view the TensorBoard. The logs required by TensorBoard is generated in worker 0, login to that worker and run:
pi@slave-3:~ $ source p3/bin/activate
(p3) pi@slave-3:~ $ tensorboard --logdir /tmp/mnist/1
TensorBoard 1.9.0 at http://slave-3:6006 (Press CTRL+C to quit)
Now, you can open http://slave-3:6006 and view the Tensorboard in real time:

FAQ
- My pip is screwed up after the Raspbian Stretch upgrade:
pi@master:/usr/local/bin $ pip
Traceback (most recent call last):
File “/usr/bin/pip”, line 9, in <module>
from pip import main
ImportError: cannot import name main
Solution:
sudo vim /usr/bin/pip
Change
from pip import main if __name__ == ‘__main__’:
sys.exit(main())
To:
from pip import __main__if __name__ == ‘__main__’: sys.exit(__main__._main())