Enable Hadoop YARN 2.9/3.x to Launch Application Using Docker Containers

6 min readNov 4, 2018

The hard way to understand Hadoop YARN.

We open sourced a project called TonY, which was initially used to launch distributed TensorFlow jobs on Hadoop. Later we extended it so it is also capable of running any distributed jobs on Hadoop, like R on YARN or PyTorch on YARN. One thing we recently are thinking of is that — can we support launching training jobs inside Docker images? As I mentioned in previous blog post, docker is de facto the best tool for developers productivity at scale. It is also easy to use for people without a solid system infrastructure background.

Before designing TonY on Docker, we need to make our Hadoop cluster capable of launching applications with docker images first. The good news is — since Hadoop 2.9.1, the support for unprivileged docker mode suffices to launch most tasks, and the bad news is — apparently, there is not enough tutorial/documentation about it. This document is far from enough..

In terms of experimentation environment, it is extremely inconvenient or close to impossible to mess with company clusters with this. So I decided to play around with that with my own clusters first.

Unfortunately, it is not possible to make this part of my Raspberry Pi series since docker requires swap space off and Hadoop + Docker + training job would go beyond 1GB memory. So I had to use my Ubuntu 18.0.4 boxes as well as my Mac Mini boxes to play around with that.

Prerequisite

Hadoop source code

You need its source code, otherwise you can’t understand the error message. The error message from Hadoop YARN is definitely not for users but for its developers. Hadoop YARN is a beast that is hard to tame. You don’t necessarily need an exceptional IDE like Intellij since configuring IDE for Hadoop worths another blog post. VSCode or Sublime is enough.

Also, if you want to use macOS as the Node Manager host, you need the source code to compile a macOS compatible container-executor binary file, the one downloaded from Hadoop website only works on Linux. This is because container-executor is written off a couple lines (1000+) of C code, not Java.

A proper host OS.

It is recommended to use the same OS for both the docker image and the host OS (like Ubuntu 18.04 or other Linux distro). I tried docker on docker solution and docker on macOS solution, however, it is fairly painful since:

For docker on docker, this is an unknown solution, if it doesn’t work, you have no idea if this solution doesn’t fit or it is your configuration issues. That was the feeling I got and it was fairly frustrating. For what it worth, it should be a viable solution at this moment after I figured out a solution with a common host + docker combo.

The other one I tried is macOS as the host and Ubuntu as the docker image, as I mentioned earlier, you need to compile Hadoop source code to fix the binary compatibility issue. Also the commands to set user/groups are very different from other Linux systems and it’s a pain to remember two sets of commands, I once messed up with sudoers file and it took me quite a while to figure out how to recover. The start-dfs.shor start-yarn.sh scripts assumes the Hadoop jars are installed in the same location. As I used to store Hadoop folder under home folder, the home directory convention is also different from Ubuntu and macOS.

If you have the same OS between host and docker image, you can simply apply the same set up for both. So suggestion is to stick with same eggs for your host & your docker image.

A property configured Hadoop

Something not mentioned in the official doc:

yarn.nodemanager.linux-container-executor.group should be the same in both container-executor.cfg and in yarn-site.xml.

2. The Hadoop configuration folder/etc/hadoop must be owned by root (both this folder itself as well as its any parental folders), that why I suggest you put the HADOOP_CONF_DIR under /etc . However, there is another problem with the download Hadoop binary, it is compiled with specifying Hadoop conf under ../etc/hadoop relative to bin/ folder. As a result of that, you need to put the whole Hadoop folder under /etc ..

pi@pi-aw:/etc/hadoop$ ll
total 172
drwxr-xr-x 10 root hadoop 4096 Nov 3 22:23 ./
drwxr-xr-x 130 root root 12288 Nov 4 11:02 ../
drwxr-xr-x 2 root hadoop 4096 Nov 3 22:15 bin/
drwxr-xr-x 3 root hadoop 4096 Nov 3 22:15 etc/
drwxr-xr-x 2 root hadoop 4096 Nov 3 22:15 include/
drwxr-xr-x 3 root hadoop 4096 Nov 3 22:15 lib/
drwxr-xr-x 2 root hadoop 4096 Nov 3 22:15 libexec/
-rw-r — r — 1 root hadoop 106210 Nov 3 22:15 LICENSE.txt
drwxrwxrwx 3 pi hadoop 4096 Nov 3 23:52 logs/
-rw-r — r — 1 root hadoop 15915 Nov 3 22:15 NOTICE.txt
-rw-r — r — 1 root hadoop 1366 Nov 3 22:15 README.txt
drwxr-xr-x 3 root hadoop 4096 Nov 3 22:15 sbin/
drwxr-xr-x 4 root hadoop 4096 Nov 3 22:15 share/

Also note that you need to make everything here owned by root:hadoop. The most tricky thing I found is to config container-executor.cfg , here is my version:

pi@pi-aw:/etc/hadoop/etc/hadoop$ sudo cat container-executor.cfg 
[sudo] password for pi: 
min.user.id=100
yarn.nodemanager.linux-container-executor.group=hadoop
[docker]
 module.enabled=true
 docker.binary=/usr/bin/docker
 docker.allowed.capabilities=SYS_CHROOT,MKNOD,SETFCAP,SETPCAP,FSETID,CHOWN,AUDIT_WRITE,SETGID,NET_RAW,FOWNER,SETUID,DAC_OVERRIDE,KILL,NET_BIND_SERVICE
 docker.allowed.networks=bridge,host,none
 docker.allowed.rw-mounts=/tmp,/etc/hadoop/logs/,/private/etc/hadoop-2.9.1/logs/

A couple things you need to notice:

a. min.user.id should be lower than your user id.

b. any directory underdocker.allowed.rw-mounts must be accessible and you must provide a value for this field.

c. docker.binary defaults to /usr/bin/docker, so if you run it inside macOS, you have to specify it.

The error message sometimes is random, if you have the right configurations, it works fine, if you have a wrong configuration, it throws random error messages at you.

A properly configured docker image

You can reference this as an example:

oliverhu/docker-hadoop

Apache Hadoop docker image. Contribute to oliverhu/docker-hadoop development by creating an account on GitHub.

github.com

Something you must understand is — the host Node Manager will follow the environmental set up in the host OS and search for the same in the client. So if you have HADOOP_CONF_DIR set to /etc/hadoop/etc/hadoop but in docker, you set it to /etc/hadoop, it won’t work. Another better way is to set HADOOP_CONF_DIR in your command. I’ll talk about that in the How To Test Section.

Make sure your user’s uid & Hadoop’s gid are the same across both host & docker image:

pi@pi-aw:/etc/hadoop/etc/hadoop$ id pi
uid=1000(pi) gid=1000(pi) groups=1000(pi),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),116(lpadmin),126(sambashare),999(docker),5000(hadoop)

How to test?

The easiest way to test is to use the distributed shell:

pi@pi-aw:/etc/hadoop$ yarn jar ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.9.1.jar -jar ./share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.9.1.jar -shell_command 'HADOOP_CONF_DIR=/etc/hadoop hdfs dfs -ls hdfs://192.168.0.14:9000/' -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=test1

Results from RM page:

If you’re a newbie, you might not find the container log, you can find it by going to the log link like this:

http://pi-aw:8042/node/containerlogs/container_1541363490513_0002_01_000001/pi

And change the container id from container_1541363490513_0002_01_000001 to container_1541363490513_0002_01_000002 , then you should be able to find the output from your distributed shell:

Pretty sure you’d face different issues on your first attempt. It took me a couple nights to make this right. Feel free to comment if you get issues :-)