Install Yarn In Docker Container

Editor’s Note, August 2020: CDP Data Center is now called CDP Private Cloud Base. You can learn more about it here.

Npm (or Yarn) Install within a Docker Container, the Right Way Published on 26 January 2017 Working as a web agency (or more specifically at marmelab, as an innovation workshop), we have to deal with several different customers and projects. Currently when Yarn is ran from inside a docker container, the package install process is unbearably slow. After Googling around it doesn't look like there's a working solution for this, and surely having Yarn be usable from dockerized environments is important enough for this to be solved, which is why I've created this ticket. This is because yarn.lock is copied from the container to your working directory each time, even if it’s not changed, thus invalidating Docker layer caching. To solve this, we need to copy yarn.

Introduction

Motivation

Bringing your own libraries to run a Spark job on a shared YARN cluster can be a huge pain. In the past, you had to install the dependencies independently on each host or use different Python package management softwares. Nowadays Docker provides a much simpler way of packaging and managing dependencies so users can easily share a cluster without running into each other, or waiting for central IT to install packages on every node. Today, we are excited to announce the preview of Spark on Docker on YARN available on CDP DataCenter 1.0 release.

In this blog post we will:

  • Show some use cases how users can vary Python versions and libraries for Spark applications.
  • Demonstrate the capabilities of the Docker on YARN feature by using Spark shells and Spark submit job.
  • Peek into the architecture of Spark and how YARN can run parts of Spark in Docker containers in an effective and flexible way.
  • Guide to show how to use this feature with CDP Data Center release.

Potential benefits

The benefits from Docker are well known: it is lightweight, portable, flexible and fast. On top of that using Docker containers one can manage all the Python and R libraries (getting rid of the dependency burden), so that the Spark Executor will always have access to the same set of dependencies as the Spark Driver for instance.

Configurations to run Spark-on-Docker-on-YARN

YARN

Marking nodes supporting Docker containers

For bigger clusters it may be a real challenge to install, upgrade and manage Docker binaries and daemons. For a large scale one can benefit from the NodeLabels feature of YARN.

Labelling the nodes that are supporting Docker enables the user to enforce that Docker containers will only be attempted to be scheduled in hosts where Docker daemon is running. In this way Docker can be installed to the hosts in a rolling fashion making easier to benefit sooner from this feature.

Environment variables

The runtime that will be used to launch the container is specified via an environment variable at application submission time. As you can see later YARN_CONTAINER_RUNTIME_TYPE will be set to “docker” – in this case, the container will be started as a Docker container instead of a regular process.

Note that this environment variable only needs to be present during container start, so we can even set this through the SparkContext object in a Python script for instance (see “When to provide configurations?” below) – which is runtime. This provides much more flexibility for the user.

Significant environment variables:

  • YARN_CONTAINER_RUNTIME_TYPE: responsible for selecting the runtime of the container – to run a regular container, set this to “default”.
  • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE: you need a docker image on which you run your container. Using YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE you can override the Docker image’s default command.
  • YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS: specifying the mounts for the application.

Users can provide multiple mounts, separated by a comma. Also each mount should have the “source path:destination path in the Docker container:type”, where type should be either ro or rw representing the read-only and read-write mounts.

  • As a best practice you may want to separate your running Spark Docker containers from the host network by specifying the YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK environment variable.

More properties like privileged mode, ports mapping, delayed removal can be found in the upstream documentation [1].

Spark configurations

There are multiple ways to add these environment variables available at container start time. Using YARN master you can add environment variables for the Spark Application Master by specifying it like this:

and also to the containers of the Spark Executors:

Note that you can specify a different container for the executors than the one that the Driver runs in. Thus you can also play with the dependencies just like in Demo I. (where we didn’t have numpy locally installed).

When to provide configurations?

Just as a Spark shell is started or a Spark job is submitted the client can specify extra configuration options with the –conf option (see Demo I., II., III.).

After a Spark job (e.g a Python script) has been submitted to the cluster, the client cannot change the environment variables of the container of the Application Master. The Spark Driver in this case runs in the same container as the Application Master, therefore providing the appropriate environment option is only possible at submission time.

In spite of the above, the spawning of Spark Executors can be modified at runtime. After initializing the SparkContext object one can change the spark.executorEnv configuration option to add the Docker related ones (see version.py or ols.py).

Timeout

Due to Docker image localization overhead you may have to increase the Spark network timeout:

We have experienced some extra latency while the Docker container got ready mainly due to the Docker image pull operation. The timeout needed to be increased so that we wait long enough to the AM and the executors to come alive and start communicating with each other.

Mount points

The NodeManager’s local directory is mounted to the Docker container by default. It is particularly important for the Spark shuffle service to leverage multiple disks (if configured) for performance.

You should probably need the /etc/passwd mount points for the Application Master to access the UNIX user and group permissions. Be careful to only do that in Read-Only mode!

Another mounting point is /opt/cloudera/parcels that enables access to the host machine’s jars so that the Spark jars don’t have to be built into the Docker image. While it is possible to replace, Cloudera recommends to mount these tested jars from the hosts, and not to include custom Spark jars with different versions in the Docker images that may be unsupported or can potentially have integration problems.

In a kerberized cluster the container should access the kerberos configuration file, usually stored in /etc/krb5.conf.

Spark architecture

Before we turn to the demos, let’s dig into the Spark architecture by examining the differences between Client and Cluster mode. We’re doing that by using YARN master, so we do not concern local mode. You can use local mode as well by simply starting a PySpark session in a Docker image, but this part will not be covered in this article as that is unrelated to the Docker on YARN feature.

Client mode

In YARN client mode, the driver runs in the submission client’s JVM on the gateway machine. A typical usage of Spark client mode is Spark-shell.

The YARN application is submitted as part of the initialization of the SparkContext object by the driver. In YARN Client mode the ApplicationMaster is a proxy for forwarding YARN allocation requests, container status, etc., from and to the driver.

In this mode, the Spark driver runs on the gateway machine as a Java process, and not in a YARN container. Hence, specifying any driver-specific YARN configuration to use Docker or Docker images will not take effect. Only Spark executors will run in Docker containers.

However, it is possible to run the Spark Driver on the gateway machine inside a Docker container – which may even be the same Docker image as the Drivers. The flaw of this method is that users should have direct access to Docker on the gateway machine, which is usually not the case.

Cluster mode

In YARN cluster mode a user submits a Spark job to be executed, which is scheduled and executed by YARN. The ApplicationMaster hosts the Spark driver, which is launched on the cluster in a Docker container.

Along with the executor’s Docker container configurations, the Driver/AppMaster’s Docker configurations can be set through environment variables during submission. Note that the driver’s Docker image can be customized with settings that are different than the executor’s image.

One can also set the name of the Docker image of the Spark Executor during runtime by initializing the SparkContext object appropriately. This can be used, for instance, to write a script that can handle its own dependencies by specifying the image during runtime depending on the executors’ needs. If the script is going to run workloads involving a specific dependency but not using another one, it’s sufficient to specify a Docker image for the Spark Executors that only contains that specific library and does not contain the other one.

Demos

A data scientist can benefit from this feature during every step of the development lifecycle of a certain model: during experimentation (see Demo I.), preparing for official runs (see Demo II.) and deploying to production (see Demo III.).

About requirements including supported OSes and Docker version, visit the Cloudera docs.

Prerequisites

  1. Ensure Docker is installed on all hosts and the Docker daemon is running.
  2. Enable the Docker on YARN feature in Cloudera Manager.
    1. Use Linux Container Executor.
    2. Add Docker to the Allowed Runtimes.
    3. Enable Docker Containers.
  3. Save changes and restart YARN.

Demo I.

Running PySpark on the gateway machine with Dockerized Executors in a Kerberized cluster.

Steps

  1. Prepare Docker image (python2:v1) with the dependencies installed.
    1. Save python2:v1 to a file named Dockerfile and build it with the “docker build” command in a command line.
    2. Publish the built Docker image to a registry. If the Docker image is public (these are) login into your Docker Hub account, and use the “docker push” command to upload to Docker Hub.
  2. Open the Cloudera Manager UI and navigate to YARN->Configuration.
    1. Add the registry of your image (name of your Docker Hub account) to the list of trusted registries.
    2. Add mounts /etc/passwd, /etc/krb5.conf and /opt/cloudera/parcels to the List of Allowed Read-Only Mounts.
    3. Save changes and restart YARN.
  3. Initialize kerberos credentials. Typically by typing the following command to a command line in one of the cluster’s hosts.
  4. Start PySpark session using the command below in a command line.
  5. Type in the Python commands from dependencies.py.

Command

Output

Demo II.

Running Dockerized PySpark on the gateway machine with Dockerized Executors in a non-kerberized cluster.

Install Yarn In Docker Container Using

Steps

  1. Prepare Docker image (python2:v2) with the dependencies installed.
    1. Save python2:v2 to a file named Dockerfile and build it with the “docker build” command in a command line.
    2. Publish the built Docker image to a registry. If the Docker image is public (these are) login into your Docker Hub account, and use the “docker push” command to upload to Docker Hub.
  2. Open the Cloudera Manager UI and navigate to YARN->Configurations.
    1. Add the registry of your image (name of your Docker Hub account) to the list of trusted registries.
    2. Add mounts /etc/passwd and /opt/cloudera/parcels to the List of Allowed Read-Only Mounts.
    3. Save changes and restart YARN.
  3. Start PySpark session using the command below in a command line.

Command

Mounts

In the command above multiple mounts were specified – /etc/spark is needed to pick up the YARN related configurations, /etc/hadoop contains the topology script that Spark uses and /etc/alternatives contains the symlinks to the other two folders.

Demo III.

Submitting Spark application with different Python version using Docker containers.

Steps

  1. Prepare Docker image (python3:v1) with the dependencies installed.
    1. Save python3:v1 to a file named Dockerfile and build it with the “docker build” command in a command line.
    2. Publish the built Docker image to a registry. If the Docker image is public (these are) login into your Docker Hub account, and use the “docker push” command to upload to Docker Hub.
  2. Open the Cloudera Manager UI and navigate to YARN->Configurations.
    1. Add the registry of your image (name of your Docker Hub account) to the list of trusted registries.
    2. Add /etc/passwd, /etc/hadoop and /opt/cloudera/parcels to the Allowed Read-Only Mounts for Docker Containers.
    3. Save changes and restart YARN.
  3. Select an arbitrary Python 3 application (example: ols.py).
  4. Submit the script as a Spark application using the command below in a command line.
  5. Check the output of the application in the Spark History Server UI.

Command

Resources

Dockerfiles

Choosing the base image

Spark only contains jars, but no native code pieces. On the other hand, Hadoop has native libraries: HDFS native client and the Container Executor code of YARN are two significant ones. The former has major importance and is recommended due to performance reasons, the latter is only required by the NodeManager to start containers (Docker containers too). Other storages like HBase or Kudu may also have native code pieces.

Mounting does not take into account the fact that we can link some potentially incompatible binaries from another OS, therefore it’s the application developer’s responsibility to pay attention to this potential error. Mounting incompatible binaries can be catastrophous and may potentially corrupt the application.

As an example, the CDP’s parcel folder contains a /jar directory which is safe to mount since it contains only jars, but the /bin containing binaries is dangerous if another OS is used as a base image. For this reason Cloudera recommends to use the same OS but perhaps in a different version (preferred) OR include the binaries with the image and limit mounting binaries from the host.

python2:v1

Some things to add here:

  • We needed to add the environment variable PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON for the Spark to pick it up.

python2:v2

Some things to add here:

Install Yarn In Docker Container Model

  • We needed to add the environment variable PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON for the Spark to pick it up.
  • We also had to explicitly define the JAVA_HOME binary, due to the collision of the PATH environment variable (can conflict in the host and in the Docker image).

python3:v1

Scripts

Yarn

dependencies.py

version.py

ols.py

Availability

Note that this feature is in beta stage in CDP Data Center 7.0 and will be GA in the near future.

Future improvements

Supporting Dockerization in a bit more user friendly way: https://issues.apache.org/jira/browse/SPARK-29474

Using Docker images has lots of benefits, but on the other hand one potential drawback is the overhead of managing them. Users must make these images available at runtime, so a secure publishing method must be established. Also a user must choose between a publicly available or a private Docker repository to share these images.

We’re currently working on supporting canonical Cloudera-created base Docker images for Spark. Using these images one would not have to mount the Cloudera parcel folder to the Docker image. Utilizing base images with the necessary dependencies also installed in the image, the parcel mount point is no longer needed. In short, it increases encapsulation.

Improve the health diagnostics of a node by detecting not running Docker daemon: https://issues.apache.org/jira/browse/YARN-9923

Further reading

[1] Apache Hadoop Document: “Launching Applications Using Docker Containers”

[2] HDP Document about Spark on Docker on YARN[3] Check out our brand new CDP documentation!

  • Check out our Troubleshooting Catalog for common errors

Authors note: I would like to thank for the meaningful reviews and conversations to: Liliana Kadar, Imran Rashid, Tom Deane, Wangda Tan, Shane Kumpf, Marcelo Vanzin, Attila Zsolt Piros, Szilard Nemeth, Peter Bacsko.

Editor's Choice

NGINX is one of the most popular web servers in the world. Not only is NGINX a fast and reliable static web server, it is also used by a ton of developers as a reverse-proxy that sits in front of their APIs.

In this tutorial we will take a look at the NGINX Official Docker Image and how to use it. We’ll start by running a static web server locally then we’ll build a custom image to house our web server and the files it needs to serve. We’ll finish up by taking a look at creating a reverse-proxy server for a simple REST API and then how to share this image with your team.

Prerequisites

To complete this tutorial, you will need the following:

Install Yarn In Docker Container

Sh -c 'yarn install && yarn run dev' - the command. We’re starting a shell using sh (alpine doesn’t have bash) and running yarn install to install all dependencies and then running yarn run dev. If we look in the package.json, we’ll see that the dev script is starting nodemon. You can watch the logs using docker logs -f. Prior to 8.7.0 and 6.11.4 the docker images overrode the default npm log level from warn to info.However due to improvements to npm and new Docker patterns (e.g. Multi-stage builds) the working group reached a consensus to revert the log level to npm defaults. Instructions to download and install Docker; An IDE or text editor to use for editing files. I would recommend VSCode; NGINX Official Image. The Docker Official Images are a curated set of Docker repositories hosted on Docker Hub that have been scanned for vulnerabilities and are maintained by Docker employees and upstream maintainers.

  • Free Docker Account
    • You can sign-up for a free Docker account and receive free unlimited public repositories
  • Docker running locally
  • An IDE or text editor to use for editing files. I would recommend VSCode

NGINX Official Image

The Docker Official Images are a curated set of Docker repositories hosted on Docker Hub that have been scanned for vulnerabilities and are maintained by Docker employees and upstream maintainers.

Official Images are a great place for new Docker users to start. These images have clear documentation, promote best practices, and are designed for the most common use cases.

Let’s take a look at the NGINX official image. Open your favorite browser and log into Docker. If you do not have a Docker account yet, you can create one for free.

Once you have logged into Docker, enter “NGINX” into the top search bar and press enter. The official NGINX image should be the first image in the search results. You will see the “OFFICIAL IMAGE” label in the top right corner of the search entry.

Now click on the nginx result to view the image details.

On the image details screen, you are able to view the description of the image and it’s readme. You can also see all the tags that are available by clicking on the “Tags” tab

Running a basic web server

Let’s run a basic web server using the official NGINX image. Run the following command to start the container.

With the above command, you started running the container as a daemon (-d) and published port 8080 on the host network. You also named the container web using the --name option.

Open your favorite browser and navigate to http://localhost:8080 You should see the following NGINX welcome page.

This is great but the purpose of running a web server is to serve our own custom html files and not the default NGINX welcome page.

Let’s stop the container and take a look at serving our own HTML files.

Adding Custom HTML

By default, Nginx looks in the /usr/share/nginx/html directory inside of the container for files to serve. We need to get our html files into this directory. A fairly simple way to do this is use a mounted volume. With mounted volumes, we are able to link a directory on our local machine and map that directory into our running container.

Let’s create a custom html page and then serve that using the nginx image.

Create a directory named site-content. In this directory add an index.html file and add the following html to it:

Now run the following command, which is the same command as above, but now we’ve added the -v flag to create a bind mount volume. This will mount our local directory ~/site-content locally into the running container at: /usr/share/nginx/html

Open your favorite browser and navigate to http://localhost:8080 and you should see the above html rendered in your browser window.

Build Custom NGINX Image

Install Yarn In Docker Containers

Bind mounts are a great option for running locally and sharing files into a running container. But what if we want to move this image around and have our html files moved with it?

Install Yarn In Docker Container In Java

There are a couple of options available but one of the most portable and simplest ways to do this is to copy our html files into the image by building a custom image.

To build a custom image, we’ll need to create a Dockerfile and add our commands to it.

In the same directory, create a file named Dockerfile and paste the below commands.

We start building our custom image by using a base image. On line 1, you can see we do this using the FROM command. This will pull the nginx:latest image to our local machine and then build our custom image on top of it.

Docker Yarn Install Cache

Next, we COPY our index.html file into the /usr/share/nginx/html directory inside the container overwriting the default index.html file provided by nginx:latest image.

You’ll notice that we did not add an ENTRYPOINT or a CMD to our Dockerfile. We will use the underlying ENTRYPOINT and CMD provided by the base NGINX image.

To build our image, run the following command:

The build command will tell Docker to execute the commands located in our Dockerfile. You will see a similar output in your terminal as below:

Now we can run our image in a container but this time we do not have to create a bind mount to include our html.

Open your browser and navigate to http://localhost:8080 to make sure our html page is being served correctly.

Setting up a reverse proxy server

Docker Yarn Install Timeout

A very common scenario for developers, is to run their REST APIs behind a reverse proxy. There are many reasons why you would want to do this but one of the main reasons is to run your API server on a different network or IP then your front-end application is on. You can then secure this network and only allow traffic from the reverse proxy server.

For the sake of simplicity and space, I’ve created a simple frontend application in React.js and a simple backend API written in Node.js. Run the following command to pull the code from GitHub.

Once you’ve cloned the repo, open the project in your favorite IDE. Take a look at Dockerfile in the frontend directory.

The Dockerfile sets up a multi-stage build. We first build our React.js application and then we copy the nginx.conf file from our local machine into the image along with our static html and javascript files that were built in the first phase.

We configure the reverse proxy in the frontend/nginx/nginx.conf file. You can learn more about configuring Nginx in their documentation.

As you can see in the second location section thatall traffic targeted to /services/m will be proxy_pass to http://backend:8080/services/m

Container

In the root of the project is a Docker Compose file that will start both our frontend and backend services. Let’s start up our application and test if the reverse proxy is working correctly.

You can see that our nginx web server has started and also our backend_1 service has started and is listening on port 8080.

Open your browser and navigate to http://localhost. You should see the following web page:

Open the developer tools window and click on the “network” tab. Now back in the browser, enter an entity name. This can be anything. I’m going to use “widgets”. Then click the “Submit” button.

Over in the developer tools window, click on the network request for widgets and see that the request was made to http://localhost and not to http://localhost:8080.

Open your terminal and notice that request that was made from the browser was proxied to the backend_1 service and handled correctly.

Shipping Our Image

Now let’s share our images on Docker so others on our team can pull the images and run them locally. This is also a great way to share your application with others outside of your team such as testers and business owners.

To push your images to Docker’s repository run the docker tag and then the docker push commands. You will first need to login with your Docker ID. If you do not have a free account, you can create one here.

Awesome Compose

The Awesome compose project is a curated list of Docker Compose samples. These samples provide a starting point for how to integrate different services using a Compose file and to manage their deployment with Docker Compose.

In the awesome compose repository you can find project templates that use NGINX as a static web server or a reverse proxy. Please take a look and if you do not find what you are looking for please consider contributing to the project. Checkout the Contribution Guide for more details.

Conclusion

Docker Install Yarn Outlet

In this article we walked through running the NGINX official image, adding our custom html files, building a custom image based off of the official image and configuring the NGINX as a reverse proxy. We finished up by pushing our custom image to Docker so we could share with others on our team.

Docker Install Yarn Company

If you have any questions, please feel free to reach out on Twitter @pmckee and join us in our community slack.