How to Run Polynote on AWS EMR?

Netflix’s Jupyter-killer notebook meets real compute power with AWS EMR to fuel Apache Spark.

Deniz Parmaksız
4 min readNov 17, 2019
Planet Jupiter with sunglasses.
Hello, Jupiter. (Source: PopMech)

Netflix open-sourced its “Jupyter-killer” notebook Polynote last month and the project already hit more than 3k stars on Github, whereas Apache Zeppelin has 4.4k and Jupyter has 8.9k stars, if this is a real popularity metric for you :)

Leaving Github stars aside, Polynote offers some really cool features for data scientists and other ML practioners. My favorites ones so far are; IDE-class auto-complete and highlighting, symbol table for defined values, detailed kernel status about running tasks, built-in data visualization and true support for mixing languages (wow!). Detailed information can be found at Netflix’s own post.

Of course we are not doing it, right? (Source: Hackernoon)

Let’s go back to our topic: How to Run Polynote on AWS EMR? Experimenting on your laptop is fine, but if you are going to run experiments with real world data 8GB RAM is not much for Spark. We need a cluster with couple of nodes with high CPU and memory specs. Thanks to cloud computing vendors, it is really easy to spin up big clusters in a few minutes.

AWS EMR has some pre-installed software for us such as Hadoop, Hive, Spark, JupyterHub, Zeppelin etc. and using those notebooks are pretty easy, since you can just spin up a cluster and start to work on it. However, if you want to leverage the cool features of Polynote, you need to get your hands dirty.

First of all, create a cluster on EMR and SSH into the master node. Then, ensure that python3 and pip are installed.

python3 -m ensurepip# Optionally, create a soft link as pip3 for easy access.sudo ln -s /usr/bin/pip-3.6 /usr/bin/pip3

After that, we need to define some environment variables for Polynote to find JDK and Spark. Open ~/.bashrc in your favorite editor (i.e. vim) and add the lines below. JDK and Spark are already installed thanks to Amazon, so we just need to let Polynote know where to find them. JAVA_HOME is already there, but it points to JRE, but we want it to point to JDK. After saving your changes do not forget to run source ~/.bashrc to make your changes affect your current session.

export PATH=~/.local/bin:$PATHexport JAVA_HOME=/etc/alternatives/java_sdk_openjdkexport SPARK_HOME=/usr/lib/spark

Last requirements before installing Polynote is to install jep and jedi for polygot support (as far as I see), pyspark as a bridge for Scala/Spark to Python/Pandas DataFrame sharing, and finally virtualenv for isolated Python environment. Optionally you can install numpy and pandas packages too. I am using --user flag to let pip install packages under local user’s package location, so we do not need to gain root access to install dependencies.

pip3 install jep jedi pyspark virtualenv numpy pandas --user

Now we are ready to install Polynote itself! First we need to find the latest package on the release page. Then, download it to our master instance using wget and unpack with tar.

wget https://github.com/polynote/polynote/releases/download/0.2.13/polynote-dist.tar.gztar -zxvpf polynote-dist.tar.gz

You should be able to see the polynote folder when you list the folders in your directory. Let’s switch into that folder using cd polynote and make a copy of config template by running cp config-template.yml config.yml command. These config file let us change the notebook server configurations such as storage locations, dependencies, binding port etc.

Since we want to access to a remote computer and the default host that Polynote binds is 127.0.0.1 , which is localhost, we need to update it to 0.0.0.0 for allowing internet access. To do that, we need to uncomment lines 20–22 and change the host value. Your final configurations should look like this.

listen:
host: 0.0.0.0
port: 8192

Finally, we run a Polynote server by simply running python3 ~/polynote/polynote.py command. Now you can access to Polynote UI by hitting http://<dns-to-your-master-node>:8192 address on Google Chrome, since it is not tested and may not work on other browsers yet. (Assuming your master node is accessible on your network via proxy etc. and 8192 port is allowed for inbound.)

To enable Spark, you need to set spark.master Spark config to yarn to let yarn run it in cluster mode. These configurations are on notebook level and can be found at the top of the notebook under Configuration & dependencies cell.

There it is! A beautiful notebook.

That’s it, you created a Polynote server and can leverage all it’s cool features with the compute power of an EMR cluster! See official docs for basic usage and mixing programming languages.

Disclaimer: Please be aware that Polynote allows arbitrary remote code execution and can be used as an attack vector, if the environment is not secure. If you are going to use it in your company environment, consult your security team before running Polynote. You are solely responsible for any breach, loss, or damage caused by running this software insecurely.

Thanks for reading, this was my first Medium post! I am planning to write more posts on Machine Learning and my side projects. If you are interested, see you in upcoming posts :)

--

--

Deniz Parmaksız

Sr. Machine Learning Engineer at Insider | AWS Ambassador