How to Run Polynote on AWS EMR?
Netflix’s Jupyter-killer notebook meets real compute power with AWS EMR to fuel Apache Spark.
Netflix open-sourced its “Jupyter-killer” notebook Polynote last month and the project already hit more than 3k stars on Github, whereas Apache Zeppelin has 4.4k and Jupyter has 8.9k stars, if this is a real popularity metric for you :)
Leaving Github stars aside, Polynote offers some really cool features for data scientists and other ML practioners. My favorites ones so far are; IDE-class auto-complete and highlighting, symbol table for defined values, detailed kernel status about running tasks, built-in data visualization and true support for mixing languages (wow!). Detailed information can be found at Netflix’s own post.
Let’s go back to our topic: How to Run Polynote on AWS EMR? Experimenting on your laptop is fine, but if you are going to run experiments with real world data 8GB RAM is not much for Spark. We need a cluster with couple of nodes with high CPU and memory specs. Thanks to cloud computing vendors, it is really easy to spin up big clusters in a few minutes.
AWS EMR has some pre-installed software for us such as Hadoop, Hive, Spark, JupyterHub, Zeppelin etc. and using those notebooks are pretty easy, since you can just spin up a cluster and start to work on it. However, if you want to leverage the cool features of Polynote, you need to get your hands dirty.
First of all, create a cluster on EMR and SSH into the master node. Then, ensure that python3 and pip are installed.
python3 -m ensurepip# Optionally, create a soft link as pip3 for easy access.sudo ln -s /usr/bin/pip-3.6 /usr/bin/pip3
After that, we need to define some environment variables for Polynote to find JDK and Spark. Open ~/.bashrc
in your favorite editor (i.e. vim) and add the lines below. JDK and Spark are already installed thanks to Amazon, so we just need to let Polynote know where to find them. JAVA_HOME
is already there, but it points to JRE, but we want it to point to JDK. After saving your changes do not forget to run source ~/.bashrc
to make your changes affect your current session.
export PATH=~/.local/bin:$PATHexport JAVA_HOME=/etc/alternatives/java_sdk_openjdkexport SPARK_HOME=/usr/lib/spark
Last requirements before installing Polynote is to install jep
and jedi
for polygot support (as far as I see), pyspark
as a bridge for Scala/Spark to Python/Pandas DataFrame sharing, and finally virtualenv
for isolated Python environment. Optionally you can install numpy
and pandas
packages too. I am using --user
flag to let pip install packages under local user’s package location, so we do not need to gain root access to install dependencies.
pip3 install jep jedi pyspark virtualenv numpy pandas --user
Now we are ready to install Polynote itself! First we need to find the latest package on the release page. Then, download it to our master instance using wget
and unpack with tar
.
wget https://github.com/polynote/polynote/releases/download/0.2.13/polynote-dist.tar.gztar -zxvpf polynote-dist.tar.gz
You should be able to see the polynote
folder when you list the folders in your directory. Let’s switch into that folder using cd polynote
and make a copy of config template by running cp config-template.yml config.yml
command. These config file let us change the notebook server configurations such as storage locations, dependencies, binding port etc.
Since we want to access to a remote computer and the default host that Polynote binds is 127.0.0.1
, which is localhost, we need to update it to 0.0.0.0
for allowing internet access. To do that, we need to uncomment lines 20–22 and change the host value. Your final configurations should look like this.
listen:
host: 0.0.0.0
port: 8192
Finally, we run a Polynote server by simply running python3 ~/polynote/polynote.py
command. Now you can access to Polynote UI by hitting http://<dns-to-your-master-node>:8192
address on Google Chrome, since it is not tested and may not work on other browsers yet. (Assuming your master node is accessible on your network via proxy etc. and 8192 port is allowed for inbound.)
To enable Spark, you need to set spark.master
Spark config to yarn
to let yarn run it in cluster mode. These configurations are on notebook level and can be found at the top of the notebook under Configuration & dependencies cell.
That’s it, you created a Polynote server and can leverage all it’s cool features with the compute power of an EMR cluster! See official docs for basic usage and mixing programming languages.
Disclaimer: Please be aware that Polynote allows arbitrary remote code execution and can be used as an attack vector, if the environment is not secure. If you are going to use it in your company environment, consult your security team before running Polynote. You are solely responsible for any breach, loss, or damage caused by running this software insecurely.
Thanks for reading, this was my first Medium post! I am planning to write more posts on Machine Learning and my side projects. If you are interested, see you in upcoming posts :)