Standalone Apache Spark cluster in Mac M1
Introduction
I had a pain in the arse while creating a Standalone Apache cluster in a Mac machine armed with an M1 processor (I hate this *h!&). The issue was not with the Spark but it was while creating some virtual machines in the Macbook.
Problems
The Oracle VirtualBox does not work on an Apple M1 processor. So, the alternative solution was to use the UTM. You can take a look into it and seemingly it worked quite fine in my Mac M1.
Deployment Architecture
I used a quite simple deployment architecture for this demonstration and also for the sake of not burning my machine.
One Master and one Worker (Slave/Executor or whatever you want to call it) node configuration resolved my demo cluster.
Initially, I configured the Master node (Ubuntu 22.04.3 LTS) with Apache Spark Standalone and then I simply cloned the VM to save my time and brain.
Spark Standalone configuration
We will configure 2 nodes. 1) Master, 2) Worker
Master node
You can either download the prepackaged version or build the source code from zero. I opted for the second option because it's the fun of OSS.
For building the source code I used openjdk version “17.0.2” 2022–01–18
Once the source code is built it will generate all the executables and libs. I configured only the PATH to point to the executables.
Below are my PATH settings:
export PATH=$PATH:/home/acme/github/spark/bin
export PATH=$PATH:/home/acme/github/spark/sbin
Cool! We have our one node ready. Just make one more change in the following file:
/etc/hostname
Write a nice name here let's say spark-master so that when we run the master node it gets a cool name (sometimes does not pick the hostname and picks the IP instead).
Worker node
Now, just clone the master node, and boom! We have one worker node ready. We will only need to do some basic configuration.
Edit the below file and put a nice name to identify the worker.
/etc/hostname
In my case, I put spark-worker-1. That's all.
Configure the Host machine
At this point, we have two VMs already up & running and we need to configure quick stuff in our host i.e Macbook
That's all!
Run the Master node
Make sure that your spark-master and spark-worker-1 nodes can communicate with each other. This is essential (of course it makes sense).
You can run the cluster in several different ways. But, being naive I will start with the simplest way.
Master node
Well, you can enter directly into the spark-master Ubuntu VM or via SSH and it's up to you. (If you feel like a hacker go for the latter option)
Once you are in, just run the below command:
acme@spark-master:~$ start-master.sh
At this moment, we have our master node i.e. spark-master running. In our host machine i.e. Mac we can even open the web UI.
Oh crap!!!! It's working. Note this information:
Master URL: spark://192.168.5.220:7077
Sometimes it might show as spark://spark-master:7077
Now let's jump to the next step.
Worker node
Now, let's enter the spark-worker-1 node and start the worker.
$ start-worker.sh spark://192.168.5.220:7077
That's all you need to do. The master URL is the one you got while running the master node.
Now, refresh the master console i.e http://spark-master:8080/
Alleluia! We have a worker registered too. So, it means that our cluster is up & running.
Conclusion
In this brief demo, I have explained how to configure a Spark Cluster in a Master+Worker setup. In the next article, I will try to write a driver program and publish it into the cluster, and let's explore the behaviors.
Until then, happy learning.