Standalone Apache Spark cluster in Mac M1

4 min readJan 11, 2024

Introduction

I had a pain in the arse while creating a Standalone Apache cluster in a Mac machine armed with an M1 processor (I hate this *h!&). The issue was not with the Spark but it was while creating some virtual machines in the Macbook.

Problems

The Oracle VirtualBox does not work on an Apple M1 processor. So, the alternative solution was to use the UTM. You can take a look into it and seemingly it worked quite fine in my Mac M1.

Deployment Architecture

I used a quite simple deployment architecture for this demonstration and also for the sake of not burning my machine.

One Master and one Worker (Slave/Executor or whatever you want to call it) node configuration resolved my demo cluster.

Initially, I configured the Master node (Ubuntu 22.04.3 LTS) with Apache Spark Standalone and then I simply cloned the VM to save my time and brain.

Spark Standalone configuration

We will configure 2 nodes. 1) Master, 2) Worker

Master node

You can either download the prepackaged version or build the source code from zero. I opted for the second option because it's the fun of OSS.

For building the source code I used openjdk version “17.0.2” 2022–01–18

Once the source code is built it will generate all the executables and libs. I configured only the PATH to point to the executables.

Below are my PATH settings:

export PATH=$PATH:/home/acme/github/spark/bin
export PATH=$PATH:/home/acme/github/spark/sbin

Cool! We have our one node ready. Just make one more change in the following file:

/etc/hostname

Write a nice name here let's say spark-master so that when we run the master node it gets a cool name (sometimes does not pick the hostname and picks the IP instead).

Worker node

Now, just clone the master node, and boom! We have one worker node ready. We will only need to do some basic configuration.

Edit the below file and put a nice name to identify the worker.

/etc/hostname

In my case, I put spark-worker-1. That's all.

Configure the Host machine

At this point, we have two VMs already up & running and we need to configure quick stuff in our host i.e Macbook

That's all!

Run the Master node

Make sure that your spark-master and spark-worker-1 nodes can communicate with each other. This is essential (of course it makes sense).

You can run the cluster in several different ways. But, being naive I will start with the simplest way.

Master node

Well, you can enter directly into the spark-master Ubuntu VM or via SSH and it's up to you. (If you feel like a hacker go for the latter option)

Once you are in, just run the below command:

acme@spark-master:~$ start-master.sh

At this moment, we have our master node i.e. spark-master running. In our host machine i.e. Mac we can even open the web UI.

Oh crap!!!! It's working. Note this information:

Master URL: spark://192.168.5.220:7077
Sometimes it might show as spark://spark-master:7077

Now let's jump to the next step.

Worker node

Now, let's enter the spark-worker-1 node and start the worker.

$ start-worker.sh spark://192.168.5.220:7077

That's all you need to do. The master URL is the one you got while running the master node.

Now, refresh the master console i.e http://spark-master:8080/

Alleluia! We have a worker registered too. So, it means that our cluster is up & running.

Conclusion

In this brief demo, I have explained how to configure a Spark Cluster in a Master+Worker setup. In the next article, I will try to write a driver program and publish it into the cluster, and let's explore the behaviors.

Until then, happy learning.

Standalone Apache Spark cluster in Mac M1

Introduction

Problems

Deployment Architecture

Spark Standalone configuration

Master node

Worker node

Configure the Host machine

Run the Master node

Master node

Worker node

Conclusion

Written by ANUPAM GOGOI

No responses yet