How To Install Apache Spark On Ubuntu 20.04 LTS

Table of Contents

How To Install Apache Spark On Ubuntu 20.04 LTS

 

 

 

 

Open-source Apache Spark is a distributed processing system designed for big data processing. It uses in-memory caching as well as optimized query execution to perform fast analytical queries on large data sets. This framework offers high-level APIs in Java, Scala, Python, and R. You can access HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source through it. It runs in Standalone, YARN, and Mesos cluster managers.

 

 

 

 

Throughout this article, you will learn how to install apache-spark on Ubuntu 20.04

 

 

 

 

 

 

Step 1: Configure VPSie cloud server

 
  1. Sign in to your system or register a newly created one by logging in to your VPSie account. 
  2. Connect by SSH using the credentials we emailed you.
  3. Once you have logged into your Ubuntu 20.04 instance, update your system using following command.

 

apt-get update && apt-get upgrade -y

 

 

 

 

Step 2: Install Java.

 

 

 

 

Run the following command,

 

 

apt-get install openjdk-11-jdk

 

 

 

 

Verify Java installation,

 

 

 

# java -version

 

 

 

 

 

 

 

Step 3: Install Scala.

 

 

 

 

 

 

Run the following command,

 

 

apt-get install scala
 

 

 

 

Verify Scala version,

 

 

 

# scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

 

 

 

 

Enter “scala” on the command line to log in to Scala,

 

 

 

 


 

 

Step 4: Install Apache Spark

 

 

 



 


Download the file using the following command,

 


curl -O https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

 

 

 


You will need to extract the downloaded file,



tar xvf spark-3.2.0-bin-hadoop3.2.tgz

 


 

Change the location of the download extract file,

 

 

mv spark-3.1.1-bin-hadoop3.2/ /opt/spark



 

Open bashrc configuration file,



vim ~/.bashrc


 


Add the following lines:



export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin




Activate the bashrc file,



source ~/.bashrc




Launch the master server,



start-master.sh



Open UFW firewall port 8080,



ufw allow 8080/tcp




You can access Apache Spark’s web interface,



http://server-ip:8080/





Launch the worker process,



start-slave.sh spark://ubuntu:7077




Use the Spark shell,



/opt/spark/bin/spark-shell



For Python, use Pyspark,



/opt/spark/bin/pyspark







I hope you have found this article useful, and I hope you have gotten some of information from it,





 

 

 

 

Try VPSie for free today!
Share on
Facebook
Twitter
LinkedIn
Print