Apache spark installation on CentOS 7/ RHEL 7


Apache Spark

Apache spark is an open source cluster computing framework. It is the fast and general engine for big data processing with built-in modules for streaming, SQL, graphX and MLib. In this article, we will be known about the installation of Apache Spark on CentOS 7 and RHEL 7 in standalone mode.

First, We need to make sure that the system is up to date by using the following command

#yum -y update

Install Java

#yum -y install java-1.8.0-openjdk
#java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Install Scala

#wget http://downloads.lightbend.com/scala/2.12.1/scala-2.12.1.rpm
#rpm -ivh scala-2.12.1.rpm

Now download the latest version of Apache spark from their official site.
Extract and copy it to the directory /usr/local/spark

# wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
--2017-02-08 03:48:57--  http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
Resolving d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)...,,, ...
Connecting to d3kbcqa49mib13.cloudfront.net (d3kbcqa49mib13.cloudfront.net)||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 195636829 (187M) [application/x-tar]
Saving to: ‘spark-2.1.0-bin-hadoop2.7.tgz’

100%[============================================================================================>] 195,636,829  187KB/s   in 9m 19s

2017-02-08 03:58:16 (342 KB/s) - ‘spark-2.1.0-bin-hadoop2.7.tgz’ saved [195636829/195636829]
#tar -xzvf spark-2.1.0-bin-hadoop2.7.tgz
#mv spark-2.1.0-bin-hadoop2.7 spark
#cp -rf spark/ /usr/local/

Then set the environment variables and path

source ~/.bash_profile

Now start spark shell by using the following command