Pandemic brought us a lot of opportunities to spend time with our loved ones, and also with our home computer. If you are interested in data, and comfortable working in Linux, setting up an infrastructure for big data similar to those used in big tech companies is easier than you think. Follow this tutorial for how to get your toes wet in a home-build data lake and data warehouse.
Overview
MySQL is often used to store structured data, and serve real time traffic (show your facebook profile). This is what we call “online data”. Companies usually take a snapshot of all the online data at the end of the day. Often through a software called Sqoop. And store them in the data warehouse after some processing. This is what we call “offline data”. It enables the company to gain insights on how data is changed over time, and access historical data when needed.
I have some random data saved in my local MySQL DB. I will show you how to export them into a Hive data warehouse, who manages big amount of data stored in Hadoop — a distributed file system that interacts with the hard disk on your computer. Then, we’ll run queries against these data using Spark — a data processing engine that retrieves the data through Hive, and transform the data into the format we desire.
The following are the versions of softwares I used. There may be compatibility issues if their versions differ too much.
- Java (1.8.0_291)
- Ubuntu (20.04.1)
- MySQL (8.0.23)
- Hadoop (3.2.2)
- Hive (3.1.2)
- Sqoop (1.4.7)
- Spark_2.12 (3.1.1)
Java
Download the Java JDK from the official site. For the best compatibilities, I choose Java 8. After download and unzip, set $JAVA_HOME
to the folder where it’s saved.
MySQL
A simple command to install. I also followed this guide to secure the server, and set up users.
sudo apt update; sudo apt install mysql-server
Hadoop
- Download Hadoop 3.2.2 from the official site.
- Follow the official guide to set up a “Pseudo Distributed Operation”, aka with a single node (the docs for Hadoop is so good that simply following it would be suffice).
- One thing missing is that, by default, Hadoop saves data into the
/tmp
folder, which will be removed at each system restart. Therefore, we need to persist it somewhere else. Edit$HADOOP_HOME/etc/hadoop/core-site.yml
, and modify the key below (replace{username}
with your own):
<property>
<name>hadoop.tmp.dir</name>
<value>/home/{username}/hadoop-data</value>
</property>
4. Make sure JAVA_HOME is set in $HADOOP_HOME/etc/hadoop/hadoop-env.sh
5. Make sure Hadoop is up at this point (at the end of the guide, there are ways to check it).
Hive
- Download Hive 3.1.2 from the official site.
- Pick a folder to unzip the file, and set
$HIVE_HOME
to point to that folder. - Create a folder in Hadoop to store Hive logs. Where
{username}
is any string you choose. I used my Linux log-in name.
hdfs dfs -mkdir -p /tmp/{username}
hdfs dfs -chmod g+w /tmp/{username}
4. Create a folder in Hadoop to store actual Hive data. Hive works on top of Hadoop, and will save all its data into this folder.
hdfs dfs -mkdir -p /user/hive/warehouse
# change the permission
hdfs dfs -chmod g+w /user/hive/warehouse
# make sure it's created
hdfs dfs -ls /user/hive/warehouse
5. Create a Hive conf file from the template provided.
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
6. Look for the keys in hive-site.xml
and replace the values of the 2 keys below with the values.
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/${user.name}</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/${user.name}_resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
7. In the same file, also change the value of this key, to use the folder we created earlier.
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
8. There is an invalid character  at line 3215 of this conf file. Remove it.
9. Another reason to use MySQL is to use it to power Hive’s metastore, instead of the Derby DB that comes with Hive. The Hive metastore is a database that stores metadata about the data in Hive. Without it, Hive wouldn’t know where & how the data is stored in Hadoop.
(Optional) Create a special user in MySQL to access the metastore (replace username
and password
).
CREATE DATABASE metastore;
CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON metastore.* TO 'username'@'localhost';
flush privileges;
10. Hive needs the JDBC driver to access MySQL data. Download the jar and save it into $HIVE_HOME/lib
11. Edit in same hive-site.xml
conf file to specify how Hive should access MySQL.
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value><name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value><name>javax.jdo.option.ConnectionUserName</name>
<value>username</value><name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
12. At this point, we can initiate the metastore, and start using Hive.
cd $HIVE_HOME
bin/schematool --dbType mysql --initSchema
# Hive should be set up at this point
bin/hive
Sqoop
I mentioned earlier that companies often use Sqoop to export data from MySQL into Hive. We could give it a try. If you don’t already have data saved in MySQL, you can import data into Hive directly — here’s an example to import csv data into Hive.
- Make sure Yarn & Hadoop is running. Hadoop must be running in distributed mode (the pseudo distributed mode we set up works as well).
- Download Sqoop 1.4.7 from the official site.
- Pick a folder to unzip the file, and set
$SQOOP_HOME
to point to that folder. - Set up env variables in
$SQOOP_HOME/conf/sqoop-env.sh
.
export HADOOP_HOME="..."
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
5. The common-lang3-3.4.jar
will cause incompatibility issues with Hive, replace it with common-lang-2.6.jar
instead.
cd $SQOOP_HOME/lib
rm commons-lang3-3.4.jar
cp $HIVE_HOME/lib/commons-lang-2.6.jar .
6. Make sure those values are set correctly in Hive conf $HIVE_HOME/conf/hive-site.xml
.
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value><name>hive.metastore.schema.verification</name>
<value>false</value>
7. There are 2 jars that need to be copied over into $SQOOP_HOME/lib
.
- $HADOOP_HOME/share/hadoop/tools/lib/hadoop-distcp-x.x.x.jar
- $HIVE_HOME/lib/hive-common-xxx.jar
8. Pick a folder where you’ll run the command, and add a link to $HIVE_HOME/conf/hive-site.xml
in the folder.
9. Now we should be ready to import the data. Here’s a sample command I used to import all tables from a database db_name
into Hive. Before I run, I created a database db_name
in Hive (replace db_name
, username
, password
).
$SQOOP_HOME/bin/sqoop import-all-tables --num-mappers 1 --connect "jdbc:mysql://localhost:3306/db_name" --username=username --password=password --hive-import --hive-database db_name --outdir output_dir
You can find more parameters from the official site.
It takes several tries to get it successful. Here’s some common errors I ran into.
1. Output directory hdfs://localhost:9000/xxx already exists# simply remove the folder with
hadoop fs -rm -r xxx2. MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"# make sure Step 8 is executed correctly3. Data imported but cannot find DB
# make sure the `javax.jdo.option.ConnectionURL` conf in hive-site.xml is set to absolution path4. Directory file:xxx could not be cleaned up.
# The metastore may not be consistent with hadoop's conf fs. Drop the `metastore` in MySQL then re-init the metastore.
Spark
Spark is by far the hardest one to set up, because the driver we use to connect to Hive with is compiled against Hive 2.3.7 which is the previous major version of Hive. We’ll need to recompile this jar to work around the incompatibility issues.
- Download Spark_2.12 3.1.1 from the official site. Since we already set up Hadoop on our machine, we can pick the
prebuilt with user-provided apache-hadoop
type. - Pick a folder to unzip the file, and set
$SPARK_HOME
to point to that folder. - There’s a jar call jline2 that’s missing from Spark’s lib. Unfortunately I can only find a github resource online, so we need to download the source and compile it into a jar. Then add it to Spark’s classpath.
# install git, maven
git clone git@github.com:jline/jline2.git
cd jline2
mvn clean package
4. Set up these env vars in Spark’s conf file $SPARK_HOME/conf/spark-env.sh
. Make sure the compiled jline2 jar is added to the classpath (replace jline2_jar_loc
).
export HADOOP_HOME="..."
export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop"
export HIVE_HOME="..."
export SPARK_DIST_CLASSPATH="{jline2_jar_loc}:"$(hadoop classpath)":$HIVE_HOME/lib/*"
5. In order to connect to hive, we need a 3rd party lib called spark-hive
. Like I mentioned earlier, unfortunately it’s not compatible with Hive 3.1.2. We need to download the source, fix the compilation error, and rebuild the jar against Hive 3.1.2. I made a quick dirty fix that’ll only work with the Spark and Hive version we are using. You can download the source from my GitHub repo, and compile it with mvn clean package
, then copy the jar into $SPARK_HOME/jars
.
If you are interested to work on a fix, the sources jar can be downloaded here.
6. Finally. After all the work, now we are ready to run some Hive queries in Spark. I used this (replace db_name
and my_table
):
$SPARK_HOME/bin/spark-shellscala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
scala> val spark = SparkSession.builder().appName("Spark Hive Example").enableHiveSupport().getOrCreate()
scala> spark.sql("USE db_name")
scala> spark.sql("select * from my_table limit 10").show()
7. Voila! Think about other things you can do with Spark, with data coming from Hive.
Final Words
Congratulations on setting up a mini big-data infrastructure on your home computer. Some of the things I plan on doing next including streaming data from Hive and ingest it into Elasticsearch for quick search results.
This tutorial includes a lot of steps without saying. You can of course try them out separately. For example, Spark comes with its own Hadoop package, so you can download it with a local instance. But I think it’s also interesting to set up the entire package to get a sense of how things are done in big tech companies.
Thanks for reading. Let me know if anything should be changed, or suggestions on fun ideas to try out.