Data Analysis with Apache HDFS, Hive, Sqoop
On the way of being Data Engineer. I like manage data, that is why I want.
Introduction
Apache Hadoop is the efficient role to big data. In this role, Apache Hadoop can storage data and process data. This article will basically touch on what Apache concepts are. After the definition, Project will made on how to use Apache HDFS, Hive, Sqoop and PostgreSQL concepts together. This project is not more about Data Analyzes. It is for Data Engineering.
Hadoop Distributed File Storage (HDFS)
HDFS was wrote / distributed by Java. HDFS is scalable and transportable file system. Sample Hadoop is that distributed to HDFS and MapReduce. Hadoop uses HDFS for storage data. Hadoop uses MapReduce for processing data. HDFS has got five services. Those are;
- Name Node
- Job Tracker
- Secondary Name Node
- Data Node
- Task Tracker
MapReduce
MapReduce is programming model about Hadoop that used for getting data storage from HDFS.
How it works?
There are two function in MapReduce. These are Map and Reduce. They are ordered to each other.
- Map are took input as <key, value> from disk. It processes input. It makes output as <key, value>.
- Reduce are took input as <key, value> without map. It produces output as <key, value>.
Keys and values are different for usage. All inputs / outputs are storage in HDFS. Map is necessary for filtering / ordering data. Reduce is not necessary. It is optional.
Yarn
Yarn is resource management technology and work schedule technology. Yarn is the basic component of Apache Hadoop. Yarn is responsible for allocation system resource to applications and timing targets in nodes.
Analyze Data
import pandas as pd
dataframe = pd.read_csv("flo100k.csv", sep="|")
dataframe.head()
There is too feature when see data. When analyze to features, company;
- What do customers prefer platform type?
- Which channel do customers use on the platform?
- What is the customer’s budget for shopping?
- Which categories do customers shop? finding answers to questions.
Features
Dataframe Schemas for PostgreSQL & Apache Hive
For these schemas, see to features names and their data types. Data types can be change what sql uses. They are for postgresql and apache hive. Seemed the dataframe structure. From now on, continue with postgresql schema.
For application what we can do;
- Create table for PostgreSQL schema.
- Importing csv file data to table.
- Transfer table PostgreSQL to Apache HDFS with Apache Sqoop.
- Transfer table PostgreSQL to Apache Hive with Apache Sqoop.
- Create another Apache Hive table as .orc formatted.
- Analyze data on Apache Hive.
Apache: HDFS, Hive, Sqoop and PostgreSQL Application
In this chapter, see how to communicate apache tools and postgresql each other. Firstly, create table for dataframe. After, importing csv file to postgresql. Then transfer all table with datas to apache hdfs and hive with apache sqoop. This transferred table is not good efficiency therefore convert table to .orc formatted table. Finally, few data analyses on apache hive.
Creating Part
create table if not exists customers (
master_id varchar(100),
order_channel varchar(100),
platform_type varchar(25),
last_order_channel varchar(25),
first_order_date varchar(75),
last_order_date varchar(75),
last_order_date_online varchar(75),
last_order_date_offline varchar(75),
order_num_total_ever_online float,
order_num_total_ever_offline float,
customer_value_total_ever_offline float,
customer_value_total_ever_online float,
interested_in_categories_12 varchar(75),
online_product_group_amount_top_name_12 varchar(75),
offline_product_group_name_12 varchar(75),
last_order_date_new varchar(75),
store_type varchar(50)
);
On the above, create table without data. That’s means setting the column name and its types and it is ready for upcoming datas.
Importing Part
psql -d traindb -c "\copy customers from 'flo100k.csv' DELIMITERS '|' CSV HEADER;"
It is console command. PostgreSQL uses copy command for importing csv files to table. In the command firstly call to PostgreSQL and then selecting which database and SQL command.
Transportation Part
sqoop import --connect jdbc:postgresql://localhost:5432/traindb --driver org.postgresql.Driver --username train --password-file file:///home/train/.sqoop.password --table customers --split-by master_id --hive-import --create-hive-table --hive-table test1.customer --target-dir /tmp/c
On the above command, sqoop connected to postgresql server. After that give driver, username, password, which table in postgresql, which separator column and hive commands. For hive commands, it is new table so that used to create hive table. Finally create temp file for transportation.
SELECT * FROM customer limit 5;
As seen, transportation was success.
Optimization Part
In here, optimize to Apache Hive. That’s mean Apache Hive gives the high performance for data format. The best performance is .orc formatted table. Given command below create table from created table with file format.
CREATE TABLE customer_orc stored as orc as SELECT * FROM test1.customer;
After the conversion, table storage decreases fast.
hdfs dfs -du -h /user/hive/warehouse/test1.db | grep customer
Storage = 20.1 M → 4.3 M
Data Analyses Part
It is basic queries but important. First is given us to total count for the store type, second is given us to total count for the order channel.
SELECT store_type , COUNT(*) as total_count FROM customer_orc
GROUP BY store_type
ORDER BY total_count DESC;
store_type|total_count|
----------|-----------|
A | 89225|
A,B | 8498|
B | 1491|
A,C | 702|
A,B,C | 75|
B,C | 10|
SELECT order_channel, COUNT(*) as total_count FROM customer_orc
GROUP BY order_channel
ORDER BY total_count DESC;
order_channel|total_count|
-------------|-----------|
Offline | 70785|
Android App | 11989|
Mobile | 8512|
Desktop | 4751|
Ios App | 3964|
Conclusion
Interested in Apache HDFS, Hive, Sqoop. It is all how to use them each other. Project is for being ready for real project. Project is made for who want to be data engineer.