Nov 27

5 min read

Data Analysis with Apache HDFS, Hive, Sqoop

On the way of being Data Engineer. I like manage data, that is why I want.

Introduction

Apache Hadoop is the efficient role to big data. In this role, Apache Hadoop can storage data and process data. This article will basically touch on what Apache concepts are. After the definition, Project will made on how to use Apache HDFS, Hive, Sqoop and PostgreSQL concepts together. This project is not more about Data Analyzes. It is for Data Engineering.

Hadoop Distributed File Storage (HDFS)

HDFS was wrote / distributed by Java. HDFS is scalable and transportable file system. Sample Hadoop is that distributed to HDFS and MapReduce. Hadoop uses HDFS for storage data. Hadoop uses MapReduce for processing data. HDFS has got five services. Those are;

Name Node
Job Tracker
Secondary Name Node
Data Node
Task Tracker

MapReduce

MapReduce is programming model about Hadoop that used for getting data storage from HDFS.

How it works?

There are two function in MapReduce. These are Map and Reduce. They are ordered to each other.

Map are took input as <key, value> from disk. It processes input. It makes output as <key, value>.
Reduce are took input as <key, value> without map. It produces output as <key, value>.

Keys and values are different for usage. All inputs / outputs are storage in HDFS. Map is necessary for filtering / ordering data. Reduce is not necessary. It is optional.

Yarn

Yarn is resource management technology and work schedule technology. Yarn is the basic component of Apache Hadoop. Yarn is responsible for allocation system resource to applications and timing targets in nodes.

Analyze Data

import pandas as pd
dataframe = pd.read_csv("flo100k.csv", sep="|")
dataframe.head()

dataframe.head() output

There is too feature when see data. When analyze to features, company;

What do customers prefer platform type?
Which channel do customers use on the platform?
What is the customer’s budget for shopping?
Which categories do customers shop? finding answers to questions.

Features

features

Dataframe Schemas for PostgreSQL & Apache Hive

For these schemas, see to features names and their data types. Data types can be change what sql uses. They are for postgresql and apache hive. Seemed the dataframe structure. From now on, continue with postgresql schema.

For application what we can do;

Create table for PostgreSQL schema.
Importing csv file data to table.
Transfer table PostgreSQL to Apache HDFS with Apache Sqoop.
Transfer table PostgreSQL to Apache Hive with Apache Sqoop.
Create another Apache Hive table as .orc formatted.
Analyze data on Apache Hive.

Apache: HDFS, Hive, Sqoop and PostgreSQL Application

In this chapter, see how to communicate apache tools and postgresql each other. Firstly, create table for dataframe. After, importing csv file to postgresql. Then transfer all table with datas to apache hdfs and hive with apache sqoop. This transferred table is not good efficiency therefore convert table to .orc formatted table. Finally, few data analyses on apache hive.

Creating Part

create table if not exists customers (
     master_id varchar(100),
     order_channel varchar(100),
     platform_type varchar(25),
     last_order_channel varchar(25),
     first_order_date varchar(75),
     last_order_date varchar(75),
     last_order_date_online varchar(75),
     last_order_date_offline varchar(75),
     order_num_total_ever_online float,
     order_num_total_ever_offline float,
     customer_value_total_ever_offline float,
     customer_value_total_ever_online float,
     interested_in_categories_12 varchar(75),
     online_product_group_amount_top_name_12 varchar(75),
     offline_product_group_name_12 varchar(75),
     last_order_date_new varchar(75),
     store_type varchar(50)
);

On the above, create table without data. That’s means setting the column name and its types and it is ready for upcoming datas.

Importing Part

psql -d traindb -c "\copy customers from 'flo100k.csv' DELIMITERS '|' CSV HEADER;"

It is console command. PostgreSQL uses copy command for importing csv files to table. In the command firstly call to PostgreSQL and then selecting which database and SQL command.

Transportation Part

sqoop import --connect jdbc:postgresql://localhost:5432/traindb --driver org.postgresql.Driver --username train --password-file file:///home/train/.sqoop.password --table customers --split-by master_id --hive-import --create-hive-table --hive-table test1.customer --target-dir /tmp/c

On the above command, sqoop connected to postgresql server. After that give driver, username, password, which table in postgresql, which separator column and hive commands. For hive commands, it is new table so that used to create hive table. Finally create temp file for transportation.

SELECT * FROM customer limit 5;

As seen, transportation was success.

Optimization Part

In here, optimize to Apache Hive. That’s mean Apache Hive gives the high performance for data format. The best performance is .orc formatted table. Given command below create table from created table with file format.

CREATE TABLE customer_orc stored as orc as SELECT * FROM test1.customer;

After the conversion, table storage decreases fast.

hdfs dfs -du -h /user/hive/warehouse/test1.db | grep customer

Storage = 20.1 M → 4.3 M

Data Analyses Part

It is basic queries but important. First is given us to total count for the store type, second is given us to total count for the order channel.

SELECT store_type , COUNT(*) as total_count FROM customer_orc 
GROUP BY store_type
ORDER BY total_count DESC;

store_type|total_count|
----------|-----------|
A         |      89225|
A,B       |       8498|
B         |       1491|
A,C       |        702|
A,B,C     |         75|
B,C       |         10|

SELECT order_channel, COUNT(*) as total_count FROM customer_orc
GROUP BY order_channel 
ORDER BY total_count DESC;

order_channel|total_count|
-------------|-----------|
Offline      |      70785|
Android App  |      11989|
Mobile       |       8512|
Desktop      |       4751|
Ios App      |       3964|

Conclusion

Interested in Apache HDFS, Hive, Sqoop. It is all how to use them each other. Project is for being ready for real project. Project is made for who want to be data engineer.

Data Analysis with Apache HDFS, Hive, Sqoop

Introduction

Hadoop Distributed File Storage (HDFS)

MapReduce

How it works?

Yarn

Analyze Data

Features

Dataframe Schemas for PostgreSQL & Apache Hive

Apache: HDFS, Hive, Sqoop and PostgreSQL Application

Creating Part

Importing Part

Transportation Part

Optimization Part

Data Analyses Part

Conclusion

More from Talha Nebi Kumru

Recommended from Medium

Check Tempur Pedic cloud reviews that aresupreme https://t.co/kgzlv6p4Ma

How to use cloud storage to securely load data into Neo4j

HashMaps

Why You Need to Leverage the Power of Your Best-of-Breed Tech with Azure Data Factory

Guide: Implementation of Design Patterns in Java

Two Methods For Resolving The Anidated For Loops Problem

Python Released Its 3 Latest Versions Update — This What You Need to Know

Components of Scratch

Get the Medium app

Talha Nebi Kumru