Data Analysis with Apache HDFS, Hive, Sqoop

Introduction

Apache Hadoop is the efficient role to big data. In this role, Apache Hadoop can storage data and process data. This article will basically touch on what Apache concepts are. After the definition, Project will made on how to use Apache HDFS, Hive, Sqoop and PostgreSQL concepts together. This project is not more about Data Analyzes. It is for Data Engineering.

Hadoop Distributed File Storage (HDFS)

  • Name Node
  • Job Tracker
  • Secondary Name Node
  • Data Node
  • Task Tracker

MapReduce

MapReduce is programming model about Hadoop that used for getting data storage from HDFS.

How it works?

There are two function in MapReduce. These are Map and Reduce. They are ordered to each other.

  • Map are took input as <key, value> from disk. It processes input. It makes output as <key, value>.
  • Reduce are took input as <key, value> without map. It produces output as <key, value>.
MapReduce Architecture

Yarn

Yarn is resource management technology and work schedule technology. Yarn is the basic component of Apache Hadoop. Yarn is responsible for allocation system resource to applications and timing targets in nodes.

Yarn Architect

Analyze Data

import pandas as pd
dataframe = pd.read_csv("flo100k.csv", sep="|")
dataframe.head()
dataframe.head() output
  • What do customers prefer platform type?
  • Which channel do customers use on the platform?
  • What is the customer’s budget for shopping?
  • Which categories do customers shop? finding answers to questions.

Features

features

Dataframe Schemas for PostgreSQL & Apache Hive

schemas for postgresql and apache hive
  1. Create table for PostgreSQL schema.
  2. Importing csv file data to table.
  3. Transfer table PostgreSQL to Apache HDFS with Apache Sqoop.
  4. Transfer table PostgreSQL to Apache Hive with Apache Sqoop.
  5. Create another Apache Hive table as .orc formatted.
  6. Analyze data on Apache Hive.

Apache: HDFS, Hive, Sqoop and PostgreSQL Application

In this chapter, see how to communicate apache tools and postgresql each other. Firstly, create table for dataframe. After, importing csv file to postgresql. Then transfer all table with datas to apache hdfs and hive with apache sqoop. This transferred table is not good efficiency therefore convert table to .orc formatted table. Finally, few data analyses on apache hive.

Creating Part

create table if not exists customers (
master_id varchar(100),
order_channel varchar(100),
platform_type varchar(25),
last_order_channel varchar(25),
first_order_date varchar(75),
last_order_date varchar(75),
last_order_date_online varchar(75),
last_order_date_offline varchar(75),
order_num_total_ever_online float,
order_num_total_ever_offline float,
customer_value_total_ever_offline float,
customer_value_total_ever_online float,
interested_in_categories_12 varchar(75),
online_product_group_amount_top_name_12 varchar(75),
offline_product_group_name_12 varchar(75),
last_order_date_new varchar(75),
store_type varchar(50)
);

Importing Part

psql -d traindb -c "\copy customers from 'flo100k.csv' DELIMITERS '|' CSV HEADER;"

Transportation Part

sqoop import --connect jdbc:postgresql://localhost:5432/traindb --driver org.postgresql.Driver --username train --password-file file:///home/train/.sqoop.password --table customers --split-by master_id --hive-import --create-hive-table --hive-table test1.customer --target-dir /tmp/c
SELECT * FROM customer limit 5;

Optimization Part

In here, optimize to Apache Hive. That’s mean Apache Hive gives the high performance for data format. The best performance is .orc formatted table. Given command below create table from created table with file format.

CREATE TABLE customer_orc stored as orc as SELECT * FROM test1.customer;
hdfs dfs -du -h /user/hive/warehouse/test1.db | grep customer

Data Analyses Part

It is basic queries but important. First is given us to total count for the store type, second is given us to total count for the order channel.

SELECT store_type , COUNT(*) as total_count FROM customer_orc 
GROUP BY store_type
ORDER BY total_count DESC;
store_type|total_count|
----------|-----------|
A | 89225|
A,B | 8498|
B | 1491|
A,C | 702|
A,B,C | 75|
B,C | 10|
SELECT order_channel, COUNT(*) as total_count FROM customer_orc
GROUP BY order_channel
ORDER BY total_count DESC;
order_channel|total_count|
-------------|-----------|
Offline | 70785|
Android App | 11989|
Mobile | 8512|
Desktop | 4751|
Ios App | 3964|

Conclusion

Interested in Apache HDFS, Hive, Sqoop. It is all how to use them each other. Project is for being ready for real project. Project is made for who want to be data engineer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Talha Nebi Kumru

Talha Nebi Kumru

3 Followers

The way on Data Engineering. I like manage data, that is why I want.