Data engineers : Let’s build a data stack step by step
if you are planning to improve you skills and get yourself involved into some heavy technical projects , I got you , this article is for you , the next ones also , I’m going to highlight the best for you , as I mention , always , the good knowledge is distributed here .
In this article , we are going to break down how to build a data stack using the workflow management tool apache airflow , make dashboards with Dash plottly , gather metrics with Statsd , and visualize the performance using Prometheus and Grafana .
All the assignments specified above , are progressing to be connected together utilising docker compose . a few of you didn’t have the chance to work with it so let’s take a brief definition of it .
Imagine that you are working with many containers at the same time , each one of them has it own specific job , are you going to handle each one of them separately ? OF COURSE NO ! . Docker Compose is a tool that helps overcome this problem and easily handle multiple containers at once .
I can see you wondering , why I started the article talking about docker compose , but it is ok , I’m going to explain why . I want you guys to understand that it is mandatory to think about the foundations of every IT project before starting digging deeper . Docker compose is going to allow us to start all the services needed for our project .
Let’s take a look from our docker compose file (docker-compose-LocalExecutor.yml) :
every docker-compose file has attributes that we need to understand :
- version : docker compose version
- image : docker image to be pulled from docker hub
- depends_on : here we need to mention which services are linked to the current service , for example apache airflow needs Postgres and Mysql to store metadata of the dags
- environment : here we define the environment variables of the service , for example POSTGRES_USER = airflow is going to be used when we start the service .
- command : the first bash command you need it to be run whenever the service started
- volumes : the location inside your container that will be mounted on your file system , for example the transformed data from the pipelines will be stored in a persistent storage place
- ports : it is the media from where your containers would use to communicate with other services , for example data is going to be ingested into Statsd from airflow via the port 8125 using udp protocol (check the screenshot below) .
- Start the database services :
2. Start airflow web server :
3.Start Statsd :
4.Start Prometheus and Grafana :
So as we can see all these definitions are written on the same file and they will be started all together with a simple command in order to achieve what the project is aiming at .
The project is divided into 3 major steps :
- data scraping and automate it using Apache airflow .
- dashboards using Dash plotly .
- monitor it using 3 tools : Statsd , Prometheus , Grafana .
1. Data scraping :
- data source : https://www.fahrrad.de/
We are going to scrape data from fahrrad.de , it is website where you can find very good offers to buy bicycles , but what are we going to scrape exactly ? yes , very good question .
we are going to extract :
- the brand name
- the category
- the model name
- the price
- the picture of the bicycle
We have two types of data to extract , structured data to be stored in CSV files and unstructured data which are images of the bicycles
The scraping scripts will be automated using the most powerful tool apache airflow , but if you have never used it before , I will give a brief overview about the tool .
Apache airflow is a work flow management tool which has a lot of operators that can help any data engineer to design and orchestrate the jobs of big data scientist projects , for example automate data collection scripts . the operators used in this project are :
- PythonOperator : for the web scraping script
- BashOperator : for defining Linux administration jobs
- EmailOperator : To send emails when the pipeline is finished
the operators will be defined under ****.py and they will be illustrated in the UI afterwards .
the Dag is a sequence of tasks and each task is defined using one operator
the task must have a definite and orderly place (t1 >> t2 : task one should be executed before the task 2 )
After gathering data from the source (the website using scraping method ) , now it is time to get insights from it , that’s why we have built an interactive dashboard using Dash plotly .
Dash plotly : it is a framework to write interactive web applications , written on top of Flask, Plotly. js and React. js .
Right now, let’s do some fancy staff, while the job is running, it will generate metrics that should be monitored, these metrics will be pushed from airflow through the port 8125 to Statsd using the udp protocol (you can check on the docker compose file), the metrics data cannot be displayed in an organised interface if we use only statsd , so we will rely on Prometheus and Grafana to achieve what we are looking for.
the whole process is done this way :
Video of the project :
Conclusion :
In this article , we built a whole data stack : starting from gathering data using a python script , moving to building a dashboard to extract useful insights from it , also monitoring the data pipeline and see the performance of the different tasks . I hope you enjoyed reading my article and feel free to ask me any question and I look forward to hearing your ideas about the next article .