Data engineers : Let’s build a data stack step by step

Figure 1 : architecture of the project
Figure 2 : Docker compose
  • version : docker compose version
  • image : docker image to be pulled from docker hub
  • depends_on : here we need to mention which services are linked to the current service , for example apache airflow needs Postgres and Mysql to store metadata of the dags
  • environment : here we define the environment variables of the service , for example POSTGRES_USER = airflow is going to be used when we start the service .
  • command : the first bash command you need it to be run whenever the service started
  • volumes : the location inside your container that will be mounted on your file system , for example the transformed data from the pipelines will be stored in a persistent storage place
  • ports : it is the media from where your containers would use to communicate with other services , for example data is going to be ingested into Statsd from airflow via the port 8125 using udp protocol (check the screenshot below) .
  1. Start the database services :
Figure 3 : mysql service
Figure 4 : airflow web server
Figure 5 : Statsd service
Figure 6 : Prometheus service
Figure 7 : Grafana service
  1. data scraping and automate it using Apache airflow .
  2. dashboards using Dash plotly .
  3. monitor it using 3 tools : Statsd , Prometheus , Grafana .

1. Data scraping :

Figure 8 : Screenshot from Fahrrad.de website
  • the brand name
  • the category
  • the model name
  • the price
  • the picture of the bicycle
  • PythonOperator : for the web scraping script
  • BashOperator : for defining Linux administration jobs
  • EmailOperator : To send emails when the pipeline is finished
Figure 9 : screenshot from the pipeline
Figure 10 : dashboard to visualize data
Figure 11 : metrics in Statsd
Figure 12 : metrics displayed in Prometheus
Figure 13 : Grafana dashboard
Figure 14 : airflow monitoring

Conclusion :

In this article , we built a whole data stack : starting from gathering data using a python script , moving to building a dashboard to extract useful insights from it , also monitoring the data pipeline and see the performance of the different tasks . I hope you enjoyed reading my article and feel free to ask me any question and I look forward to hearing your ideas about the next article .

GIT repo :

https://github.com/chiheb08/data_enginnering_project

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store