ÇSTech
Published in

ÇSTech

Google BigQuery Migration: CicekSepeti Journey

Çiçek Sepeti BigQuery

Along with the pandemic and growing sales in the e-commerce sector, reporting and data warehousing become more important for data-driven companies. With that importance, there is also a need for a change in data warehouse architecture for growing companies.

Since the previous volume of data is not too much, traditional methods were enough for the companies. It was the same at the Ciceksepeti.

Previously we are using the MSSQL server for our data warehousing purposes and SSIS for running tasks at a specified time. MSSQL was on a virtual server on the cloud and if the volume of the data increases, we need to increase server capacity as well. With that configuration, there are a lot of things to consider to run a data pipeline flawlessly. Server CPU, memory, and disk on the virtual machine side. Indexes, database backups, and database availability on the MSSQL server side.

Also, we were getting data from source databases via python scripts. These scripts need to be run on a server, that has lots of python dependencies and these scripts need to be maintained as well. With these python scripts, we are getting data snapshots incrementally, and loading data into staging tables. But this method has some flaws ;
1. There are no Change Data Capture (CDC) events from the source database with that method.

2. There might be some data losses when you implement incremental load.

3. You need a trigger job to copy data from the source database with that method.

But with the recent volume of data, the architecture of the data warehouse needs to be changed.

Google Cloud Platform

We selected Google BigQuery for our data warehousing solution. BigQuery is so simple and there is no need to worry about indexes, disk spaces for tables, compute powers, etc. BigQuery’s self-managed solution is exactly what we are looking for.

BigQuery Architecture

Our data engineering team getting data from source databases with Kafka. We have hundreds of tables and multiple source databases. Also, we have different types of data sources which are MSSQL, MySQL, and MongoDB. After a certain period of time, all these data are being sent to Google Cloud Storage, our Data Lake. This data resides in our data lake and waits to be ingested or transformed. With the help of BigQuery’s external tables, we read Google Cloud Storage data, which was sent by Kafka, and this is generating a staging layer for data ingestion. Later this data is ingested by BigQuery and generates our data warehousing tables backbone.

Kafka enables us to CDC events sink to our Cloud Storage data lake and we won’t be missing any change events from source tables. Since all events are captured by Kafka there won’t be any data losses that we previously had.

Also, the transformation part is being taken care of by BigQuery. For transformation, we are using BigQuery’s stored procedures which they called routines. We have different datasets, multiple calculations, and lots of routines.

Apache Airflow on GCP as Cloud Composer

To be able to orchestrate these complex calculations and dependencies, we choose Google Cloud Composer. Google Cloud Composer is managed Airflow service by Google. With the help of Airflow’s easiness and python capabilities, we replaced our existing SSIS jobs. Airflow has a nice feature that you can write your own custom operators. So we started building our in-house common operators for Airflow. An operator is a task block which is a repetitive task that needs to be done by code. With that library, we mostly wrote below repetitive tasks;

  • Data sending through the APIs
  • Data sending through the STFP connections
  • Data getting from APIs
  • Data loading to different sources
  • Routine runs and dependencies for BigQuery

Once we wrote all these things, migrating to BigQuery from the old data warehouse is accelerated.

We simply started re-writing to old data warehouse code to BigQuery standards, which is partitioning clustering etc. After that, since all prior tasks for job run and dependency architecture are ready in Airflow, we used those operators and architecture to write new Airflow dags.

Python Flask Endpoints

Also to ease our job of migrating old data warehouse code to BigQuery, we created some python scripted flask endpoints;

  • Initial table from MSSQL to BigQuery
  • MSSQL to BigQuery T-SQL converter
  • Merge script creator for table
  • View creator from table columns

Since we need to move data from the legacy data warehouse to BigQuery, we wrote a python code that reads data from the MSSQL server and writes it into the .parquet file chunk by chunk to avoid MSSQL server overhead. After all read operations are complete, the script uploads this parquet file to Google Cloud Storage. Then from that parquet file, the script loads the table which is specified partitioning and clustering options.

We simply copy and paste old SQL code to the T-SQL converter endpoint and this endpoint removes MSSQL-based column naming and converts the script to BigQuery standards. Also, it generates staging tables and staging view scripts.

Merge script creator, the initially loaded table’s merge script is being created by the Flask endpoint. First, the script retrieves the table’s columns with python BigQuery libraries. We have a merge template for tables and the script applies this template for all of the columns.

View script creator, the initially loaded table’s columns are retrieved via python and with the help of some template, a view script is being generated by the Flask endpoint.

All of these scripts and new technologies helped us to migrate the legacy data warehouse to the new BigQuery data warehouse. We finished most of the migration, but still, it’s an ongoing process for us.

Thanks for reading and your time, you can always reach me via LinkedIn. 👋

For my next medium posts, I will be explaining and sharing some code blocks from our migration endpoints. Stay tuned… 😉

References

https://en.wikipedia.org/wiki/Change_data_capture

https://airflow.apache.org/

https://kafka.apache.org/

https://flask.palletsprojects.com/en/2.1.x/

https://footfalldata.com/technology/google-cloud-composer-dealing-with-short-living-tasks/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store