8 Traits Of Highly Productive Data Engineers
Himanshu Godara, senior data engineer; Nehil Jain, junior principal data engineer at QuantumBlack, AI by McKinsey.
About five years ago, Maxime Beauchemin, creator of Apache Airflow and Apache Superset, talked about data engineers (DE) having the worst seat at the table. Thankfully, much effort has been invested in improving how Data engineers engage with the broader organisation since then. Through our work on various projects building custom machine learning applications and operationalising data for optimising business functions, we have picked up ideas that can help drive autonomy and ownership of critical business impacts. These insights will hopefully help ensure that data engineers are never sitting at the worst seat at the table.
We approach advanced analytics projects from the perspective of people, the interdisciplinary set of experts collaborating together. We leverage processes, best practices and protocols to reduce the risk and ensure consistency in delivery and tech, the tools, frameworks and platforms we use to deliver secure, reliable data for consumption.
This article is the first post in a three-part series in which we’ll examine the people, processes and technology for high-performing data teams. We’ll explore how practitioners can build themselves an ecosystem for success.
The inspiration for this piece came from an experience we had. We found ourselves in a group that would build SQL transformations without versioning their code and directly sharing the database objects with downstream consumers like data scientists and SMEs. The team would often spend hours debugging the problems in production. The traits we’ve identified are designed to help you avoid falling into the same pitfalls that we did.
1. Validate your choice of no-code tools with business needs
No-code tools have low barriers to entry. They make it easier to collaborate if there is a way to see the history of changes. Choose a tool that best suits your data team’s capabilities and culture and matches your business requirements. This will keep your team productive and reduce the chance of you becoming the bottleneck.
2. Learn to make the buy vs build tradeoff
Leveraging existing frameworks or tools available in the market can reduce your time to delivery. Learn to focus on finding an efficient solution to whatever the business problem is.
In one of our projects, the data team was convinced about building everything in-house and was developing data ingestion solutions. The team spent a lot of time and resources on this and still needed to solve the problem of loading data from various sources.
In this instance, using a hybrid SaaS data integration platform was a better option for the team to solve their ingestion problems sophisticatedly. Engineering teams should be mindful of when to leverage existing tools and weigh up carefully when to go down the path of in-house development.
In addition to being open to using different tools, deciding when it is appropriate to build a custom fix and when to buy a solution is critical. Consider the cost of building and maintaining a solution. Consider the amount of control required for the business needs and the connectivity to the already existing systems and components of the stack.
3. Leverage open-source to harness the power of community
We’ve worked with engineers who try to stick to the technology and way of implementing solutions they know. With the modern data stack evolving, we need to leverage the power of open-source data tooling at our disposal.
By staying up to date with the latest ecosystem, you can benefit from the learnings of the open-source community, which will impact your delivery speed by standing on the shoulders of giants. For example, while developing data quality solutions, you should be aware of work done in the Great Expectations framework, and you might be able to use it instead of building a custom data validation framework.
However, it’s important to scan these open-source solutions for security vulnerabilities, as we have seen a rise in adopting this practice across enterprises.
4. Don’t get attached to one tool
Make sure you understand the effectiveness and limitations of various tools and programming languages. If the only tool you have is a hammer, you tend to see every problem as a nail.
We saw this example in a project where the engineering team used PySpark heavily for data transformations. This resulted in them using the same tool for every project, even when the datasets needed to be more significant to warrant using it. They never leveraged their data warehouse to compute efficiently. We re-evaluated these choices and reduced the operating costs by tuning the processing compute cluster size for different patterns and leveraging the warehouse compute when possible.
5. Importance of investing time in learning
Building your knowledge of different software and data analytics tools will help you to be the bridge between various disciplines.
When working on a project where we helped a team optimise the supply chain, we coached them to use one data transformation tool they needed to learn about. We increased the team’s iteration speed during development by introducing this in combination with a common visualisation tool. This positively impacted the revenue in one quarter because the team could validate their work and reliably deliver actionable insights.
6. Sharing best practices and standardising code to prevent re-work
The best data practitioners share their learnings and best practices, enabling the team to be empowered and autonomous.
Investing time in standardising code scripts into frameworks and templates can be a huge time saver for the team.
We were helping a marketing team run hypothesis-driven campaigns using analytics. After the first two successful iterations, we created cookie-cutter code templates and playbooks to drastically reduce the time to run future iterations. We often conducted office hours and demo sessions to help people contribute to the templates and playbooks.
7. Keep your eyes on the product
Having a product mindset on data assets is crucial. Treat data models as Application Programming Interfaces (APIs) for downstream consumers. Work with data product owners to actively seek feedback from downstream consumers, as they will help you develop and improve. Also, document your data models and test the data they deliver.
It can be easy to focus on resolving requests and bringing data to the requested table and views rather than getting continual feedback from downstream consumers.
8. Balance the short-term with long-term needs of the business
While designing a pipeline, focus on critical business impact and consider how it will evolve.
Avoid tunnel vision and focus on just delivering the immediate task at hand. Make sure you consider idempotency, stateful vs stateless processing, and batch vs streaming solutions. This will help you prevent production bugs and downtime due to production data’s ever-changing nature and scale.
These are just some of the learnings we’ve taken away from elevating the role of data engineering and building high-performing data teams.
Please keep your eyes on the QuantumBlack AI by McKinsey Medium page, as the following article in our series for becoming the perfect practitioner will be posted soon. That one will focus on the ideal platform.
Special thanks for the contribution from Alex Arutyunyants, Michael P Hernandez, Piotr Roszkowski, and Bruce Philp.
Are you interested in being part of high-performance teams building data platforms and processes for enterprises? Want to rapidly launch and iterate products across vast industries — or even join us in implementing these use cases with leading organisations? At QuantumBlack, AI by McKinsey, we’re expanding our team of Technical Product Designers; check out our Careers page for more information and follow us on LinkedIn.