A deep dive into the TileDB data format & storage engine
Recently we hosted a webinar where I delivered a deep dive into TileDB Embedded, the open-source storage engine that powers the TileDB Cloud universal database. I demonstrated many hands-on examples on how to use TileDB from Python, R and SQL, which can serve as a great onboarding session for new and existing TileDB users. Finally, I showed two new powerful TileDB features, namely attribute filter condition push-down and schema evolution.
Here is the video recording of the webinar. Brace yourselves, it is 1.5 hours (with Q&A) as it covers a lot of ground.
But do not fret, in this blog post I provide a nice gist and a form of an index, so that you can quickly locate the stuff that interests you the most.
Specifically, I summarize what I covered in the webinar, and provide links to the Jupyter notebooks I used. The links take you to a preview of the notebooks on TileDB Cloud. From there you can download and run them locally, no TileDB Cloud account is needed. Alternatively, you can launch them directly in TileDB Cloud. For that you will need to sign up, but doing so is free, no credit card is required and you get $10 in free credits — more than enough to run all the webinar examples… hundreds of times. The webinar, as well as the rest of this blog post, are focused 100% on the open-source TileDB Embedded.
Introduction & Why Arrays
This webinar is for you if:
- You are interested in data storage fundamentals (e.g., layout, compression, IO, etc.)
- You are tired of using many different inefficient (often domain-specific) data formats
- You wish to efficiently store and access any kind of data from anywhere with any tool
TileDB Embedded is a fast C++ library that allows you to store and access any data as multi-dimensional arrays. I go to great lengths in the recording to explain how foundational arrays are for laying out data of any type and domain as a sequence of bytes in a 1-dimensional storage medium. Arrays allow you to manipulate the data layout and, therefore, optimize it for your access patterns, increasing the locality of the results on storage and thus maximizing IO performance.
The TileDB Embedded library is wrapped by numerous programming language APIs and is integrated with a broad set of SQL engines and data science tools. It also works very well on a large set of storage backends, being particularly optimized for object stores, such as AWS S3, Azure Blob Storage and Google Cloud Storage.
The Basics
This section covers the data format of dense and sparse arrays, their differences and their basic write/read functionality. It also discusses features like groups, array metadata and variable-length attributes and dimensions. Finally, it explains how arrays subsume dataframes and how easy it is to use arrays on cloud object stores or any other backend.
Related notebooks:
Tiling & Layout
Tiling and cell layout are paramount to arrays. Here I discuss the main concepts, such as space tile extents, data tiles, tile capacity and the global cell order. Moreover, I cover tile filters such as compression, encryption and checksums.
Related notebooks:
Advanced Internal Mechanics
This section delves into more advanced features of TileDB Embedded, such as versioning, time traveling, indexing, consolidation and vacuuming. It also introduces two new exciting features, namely attribute filtering push-down and schema evolution. Finally, it offers quick tips on writing and reading for boosting performance.
Related notebooks:
- Dense array versioning
- Sparse array versioning
- Dense array consolidation
- Sparse array consolidation
- Dense array schema evolution
- Sparse array schema evolution
- Sparse array attribute filtering push-down
Work In Progress
Our team is very hard at work and numerous new features are coming up. Here is a small taste of what will appear in the upcoming releases.
TileDB vs. Others
This is by no means a full-fledged comparison of TileDB to other storage engines, but in this section I touched upon a quick qualitative comparison between TileDB and popular systems like HDF5, Zarr, Parquet and Delta Lake. We are always happy to benchmark TileDB with anything you have in mind, but please suggest data and queries, which we should always make public so that anyone can reproduce and/or rebut.
The Full Slide Deck
Here are the slides I used in the webinar.
A few final remarks:
- We are hiring! If you liked what you saw and you feel that you are a good fit, please apply today.
- Please follow us on Twitter, join our Slack community or participate in our forum. We would like to hear from you so that we can get better.
Last but not least, a huge thank you to the entire team for all the awesome work. I am just a mere representative and am the exclusive recipient of complaints. All the credit for this amazing library goes to our awesome team!