Real-world use cases from the trenches
This post is a second in a series of articles in anticipation of DuckCon #6. Join us as we delve deeper into the world of DuckDB and showcase practical use cases. Through our experiences and tips, we aim to help you unlock the full potential of your data. If you missed it, you can go back and read part 1.
Today's data world is abuzz with new tools, technologies, and self doubt. The industry feels unique in that every time a new framework of thought arises, another meets it with challenges, drawbacks, and improvements. You could argue that this is a feature of a fanatical following always trying to improve their ways of operating, but at times it feels more like insecurity in using existing tools, lest you be 'left behind'.
At Crescent, we want to show the world that small data is great and can be useful for more businesses everywhere. Not every workflow needs to carry the burden of the cloud tax, and in many cases, the smarter choice is to optimize for simplicity and efficiency.
Enter DuckDB—a fast, embeddable, and highly effective database system that challenges the notion that big problems always require big solutions. In this post, we'll explore how DuckDB enables powerful analysis of large datasets with ease, using a real-world example we have worked on.
Today’s case revolves around analyzing the datasets for one of our clients, demonstrating a common story in the data world. In this case, the analysis of gigabytes of data is essential to their operation, but they have challenges relating to data quality and processing speeds.
The Challenge
For our client, their existing workflow was built using Databricks which was used to manage and analyze the datasets produced. As Databricks is primarily a hosted Spark solution, it requires specialist knowledge to operate effectively, making it difficult for team members without a deep understanding of Spark concepts to contribute or adapt the code. In this case, the implemented workflows were spread across multiple notebooks, creating fragmentation that slowed collaboration and made maintaining or scaling the pipeline complex. The distributed computing power of Spark introduces overhead and cost whilst still being difficult to calculate the return on investment.
Under the hood of the ease of use of Databricks lies a dark truth. The system operates best when we push back against the ease of getting going, and start using it for what it is best at - large scale distributed analytical data processing. Not every solution benefits from distribution, and not every workflow should be built with a cloud-first mindset. Similarly, it is too easy for engineers to succumb to the dreaded vendor lock-in, meaning that as time passes it becomes harder and harder to unravel your workflows should you want to move.
We, at Crescent, believe it is time to think small. While the Databricks implementation worked, we asked if it was optimal for our clients’ needs, now and in the future. On paper, the existing setup could handle ultimate scale, but the reality is that they would never get there. Planning for scale comes at a cost, so we had to ask the hard questions:
Is a cloud-first approach required now?
Is the data really the right size for Databricks?
Can we reduce a future dependency on any one software provider?
The Data
For our specific use-case the workflow was benchmarked as taking 2.5 hours to complete the analysis of 12GB of data - a common story in the data world. Built for ultimate scalability, but neither simple nor fast.
2.5 hours
to process 12GB of data
Avoiding vendor lock-in
To best help our client avoid vendor lock-in where possible we opted for evergreen tools, or Lindy technologies, that are unlikely to change and are used ubiquitously. We used the following:
SQL where we can
Python to fill in the gaps and orchestrate
Blob storage to hold on to the data.
Each of these can be used with DuckDB, our favourite new technology that is making a splash in the data world. Fortunately it is open source, and exposes APIs for friendlier-SQL, Python, and functions to easily talk to blob stores. Given the requirements of the analysis, and the breadth of functionality of these tools, we determined that they would suffice for this use-case.
Scale, Speed & Spending
While the datasets discussed above are in the order of 10GB, occasionally, our client stated that there are larger datasets that can reach up to 200GB. This raised a question around scalability (and explained the first choice for Databricks) though 200GB is not very large when on-demand compute instances can have over 32TB of memory. DuckDB can efficiently handle larger-than-memory datasets, meaning we can develop locally, and happily process the test dataset of 12GB on a typical modern laptop. It is also known to be fast for analytical use-cases, and we were therefore confident that DuckDB would be a good choice.
Deploying the solution beyond a single local machine is also necessary. Fortunately, when such a time arises, we can use MotherDuck for hybrid query execution, and store the results either in MotherDuck itself, or in a cloud of your choice. So, how does a DuckDB implementation of the analysis workflow compare?
Ducking Fast
The goal was to run the workflow on typical modern hardware (a Macbook Pro M1 in this case) to show the capabilities of local single node systems available to developers today, without having the cloud tax of using a system like Databricks. Using DuckDB and Python we managed to rewrite the pipeline and reduced processing time from 2.5 hours on a Databricks cluster to 20 minutes on a local machine - over a 10x improvement!
10x speed up
to process the same 12GB of data
The solution was written using Python and DuckDB's flavour of SQL. These two technologies ensure a larger applicable audience can understand the logic compared to an existing PySpark/distributed implementation. This results in better maintainability and quicker troubleshooting.
Cloud-second approach
The cloud revolution was not for nothing however and the benefits are clear for the right use-cases: accessibility, availability, reduced operational costs, and scalability (when it's needed), to name a few. So when we do need to scale our DuckDB workflows beyond a single node, MotherDuck is there to step in - something we have already spoken about.
The seamless integration of MotherDuck with DuckDB allows a small code change to scale the above results beyond a local machine. Expanding to handle data volumes that seem large (up to and beyond 200GB) for local machines is still trivial for the cloud by simply scaling up. MotherDuck handles this process both locally and remotely with their innovative hybrid query engine which is great news for cloud storage implementations.
From these results, our client was able to cut complexity, reduce costs, and improve their product delivery by focussing on only one workflow currently within their landscape.
Key Takeaways
Consider the size of your data: Distributed systems such as Databricks may not be the best option for analyzing small to medium-sized datasets.
Rely on sustainable tech: Beware of vendor lock-in and lean more towards open-source systems that have withstood the test of time.
SQL and Python are powerful tools for data analysis: These languages can be used to write code that is both expressive and efficient.
Local-first, cloud deployable solutions can provide the best of both worlds: DuckDB and Motherduck allow you to develop and test your workflows locally, and then deploy them to the cloud when you need to scale.
At Crescent, we believe in building simpler and smarter, not bigger, data solutions. The transformation from a complex distributed system to a lean, efficient DuckDB workflow underscores the power of thinking small. By embracing simplicity, leveraging evergreen technologies, and focusing on real-world needs, we can unlock faster, more accessible insights for all.
Join Us at DuckCon #6
To learn more about DuckDB and its integrations, join us at DuckCon 6. This annual conference brings together data enthusiasts from around the world to share knowledge and best practices. See you there!
Comments