The Soda team recently participated in Subsurface LIVE Winter Edition, Dremio’s cloud data lake conference, that took place online.
It was fascinating to see how rapidly things have changed in data management over the last few years. Notably, the rising importance of data monitoring and the role of the data engineer, which coincidentally (and an unashamed plug here), we at Soda were showcasing at the conference with the availability of our new open source project, Soda SQL. Coincidence or perfect timing? Read on.
In the opening keynote, Tomer Shiran, Chief Product Officer at Dremio, touched upon some of the changes that have occurred in the data management space. Tomer spoke about the change in architecture from a monolithic, client-server approach often using proprietary software, to a more loosely connected, cloud-based approach that relies on open-source software.
The key driver for this is the need to manage much larger datasets. The modern data stack is increasingly massive and complex. Organizations are all too aware of the need to deliver the right data to the right people, at the right time. And, as Tomer highlighted, the need to make data available 24/7 to different users across the entire organization. Whilst this might be, we need to recognize that many businesses are struggling to keep on top of all the known (and unknown) data quality issues.
Through my virtual-quasi-real interactions and conversations at the Soda booth, I can confirm that the demand is real! Data engineering and infrastructure teams are under immense pressure to manage the relentless demand for supreme-quality, analytics-ready data from an ever-increasing number of data sources.
These challenges - and the possible solutions - are what so many attendees discussed throughout the conference. In fact, ‘on-demand data availability’ might be the next big thing that business analysts and the industry will be talking about in 2021 and beyond.
Another key takeaway from the conference was the need for data compatibility and consistency across the data pipeline, from the source to the user. For example, a change in the data type of a column in a source DB needs to be reflected in a data lake’s schema that a user accesses. Well, data quality consistency validation across the data pipeline is one of the reasons for building the Soda data monitoring platform. It was fun to introduce the platform to attendees during the event.
Undeniably, we are seeing the rise of data engineers, data product owners and data scientists. Hurrah! But with that, comes the realization of the struggle to keep on top of all of the known (and unknown) data quality issues. We need to bring additional software engineering principles into the data engineering workflow and we've started with that at Soda. The resource-stricken data engineering and infrastructure teams struggling to manage the increasing demand for analytics-ready data from an ever-increasing number of data sources are ever-present.
Is it a hopeless situation? Absolutely not! This fast-growing community already uses a plethora of open-source developer tools to facilitate modern day data product management such as Spark or DBT. Now, we need to bring additional software engineering principles into the data engineering workflow. Let’s keep exploring that, together.
And really the best takeaway was the unity in the understanding that data needs to be monitored, tested and validated as soon as possible and ultimately before it reaches the user.
If you were unable to attend the conference yourself, you can access the sessions on-demand, here. TLDL: From all of the informative and provoking talks at Subsurface, listen to Tomer Shiran’s keynote and Roy Hasson’s AWS presentation on data lakes. But, depending on where you are in the data pipeline, all are worthwhile.
I started talking about change, and so I’ll end with it. Like many, I missed the in-person interaction that this community does well, and thrives on, however, it was still great, and the DJ session was unexpectedly fun! Soda was proud to sponsor, and I personally got a lot from the conference.
I of course would love for you to now go and explore Soda SQL, which appears to be so well-timed.
Soda SQL is our newly released open source project. Soda is championing the engineering principles of Test-Driven Development (TDD) in its data monitoring platform and we’d like for you to give it a try.
Go on, go and test yourself some good quality data.