A Hard Look at Data Workflows
Usually after a data workflow is built in code, there's a whole separate sequence of operations that the engineering team never sees. Let's say that your sales data and your inventory data live in two separate systems. But because you're a good engineering team, you set up data pipelines so that live data is always available in your database. There's one ETL that fills up one table with sales data, and another ETL that fills up a different table with inventory data. And because you're particularly conscientious, you take the extra step of writing a SQL query that outputs the sales and inventory for each item nicely and neatly, and you save that in whatever BI tool you use (like Metabase, Looker, or Tableau) so that people are able to find it, making sure to label it cleanly in the lexicon for later discovery.
But — even if you do all this work, there's STILL going to be a whole lot of data transformations that get done offline. See, for inventory planning, the inventory team wants to understand sales averaged by week over the last 4 and 8 weeks so they can observe trends. So they add that in Excel for a few weeks, and then bring it back to the data team to get that folded into the SQL query. And that works for a few more weeks, until the VP of Operations decides there have still been too many surprises and they need to get the 12-week and 16-week figures on there too. Now productivity is starting to drag, because even though this kind of iterative process is probably the best thing for the organization, everybody is kind of tired of making these little requests of everyone else. So people start to find homegrown solutions like writing little single-use Excel macros, and by the end the whole workflow doesn't function if one person is out of the office, because nobody else knows that the last two steps of the process are done on one person’s local Excel.
There's a huge difference between a data workflow that's 100% documented and public, and a data workflow where even one step is local to somebody's machine. You as the engineering and data science team are never going to know how often this is happening, but it’s probably a lot. Ask your ops team an open-ended question like “what do you use spreadsheets for?” to get a sense of how much of this “shadow ETL” activity is happening outside of your main data scheme.
This is why a lot of teams find it useful to build these flows in Parabola where anybody with a web browser can see and understand what's going on. You can literally double-click on every step and look at exactly what is being done to the data. You can observe the state of the data pre and post every transformation, and read the order in which the steps proceed. And edits you make are not destructive, so you don’t end up erasing rows or columns you may need later.
It's easy to imagine how giving everyone the power to create a data workflow will help people be more self-sufficient and productive. But what we also observe, which is probably even more exciting, is that once everyone can see how other people's data workflows work, the whole team starts to get better ideas.
We do have customers who decided to basically rip out all of their traditional ETLs from code and just use the Parabola connections instead. But it's far more common to leave the traditional data pipeline running and use Parabola as a kind of sandbox for experimentation and collaboration. The stuff that really works well and is popular can be folded back into enhancements to the code. And because you're only bringing over things that you already know will work, the load on the engineering team is much lighter. So everyone is free to focus on their core priorities.
Starting with Parabola
We think Parabola will make organizations more efficient by unlocking the collaborative potential of data workflows. Hopefully now you have a sense of exactly what that means. To learn more, email firstname.lastname@example.org or sign up for a 14-day trial at parabola.io.