Following on from the last blog post, about dataops, I think it’s a great time to speak about the data pipeline, which is another hot topic that’s come to the fore in the last few years.
So, what is a data pipeline? We’ve (probably) all heard the term, but what does it mean?
Everyone has a data pipeline, of sorts. If you simply pull your data into spreadhsheets to the work the data, or into the complex data pipelines of large organisations, everything is your data pipeline.
At the base level, the data pipeline is the process your data goes through to take it from it’s raw state, into storage (data warehouse, data lake, etc), and then into your analytics solution.
The modern data pipeline automates this process and can provide near real time streaming of the data.
Isn’t that ETL?
No, ETL is a step in your data pipeline. It became popular in the 1970’s and is often used in data warehousing. It’s one of the data pipeline’s first steps, where you extract the data from it’s source (E). Convert it into a form that is written to the destination, which can include adding other sources of data (T), and finally writing the data to it’s destination (L).
So what’s ELT? Has someone misspelled ETL?
Recently ELT (Extract, Load, Translate) has been used when dealing with data lakes. In the ELT process the data is not transformed when it is extracted to the data lake, but stored in its original form. As the data isn’t transformed when extracted to the data lake, the query and schema don’t need to be defined prior to this. Often though, some sources can be databases or a structured data systems, and because of this (particular data) it will have an associated schema.
Why have data lakes become popular?
As has been previously discussed, the amount of data being created is increasing day by day, hour by hour, so a new way of storing and accessing data was needed.
Data lakes allow you to store high velocity, high volumes of data in a constant stream. This can include relational data (operational databases and data from line of business applications) and non-relational data (mobile apps, IoT devices and social media etc).
Since databases marry storage with compute, storing large volumes of data in this way becomes very expensive. Which leads to high levels of data retention management (either cutting certain fields off of the data or limiting the time that the data is held), to reduce costs.
Data lakes by contrast are relatively inexpensive. This is mainly down to the fact that storage for data, in this unstructured format, is relatively cheap, and you don’t incur the costs associated iwth a data warehouse in preparing the data (for storage), which is rather time consuming, and costly.
Do I need a data lake?
Well, the answer is, it depends. There is no one size fits all answer here. To understand if a data lake is right for you, ask yourself these 4 questions
- How structured is your data?
If most of your data sits in structured tables (CRM records/financial balance sheets etc), then it’s probably best/easier to stick with a data warehouse.
If you’re working with large volumes of event-based data (server logs or a click stream) then it might be easier to store that data in its raw form in a data lake.
- Is Data retention an issue?
If you’re constantly trying to balance between keeping hold of data (for analysis), or getting rid of it to manage costs, then it would make sense to investigate a data lake.
- Do you have a predictable or experimental use case?
Are you looking to build a dashboard/report from your data, that’s built from a fairly fixed set of queries against tables, that are regularly updated, then a data warehouse would probably be your best option.
If, however, you have more experimental use cases (machine learning/predictive analytics), then it’s more difficult to know what data will be needed, and how you’d like to query it. Then a data lake might be a better option.
- Do you work with (or are you looking to work with) streaming data?
Streaming data is data that’s generated continuously by a large amount of data sources, which typically send the data records simultaneously. This can be from log files generated by the use of mobile phones, web applications, geospatial services, ecommerce purchases, game player activity, social networks, financial trading floors. The list goes on.
If you are looking to do the above, then it would make sense to investigate a data lake.
So, I’ve gotten all my data stored What happens next?
The next step is to get your data into an analytics solution.
If you’re using a data warehouse, then you would probably look to connect your analytics solution to it, and then build your analytics dashboards.
You can’t really do that from a data lake. The next logical step on from a data lake is to build a data catalogue , which essentially gives you a unified view of all your data, and the associated metadata.
What vendors should I be looking at to help with my data pipeline?
Well that depends on if you’re using a data warehouse, or data lake.
If it’s a data warehouse, then it’s really business as usual. PowerBi and Tableau need clean data, so they do require a data warehouse. Qlik can easily connect to your data warehouse, though I do know some organisations that use Qlik as their data warehouse. Some aren’t small either, having up to 2,000 users.
If you’re looking at a data lake, there are numerous vendors that can assist with the different stages, though a vast majority only provide one or two pieces of the puzzle.
In my mind, after Qlik’s acquisitions and integration of Attunity and Podium Data (now under the banner of Qlik Data Integration (QDI)), is the most comprehensive solution.
QDi is quite brilliant, it will take your data (from an industry leading number of sources), automate the creation of your data warehouse/data lake, stream the data into platforms like Kafka, and also output to your analytic solution of choice. Of course, I think this should be Qlik but if you want to do just reporting, then you can connect any other solution to its outputs (PowerBI, Tableau etc).
The automation is incredibly powerful too. Allowing you to automate the mapping, target table creation and data instantiation. Quickly create and deploy analytics ready structures (not just to Qlik remember), and (at scale) catalogue, inventory, search & retrieve data.
In addition to all this, QDi also allows you to automate the movement of your data between on premises data sources and cloud storage.
If you’re looking to investigate any of this, I can only strongly recommend you looking at Qlik’s solution as part of your process.
Thanks for making it this far. Do you have a data pipeline? What does it look like? I’d love to hear in the comments.