A data lake is a centralized storage repository that lets you store all your structured data like tables from RDBMSs, semi-structured like CSV files, XML files, logs, JSON, etc., and unstructured enterprise data like PDFs, word documents, text files, emails, etc. in its raw format.
The term Data Lake was coined by James Dixon, the CTO of Pentaho. He uses the following analogy:
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
The purpose of the data lake is to provide a resting place for data in its native format until it’s needed. The data is transformed and schema is applied only when the data is needed for analysis. This is called "schema on read" because data is kept raw until it is ready to be used.
There are plenty of use cases that see enterprises turn to data lakes. Some are:
- They are suitable repositories for data sources that produce enormous amounts of information. Examples of this include website activity logs, IoT data, social media data, and logistics updates.
- Can be an ideal environment to train and develop a Machine Learning or Artificial Intelligence tool.
- Since it is stored as raw data, it can be manipulated multiple times in multiple ways as per the needs of the business.
- Scalable, cost-efficient and single point storage solution for a company’s data needs.