Best Practices for Deploying Data Lakes

Best Practices for Deploying Data Lakes

By John Gray | August 9, 2019

Although still a burgeoning term, data lakes have recently gained more recognition among IT teams as data increasingly becomes a foundation of modern business. Conceived as a solution to reduce data sprawl and data siloes, data lakes emerged from the industry of data warehousing, which targeted the frustrations IT encountered when trying to create an organized repository of strategic datasets on which to make key business decisions. This use can range from data analytics to better understand customer needs to artificial intelligence to solve for real-time challenges.

Data lakes, in many ways, are an evolution of data warehousing. Many data warehouse projects failed: They were too costly, took too long, and only achieved a small subset of the original goals. With data changing and growing so rapidly, the need for quickly getting value out of data has grown ever more pressing. Nobody can afford to spend months or years analyzing and modeling data for business use. By the time the data is usable in a data warehouse, the business needs have changed.

In a similar vein to data warehouses, data marts emerged to embrace data with a specific use or cataloged by a certain quality (marketing departmental data, for example). Data marts have been more successful because the usage of the data is better understood, and the results can be delivered more quickly. However, the compartmentalized nature of data marts has made them less useful to businesses that have massive amounts of data and that need to use that data cross-functionally and across several parties.

For this reason, data lakes have developed out of a need to meet business needs at scale. They are intended to speed things up, making data more readily usable for previously undefined needs. The emergence of truly large-scale cloud computing with its massive cheap compute power and almost infinite storage has made this data lake approach viable.

Since data lakes are still a rather new concept, the market hasn’t yet fully adapted to them. Therefore, early adopters will see the most value from data lakes at this time, perhaps in using them to empower artificial intelligence within daily business. Beyond those who have already embraced data lakes, many IT teams are assessing them to find the right solution for their business. What can be done to properly deploy a data lake? Here are my suggestions for three best practices to follow:

1. Put data into a data lake with a strategy

The core reason behind keeping a data lake is using that data for a purpose. Although in theory a data lake should serve many, yet to be defined uses, it is better to start out knowing something about how the data will be used. Consider how you will gain value from a data lake beyond storage. As with any IT initiative, it’s important to first match a data lake’s deployment to a concrete strategy that not only aligns with IT goals, but long-term business goals as well.

Ask yourself if keeping a data lake will assist the business in leveraging its data. Keeping data for use “down the road” is costly if down the road is years from now. If a business doesn’t intend to use their data for a specific purpose in the short-term, it becomes wasted funds to store that data.

2. Keep data at the lowest level of granularity — and tag it

Storing data at the most detailed level allows the data to be assembled, aggregated, and otherwise manipulated for a myriad of purposes. Don’t aggregate or summarize the data prior to storing it in the data lake. Because the value of having a data lake will not be realized until a business can make use of the data within it, it is better to put data into the lake with tagging and cataloging, so that when needed, IT can sift through the repository to pull out assets. The use of tagging, which is needed for reporting, can help to enable analytics projects. Also, machine learning (ML) and AI can aid in the tagging process by sifting through existing data and creating tags.

Additionally, companies can use these data analytics, ML, and AI projects to drive overall improved competitiveness for the business. One tool can empower another.

3. Have a data destruction plan

Too often companies accumulate large amounts of data without any plan in place to get rid of unnecessary assets. Especially if there’s a compliance obligation to destroy information after a certain time-lapse (as GDPR tasks companies to do with EU citizen data), not having a destruction plan can be a roadblock to performing these duties.

Pairing a destruction plan with your data lake can help you retrieve what needs to be destroyed and when. It can also solve for scenarios in which businesses are required to track where all client data resides: having a single location simplifies cost and saves time.

Preparing for the future

As increased amounts of data proliferate the business landscape, there will continue to be a need to store and use that data in a strategic fashion. Data lakes are emerging as a great way to drive empowerment that unlocks the value of data for the business. In considering a data lake solution, first determine how you think your organization will use the data, then where you’ll put it. For example, the cloud has great appeal for data lakes due to the lowered storage costs. If the cloud makes sense to your company goals, examine a third-party provider that can meet your unique infrastructure needs. How will the cloud services provider or your own DevOps build a process into your data lake so that data can be loaded and lifted from the lake according to objectives?

Since undoubtedly there will be a lot of processing to gain full value from having a data lake, consider where steps in the analytics process can be automated. You also need staff skilled in building the infrastructure to host a data lake, to load data into the data lake, and to transform the data for use. Establishing regular, open communications between IT and business leaders is a good first step to enable any IT transformation, such as a data lake solution.

NOTE: This blog post originally appeared in August 2019 as an article on InformationWeek: 

Interested in Machine Learning and Artificial Intelligence?
Read more