A data lake is a data storage model of how a business can create a large repository from multiple sources of multiple types of data. The primary purpose of a data lake is to cluster together all the information in an environment to run analytics against it. Then organizations can look for trends in a new way, using more sources. This is one of the primary examples of a solution created to help solve the issue of Big Data and to promote data analytics. This paper will go over what is big data, a data lake, how they interact, the value they bring, and the pro and cons they bring to data analytics.
There is no question that data is the most valuable part of any business. Whether it is Visa, Disney, or a local restaurant data is imperative to the business. Traditionally the value was placed in having the data (customer lists, banking information, payroll, etc.) but recently there has been a few changes to the industry. The amount of data a company has is constantly growing which is leading to new opportunities and changes. Businesses are now about to run analytics against their data to find trends and make predictions which add value to the business. This allows them to be more efficient and competitive in the market by giving the business a new perspective. The amount of valuable insight data analytics brings is almost proportional to the amount of data analyzed.
This massive amount of data growth combined with data analytics offers new insight that was previously unavailable. The problem is managing and maintaining all this ever-growing data. This problem has earned the term “Big Data”. The idea is there is a wealth of untapped information already in a business’s data center but they must find a way to bring all their data in one place for analytics in order to solve their growing Big Data problem. This would lead to better customer insight, operational efficiency, and competitive advantages through analytics.
What is Big Data?
Big Data is simply, data that is too “large” for a traditional application handle. It doesn’t have to be made up of hundreds of petabytes of information. It could be as simply as not being able to attach a file to an email because it is too large or having a file system so large it’s impossible to run a search. Now normally people think of Big Data as this massive conglomerate of storage when it reality, any amount of data can be considered Big Data. Big Data can be categorized by its volume, velocity and variety.
Volume is the amount of data in an environment and it is the most commonly thought of when people think of Big Data. It is the amount of data in an environment. This volume presents the most immediate challenge to conventional IT structures. It calls for scalable storage, and a distributed approach to querying. Many companies already have large amounts of archived data, perhaps in the form of logs, but not the capacity to process it. This can very quickly create capacity constraints for an organization.
Velocity is how quickly the data needs to be accessed. An example of this would be the stock market. The speed at which the data has to be accessed is so important that Milliseconds of latency can result in Millions of dollars. If users cannot access the data fast enough to deliver value to the business then it is a huge resource lost to the business.
Variety brings about the many different forms data can take. Data can be made up of structured data bases to unstructured media files to everything in between. Being able to manage each type of data set reliably and effectively is a challenge. This is where the idea of a Data Lake can solve some of the storage silo problems produced by the variety of Big Data.
Big Data is forcing businesses to move from structured data warehouses to more accessible and scalable centrally managed repositories like Hadoop. This is the central repository for any and all data that might potentially be valuable for future analysis. Data Lakes offer robust and intelligent management of this central repository. Big Data has changed analytics from measuring the past to predicting the future. This was traditionally localized to only data mining, where organizations would build predictive models. Now technologies like Hadoop, provide a new ability to analyze larger amounts of data quickly. Now data analysts can rapidly develop more accurate predictive models that deliver unprecedented value to the business.
What is a Data Lake?
Before talking about what a data lake is, businesses need to understand what types of data there is in the world. First of all, there is structured data. This is data from within relational databases like Oracle, SQL Server, and IBM DB2 where everything fits into rows and columns. A Data Lake allows data to be entered, stored, queried, and analyzed easily. There is also Unstructured data, which is primarily user-generated data like documents, videos, images, web pages, pdf’s, etc. They are not so easily classified and they don’t fit into a nice neat box. Finally, there is Semi-structured data, which is similar to structured data that lacks the strict data model. For example, it could be a word document that is just text, but it might have some metadata (like tags or markers).
Most all of the data growth that is taking place today is in the Unstructured Data segment. In fact, 80 percent of data is unstructured and it is almost doubling every year. This is one of the huge drivers of Big Data. Unstructured data can come in so many types and sizes that it is hard to index to search through. This lead to the development of yet another new type of unstructured storage, Object-based storage.
Object-based storage is a storage architecture that packs the data along with all its metadata into a single object to be managed. This is unlike traditional file systems, which manage data as a file hierarchy and block storage which data is stored as blocks within sectors and tracks. An object is given an ID and the object is always retrieved by an application presenting the object ID to object storage.
Between the variations of Structured, semi-structured, and unstructured data it is almost impossible for businesses to maintain this individually never mind combine them into one central pool of storage. This is before organizations can even evaluate how to handle each of the data’s Volume, Velocity, and Variety needs. In order to properly run analytics against the different data sets to gain insight, organizations need a way to bring these varying types of data together.
The Data Lake is a flexible infrastructure model made up of storage, compute, and software that can scale and evolve with the data requirements. It enables businesses to ingest, store, and finally analyze multiple different types of data. This allows data scientists and application developers to create applications and analytics that can act on that data in real-time. The Data Lake enables organizations to consolidate the departmental data silos into a single extensible repository that removes duplicate data and administration overheads.
How to create a Data Lake for Analytics?
There are many different vendors who offer a Data Lake solution. The two largest are EMC’s Enterprise Data Lake solution and a home-grown option leveraging Apache Hadoop. Depending whether a customer has the resources to build their own Data Lake leveraging open source solutions or if they want a paid for solution. Both have their pros and cons and follow a similar concept. The difference is in their execution.
The idea of a Data Lake is similar to Apache Hadoop and its open source projects that come with it (i.e. Pig, Hive, etc.). Hadoop has become popular because it provides a cheap and flexible frame work to meet big data challenges. Organizations are discovering the data lake as an evolution from their existing data warehousing architectures. Data in the data lake is often divided into categories. Some data, such as video, audio, images, and other assets, are stored in a filesystem and then use the power of Hadoop to apply various analytic techniques to extract insights. Other data may include unstructured or partially structured text that is also stored in the filesystem but that require different forms of analytics. The number of categories and the types of analytics applied to each category vary widely across industries. Call detail records may be the focus in the telecommunication industry while sensor data is especially critical in manufacturing. The goal of these analytics are ultimately to provide insights or to spur on more new forms of analytics. This makes the data more manageable by structuring it with metadata IDs. This allows what was once a massive amount of varying data to be operationalized and reused more effectively. This process can then be repeated with the help of machine learning to create new insights from data discovery platforms, in-memory databases, graph analytics engines, NoSQL repositories, and much more. This creates a repeatable automated processes for farming business intelligence from Big Data.
EMC’s Data Lake solution is a little different. EMC focuses more on the hardware side of the equation, leveraging either their Isilon or Elastic Cloud Storage (ECS) storage solutions. Both function as a central repository for all of the data to be accessed in a data lake model. Each one has their own unique features and functionality they bring to the table though.
ECS is a multi-purpose, shared global storage device. Using Software Defined storage on commodity servers, ECS can scale to Exabyte sizes for both small and large files. Built from the ground up to support multiple protocols, ECS simplifies storage by becoming a global repository for many different data sets. ECS supports both Block and Object storage workloads on a single scalable storage cluster. The advantage here is a turnkey repository for a data lake that provides data replication, security, flexible access, and cloud scale at the cost of commodity hardware. ECS offers advanced HDFS services for Big Data applications and analytics natively. This makes it one of the idea hardware solutions to be the repository for an organizations Data Lake.
Now EMC Isilon is a scale-out NAS platform with native integration of the Hadoop Distributed File System (HDFS). In fact Isilon network-attached storage has no need for data ingestion. Users can run Big Data analytics in place so there is no need to move data to a dedicated Hadoop infrastructure. The end users can store unstructured data using traditional protocols like SMB, NFS, FTP, etc. natively on Isilon. Isilon also allows those same data sets to be accessible via HDFS to any analytics engines. This speeds up data analysis and reduces costs.
These solutions can either work together or independently. The EMC and Hadoop solutions all solve unique problems in their areas of concentration. They provide all the support, scalability, and protocols for a Data Lake solution. This flexibility of choice between EMC and Hadoop has led to their wide adoption in the market already for Big Data solutions.
How is a Data Lake a primary driver for Big Data Analytics?
Data Lakes are simply a data management platforms for analyzing many types of data from different sources in their native format. Instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization. This creates an easy way for businesses to use their pre-existing data sets in new ways with analytics. A user can take data from different sources and combine them into one large repository that is scalable and secure. This opens up the doors to what is possible with analytics.
“The phenomenon is this: enterprises are finding new sources of data, new ways to analyze data, new ways to apply the analysis to the business, and new revenues for themselves as a result. They are using new approaches, moving from descriptive to predictive and prescriptive analytics and doing data analysis in real-time. They are also increasingly adopting self-service business intelligence and analytics, giving executives and frontline workers easy-to-use software tools for data discovery and timely decision-making.”– EMC
This is allowing businesses to use their own data to be more efficient and competitive. Over the past few years data analytics has become more common in many different market segments. It is becoming increasingly difficult for a company to compete in an industry without doing any data analytics in the market but there competition is. Soon just running data analytics won’t be enough for a competitive edge. It will be how much data an organization can analyze to find details the competition may have missed. This is where the adoption of a Data Lake allows organizations to bring as much of their different data sets together for a more detailed and holistic approach to their analytics strategy.
What are the road blocks to adoption?
There are a few detractors to going to a Data Lake model. The first being an organization will need highly skilled resources who understand data manipulation and analysis. This can be costly for organizations to initially start, maintain, and support. While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines. Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch. This means that it is important to categorize and maintain the Data Lake to be able to accurately and quickly access the data in it.
Another concern is security and governance since all of “your eggs will be in one basket”. This determines how various organizations have to protect their data, ensure data loss prevention, and protect the organization’s public reputation. If data is to be used for analytics then the validity of the data is key. If the data has been tampered with or partially deleted then the results of any analytics will be skewed. If an organization can’t first protect their data in a single repository they cannot move forward with a Data Lake strategy.
There are quite a few other road blocks like not having enough data, lack of collaboration with the business, lack of technical infrastructure. If an organization does not have enough historical or real time data then any data analytics will not deliver the desired insight the business may need. If there is poor communication or support from the business then the first steps towards big data analytics may never take place. If they do succeed in getting the project off the ground, the insights from the analytics may not be in line with the business needs. This would ultimately offer little to no value to the business. Finally, if the proper technical resources are not in place the migration to a Data Lake will be impossible. If a business can successfully maneuver past these challenges the potential rewards are massive.
What is the potential for Analytics to be applied to Data Lakes?
Customers want to know how they can extract real value from these new sources of information to be more competitive as well as find any inefficiencies in their businesses. There is a lot of potential insight a business can reap from their own data they already have. Some of the basic analytics they are doing today isn’t even scratching the surface of what is possible. IDC estimated that only 0.5 percent of all data in the world is being analyzed which shows the massive potential businesses have in front of them.
“The Big Data technology and services market represents a fast-growing multibillion-dollar worldwide opportunity. In fact, a recent IDC forecast shows that the Big Data technology and services market will grow at a 26.4% compound annual growth rate to $41.5 billion through 2018, or about six times the growth rate of the overall information technology market. Additionally, by 2020 IDC believes that line of business buyers will help drive analytics beyond its historical sweet spot of relational (performance management) to the double-digit growth rates of real-time intelligence and exploration/discovery of the unstructured worlds.” – IDC
For example, Netflix was able to create one of the most popular shows ever through simple analytics. They did some research through analytics on their 30 million subscribers. Netflix knew it had a large number of subscribers that liked movies from director David Fincher as well as a large portion that liked Kevin Spacey films. It was the results of those analytics that led Netflix to go ahead with “House of Cards.”
Big Data analytics is still in its infancy with a lot more potential in the future. Despite their analytical prowess, less than half of Analytical Innovators report that they are very effective at capturing data, analyzing and integrating it, and using analytical insights to guide [corporate] strategy. Similarly, less than half of the Analytical Innovators strongly agree that they share data with stakeholders and have an intergraded approach to information management. Even in light of this there is such a large competitive advantage to analytics that it is worth the effort. The Data Lake is a primary enabler to making Big Data analytics possible in enterprises.
Big Data and the ocean covering the earth are very similar. There is about $60 Billion dollars’ worth of treasure on the ocean floor and it is up for grabs. The problem is no one has a way to reach it. This is similar to the issue businesses are seeing with Big Data, where a business has this vast amount of data with potentially millions of dollars’ worth of value but aren’t able to harness it.