Hadoop is an open source framework for storing and processing Big Data on large clusters of commodity hardware for massive data storage and faster processing. This would seem to make it a natural hub for all of an enterprise’s data. In this scenario, Hadoop could serve every group within the enterprise in one big, seamless architecture. This would sweep away all the familiar problems of data silos, including out-of-sync duplicate instances and security inconsistencies across an organization’s critical data.
However, that’s not the way most real-world Hadoop implementations are structured. By its very nature, Big Data serves a wide audience and supports a broad range of different use cases. This is in contrast to traditional data warehousing systems and data marts, which are typically designed for more specific business purposes. Hadoop initiatives usually begin in small isolated groups – rather than a centralized initiative. This often results in a proliferation of Hadoop clusters, with each team managing their own environment.
From an IT perspective, this can lead to a number of challenges including infrastructure sprawl, lack of governance, lack of standardization, and significant maintenance challenges. And almost every group (or “tenant”) wants to keep at least some of the applications, data, and computing resources separate from the other groups or departments within their organization. To provide this separation and isolation, most organizations are likely to adopt one or the following approaches:
· Islands of Hadoop, managed independently
· Islands of Hadoop, centrally managed
· Multiple Hadoop tenants on shared infrastructure, centrally managed
Managing Islands of Hadoop
As noted above, many organizations have islands of Hadoop clusters which may be managed independently. Each Hadoop cluster is on a separate physical server, typically with direct attached storage, so each cluster stores its own data and is managed individually. This approach prevents different groups from sharing data sets across clusters, and it’s difficult for IT to gain any operational efficiencies or benefits from standardization. However, this model does provide certainty about the location of data as well as physical isolation. This is very important for some types of data—especially financial information or other highly sensitive data—that may be subject to government regulations or compliance policies that aim to protect security and privacy.
Another approach adopted by some organizations is to have islands of Hadoop clusters that are managed centrally. A centralized IT team may provide some policy controls and governance; they may also be able to achieve some operational and cost saving benefits through standardization. In many cases, they may use a management console to monitor and manage multiple Hadoop clusters. This approach does minimize some of the challenges and inefficiencies of managing each cluster independently. However, as in the other “islands of Hadoop clusters” scenario, the users don’t benefit from sharing data sets across multiple environments.
In both of these scenarios, the process for getting access to a new Hadoop cluster is often time-consuming. To paraphrase an IT executive at one Fortune 500 enterprise: “It’s like trying to use a restroom at a gas station, where you have to wait for the current occupant to finish and then get the key.” Each new Hadoop cluster is carefully planned for a specific set of workloads and requires provisioning of the physical server, storage, and other infrastructure components. Users need to clearly justify their use case and provide details about their application needs. There is typically an IT gatekeeper to gather this input, determine the system requirements, coordinate the deployment of the various components, and help onboard users to the new system when it’s available. Given the diverse demands and use cases associated with Big Data, the end-to-end process can take several days – if not weeks or even months.
Centrally Managed, Shared Infrastructure
Of the three different options cited above, the “shared infrastructure” approach has the greatest potential benefits – in terms of efficiency, cost savings, and governance. There is a compelling argument for deploying multiple Hadoop tenants on centrally managed, shared infrastructure. The potential benefits of multi-tenancy are that it can consolidate and simplify management for greater efficiency, enabling the sharing of resources for cost savings as well as the sharing of data – to eliminate the hassles and security risks of having to duplicate and store the same data for different user groups.
However, the traditional Hadoop reference architecture for multi-tenancy also raises significant potential concerns. The Hadoop community has offered a multi-tenant reference architecture for a single physical cluster based on recent open source projects. Organizations may create shared Hadoop clusters for specific use cases, for specific applications, or for specific lines of business. They may also set up a shared cluster strictly for development and testing purposes; or dedicate a cluster to a specific version or distribution of Hadoop. But this approach is technically complex and not for the faint of heart.
There are also business challenges associated with this model for Hadoop multi-tenancy: it conflicts with real-world enterprise and legal restrictions on where certain types of data can be stored. While multi-tenancy promises enterprise-wide leverage of data and resources, many organizations are concerned about data security and privacy within the organization itself. The idea of multi-tenancy on a single physical Hadoop cluster is a non-starter for most enterprises from an operational risk perspective.
Before companies will even consider this approach to Hadoop multi-tenancy, they need to be confident that these tenants can share a common set of physical infrastructure without negatively impacting service levels or violating any security and privacy constraints. Is there any risk of confidential data, competitive data, employee information, financial data, or sensitive customer information being compromised? Could one set of users access this information—whether mistakenly or maliciously—and either delete it or use it for themselves? For many organizations, these risks make this approach untenable.
What Does Secure Multi-tenancy for Hadoop Look Like?
What’s needed is a secure multi-tenant Hadoop architecture that authenticates each user, “knows” what each user is allowed to see or do, and tracks who did what and when.
A user should only be able to see the data sets that he or she is authorized to see. For example, some customer data may be exclusive to analysts on the finance team — whereas some customer data may be shared with the sales and marketing teams. A business analyst on the marketing team can analyze customer satisfaction data, responses to promotional campaigns, or customer sentiment on social media. The sales team can analyze data to uncover new sales opportunities or improve revenue per customer. These users may share the same Big Data applications, infrastructure, and some of the same customer data sets in a Hadoop cluster. But only finance users with specific authorization should have access to sensitive financial data or private customer information.
To enable this secure multi-tenant architecture for Hadoop, administrators need to be able to manage users and grant access to resources based on each user’s unique needs. They also need to be able to audit and track usage across multiple tenants and multiple clusters.
The Real Risk of Secure Multi-Tenancy for Hadoop: Not Doing It
There are many important considerations for your Hadoop deployment, but secure multi-tenancy may be one of the most important. So how do you make this work?
It is possible to implement a secure multi-tenant architecture for Hadoop with the technology available today. In particular, these requirements are well-suited to virtualization. The traditional approach to Hadoop infrastructure, as highlighted above, is to run each Hadoop cluster on its own separate physical server. Hadoop virtualization has been avoided due to concerns about performance. But with recent technology innovations, these performance issues have been addressed. It is also now possible to allow IT professionals to access and share existing data across several clusters without copying to each cluster (a task that causes headache and costs time). This is contrary to the traditional “islands of Hadoop” approach with direct attached storage, but it opens up exciting new opportunities. It enables logical separation of the compute and storage – and provides a foundation for shared resources in a secure multi-tenancy model.
By leveraging virtualization and other technology innovations for Hadoop infrastructure, IT organizations can ensure security and privacy while gaining the benefits of shared resources. They can implement enterprise-grade authorization mechanisms based on user directories and authentication technologies such as Kerberos. Multiple groups can access the same data for their Hadoop analysis in shared storage, avoiding the cost of moving or duplicating data. Multiple users can use the same Big Data applications, without the company having to buy additional licenses or install them in multiple places. Disparate stakeholders—such as marketing, sales, and finance—no longer have to wait in a queue to gain access to their own physical Hadoop cluster. Virtual Hadoop clusters can be provisioned within a matter of a few minutes, instead of multiple days or weeks.
There are also operational efficiencies and cost savings. For example, virtualizing and sharing infrastructure means an opportunity to reduce costs by consolidating server and storage hardware. Additional cost savings can result from improving resource utilization on virtualized infrastructure. And with this approach, IT professionals can be freed up from the mundane, tactical activities associated with maintaining multiple islands of Hadoop clusters; there’s a potential to repurpose this staff to higher-value activities.
With all these advantages, secure multi-tenancy for Hadoop is no longer just a “nice to have”. The status quo of most Hadoop implementations – with islands of Hadoop clusters – should become a thing of the past. Secure multi-tenancy should become a “must have” for organizations implementing Hadoop today. If not, they put their Big Data initiative at risk.
- Secure multi-tenancy for Hadoop clusters - July 1, 2015