In modern data architectures, the need to manage and query vast datasets efficiently, consistently, and accurately is paramount. For organizations that deal with big data processing, managing metadata becomes a critical concern. This is where Hive Metastore (HMS) can serve as a central metadata store, playing a crucial role in these modern data architectures.
HMS is a central repository of metadata for Apache Hive tables and other data lake table formats (for example, Apache Iceberg), providing clients (such as Apache Hive, Apache Spark, and Trino) access to this information using the Metastore Service API. Over time, HMS has become a foundational component for data lakes, integrating with a diverse ecosystem of open source and proprietary tools.
In non-containerized environments, there was typically only one approach to implementing HMS—running it as a service in an Apache Hadoop cluster. With the advent of containerization in data lakes through technologies such as Docker and Kubernetes, multiple options for implementing HMS have emerged. These options offer greater flexibility, allowing organizations to tailor HMS deployment to their specific needs and infrastructure.
In this post, we will explore the architecture patterns and demonstrate their implementation using Amazon EMR on EKS with Spark Operator job submission type, guiding you through the complexities to help you choose the best approach for your use case.
Solution overview
Prior to Hive 3.0, HMS was tightly integrated with Hive and other Hadoop ecosystem components. Hive 3.0 introduced a Standalone Hive Metastore. This new version of HMS functions as an independent service, decoupled from other Hive and Hadoop components such as HiveServer2. This separation enables various applications, such as Apache Spark, to interact directly with HMS without requiring a full Hive and Hadoop environment installation. You can learn more about other components of Apache Hive on the Design page.
In this post, we will use a Standalone Hive Metastore to illustrate the architecture and implementation details of various design patterns. Any reference to HMS refers to a Standalone Hive Metastore.
The HMS broadly consists of two main components:
- Backend database: The database is a persistent data store that holds all the metadata, such as table schemas, partitions, and data locations.
- Metastore service API: The Metastore service API is a stateless service that manages the core functionality of the HMS. It handles read and write operations to the backend database.
Containerization and Kubernetes offers various architecture and implementation options for HMS, including, running:
In this post, we’ll use Apache Spark as the data processing framework to demonstrate these three architectural patterns. However, these patterns aren’t limited to Spark and can be applied to any data processing framework, such as Hive or Trino, that relies on HMS for managing metadata and accessing catalog information.
Note that in a Spark application, the driver is responsible for querying the metastore to fetch table schemas and locations, then distributes this information to the executors. Executors process the data using the locations provided by the driver, never needing to query the metastore directly. Hence, in the three patterns described in the following sections, only the driver communicates with the HMS, not the executors.
HMS as sidecar container
In this pattern, HMS runs as a sidecar container within the same pod as the data processing framework, such as Apache Spark. This approach uses Kubernetes multi-container pod functionality, allowing both HMS and the data processing framework to operate together in the same pod. The following figure illustrates this architecture, where the HMS container is part of Spark driver pod.
This pattern is suited for small-scale deployments where simplicity is the priority. Because HMS is co-located with the Spark driver, it reduces network overhead and provides a straightforward setup. However, it’s important to note that in this approach HMS operates exclusively within the scope of the parent application and isn’t accessible by other applications. Additionally, row conflicts might arise when multiple jobs attempt to insert data into the same table simultaneously. To address this, you should make sure that no two jobs are writing to the same table simultaneously.
Consider this approach if you prefer a basic architecture. It’s ideal for organizations where a single team manages both the data processing framework (for example, Apache Spark) and HMS, and there’s no need for other applications to use HMS.
Cluster dedicated HMS
In this pattern, HMS runs in multiple pods managed through a Kubernetes deployment, typically within a dedicated namespace in the same data processing EKS cluster. The following figure illustrates this setup, with HMS decoupled from Spark driver pods and other workloads.
This pattern works well for medium-scale deployments where moderate isolation is enough, and compute and data needs can be handled within a few clusters. It provides a balance between resource efficiency and isolation, making it ideal for use cases where scaling metadata services independently is important, but complete decoupling isn’t necessary. Additionally, this pattern works well when a single team manages both the data processing frameworks and HMS, ensuring streamlined operations and alignment with organizational responsibilities.
By decoupling HMS from Spark driver pods, it can serve multiple clients, such as Apache Spark and Trino, while sharing cluster resources. However, this approach might lead to resource contention during periods of high demand, which can be mitigated by enforcing tenant isolation on HMS pods.
External HMS
In this architecture pattern, HMS is deployed in its own EKS cluster deployed using Kubernetes deployment and exposed as a Kubernetes Service using AWS Load Balancer Controller, separate from the data processing clusters. The following figure illustrates this setup, where HMS is configured as an external service, separate from the data processing clusters.
This pattern suits scenarios where you want a centralized metastore service shared across multiple data processing clusters. HMS allows different data teams to manage their own data processing clusters while relying on the shared metastore for metadata management. By deploying HMS in a dedicated EKS cluster, this pattern provides maximum isolation, independent scaling, and the flexibility to operate and managed as its own independent service.
While this approach offers clear separation of concerns and the ability to scale independently, it also introduces higher operational complexity and potentially increased costs because of the need to manage an additional cluster. Consider this pattern if you have strict compliance requirements, need to ensure complete isolation for metadata services, or want to provide a unified metadata catalog service for multiple data teams. It works well in organizations where different teams manage their own data processing frameworks and rely on a shared metadata store for data processing needs. Additionally, the separation enables specialized teams to focus on their respective areas.
Deploy the solution
In the remainder of this post, you will explore the implementation details for each of the three architecture patterns, using EMR on EKS with Spark Operator job submission type as an example to demonstrate their implementation. Note that this implementation hasn’t been tested with other EMR on EKS Spark job submission types. You will begin by deploying the common components that serve as the foundation for all the architecture patterns. Next, you’ll deploy the components specific to each pattern. Finally, you’ll execute Spark jobs to connect to the HMS implementation unique to each pattern and verify the successful execution and retrieval of data and metadata.
To streamline the setup process, we’ve automated the deployment of common infrastructure components so you can focus on the essential aspects of each HMS architecture. We’ll provide detailed information to help you understand each step, simplifying the setup while preserving the learning experience.
Scenario
To showcase the patterns, you will create three clusters:
- Two EMR on EKS clusters:
analytics-cluster
anddatascience-cluster
- An EKS cluster:
hivemetastore-cluster
Both analytics-cluster
and datascience-cluster
serve as data processing clusters that run Spark workloads, while the hivemetastore-cluster
hosts the HMS.
You will use analytics-cluster
to illustrate the HMS as sidecar and cluster dedicated pattern. You will use all three clusters to demonstrate the external HMS pattern.
Source code
You can find the codebase in the AWS Samples GitHub repository.
Prerequisites
Before you deploy this solution, make sure that the following prerequisites are in place:
Set up common infrastructure
Begin by setting up the infrastructure components that are common to all three architectures.
- Clone the repository to your local machine and set the two environment variables. Replace <AWS_REGION> with the AWS Region where you want to deploy these resources.
- Execute the following script to create the shared infrastructure.
- To verify successful infrastructure deployment, navigate to the AWS Management Console for AWS CloudFormation, select your stack, and check the Events, Resources, and Outputs tabs for completion status, details, and list of resources created.
You have completed the setup of the common components that serve as the foundation for all architectures. You will now deploy the components specific to each architecture and execute Apache Spark jobs to validate the successful implementation.
HMS in a sidecar container
To implement HMS using the sidecar container pattern, the Spark application requires setting both sidecar and catalog properties in the job configuration file.
- Execute the following script to configure the
analytics-cluster
for sidecar pattern. For this post, we stored the HMS database credentials into a Kubernetes Secret object. We recommend using Kubernetes External Secrets Operator to fetch HMS database credentials from AWS Secrets Manager.
- Review the Spark job manifest file
spark-hms-sidecar-job.yaml
. This file was created by substituting variables in thespark-hms-sidecar-job.tpl
template in the previous step. The following samples highlight key sections of the manifest file.
Spark job configuration
Submit the Spark job and verify the HMS as sidecar container setup
In this pattern, you will submit Spark jobs in analytics-cluster
. The Spark jobs will connect to the HMS service running as a sidecar container in the driver pod.
- Run the Spark job to verify that the setup was successful.
- Describe the
sparkapplication
object.
- List the pods and observe the number of containers attached to the driver pod. Wait until the Status changes from
ContainerCreating
toRunning
(should take just a few seconds).
- View the driver logs to validate the output.
- If you encounter the following error, wait for a few minutes and rerun the previous command.
- After successful completion of the job, you see the following message in the logs. The tabular output successfully validates the setup of HMS as a sidecar container.
Cluster dedicated HMS
To implement HMS using a cluster dedicated HMS pattern, the Spark application requires setting up HMS URI and catalog properties in the job configuration file.
- Execute the following script to configure the
analytics-cluster
for cluster dedicated pattern.
- Verify the HMS deployment by listing the pods and viewing the logs. No Java exceptions in the logs confirms that the Hive Metastore service is running successfully.
- Review the Spark job manifest file,
spark-hms-cluster-dedicated-job.yaml
. This file is created by substituting variables in thespark-hms-cluster-dedicated-job.tpl
template in the previous step. The following sample highlights key sections of the manifest file.
Submit the Spark job and verify the cluster dedicated HMS setup
In this pattern, you will submit Spark jobs in analytics-cluster
. The Spark jobs will connect to the HMS service in the same data processing EKS cluster.
- Submit the job.
- Verify the status.
- Describe driver pod and observe the number of containers attached to the driver pod. Wait until the status changes from
ContainerCreating
toRunning
(should take just a few seconds).
- View the driver logs to validate the output.
- After successful completion of the job, you should see the following message in the logs. The tabular output successfully validates the setup of cluster dedicated HMS.
External HMS
To implement an external HMS pattern, the Spark application requires setting up an HMS URI for the service endpoint exposed by hivemetastore-cluster
.
- Execute the following script to configure
hivemetastore-cluster
for External HMS pattern.
- Review the Spark job manifest file
spark-hms-external-job.yaml
. This file is created by substituting variables in thespark-hms-external-job.tpl
template during the setup process. The following sample highlights key sections of the manifest file.
Submit the Spark job and verify the HMS in a separate EKS cluster setup
To verify the setup, submit Spark jobs in analytics-cluster
and datascience-cluster
. The Spark jobs will connect to the HMS service in the hivemetastore-cluster
.
Use the following steps for analytics-cluster
and then for datascience-cluster
to verify that both clusters can connect to the HMS on hivemetastore-cluster
.
- Run the spark job to test the successful setup. Replace <CONTEXT_NAME> with Kubernetes context for
analytics-cluster
and then fordatascience-cluster
.
- Describe the
sparkapplication
object.
- List the pods and observe the number of containers attached to the driver pod. Wait until the status changes from
ContainerCreating
toRunning
(should take just a few seconds).
- View the driver logs to validate the output on the data processing cluster.
- The output should look like the following. The tabular output successfully validates the setup of HMS in a separate EKS cluster.
Clean up
To avoid incurring future charges from the resources created in this tutorial, clean up your environment after you’ve completed the steps. You can do this by running the cleanup.sh
script, which will safely remove all the resources provisioned during the setup.
Conclusion
In this post, we’ve explored the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.
We encourage you to experiment with these patterns in your own setups, adapting them to fit your unique workloads and operational needs. By understanding and applying these design patterns, you can optimize your Hive Metastore deployments for performance, scalability, and security in your EMR on EKS environments. Explore further by deploying the solution in your AWS account and share your experiences and insights with the community.
About the Authors
Avinash Desireddy is a Cloud Infrastructure Architect at AWS, passionate about building secure applications and data platforms. He has extensive experience in Kubernetes, DevOps, and enterprise architecture, helping customers containerize applications, streamline deployments, and optimize cloud-native environments.
Suvojit Dasgupta is a Principal Data Architect at AWS. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges. Solana Token Creator