Processing and migrating large data tables from Hive to GCS using Java and Dataproc Serverless

Dataproc is a fully-managed cloud service for running Apache Spark workloads over Google Cloud Platform. It creates transitory clusters instead of provisioning and maintaining a cluster for all our jobs.

The Dataproc Templates provide a flexible and easy-to-use mechanism for managing and executing use cases on Dataproc serverless without the need to develop them.

These templates implement common Spark workloads, letting us customize and run them easily.

Objective

This blog post elucidates on how processing and migrating large volumes of workload from Apache Hive Metastore to Google cloud works using Dataproc Serverless.

Pre-requisites

For running these templates, we will need:

  • The Google Cloud SDK installed and authenticated
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Key Benefits

  • Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster.
  • Hive to GCS template is open source, fully customizable and ready to use for simple jobs.
  • You can ingest data from Hive to GCS in AVRO, CSV, ORC and JSON formats.

Configuration Parameters

Following properties are included within the template to configure the execution —

  • spark.sql.warehouse.dir=<warehouse-path> : Location path to Spark SQL Hive Warehouse where Spark SQL pe rsists tables.
  • hive.gcs.output.path=<gcs-output-path> : GCS input path.
  • hive.gcs.output.path=<gcs-output-path> : GCS output path.
  • hive.input.table=<hive-input-table> : Hive input table name.
  • hive.input.db=<hive-output-db> : Hive input database name.
  • hive.gcs.output.format=avro : GCS output file format. This can either be avro, csv, paraquet, json, orc. The default output path is set to avro.
  • hive.partition.col=<hive-partition-col> : Column name to partition hive data.
  • hive.gcs.save.mode=overwrite : Set the write mode to GCS. The default parameter value is overwrite. You can read about how each save mode behaves here.

There are two other optional properties with "Hive to GCS" template for applying spark sql transformations before loading into GCS—

--templa   teProperty hive.gcs.temp.table='temporary_view_name' 
--templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name'
Note: When using the transformation properties, the name of the Spark temporary view and the name of the table in the query should exactly match to avoid "table view not found" error.

Usage

  1. Create a staging bucket in GCS to store the dependencies required to run the serverless cluster.
export GCS_STAGING_BUCKET="my-gcs-staging-bucket"
gsutil mb gs://$GCS_STAGING_BUCKET

2. Clone the Dataproc Templates repository and navigate to the Java template folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Configure the Dataproc Serverless job by exporting the variables needed for submission —

We will use the provided bin/start.sh script and configure using the following mandatory environment variables to submit the job to dataproc serverless:

  • GCP_PROJECT : GCP project id to run Dataproc Serverless on.
  • REGION : Region to run Dataproc Serverless in.
  • SUBNET : Subnet where Hive warehouse exists so as to launch the job in same subnet.
  • GCS_STAGING_BUCKET : GCS staging bucket location, where Dataproc will store staging assets (should be within the bucket created earlier).
# GCP project id to run the dataproc serverless job
GCP_PROJECT=<gcp-project-id>
# GCP region where the job needs to be submitted
REGION=<region>
# Subnet where the hive warehouse exists
SUBNET=<subnet>
# Staging storage location for Dataproc Serverless (Already done in step 1)
GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>

4. Execute the Hive To GCS Dataproc template � ��

After configuring the job, we will now trigger the bin/start.sh specifying the template we want to run along with the parameter values for the execution.

Note: Dataproc API should be enabled when submitting the job.
bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://<hostname-or-ip>:9083 \
-- --template HIVETOGCS \
--templateProperty hive.input.table=<table> \
--templateProperty hive.input.db=<database> \
--templateProperty hive.gcs.output.path=<gcs-output-path>

5. Monitor and view the Spark batch job

You can monitor logs and view the metrics after submitting the job in Dataproc Batches UI.

References


Processing and migrating large data tables from Hive to GCS using Java and Dataproc Serverless was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Namaste Devops is a one stop solution view, read and learn Devops Articles selected from worlds Top Devops content publishers inclusing AWS, Azure and others. All the credit/appreciations/issues apart from the Clean UI and faster loading time goes to original author.

Comments

Did you find the article or blog useful? Please share this among your dev friends or network.

An android app or website on your mind?

We build blazing fast Rest APIs and web-apps and love to discuss and develop on great product ideas over a Google meet call. Let's connect for a free consultation or project development.

Contact Us

Trending DevOps Articles

Working with System.Random and threads safely in .NET Core and .NET Framework

Popular DevOps Categories

Docker aws cdk application load balancer AWS CDK Application security AWS CDK application Application Load Balancers with DevOps Guru Auto scale group Automation Autoscale EC2 Autoscale VPC Autoscaling AWS Azure DevOps Big Data BigQuery CAMS DevOps Containers Data Observability Frequently Asked Devops Questions in Interviews GCP Large Table Export GCP Serverless Dataproc DB Export GTmetrix Page Speed 100% Google Page Speed 100% Healthy CI/CD Pipelines How to use AWS Developer Tools IDL web services Infrastructure as code Istio App Deploy Istio Gateways Istio Installation Istio Official Docs Istio Service Istio Traffic Management Java Database Export with GCP Jenkin K8 Kubernetes Large DB Export GCP Linux MSSQL March announcement MySQL Networking Popular DevOps Tools PostgreSQL Puppet Python Database Export with GCP Python GCP Large Table Export Python GCP Serverless Dataproc DB Export Python Postgres DB Export to BigQuery Sprint Top 100 Devops Questions TypeScript Client Generator anti-patterns of DevOps application performance monitoring (APM) aws amplify deploy blazor webassembly aws cdk application load balancer security group aws cdk construct example aws cdk l2 constructs aws cdk web application firewall aws codeguru reviewer cli command aws devops guru performance management aws service catalog best practices aws service catalog ci/cd aws service catalog examples azure Devops use cases azure devops whitepaper codeguru aws cli deploy asp.net core blazor webassembly devops guru for rds devops guru rds performance devops project explanation devops project ideas devops real time examples devops real time scenarios devops whitepaper aws docker-compose.yml health aware ci/cd pipeline example host and deploy asp.net core blazor webassembly on AWS scalable and secure CI/CD pipelines security vulnerabilities ci cd pipeline security vulnerabilities ci cd pipeline aws smithy code generation smithy server generator
Show more