Processing and migrating large data tables from Hive to GCS using Java and Dataproc Serverless

Dataproc is a fully-managed cloud service for running Apache Spark workloads over Google Cloud Platform. It creates transitory clusters instead of provisioning and maintaining a cluster for all our jobs.

The Dataproc Templates provide a flexible and easy-to-use mechanism for managing and executing use cases on Dataproc serverless without the need to develop them.

These templates implement common Spark workloads, letting us customize and run them easily.

Objective

This blog post elucidates on how processing and migrating large volumes of workload from Apache Hive Metastore to Google cloud works using Dataproc Serverless.

Pre-requisites

For running these templates, we will need:

The Google Cloud SDK installed and authenticated
A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled. You can review all the Dataproc Serverless networking requirements here.

Key Benefits

Use Dataproc Serverless to run Spark batch workloads without provisioning and managing your own cluster.
Hive to GCS template is open source, fully customizable and ready to use for simple jobs.
You can ingest data from Hive to GCS in AVRO, CSV, ORC and JSON formats.

Configuration Parameters

Following properties are included within the template to configure the execution —

spark.sql.warehouse.dir=<warehouse-path> : Location path to Spark SQL Hive Warehouse where Spark SQL pe rsists tables.
hive.gcs.output.path=<gcs-output-path> : GCS input path.
hive.gcs.output.path=<gcs-output-path> : GCS output path.
hive.input.table=<hive-input-table> : Hive input table name.
hive.input.db=<hive-output-db> : Hive input database name.
hive.gcs.output.format=avro : GCS output file format. This can either be avro, csv, paraquet, json, orc. The default output path is set to avro.
hive.partition.col=<hive-partition-col> : Column name to partition hive data.
hive.gcs.save.mode=overwrite : Set the write mode to GCS. The default parameter value is overwrite. You can read about how each save mode behaves here.

There are two other optional properties with "Hive to GCS" template for applying spark sql transformations before loading into GCS—

--templa   teProperty hive.gcs.temp.table='temporary_view_name' 
--templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name'

Note: When using the transformation properties, the name of the Spark temporary view and the name of the table in the query should exactly match to avoid "table view not found" error.

Usage

Create a staging bucket in GCS to store the dependencies required to run the serverless cluster.

export GCS_STAGING_BUCKET="my-gcs-staging-bucket"
gsutil mb gs://$GCS_STAGING_BUCKET

2. Clone the Dataproc Templates repository and navigate to the Java template folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Configure the Dataproc Serverless job by exporting the variables needed for submission —

We will use the provided bin/start.sh script and configure using the following mandatory environment variables to submit the job to dataproc serverless:

GCP_PROJECT : GCP project id to run Dataproc Serverless on.
REGION : Region to run Dataproc Serverless in.
SUBNET : Subnet where Hive warehouse exists so as to launch the job in same subnet.
GCS_STAGING_BUCKET : GCS staging bucket location, where Dataproc will store staging assets (should be within the bucket created earlier).

# GCP project id to run the dataproc serverless job
GCP_PROJECT=<gcp-project-id>

# GCP region where the job needs to be submitted
REGION=<region>

# Subnet where the hive warehouse exists
SUBNET=<subnet>

# Staging storage location for Dataproc Serverless (Already done in step 1)
GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>

4. Execute the Hive To GCS Dataproc template � ��

After configuring the job, we will now trigger the bin/start.sh specifying the template we want to run along with the parameter values for the execution.

Note: Dataproc API should be enabled when submitting the job.

bin/start.sh \
--properties=spark.hadoop.hive.metastore.uris=thrift://<hostname-or-ip>:9083 \
-- --template HIVETOGCS \
--templateProperty hive.input.table=<table> \
--templateProperty hive.input.db=<database> \
--templateProperty hive.gcs.output.path=<gcs-output-path>

5. Monitor and view the Spark batch job

You can monitor logs and view the metrics after submitting the job in Dataproc Batches UI.

References

Processing and migrating large data tables from Hive to GCS using Java and Dataproc Serverless was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Namaste Devops is a one stop solution view, read and learn Devops Articles selected from worlds Top Devops content publishers inclusing AWS, Azure and others. All the credit/appreciations/issues apart from the Clean UI and faster loading time goes to original author.

Tool	Concepts	Notes	Commands	Tutorials
Docker	Docker Concepts	Docker Notes		Docker Tutorial
Kubernetes	Kubernetes Concepts	Kubernetes Commands
Prometheus	Prometheus Concepts			Prometheus Tutorial
Git	Git concepts/td>	Git Advanced Notes
Ansible	Ansible Concepts			Ansible Tutorials
Linux	OS/Linux-Concepts		Linux Commands
Networking	Networking Concepts		Networking Commands
Puppet	Puppet Concepts		Puppet Commands
Gitlab	Gitlab Concepts		Gitlab Commands
Chef	Chef Concepts		Chef Commands
Travis	Travis Concepts		Travis Commands
Github Actions	Actions Concepts			Actions Tutorial
CircleCI	CircleCI Concepts		CircleCI Commands
Jenkins	Jenkins Concepts			Jenkins Tutorials
Terraform	Terraform Concepts			Terraform Tutorials
Books	Interview Questions

namaste 🙏 devops