[Java] Fast export large database tables using GCP Serverless Dataproc

[Java] Fast export large database tables — using GCP Serverless Dataproc

Import large tables from any JDBC (MySQL, PostgreSQL, MSSQL) to BigQuery

Dataproc Serverless is a great addition to the Google Cloud Dataproc platform. It allows users to run Spark workloads without the provisioning or management of clusters. Dataproc Serverless simply manages all the infrastructure required behind the scenes.

Dataproc Templates provide us common use cases of those kind of workloads, without the need to develop them ourselves. These templates also let us customize and run them quickly.

Introduction

If you need to import/export large tables with 100s of GBs-TBs fast, approach to import/export data out in multiple threads parallelly, and using a robust, proven, open source and hardened mechanism. This post may help you.

Le t's use the JDBCToBigQuery template to export tables in a fast, efficient and multi threaded fashion.

Requirements

Use any machine with JDK8+, Maven3+ , Git and gcloud CLI pre-installed.
Alternatively use cloud shell, which has those tools pre-installed.

Simple Usage

This approach is not multi-threaded, so it works fine with tables smaller than 1Gb in size.

[Recommended] Clone your active instance, or create a read replica. Pausing writes to the source database is recommended for consistency purposes.
Make sure your database is reachable from VPC network. If using a public database, make sure to enable cloud NAT. Please, refer to this for further information.
Create a GCS bucket and staging location for your jar files. Download the JDBC Driver jar for the respective source database, and the BigQuery connector with Spark. Upload those jars files into the GCS Bucket.
Clone Dataproc Templates git repo:

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

5. Obtain authentication credentials:

gcloud auth application-default login

6. Execute the template, refer JDBCToBigQuery documentation for more details. Replace environment values to match your case (gcp project, region, jdbc url, path of jars, etc.)

export GCP_PROJECT=my-g   cp-proj \
export REGION=us-central1  \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default   \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysql-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"

bin/start.sh \
-- --template JDBCTOBIGQUERY \
--templateProperty jdbctobq.bigquery.location=bigquery-destination \
--templateProperty jdbctobq.jdbc.url=com.mysql.cj.jdbc.Driver \
--templateProperty jdbctobq.jdbc.driver.class.name=<jdbc driver class name> \
--templateProperty jdbctobq.sql="SELECT * FROM MyDB.employee" \
--templateProperty jdbctobq.write.mode=Overwrite \
--templateProperty jdbctobq.temp.gcs.bucket=temp-bucket-name

NOTE: It will ask you to enable Dataproc API, if not enabled already.

Advance Usage (multi threaded export/import)

As suming you have a table schema of Employee in mysql database as below:

CREATE TABLE `employee` (
  `id` bigint(20) unsigned NOT NULL,
  `name` varchar(100) NOT NULL,
  `email` varchar(100) NOT NULL,
  `current_salary` int unsigned DEFAULT NULL,
  `account_id` bigint(20) unsigned NOT NULL,
  `department` varchar(100) DEFAULT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Assuming the max employee id is 100 million (used for upperBound parameter).

Perform steps 1–4 as described in previous section.
Change step 6 by specifying the partition properties.

Execute spark job along with partition parameters, example below:

export GCP_PROJECT=my-gcp-proj \
export REGION=us-central1  \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default   \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysq   l-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"

bin/start.sh \
-- --template JDBCTOBIGQUERY \
--templateProperty jdbctobq.bigquery.location=bigquery-destination \
--templateProperty jdbctobq.jdbc.url=com.mysql.cj.jdbc.Driver \
--templateProperty jdbctobq.jdbc.driver.class.name=<jdbc driver class name> \
--templateProperty jdbctobq.sql="SELECT * FROM MyDB.employee" \
--templateProperty jdbctobq.write.mode=Overwrite \
--templateProperty jdbctobq.temp.gcs.bucket=temp-bucket-name \
--templateProperty jdbctobq.sql.partitionColumn=id \
--templateProperty jdbctobq.sql.lowerBound=0 \
--templateProperty jdbctobq.sql.upperBound=100000000 \
--templateProperty jdbctobq.sql.numPartitions=400

Another Targets

Another database
Spark JDBC natively s upports following databases MySQL / MariaDB, Postgresql, DB2 and Oracle. Using GCSToJDBC template (blogpost)you can ingest data into any of them.
Running JDBCToBigQuery from a Python Environment.

References

[Java] Fast export large database tables — using GCP Serverless Dataproc was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Namaste Devops is a one stop solution view, read and learn Devops Articles selected from worlds Top Devops content publishers inclusing AWS, Azure and others. All the credit/appreciations/issues apart from the Clean UI and faster loading time goes to original author.

Tool	Concepts	Notes	Commands	Tutorials
Docker	Docker Concepts	Docker Notes		Docker Tutorial
Kubernetes	Kubernetes Concepts	Kubernetes Commands
Prometheus	Prometheus Concepts			Prometheus Tutorial
Git	Git concepts/td>	Git Advanced Notes
Ansible	Ansible Concepts			Ansible Tutorials
Linux	OS/Linux-Concepts		Linux Commands
Networking	Networking Concepts		Networking Commands
Puppet	Puppet Concepts		Puppet Commands
Gitlab	Gitlab Concepts		Gitlab Commands
Chef	Chef Concepts		Chef Commands
Travis	Travis Concepts		Travis Commands
Github Actions	Actions Concepts			Actions Tutorial
CircleCI	CircleCI Concepts		CircleCI Commands
Jenkins	Jenkins Concepts			Jenkins Tutorials
Terraform	Terraform Concepts			Terraform Tutorials
Books	Interview Questions

namaste 🙏 devops