[Java] Fast export large database tables  using GCP Serverless Dataproc

[Java] Fast export large database tables — using GCP Serverless Dataproc

Import large tables from any JDBC (MySQL, PostgreSQL, MSSQL) to BigQuery

Dataproc Serverless is a great addition to the Google Cloud Dataproc platform. It allows users to run Spark workloads without the provisioning or management of clusters. Dataproc Serverless simply manages all the infrastructure required behind the scenes.

Dataproc Templates provide us common use cases of those kind of workloads, without the need to develop them ourselves. These templates also let us customize and run them quickly.

Introduction

If you need to import/export large tables with 100s of GBs-TBs fast, approach to import/export data out in multiple threads parallelly, and using a robust, proven, open source and hardened mechanism. This post may help you.

Le t's use the JDBCToBigQuery template to export tables in a fast, efficient and multi threaded fashion.

Requirements

Use any machine with JDK8+, Maven3+ , Git and gcloud CLI pre-installed.
Alternatively use cloud shell, which has those tools pre-installed.

Simple Usage

This approach is not multi-threaded, so it works fine with tables smaller than 1Gb in size.

  1. [Recommended] Clone your active instance, or create a read replica. Pausing writes to the source database is recommended for consistency purposes.
  2. Make sure your database is reachable from VPC network. If using a public database, make sure to enable cloud NAT. Please, refer to this for further information.
  3. Create a GCS bucket and staging location for your jar files. Download the JDBC Driver jar for the respective source database, and the BigQuery connector with Spark. Upload those jars files into the GCS Bucket.
  4. Clone Dataproc Templates git repo:
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

5. Obtain authentication credentials:

gcloud auth application-default login

6. Execute the template, refer JDBCToBigQuery documentation for more details. Replace environment values to match your case (gcp project, region, jdbc url, path of jars, etc.)

export GCP_PROJECT=my-g   cp-proj \
export REGION=us-central1 \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysql-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"

bin/start.sh \
-- --template JDBCTOBIGQUERY \
--templateProperty jdbctobq.bigquery.location=bigquery-destination \
--templateProperty jdbctobq.jdbc.url=com.mysql.cj.jdbc.Driver \
--templateProperty jdbctobq.jdbc.driver.class.name=<jdbc driver class name> \
--templateProperty jdbctobq.sql="SELECT * FROM MyDB.employee" \
--templateProperty jdbctobq.write.mode=Overwrite \
--templateProperty jdbctobq.temp.gcs.bucket=temp-bucket-name

NOTE: It will ask you to enable Dataproc API, if not enabled already.

Advance Usage (multi threaded export/import)

As suming you have a table schema of Employee in mysql database as below:

CREATE TABLE `employee` (
`id` bigint(20) unsigned NOT NULL,
`name` varchar(100) NOT NULL,
`email` varchar(100) NOT NULL,
`current_salary` int unsigned DEFAULT NULL,
`account_id` bigint(20) unsigned NOT NULL,
`department` varchar(100) DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Assuming the max employee id is 100 million (used for upperBound parameter).

Perform steps 1–4 as described in previous section.
Change step 6 by specifying the partition properties.

Execute spark job along with partition parameters, example below:

export GCP_PROJECT=my-gcp-proj \
export REGION=us-central1 \
export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default \
export GCS_STAGING_LOCATION=gs://my-gcp-proj/mysq l-export/staging \
export JARS="gs://my-gcp-proj/mysql-export/mysql-connector-java-8.0.17.jar,gs://my-gcp-proj/bigquery-jar/spark-bigquery-with-dependencies_2.12-0.23.2.jar"

bin/start.sh \
-- --template JDBCTOBIGQUERY \
--templateProperty jdbctobq.bigquery.location=bigquery-destination \
--templateProperty jdbctobq.jdbc.url=com.mysql.cj.jdbc.Driver \
--templateProperty jdbctobq.jdbc.driver.class.name=<jdbc driver class name> \
--templateProperty jdbctobq.sql="SELECT * FROM MyDB.employee" \
--templateProperty jdbctobq.write.mode=Overwrite \
--templateProperty jdbctobq.temp.gcs.bucket=temp-bucket-name \
--templateProperty jdbctobq.sql.partitionColumn=id \
--templateProperty jdbctobq.sql.lowerBound=0 \
--templateProperty jdbctobq.sql.upperBound=100000000 \
--templateProperty jdbctobq.sql.numPartitions=400

Another Targets

  1. Another database
    Spark JDBC natively s upports following databases MySQL / MariaDB, Postgresql, DB2 and Oracle. Using GCSToJDBC template (blogpost)you can ingest data into any of them.
  2. Running JDBCToBigQuery from a Python Environment.

References


[Java] Fast export large database tables — using GCP Serverless Dataproc was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Namaste Devops is a one stop solution view, read and learn Devops Articles selected from worlds Top Devops content publishers inclusing AWS, Azure and others. All the credit/appreciations/issues apart from the Clean UI and faster loading time goes to original author.

Comments

Did you find the article or blog useful? Please share this among your dev friends or network.

An android app or website on your mind?

We build blazing fast Rest APIs and web-apps and love to discuss and develop on great product ideas over a Google meet call. Let's connect for a free consultation or project development.

Contact Us

Trending DevOps Articles

Working with System.Random and threads safely in .NET Core and .NET Framework

Popular DevOps Categories

Docker aws cdk application load balancer AWS CDK Application security AWS CDK application Application Load Balancers with DevOps Guru Auto scale group Automation Autoscale EC2 Autoscale VPC Autoscaling AWS Azure DevOps Big Data BigQuery CAMS DevOps Containers Data Observability Frequently Asked Devops Questions in Interviews GCP Large Table Export GCP Serverless Dataproc DB Export GTmetrix Page Speed 100% Google Page Speed 100% Healthy CI/CD Pipelines How to use AWS Developer Tools IDL web services Infrastructure as code Istio App Deploy Istio Gateways Istio Installation Istio Official Docs Istio Service Istio Traffic Management Java Database Export with GCP Jenkin K8 Kubernetes Large DB Export GCP Linux MSSQL March announcement MySQL Networking Popular DevOps Tools PostgreSQL Puppet Python Database Export with GCP Python GCP Large Table Export Python GCP Serverless Dataproc DB Export Python Postgres DB Export to BigQuery Sprint Top 100 Devops Questions TypeScript Client Generator anti-patterns of DevOps application performance monitoring (APM) aws amplify deploy blazor webassembly aws cdk application load balancer security group aws cdk construct example aws cdk l2 constructs aws cdk web application firewall aws codeguru reviewer cli command aws devops guru performance management aws service catalog best practices aws service catalog ci/cd aws service catalog examples azure Devops use cases azure devops whitepaper codeguru aws cli deploy asp.net core blazor webassembly devops guru for rds devops guru rds performance devops project explanation devops project ideas devops real time examples devops real time scenarios devops whitepaper aws docker-compose.yml health aware ci/cd pipeline example host and deploy asp.net core blazor webassembly on AWS scalable and secure CI/CD pipelines security vulnerabilities ci cd pipeline security vulnerabilities ci cd pipeline aws smithy code generation smithy server generator
Show more