Skip to content

Replicating data from SQL Server to BigQuery by DataFusion

Scenario

Diagram showing direct connection between CloudSQL and Data Fusion is not possible

Therefore, a connection must be made through a GCE Proxy

Architecture diagram showing GCE Proxy connecting CloudSQL and Data Fusion

Image source

Implementation Method

Prerequisites

Enable APIs

API activation link as follows:

Terminal window
1
https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com,bigquery.googleapis.com,storage.googleapis.com,sqladmin.googleapis.com

Create VPC

Terminal window
1
export PRJ_ID="PROJECT_ID"
2
export VPC_NAME="NAME"
3
4
gcloud compute networks create ${VPC_NAME} --project=${PRJ_ID} --subnet-mode=custom --mtu=1460 --bgp-routing-mode=regional
5
6
gcloud compute networks subnets create ${VPC_NAME}-us --project=${PRJ_ID} --range=10.1.1.0/24 --stack-type=IPV4_ONLY --network=${VPC_NAME} --region=us-central1 --enable-private-ip-google-access
7
8
gcloud compute --project=${PRJ_ID} firewall-rules create ${VPC_NAME}-allow-rdp --direction=INGRESS --priority=1000 --network=${VPC_NAME} --action=ALLOW --rules=tcp:3389 --source-ranges=0.0.0.0/0
9
10
gcloud compute --project=${PRJ_ID} firewall-rules create ${VPC_NAME}-allow-internal --direction=INGRESS --priority=1000 --network=${VPC_NAME} --action=ALLOW --rules=all --source-ranges=10.0.0.0/8

Create CloudSQL

Create CloudSQL for MSSQL, enable private IP, and then create test data

Create Test Data

  • Import test data
  • Need to use SSMS tool to connect to the MSSQL host; this example uses GCE
  • First, create a Windows Server 2022 host
  • Download and install SSMS tool
  • Enable CDC synchronization
Import Test Data

Download AdventureWorks2017.bak file and import it into the database. In CloudSQL, select import

CloudSQL import option

Click BROWSE to create a new Bucket

Creating a new bucket

Enter Bucket name

Entering bucket name

Select Region

Selecting region for bucket

Use default, click continue

Continuing with default settings

Again, select default

Selecting default settings again

Click create

Creating the bucket

Click confirm

Confirming bucket creation

After uploading to the Bucket, select the bak file

Selecting bak file from bucket

  1. Select the bak file uploaded to the bucket
  2. Enter the database name to be created after import
  3. Import

Importing database from bak file

Set up Authorized network

Setting up authorized network

Verify using SSMS

Enter CloudSQL IP for Server name, both internal and external IP are acceptable Select SQL Server Authentication for Authentication Default Login is sqlserver Password is the one set previously

SSMS connection settings

Check if import was successful

Verifying successful import in SSMS

Set up CDC Synchronization

Refer to Enable change data capture (CDC) for setup

Enable database CDC synchronization feature

1
EXEC msdb.dbo.gcloudsql_cdc_enable_db [DB_NAME]

Expected result

Expected result of enabling CDC

Enable CDC capture job

1
EXEC sys.sp_cdc_start_job @job_type = N'cleanup'

Enabling CDC capture job

Enable table synchronization

1
EXEC sys.sp_cdc_enable_table
2
@source_schema = N'dbo',
3
@source_name = N'age',
4
@role_name = N'CDC'

Verify if enabled successfully

1
EXECUTE sys.sp_cdc_help_change_data_capture
2
@source_schema = N'dbo',
3
@source_name = N'age'

Verifying CDC enablement

Create CloudSQL Proxy VM

Since DataFusion cannot directly connect to CloudSQL, we must create a GCE running CloudSQL Proxy for connection

First, create an Ubuntu 20.04 VM and install MSSQL Client packages following these steps

Install SQL Server command-line tools on Linux - SQL Server | Microsoft Learn

Terminal window
1
curl https://packages.microsoft.com/keys/microsoft.asc | sudo tee /etc/apt/trusted.gpg.d/microsoft.asc
2
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
3
sudo apt-get update
4
sudo apt-get install mssql-tools18 unixodbc-dev
5
echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc
6
source ~/.bashrc

Download CloudSQL Proxy

Terminal window
1
cd /opt
2
curl -o cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.7.2/cloud-sql-proxy.linux.amd64
3
chmod +x cloud-sql-proxy

Set up startup, remember to replace CloudSQL Connection name

Terminal window
1
export INSTANCE_CONNECTION_NAME="YOUR_CONNECTION_NAME"
2
echo "/opt/cloud-sql-proxy --address 0.0.0.0 --port 1433 ${INSTANCE_CONNECTION_NAME} --private-ip &" > /opt/start_proxy.sh
3
sed -i '1i #!/bin/sh' /opt/start_proxy.sh
4
chmod +x /opt/start_proxy.sh
5
echo "[Unit]
6
Description=start cloudsql proxy
7
8
[Service]
9
ExecStart=/opt/start_proxy.sh
10
Type=oneshot
11
RemainAfterExit=yes
12
13
[Install]
14
WantedBy=multi-user.target" > /etc/systemd/system/cloudsqlproxy.service
15
systemctl daemon-reload
16
systemctl enable cloudsqlproxy.service

Test using sqlcmd

Terminal window
1
/opt/start_proxy.sh
2
sqlcmd -S tcp:127.0.0.1,1433 -U YOUR_USERNAME -P YOUR_PASSWORD -C

You can use the following command to query the database

1
select DB_NAME()
2
GO

Output example

Example output of database query

Create Cloud Data Fusion instance

Click here to create instance

Creating Cloud Data Fusion instance

Enter a name, then click Create

Entering name for Data Fusion instance

Click Add to edition

Adding to edition

Check Enable Private IP

Enabling Private IP for Data Fusion

Set up VPC Peering

First, copy necessary information from Data Fusion

Copying necessary information from Data Fusion

  1. Copy Project ID
  2. Copy Region
  3. Copy Instance ID

Copying Project ID, Region, and Instance ID

Go to VPC and add PEERING

Adding PEERING in VPC

Create PEERING

  1. Enter name
  2. Select other PROJECT
  3. Paste the ProjectID from the previous step
  4. Paste steps 2+3 from the previous step, in the format “Region”-“InstanceID”
  5. Check Export custom routes
  6. Create

Creating PEERING

After creation, PEERING will appear

PEERING created successfully

Configure instance

Download JDBC version 9.4, extract it, then click View Instance. Using the latest version may result in encryption issues preventing connection

Downloading JDBC driver

Click Continue

Continuing to instance setup

Click Control Center

Accessing Control Center

Click +, then upload Driver

Uploading JDBC driver

Upload the mssql-jdbc-9.4.1.jre8.jar file

Selecting mssql-jdbc-9.4.1.jre8.jar file

Enter sqlserver for Name Enter com.microsoft.sqlserver.jdbc.SQLServerDriver for Class name

Keep Version as default

Configuring JDBC driver details

Start Synchronization

Return to the main menu and click Replication

Selecting Replication from main menu

Create a replication job

Creating replication job

Enter a name

Entering replication job name

  1. Select SQL Server
  2. Enter the IP of the CloudSQL Proxy GCE
  3. Select the driver configured earlier
  4. Enter the name of the database to be synchronized
  5. Enter user credentials

Configuring SQL Server connection details

Click Next

Proceeding to next step

  1. You can set a pre-created Dataset name
  2. Select data location
  3. Next

Configuring BigQuery dataset details

Select the tables to synchronize, then click Next

Selecting tables for synchronization

Click Next

Proceeding to next configuration step

Use default values, then Next

Using default values and proceeding

Use default values, then Next

Continuing with default settings

Click Start

Starting the replication process

Starting will take some time

Replication job starting

Once started, you can click Logs to check the status

Viewing replication logs

Add new data in the SSMS database

Adding data in SSMS database

You should then be able to see the newly added records in BQ

Verifying new records in BigQuery

Troubleshooting

Table Without Primary Key

Tables without a Primary Key cannot synchronize normally. Please ensure that tables have a Primary Key

Error message for table without Primary Key

CloudSQL Proxy VM Unable to Connect to CloudSQL

When testing the connection from CloudSQL Proxy VM to CloudSQL, insufficient permissions were found

Terminal window
1
root@cdc-proxy:/opt# sqlcmd -S tcp:127.0.0.1,1433 -U YOUR_USERNAME -P YOUR_PASSWORD -C
2
3
2023/12/06 03:10:36 [timotion-bigquery:asia-east1:cdc-mssql] accepted connection from 127.0.0.1:56693
4
2023/12/06 03:10:36 [timotion-bigquery:asia-east1:cdc-mssql] failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "timotion-bigquery:asia-east1:cdc-mssql"): googleapi: Error 403: Request had insufficient authentication scopes.
5
Details:
6
[
7
{
8
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
9
"domain": "googleapis.com",
10
"metadata": {
11
"method": "google.cloud.sql.v1beta4.SqlConnectService.GetConnectSettings",
12
"service": "sqladmin.googleapis.com"
13
},
14
"reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
15
}
16
]
17
More details:
18
Reason: insufficientPermissions, Message: Insufficient Permission
19
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server..
20
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : TCP Provider: Error code 0x68.
21
Sqlcmd: Error: Microsoft ODBC Driver Server : Client unable to establish connection.
22
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection due to prelogin failure.

If using the default GCE service account, you’ll need to adjust the access scope. First, click on the VM

Selecting VM for access scope adjustment

Click Edit

Editing VM settings

Enable CloudSQL and save

Enabling CloudSQL access and saving changes

Also, confirm whether the sqladmin.googleapis.com API is enabled. If not, using sqlcmd will result in the following error:

Terminal window
1
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server..
2
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : TCP Provider: Error code 0x68.
3
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection.
4
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection due to prelogin failure.

Click here to enable the API