跳到內容

Replicating data from SQL Server to BigQuery by DataFusion

Scenario

Diagram showing direct connection between CloudSQL and Data Fusion is not possible

因此需透過 GCE 的 Proxy 進行連線

Architecture diagram showing GCE Proxy connecting CloudSQL and Data Fusion

圖片來源

實做方式

前置需求

啟用 API

API 啟用連結如下

Terminal window
1
https://console.cloud.google.com/flows/enableapi?apiid=datafusion.googleapis.com,bigquery.googleapis.com,storage.googleapis.com,sqladmin.googleapis.com

建立 VPC

Terminal window
1
export PRJ_ID="PROJECT_ID"
2
export VPC_NAME="NAME"
3
4
gcloud compute networks create ${VPC_NAME} --project=${PRJ_ID} --subnet-mode=custom --mtu=1460 --bgp-routing-mode=regional
5
6
gcloud compute networks subnets create ${VPC_NAME}-us --project=${PRJ_ID} --range=10.1.1.0/24 --stack-type=IPV4_ONLY --network=${VPC_NAME} --region=us-central1 --enable-private-ip-google-access
7
8
gcloud compute --project=${PRJ_ID} firewall-rules create ${VPC_NAME}-allow-rdp --direction=INGRESS --priority=1000 --network=${VPC_NAME} --action=ALLOW --rules=tcp:3389 --source-ranges=0.0.0.0/0
9
10
gcloud compute --project=${PRJ_ID} firewall-rules create ${VPC_NAME}-allow-internal --direction=INGRESS --priority=1000 --network=${VPC_NAME} --action=ALLOW --rules=all --source-ranges=10.0.0.0/8

建立 CloudSQL

建立 CloudSQL for MSSQL 並啟用 private ip 後,即可建立測試資料

建立測試資料

  • 匯入測試資料
  • 需使用 SSMS 工具進行連線至 MSSQL 主機,本範例使用 GCE 進行
  • 先建立好一台 Windows Server 2022 主機
  • 下載並安裝 SSMS 工具
  • 開啟 CDC 同步
匯入測試資料

下載 AdventureWorks2017.bak 檔後匯入資料庫,於 CloudSQL 終點選 import

CloudSQL import option

點選 BROWSE 建立新的 Bucket

Creating a new bucket

輸入 Bucket 名稱

Entering bucket name

選擇 Region

Selecting region for bucket

使用預設,點選繼續

Continuing with default settings

同樣選擇預設

Selecting default settings again

點選建立

Creating the bucket

點選確認

Confirming bucket creation

上傳到 Bucket 後,選擇 bak 檔案

Selecting bak file from bucket

  1. 選擇上傳至 bucket 的 bak 檔案
  2. 輸入匯入後建立的資料庫名稱
  3. 匯入

Importing database from bak file

設定 Authorized network

Setting up authorized network

使用 SSMS 確認

Server name 輸入 CloudSQL ip,內外部 ip 均可 Authentication 選擇 SQL Server Authentication Login 預設帳號為 sqlserver Password 為之前建立密碼

SSMS connection settings

查看是否成功匯入

Verifying successful import in SSMS

設定 CDC 同步

參考 Enable change data capture (CDC) 進行設定

啟動資料庫 CDC 同步功能

1
EXEC msdb.dbo.gcloudsql_cdc_enable_db [DB_NAME]

預期結果

Expected result of enabling CDC

啟用 CDC capture job

1
EXEC sys.sp_cdc_start_job @job_type = N'cleanup'

Enabling CDC capture job

啟用資料表同步

1
EXEC sys.sp_cdc_enable_table
2
@source_schema = N'dbo',
3
@source_name = N'age',
4
@role_name = N'CDC'

確認是否啟用成功

1
EXECUTE sys.sp_cdc_help_change_data_capture
2
@source_schema = N'dbo',
3
@source_name = N'age'

Verifying CDC enablement

建立 CloudSQL Proxy VM

由於 DataFusion 無法直接連到 CloudSQL,因此必須透過建立一台 GCE 上面跑 CloudSQL Proxy 方式連線

首先建立一台 Ubuntu 20.04 VM 並依照以下流程安裝 MSSQL Client 套件

Install SQL Server command-line tools on Linux - SQL Server | Microsoft Learn

Terminal window
1
curl https://packages.microsoft.com/keys/microsoft.asc | sudo tee /etc/apt/trusted.gpg.d/microsoft.asc
2
curl https://packages.microsoft.com/config/ubuntu/20.04/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list
3
sudo apt-get update
4
sudo apt-get install mssql-tools18 unixodbc-dev
5
echo 'export PATH="$PATH:/opt/mssql-tools18/bin"' >> ~/.bashrc
6
source ~/.bashrc

下載 CloudSQL Proxy

Terminal window
1
cd /opt
2
curl -o cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.7.2/cloud-sql-proxy.linux.amd64
3
chmod +x cloud-sql-proxy

設定開機啟動,記得置換 CloudSQL Connection name

Terminal window
1
export INSTANCE_CONNECTION_NAME="YOUR_CONNECTION_NAME"
2
echo "/opt/cloud-sql-proxy --address 0.0.0.0 --port 1433 ${INSTANCE_CONNECTION_NAME} --private-ip &" > /opt/start_proxy.sh
3
sed -i '1i #!/bin/sh' /opt/start_proxy.sh
4
chmod +x /opt/start_proxy.sh
5
echo "[Unit]
6
Description=start cloudsql proxy
7
8
[Service]
9
ExecStart=/opt/start_proxy.sh
10
Type=oneshot
11
RemainAfterExit=yes
12
13
[Install]
14
WantedBy=multi-user.target" > /etc/systemd/system/cloudsqlproxy.service
15
systemctl daemon-reload
16
systemctl enable cloudsqlproxy.service

使用 sqlcmd 測試

Terminal window
1
/opt/start_proxy.sh
2
sqlcmd -S tcp:127.0.0.1,1433 -U YOUR_USERNAME -P YOUR_PASSWORD -C

可使用以下指令查詢資料庫

1
select DB_NAME()
2
GO

輸出範例

Example output of database query

建立 Cloud Data Fusion instance

點我建立 instance

Creating Cloud Data Fusion instance

輸入名稱,直接點選 Create 即可

Entering name for Data Fusion instance

點選 Add to edition

Adding to edition

勾選 Enable Private IP

Enabling Private IP for Data Fusion

設定 VPC Peering

先至 Data Fusion 中複製必要資訊

Copying necessary information from Data Fusion

  1. 複製 Project ID
  2. 複製 Region
  3. 複製 Instance ID

Copying Project ID, Region, and Instance ID

至 VPC 中新增 PEERING

Adding PEERING in VPC

建立 PEERING

  1. 輸入名稱
  2. 選擇其他 PROJECT
  3. 貼上上一步的 ProjectID
  4. 貼上上一步步驟 2+3,組成方式為”Region”-“InstanceID”
  5. 勾選 Export custom routes
  6. 建立

Creating PEERING

建立完成後便會出現 PEERING

PEERING created successfully

設定 instance

下載 JDBC 9.4 版本,並解壓縮後,點選 View Instance,如果使用最新版會遇到加密問題而無法連線

Downloading JDBC driver

點選繼續

Continuing to instance setup

點選 Control Center

Accessing Control Center

點選 + 後,再上傳 Driver 上傳

Uploading JDBC driver

上傳 mssql-jdbc-9.4.1.jre8.jar

Selecting mssql-jdbc-9.4.1.jre8.jar file

Name 輸入 sqlserver Class name 輸入 com.microsoft.sqlserver.jdbc.SQLServerDriver

Version 維持預設即可

Configuring JDBC driver details

開始同步

回到主選單,點選 Replication

Selecting Replication from main menu

建立 replication job

Creating replication job

輸入名稱

Entering replication job name

  1. 選擇 SQL Server
  2. 輸入 CloudSQL Proxy GCE 之 ip
  3. 選擇前面設定的 driver
  4. 輸入欲同步的資料庫名稱
  5. 輸入使用者帳密

Configuring SQL Server connection details

點選下一步

Proceeding to next step

  1. 可設定事先建立好的 Dataset 名稱
  2. 選擇資料位置
  3. 下一步

Configuring BigQuery dataset details

選擇要同步的表格後,再點選下一步

Selecting tables for synchronization

點選下一步

Proceeding to next configuration step

使用預設值,下一步

Using default values and proceeding

使用預設值,下一步

Continuing with default settings

點選開始

Starting the replication process

啟動會需要一點時間

Replication job starting

啟動後便可點選 Logs 查看狀況

Viewing replication logs

於 SSMS 資料庫中新增資料

Adding data in SSMS database

便可於 BQ 中查到所新增之紀錄

Verifying new records in BigQuery

疑難排解

資料表無 Primary Key

沒有 Primary Key 會無法正常同步,請留意資料表需有 Primary Key

Error message for table without Primary Key

CloudSQL Proxy VM 無法連線到 CloudSQL

在 CloudSQL Proxy VM 測試連線到 CloudSQL 時發現權限不足

Terminal window
1
root@cdc-proxy:/opt# sqlcmd -S tcp:127.0.0.1,1433 -U YOUR_USERNAME -P YOUR_PASSWORD -C
2
3
2023/12/06 03:10:36 [timotion-bigquery:asia-east1:cdc-mssql] accepted connection from 127.0.0.1:56693
4
2023/12/06 03:10:36 [timotion-bigquery:asia-east1:cdc-mssql] failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "timotion-bigquery:asia-east1:cdc-mssql"): googleapi: Error 403: Request had insufficient authentication scopes.
5
Details:
6
[
7
{
8
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
9
"domain": "googleapis.com",
10
"metadata": {
11
"method": "google.cloud.sql.v1beta4.SqlConnectService.GetConnectSettings",
12
"service": "sqladmin.googleapis.com"
13
},
14
"reason": "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
15
}
16
]
17
More details:
18
Reason: insufficientPermissions, Message: Insufficient Permission
19
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server..
20
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : TCP Provider: Error code 0x68.
21
Sqlcmd: Error: Microsoft ODBC Driver Server : Client unable to establish connection.
22
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection due to prelogin failure.

如果是使用預設的 GCE service account,會需要調整 access scope,先點選 VM

Selecting VM for access scope adjustment

點選編輯

Editing VM settings

開啟 CloudSQL 後儲存

Enabling CloudSQL access and saving changes

並確認是否開啟 sqladmin.googleapis.com 這個 API,如果沒開,使用 sqlcmd 會出現以下錯誤

Terminal window
1
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server..
2
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : TCP Provider: Error code 0x68.
3
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection.
4
Sqlcmd: Error: Microsoft ODBC Driver 18 for SQL Server : Client unable to establish connection due to prelogin failure.

點我開啟 API