背景

其它数据中心的 SQLServer 定期备份并将备份文件上传 S3,再还原到 AWS RDS(SQLServer),采用每周末一次全量备份+工作日每天一次差异备份的执行计划。

我使用 pymssql 库访问 AWS RDS,并执行 AWS 提供的 stored procedure 来进行数据库还原,T-SQL 如下,详见 AWS 文档。

exec msdb.dbo.rds_restore_database
	@restore_db_name='database_name',
	@s3_arn_to_restore_from='arn:aws:s3:::bucket_name/file_name.extension',
	@with_norecovery=0|1,
	[@kms_master_key_arn='arn:aws:kms:region:account-id:key/key-id'],
	[@type='DIFFERENTIAL|FULL'];

Importing and exporting SQL Server databases using native backup and restore - Amazon Relational Database Service

创建一个还原 Task 后,轮询 Task 执行状态。

T-SQL:

exec msdb.dbo.rds_task_status
	[@db_name='database_name'],
	[@task_id=ID_number];

代码逻辑如下,这里借助 redo 在数据库连接丢失后进行重试,重新建立数据库连接。

@redo.retriable(attempts=5, sleeptime=60, retry_exceptions=(DisconnectError,))
def wait(self, creds: SqlCreds, delay=60):
    """
    SQL Server also has an "Idle Connection Timeout" setting, which determines how long a connection
    can be idle (not used) before it is closed. This setting is controlled by the "Remote Query Timeout"
    configuration option and is set to 600 seconds (10 minutes) by default.
    """
    conn = creds.connect()
    try:
        while True:
            status = self.status(conn=conn)
            lifecycle = status["lifecycle"]
            database = self.meta["db_name"]
            if lifecycle in (LifeCycle.IN_PROGRESS, LifeCycle.CREATED):
                LOG.info(f"Waiting for {database} to be restored...")
                time.sleep(delay)
            elif lifecycle == LifeCycle.SUCCESS:
                return
            elif lifecycle == LifeCycle.ERROR:
                raise pymssql.OperationalError(
                    f"Error restoring {database}\\t{json.dumps(status, default=str)}"
                )
            else:
                raise ValueError(
                    f"Unknown lifecycle: {lifecycle}\\t{json.dumps(status, default=str)}"
                )
    except pymssql.OperationalError as e:
        LOG.error(f"Error waiting: {e}")
        if creds.is_disconnect(e, conn):
            raise DisconnectError(f"PyMSSQL connection has gone away: {e}")
    finally:
        conn.close()

异常 1

这是 RDS rds_task_status 的执行异常日志:

[2023-02-17 03:28:23.083] Aborted the task because of a task failure or a concurrent RESTORE_DB_DIFFERENTIAL request.
[2023-02-17 03:28:23.643] Task has been aborted
[2023-02-17 03:28:23.910] Task was initiated on server name: EC2AMAZ-7POJKEI, current server name: EC2AMAZ-T85JIN4. Cannot complete the task, please try again.

异常 2

这是 Python 程序的执行异常日志:

2023-02-17 02:02:33,089: ERROR : Error waiting: (20047, b'DB-Lib error message 20047, severity 9:\\nDBPROCESS is dead or not enabled\\n')
2023-02-17 02:04:02,066: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #3
2023-02-17 02:07:17,487: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #4
2023-02-17 02:11:41,495: INFO : retry: calling wait with args: (FullBackup(id='MARTPH_20230217|2023-02-11|FULL|1|2023-02-17'), <bayer_cdp_common_utils.mssql_handler.SqlCreds object at 0x7ff9144bc4e0>), kwargs: 
{}
, attempt #5
2023-02-17 02:12:41,536: INFO : retry: Giving up on wait
Traceback (most recent call last):
  File "/tmp/runscript.py", line 230, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-l7dpyz1k/ph-cdw-sqlserver-restore.py", line 90, in <module>
  File "/tmp/glue-python-scripts-l7dpyz1k/ph-cdw-sqlserver-restore.py", line 56, in _wait_for_restore
  File "/glue/lib/installation/bayer_cdp_common_utils/rds_restore_handler.py", line 837, in wait_for_restore
    bak.full_backup.wait(creds)
  File "/glue/lib/installation/redo/__init__.py", line 215, in _retriable_wrapper
    return retry(func, args=args, kwargs=kwargs, *retry_args, **retry_kwargs)
  File "/glue/lib/installation/redo/__init__.py", line 170, in retry
    return action(*args, **kwargs)
  File "/glue/lib/installation/bayer_cdp_common_utils/rds_restore_handler.py", line 408, in wait
    conn = creds.connect()
  File "/glue/lib/installation/bayer_cdp_common_utils/mssql_handler.py", line 36, in connect
    autocommit=autocommit,
  File "src/pymssql/_pymssql.pyx", line 653, in pymssql._pymssql.connect
pymssql._pymssql.OperationalError: (20009, b'DB-Lib error message 20009, severity 9:\\nUnable to connect: Adaptive Server is unavailable or does not exist (fusion-q-1.cpmknfgp0x9h.rds.cn-north-1.amazonaws.com.cn)\\nNet-Lib error during Connection timed out (110)\\nDB-Lib error message 20009, severity 9:\\nUnable to connect: Adaptive Server is unavailable or does not exist (fusion-q-1.cpmknfgp0x9h.rds.cn-north-1.amazonaws.com.cn)\\nNet-Lib error during Connection timed out (110)\\n')