Synchronize Amazon SageMaker Unified Studio
Synchronizing SageMaker Unified Studio is the process of integrating metadata from SageMaker Unified Studio and making the data available in Collibra Platform.
You can synchronize manually or automate the process by adding a synchronization schedule.
Prerequisites
In your Collibra environment:
- You have created a AWS connection.
- You have added the SageMaker Unified Studio data catalog capability to the AWS connection.
- You have a resource role with the Configure external system resource permission, for example, Owner.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer. For example, Edge integration engineer.
Steps
-
On the main toolbar, click
→
Catalog.
The Catalog homepage opens. -
On the main toolbar, click
.
The Create dialog box appears. - In the Register with Edge section of the Create dialog box, click Integration configuration.
The Integration configuration page opens. - In the Connection name column, locate the AWS connection that you used when you added the SageMaker Unified Studio data catalog capability and click the link in the Capabilities column.
The Synchronization page opens. - In the Synchronization configuration section, click Add configuration.
- Complete the fields as follows:
Field Action Required Updated: <timestamp> Click Updated: <timestamp> next to Synchronization configuration, where timestampindicates the last time when the data was loaded from SageMaker Unified Studio.
The SageMaker Unified Studio domain names and IDs are loaded to the dropdown list of the SageMaker Unified Studio domain IDs field. This can take some time.
No
Synchronization source Select the data source in SageMaker Unified Studio that you want to integrate:
- Amazon Redshift
- Amazon Glue
YesSystem In System, select the System asset in which you want to add the SageMaker Unified Studio assets.
YesDefault community Select a Collibra community to ingest the metadata. Subdomains per schema and database will be automatically created in this community.
YesAWS regions Select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions where SageMaker is available.
YesSageMaker Unified Studio domain IDs Enter the SageMaker Unified Studio domains to ingest metadata from. If no domains are selected, the integration will ingest metadata from all domains in the account.
Important Avoid integrating the same data into multiple domains because the SageMaker Unified Studio integration uses unique SageMaker IDs to identify assets. If you integrate the same data into multiple domains, the integration may locate existing assets and overwrite their values, even if those assets belong to a different domain than the one specified in the integration.
YesJDBC connections If you want to allow sampling, profiling, and classification of assets created via the SageMaker Unified Studio integration, add the JDBC connection information. This field is unavailable when you set the Synchronization source to Amazon Glue.
To do so, complete the following steps:- Click Add Item.
- In Database full name, enter the Redshift database name in the following format:
redshift.{awsAccountId}.{awsRegion}.[serverless|cluster].[{workgroupName}|{clusterName}]>DbNameExampleredshift.138268366.eu-west-1.cluster.ay-test-cluster>devredshift.138268366.eu-west-1.serverless.ay-redshift-serverless-workgroup>dev
- In JDBC connection, select the JDBC connection that you created for your Redshift database.
- Click Save.
Note Make sure to add all JDBC connections for the Redshift databases that you want to integrate.
No
- Click Save.
- Click Synchronize.
A notification indicates the synchronization has started.
If you added JDBC connections, the synchronization automatically creates the required capabilities: Catalog JDBC ingestion, JDBC profiling, Catalog Data Classification, and Catalog JDBC sampling, allowing you to profile, classify, and retrieve sample data for the assets.
-
On the main toolbar, click
→
Catalog.
The Catalog homepage opens. -
On the main toolbar, click
.
The Create dialog box appears. - In the Register with Edge section of the Create dialog box, click Integration Configuration.
The Integration Configuration page opens. - In the Connection name column, locate the AWS connection that you used when you added the SageMaker Unified Studio data catalog capability and click the link in the Capabilities column.
The Synchronization page opens. - In the Synchronization configuration section, click Add configuration.
- Complete the fields as follows:
Field Action Required Updated: <timestamp> Click Updated: <timestamp> next to Synchronization configuration, where timestampindicates the last time when the data was loaded from SageMaker Unified Studio.
The SageMaker Unified Studio domain names and IDs are loaded to the dropdown list of the SageMaker Unified Studio domain IDs field. This can take some time.
No
Synchronization source Select the data source in SageMaker Unified Studio that you want to integrate:
- Amazon Redshift
- Amazon Glue
YesSystem In System, select the System asset in which you want to add the SageMaker Unified Studio assets.
YesDefault community Select a Collibra community to ingest the metadata. Subdomains per schema and database will be automatically created in this community.
YesAWS regions Select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions where SageMaker is available.
YesSageMaker Unified Studio domain IDs Enter the SageMaker Unified Studio domains to ingest metadata from. If no domains are selected, the integration will ingest metadata from all domains in the account.
Important Avoid integrating the same data into multiple domains because the SageMaker Unified Studio integration uses unique SageMaker IDs to identify assets. If you integrate the same data into multiple domains, the integration may locate existing assets and overwrite their values, even if those assets belong to a different domain than the one specified in the integration.
YesJDBC connections If you want to allow sampling, profiling, and classification of assets created via the SageMaker Unified Studio integration, add the JDBC connection information. This field is unavailable when you set the Synchronization source to Amazon Glue.
To do so, complete the following steps:- Click Add Item.
- In Database full name, enter the Redshift database name in the following format:
redshift.{awsAccountId}.{awsRegion}.[serverless|cluster].[{workgroupName}|{clusterName}]>DbNameExampleredshift.138268366.eu-west-1.cluster.ay-test-cluster>devredshift.138268366.eu-west-1.serverless.ay-redshift-serverless-workgroup>dev
- In JDBC connection, select the JDBC connection that you created for your Redshift database.
- Click Save.
Note Make sure to add all JDBC connections for the Redshift databases that you want to integrate.
No
- Optionally, in AWS Regions, select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions were SageMaker is available.
- Click Save.
- In the Synchronization Schedule section, click Add schedule.
- Enter the required information and click Save:
Field Description Repeat The interval when you want to synchronize automatically. The possible values are: Daily, Weekly, Monthly, and Cron expression. CronThe Quartz Cron expression that determines when the synchronization takes place.
This field is only visible if you select
Cron expressionin the Repeat field.EveryThe day on which you want to synchronize, for example, Sunday.
This field is only visible if you select
Weeklyin the Repeat field.Every firstThe day of the month on which you want to synchronize, for example, Tuesday.
This field is only visible if you select
Monthlyin the Repeat field.At
The time at which you want to synchronize automatically, for example, 14:00.
- You can only schedule on the hour. For example, you can add a synchronization schedule at 8:00, but not at 8:45.
- This field is only visible if you select
Daily,Weekly, orMonthlyin the Repeat field.
Time zone The time zone for the schedule.
Tip If you added JDBC connections, the synchronization automatically creates the required capabilities: Catalog JDBC ingestion, JDBC profiling, Catalog Data Classification, and Catalog JDBC sampling, allowing you to profile, classify, and retrieve sample data for the assets.
What's next
The synchronization job synchronizes the SageMaker Unified Studio data.
After the synchronization:
- You can view a summary of the results from the Activities list.
- The resulting assets get a relation to the Domain that you selected.
For information on the integrated data, go to Synchronized SageMaker Unified Studio data. - You can profile, classify, and request sample data. For details, go to Steps: Integrate Amazon SageMaker Unified Studio.