Synchronize Amazon SageMaker Unified Studio

Synchronizing SageMaker Unified Studio is the process of integrating metadata from SageMaker Unified Studio and making the data available in Collibra Platform.

You can synchronize manually or automate the process by adding a synchronization schedule.

Prerequisites

In your Collibra environment:

Steps

  1. On the main toolbar, click Products icon Catalog.
    The Catalog homepage opens.
  2. On the main toolbar, click Plus icon.
    The Create dialog box appears.
  3. In the Register with Edge section of the Create dialog box, click Integration configuration.
    The Integration configuration page opens.
  4. In the Connection name column, locate the AWS connection that you used when you added the SageMaker Unified Studio data catalog capability and click the link in the Capabilities column.
    The Synchronization page opens.
  5. In the Synchronization configuration section, click Add configuration.
  6. Complete the fields as follows:
    FieldActionRequired
    Updated: <timestamp>Click Updated: <timestamp> next to Synchronization configuration, where timestamp indicates the last time when the data was loaded from SageMaker Unified Studio.
    The SageMaker Unified Studio domain names and IDs are loaded to the dropdown list of the SageMaker Unified Studio domain IDs field. This can take some time.
    No
    Synchronization source

    Select the data source in SageMaker Unified Studio that you want to integrate:

    • Amazon Redshift
    • Amazon Glue
    Yes
    SystemIn System, select the System asset in which you want to add the SageMaker Unified Studio assets. Yes
    Default communitySelect a Collibra community to ingest the metadata. Subdomains per schema and database will be automatically created in this community. Yes
    AWS regions

    Select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions where SageMaker is available.

    Yes
    SageMaker Unified Studio domain IDs

    Enter the SageMaker Unified Studio domains to ingest metadata from. If no domains are selected, the integration will ingest metadata from all domains in the account.

    Important Avoid integrating the same data into multiple domains because the SageMaker Unified Studio integration uses unique SageMaker IDs to identify assets. If you integrate the same data into multiple domains, the integration may locate existing assets and overwrite their values, even if those assets belong to a different domain than the one specified in the integration.

    Yes
    JDBC connections

    If you want to allow sampling, profiling, and classification of assets created via the SageMaker Unified Studio integration, add the JDBC connection information. This field is unavailable when you set the Synchronization source to Amazon Glue.
    To do so, complete the following steps:

    1. Click Add Item.
    2. In Database full name, enter the Redshift database name in the following format:
      redshift.{awsAccountId}.{awsRegion}.[serverless|cluster].[{workgroupName}|{clusterName}]>DbName
      Example 
      • redshift.138268366.eu-west-1.cluster.ay-test-cluster>dev
      • redshift.138268366.eu-west-1.serverless.ay-redshift-serverless-workgroup>dev
    3. In JDBC connection, select the JDBC connection that you created for your Redshift database.
    4. Click Save.
    Note Make sure to add all JDBC connections for the Redshift databases that you want to integrate.

     

    No
  7. Click Save.
  8. Click Synchronize.
    A notification indicates the synchronization has started.
    If you added JDBC connections, the synchronization automatically creates the required capabilities: Catalog JDBC ingestion, JDBC profiling, Catalog Data Classification, and Catalog JDBC sampling, allowing you to profile, classify, and retrieve sample data for the assets.
  1. On the main toolbar, click Products icon Catalog.
    The Catalog homepage opens.
  2. On the main toolbar, click Plus icon.
    The Create dialog box appears.
  3. In the Register with Edge section of the Create dialog box, click Integration Configuration.
    The Integration Configuration page opens.
  4. In the Connection name column, locate the AWS connection that you used when you added the SageMaker Unified Studio data catalog capability and click the link in the Capabilities column.
    The Synchronization page opens.
  5. In the Synchronization configuration section, click Add configuration.
  6. Complete the fields as follows:
    FieldActionRequired
    Updated: <timestamp>Click Updated: <timestamp> next to Synchronization configuration, where timestamp indicates the last time when the data was loaded from SageMaker Unified Studio.
    The SageMaker Unified Studio domain names and IDs are loaded to the dropdown list of the SageMaker Unified Studio domain IDs field. This can take some time.
    No
    Synchronization source

    Select the data source in SageMaker Unified Studio that you want to integrate:

    • Amazon Redshift
    • Amazon Glue
    Yes
    SystemIn System, select the System asset in which you want to add the SageMaker Unified Studio assets. Yes
    Default communitySelect a Collibra community to ingest the metadata. Subdomains per schema and database will be automatically created in this community. Yes
    AWS regions

    Select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions where SageMaker is available.

    Yes
    SageMaker Unified Studio domain IDs

    Enter the SageMaker Unified Studio domains to ingest metadata from. If no domains are selected, the integration will ingest metadata from all domains in the account.

    Important Avoid integrating the same data into multiple domains because the SageMaker Unified Studio integration uses unique SageMaker IDs to identify assets. If you integrate the same data into multiple domains, the integration may locate existing assets and overwrite their values, even if those assets belong to a different domain than the one specified in the integration.

    Yes
    JDBC connections

    If you want to allow sampling, profiling, and classification of assets created via the SageMaker Unified Studio integration, add the JDBC connection information. This field is unavailable when you set the Synchronization source to Amazon Glue.
    To do so, complete the following steps:

    1. Click Add Item.
    2. In Database full name, enter the Redshift database name in the following format:
      redshift.{awsAccountId}.{awsRegion}.[serverless|cluster].[{workgroupName}|{clusterName}]>DbName
      Example 
      • redshift.138268366.eu-west-1.cluster.ay-test-cluster>dev
      • redshift.138268366.eu-west-1.serverless.ay-redshift-serverless-workgroup>dev
    3. In JDBC connection, select the JDBC connection that you created for your Redshift database.
    4. Click Save.
    Note Make sure to add all JDBC connections for the Redshift databases that you want to integrate.

     

    No
  7. Optionally, in AWS Regions, select the region of the SageMaker Unified Studio assets. If no regions are selected, the integration searches all regions were SageMaker is available.
  8. Click Save.
  9. In the Synchronization Schedule section, click Add schedule.
  10. Enter the required information and click Save:
    FieldDescription
    RepeatThe interval when you want to synchronize automatically. The possible values are: Daily, Weekly, Monthly, and Cron expression.
    Cron

    The Quartz Cron expression that determines when the synchronization takes place.

    This field is only visible if you select Cron expression in the Repeat field.

    Every

    The day on which you want to synchronize, for example, Sunday.

    This field is only visible if you select Weekly in the Repeat field.

    Every first

    The day of the month on which you want to synchronize, for example, Tuesday.

    This field is only visible if you select Monthly in the Repeat field.

    At

    The time at which you want to synchronize automatically, for example, 14:00.

    • You can only schedule on the hour. For example, you can add a synchronization schedule at 8:00, but not at 8:45.
    • This field is only visible if you select Daily, Weekly, or Monthly in the Repeat field.
    Time zoneThe time zone for the schedule.

Tip If you added JDBC connections, the synchronization automatically creates the required capabilities: Catalog JDBC ingestion, JDBC profiling, Catalog Data Classification, and Catalog JDBC sampling, allowing you to profile, classify, and retrieve sample data for the assets.

What's next

The synchronization job synchronizes the SageMaker Unified Studio data.

After the synchronization: