Sample data limitations and guidelines

  • Sample data via Edge may require additional Edge site memory, CPU and disk space.
  • Currently, you can only request sample data via Edge for Table and Column assets.
  • Sample data for data sources registered via Edge is temporarily cached on the Edge site. In the cache, the sample data is not encrypted. This means that the data is available in clear text in the Edge cache for 24-48 hours. Only the key that allows to identify the sample data's origin is encrypted.
  • For performance reasons, the number of samples to display must be less than 1,000. This limit is configurable in the Maximum number of samples setting, in the Data Profiling section. The default value is 100. The maximum value is 1,000.
    Go to Configure the use of sample data via Edge or Configure the use of sample data via Jobserver.
  • For performance reasons, avoid sampling tables with more than 1,500 columns.
    This limit is not configurable at the moment.
  • The sampling feature always uses push-down sampling if push-down sampling is available for the data source. Push-down sampling increases the sample data extraction speed.
    We advise to only allow sampling on data sources that support push-down sampling. To know if your data source allows for push-down sampling (called partial scan in Edge), go to Data sources supported by Edge or Overview of Collibra-provided JDBC drivers (Jobserver).

    Note If you try sampling on a data source that does not allow push-down sampling, the sample data extraction time is proportional to the database table size. The bigger the table, the longer it will take to retrieve the samples.