Calculation components for Automatic Data Classification

The following components are used to calculate data classes via the Cloud Data Classification Platform or via Edge:

Component

Purpose

Neural network A machine learning tool that is continuously trained to identify linguistic patterns. Training data has been collected to have an initial set of patterns.
Regex matcher

A wide range of regular expressions to identify matching patterns. When the matched types in a column exceeds a certain threshold, the result is used in the final calculation of the data class.

Dictionary search The classification is based on a dictionary attack. Multiple data classes only have a limited number of possible values, for example countries, country codes, currencies and days of week. These are all stored in a dictionary.
The sample data is matched against these dictionaries.
Aggregator The aggregator gathers the responses from the neural network, regex matcher and dictionary search and creates a final response based on underlying algorithms.

The classification process takes the following data into account:

  • Name of the column
  • Number of distinct values
  • Data type
  • Samples

How does retraining work?

Data classification on Edge does not retrain the classification model to improve future classification predictions. However, when you reject a data class, this data class won’t be suggested again by the data classification.

The Cloud Data Classification Platform retrains, by default, every day, at a random time during the day. In the Cloud Data Classification Platform, the calculations are all based on the received data samples. Every time you accept a predicted data class, the sample data used to calculate that data class is added to the Cloud Data Classification Platform to improve future data class predictions. See also Feedback on Automatic Data Classification.

Example 
Assume you have a single column, C, containing sample data [a,b,c,d]. You classify this column, and the classification algorithm returns class x with confidence 70%.
If you accept this class, then future columns containing the values [a,b,c,d] will be slightly more likely to be classified as x. In the future, a column with the same sample data may be classified as x with confidence 71%. The same can be said for a rejection of the above classification, with future results returning a confidence of, for example, 65%.

Note In reality, changes will be more discrete and take more than one accepted or rejected data class to become effective.