Security & Privacy
Our Journey to Efficiently and Accurately Detect Personal Information
In the digital world today, most governments and companies use cloud services to organize services and store user data today. Under these circumstances, there will always be a risk of leaking customer information stored in the cloud due to security attacks launched on these environments. Every Enterprise should take measures to handle digital data responsibly and place extra emphasis on data protection. With the increasing threat of personal information exposure, we are considering all possible circumstances to defend against these threats thoroughly. This blog covers an area of research and development we did around personal information detection. We will share our approach and insights on building technology for personal information detection.
Personal information detection plays an important role in Data Discovery, which serves as a key function for enterprise data protection. Data Discovery deals with the function of finding data in the enterprise environment under the guidance of data policies and standards. Data Discovery helps populate the data inventory within the enterprise.
We researched and development key technologies that automatically identifies and protects personal information during Data Discovery. Our technology reflects some of the challenges we encountered in our environment.
We process and protect various forms of data types in our enterprise environment. To gauge an accurate understanding of our data landscape, we need to find ways to capture an accurate snapshot of data in our enterprise without inspecting the entire data one by one.
We acknowledge full data inspection doesn’t scale well in time or space as data constantly evolves in the enterprise. Applying methods that are effective and relatively accurate would be more pragmatic. One could attempt to use random sampling of data then performing detection of personal information on that data. However, the random sampling approach has several drawbacks. First, every sampling process can have different personal information detection outcomes. The aforementioned issue can occur, especially if the data has difficult-to-normalize types, such as names and addresses. Second, when a data type of low frequency is sampled, an issue can occur where corresponding data becomes the representative data type of the target large-scale data.
One way to identify personal information is by building patterns and these patterns are generally detected using regular expressions, and known format types. Sometimes these patterns are difficult to express using a dictionary. Furthermore, there are cases where identifying which personal information type certain data belongs to is unclear. For example, it is difficult to distinguish data in “yyyy-mm-dd” formats between birthdays and creation times. In addition, if certain personal information types of patterns, such as IMEI, are inadvertently matched in hashed data, a “False Positive Problem” in detection will occur.
With the rising popularity of No SQL data types, semi-structured types of data, not primitive types, are widespread. These types of data cause problems where column data is hard to process collectively by judging it as a specific personal information type. In addition, if the table schema is recursive, accurate personal information detection is even more challenging for general queries, increasing the amount of data analyses to account for variability.
For data types with variable formats and patterns, it becomes a challenge to construct regular expressions and consistent patterns. While dictionary based methods could address some of these challenges, additional measures need to be taking into account for data types with extreme levels of variability (name, address etc).
Figure 1. The workflow to find representative PII types in large-scale column data
Our approach to addressing scalability started by first examining partitions and blocks with the largest capacity for each table. We checked meta information, such as partitions and tables in the Data Inventory, to reduce the data scale but sample data with the broadest coverage. After that, we sorted this data by group and counted for each column in the table so that N pairs (data and aggregation) are dynamically queried in high aggregation order to sample them with high-frequency data. This process greatly improves analysis times by analyzing only a total of N fixed data regardless of the table’s size and allows the maximum analysis time to be calculated before the data is analyzed.
Figure 2. Several filters and validation logic to address false positive problems
To raise the certainty over the personal information detected, we built several filters and validation logic constructed using regular expressions and dictionary searches. For example, it is difficult to know whether a piece of data is a date of birth or creation time using only date-type values, so we applied some filters to check for similarities with birthdays by referring to the target data’s context, like table name, column name, column description, etc. In addition, we apply statistical filters to remove personal information types with low frequencies by combining context information and personal information risk assessment or removing some values depending on personal information types in case of duplicate detections or biased distributions. For personal information types with verifiable sources (IMEIs, Personal Account Numbers etc) , we solve numerous False Positive Problems by applying validation filters to verify once more that the detected personal information type is correct.
Figure 3. An example of variability processing for unstructured data types
To address structural variability in data types, we analyze and normalize the data schema into a composite format structure, such as record type, and stored in a data warehouse so queries can be made in the same manner as RDB. We found the most common un-structured data types exist in a JSON format and focused on analyzing and normalizing JSONs. We add an extra step when we analyze the data. We generate a flattened query based on the type of field in the column’s composite structure and perform data aggregation for each data type. This step allows us to better detect personal information by identifying and classifying personal information in detailed sub-attribute units for semi-structured data.
To address semantic variability in data types where extreme levels of variance (Names, Addresses etc) exist, it is difficult to configure the dictionary, and even if the dictionary is used, similar information that is not in the dictionary may not be detected. We found extreme levels of variance come from multiple sources that are deeply tied to the culture and language of different groups. To reduce these cases and improve detection performance, we used AI to infer new information from existing information.
For example, in addresses, Korea can use the public open API to configure detailed addresses, but in other countries, open data cannot cover the entire address alone.
Therefore, we creatively searched for different ways to complement insufficient address data and the address notation pattern by country. As a result, we enhanced detection performance by combining the data and stacking the RNN model, a representative natural language learning technique.
Figure 4. Improved PII detection performance with AI
As we stated earlier, personal information detection plays an important role in Data Discovery. As the data progresses in its lifecycle, different types of data protection technologies play a role in the enterprise.
* PII: Personally Identifiable Information
Figure 5. Various data protection technologies within the enterprise data lifecycle
There are many research areas we are pursuing to advance technology for data protection. We are actively looking into various areas of research spanning from de-identification, privacy preserving computing systems and cryptography. Our technical efforts are working towards an environment that safely processes data based on the data type and purpose of use while harnessing the true value of data to deliver value to all stakeholders. We are actively looking to come up with better ways to ensure privacy and utility.