An important requirement for data privacy and protection is to find and catalog tables and columns that contain PII or PHI data in a data warehouse. Open source data catalogs like and enable cataloging of information in data warehouses. Moreover, tables and columns can be tagged including PII and type of PII tags. Datahub Amundsen The missing piece is to scan, detect and tag tables and columns with PII. This post describes two strategies to scan and detect PII as well as introduce an open source application that can be used to scan data warehouses. PIICatcher What is PII data? PII or is generally defined as any piece of information that can be used to identify an individual. Traditionally, information such as SSN, mailing, email, or phone numbers is considered PII. As technology has evolved, the scope of PII has increased to include login IDs, IP addresses, geolocation, and biometric data. Personally Identifiable Information There are different types of PII data: : is any data that can be used to directly link to an individual such as name, phone numbers, email, and mailing address. Sensitive : is any data that can be used to indirectly link to an individual such as location and race. Non-Sensitive Specifically, PII, as defined by Compliance laws, are: : PII is any data that can be used to clearly identify an individual. This also includes IP addresses, login ID details, social media posts, digital images, geolocation, and more. GDPR : Personal information is defined as information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household. CCPA : HIPAA also defines PII as any type of information that relates directly or indirectly to an individual. HIPAA Beyond the above definition, domains and businesses may have specific PII data collected by them. A simple example is PHI (Personal Health Information) collected by the health industry. Similarly, bank account or crypto-currency wallet IDs can also be used to identify individuals. The following list can be considered as basic or common PII information that all industries need to manage: Phone Email Credit Card Address Person/Name Location Date Gender Nationality IP Address SSN User Name Password Challenges An example record in the table in is: patients Synthetic Patient Records with COVID-19 Column Name Data Id f0f3bc8d-ef38-49ce-a2bd-dfdda982b271 BIRTHDATE 2017-08-24 SSN 999-68-6630 FIRST Jacinto644 LAST Kris249 RACE white ETHNICITY nonhispanic GENDER M BIRTHPLACE Beverly Massachusetts US ADDRESS 888 Hickle Ferry Suite 38 CITY Springfield STATE Massachusetts COUNTY Hampden County ZIP 01106 LAT 42.151961474963535 LON -72.59895940376188 HEALTHCARE_EXPENSES 8446.49 HEALTHCARE_COVERAGE 1499.08 Note that most of the columns store PII data. However, it can be confusing to detect if a column stores PII data and the type of PII data. For example, if the scanner only scans the data in then it may detect it as a phone number. Similarly, or in the GENDER column or in RACE column, do not provide enough context to detect if it is PII and the type of PII data. In both these cases, it is easier to scan the column names. SSN M F white Conversely, the table stores the name of health insurance companies in the column. In this case, the scanner has to check the data to detect that the column does contain PII data. payers NAME NAME not Techniques to scan and detect PII data Based on the previous section, the two main strategies to scan for PII data are: Scan column and table names Scan data stored in columns Scan Data Warehouse Metadata Data engineers use descriptive names for tables and columns to help users understand the data stored in them. Therefore, the names of tables and columns provide clues to the type of data stored. For example, , , or may be used to store the name of a person. first_name last_name full_name name or may be used to store US SSN numbers. ssn social_security or may be used to store phone numbers. phone phone_number All data warehouses provide an information schema to extract schema, table and column information. For example, the following query can be used to get metadata from Snowflake: SELECT lower(c.column_name) AS col_name, c.comment AS col_description, lower(c.data_type) AS col_type, lower(c.ordinal_position) AS col_sort_order, lower(c.table_catalog) AS database, lower({cluster_source}) AS cluster, lower(c.table_schema) AS schema, lower(c.table_name) AS name, t.comment AS description, decode(lower(t.table_type), 'view', 'true', 'false') AS is_view FROM {database}.{schema}.COLUMNS AS c LEFT JOIN {database}.{schema}.TABLES t ON c.TABLE_NAME = t.TABLE_NAME AND c.TABLE_SCHEMA = t.TABLE_SCHEMA Regular expressions can be used to match table or column names. For example, the regular expression below detects a column that stores social security numbers: ^.*(ssn|social).*$ Scan data stored in columns The second strategy is to scan the data stored in columns. Within this strategy the two sub-strategies are: Regular Expressions NLP libraries such as and Stanford NER Detector Spacy The major disadvantage of this strategy is that NLP libraries are compute-intensive. It can be prohibitively expensive to run NLP scanners even on moderately sized tables let alone tables of millions or billions of rows. Therefore, a sample of rows should be scanned. Choosing a random sample is harder than expected. Luckily, a few databases provide built-in functions to choose a random sample. For example, the Snowflake query below provides a random sample: random select {column_list} from {schema_name}.{table_name} TABLESAMPLE BERNOULLI (10 ROWS) Once the rows have been extracted, they can be processed using regular expressions or NLP libraries to detect PII content. Breaking Ties As explained in , both techniques are required to detect PII data. However, both techniques are susceptible to false positives and negatives. More often than not, different techniques suggest conflicting PII types. Detecting the right type is hard and the subject of a future blog post. challenges PIICatcher: Scan data warehouses for PII data implements both the strategies to scan and detect PII data in the data warehouses. PIICatcher Features A data warehouse can be scanned using either strategies. PIICatcher is with a growing set of regular expressions for scanning column names as well as data. It also include . batteries-included Spacy PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources. There are ingestion functions for both Datahub and Amundsen which will tag columns and tables with PII and the type of PII tags. Check out for an example of how PIIcatcher is used in production. AWS Glue & Lake Formation Privilege Analyzer Conclusion Column names and data can be scanned to detect PII in databases. Both strategies are required to reliably detect PII data. PIICatcher is an open-source application that implements both of these strategies. It can tag datasets with PII and the type of PII to enable data admins to take more informed decisions on data privacy and security. This article was also published on Tokern.