IBM Security Global Forum

 View Only

How (And Why) Businesses Should Automate PII Discovery

By Andrej Kovacevic posted Wed September 20, 2023 05:06 PM


These days, businesses of all sizes in every industry collect data—and lots of it. At last count, experts estimate that businesses and consumers around the world will create, capture, and consume a staggering 120 zettabytes of data in 2023 alone. For the businesses dealing with this avalanche of data, there's quite a bit of risk involved. This is especially true whenever there's personally identifiable data (PII) involved.

Depending on where a business operates, the collection, storage, and use of PII could fall under a variety of regulatory structures. There's the GDPR in the EU, HIPAA in the US, and the CCPA in California, to name a few. Critically, such regulations apply to all PII, even when businesses aren't aware that they possess it. This makes it mission-critical for all businesses to find ways to detect PII in their data stores and systems so that they may either purge it or protect it as needed.

At the scale that modern businesses operate, however, doing so can be a gargantuan task. That is unless the business finds ways to leverage automation to help. Fortunately, there are now multiple automation tools and techniques that work well to help businesses detect, categorize, and protect PII. Here is why this is so important and a few of the most common approaches.

The Danger of Undiscovered PII

Businesses that collect customers' PII have a responsibility to use it according to applicable regulations and to store it safely. However, in the transition to the big data age, plenty of companies found themselves dealing with large volumes of unstructured data coming from legacy systems. And short of having human employees comb through all of that unstructured data to categorize everything, there weren't many ways to catch unknown PII reliably in the past.

That can create a significant risk—like a ticking timebomb—inside a business's data stores if anything slips through undetected. To understand that risk, just listen to Hari Ravichandran, CEO of Aura, a well-known identity privacy and security firm. He says, "Hackers and other cybercriminals specialize in turning even vague personal information into something exploitable. And as long as there's a trail of breadcrumbs that can link an identity theft back to a culpable business that allowed a data theft, they'll face liability."

To avoid the possibility of such liability, businesses can now avail themselves of two main approaches for automating PII detection.

Simple Regex and Basic NLP Matching

One of the simplest ways businesses can integrate PII detection into their data workflows is by using regular expressions and basic natural language processing. This is typically enough to catch obvious and less-than-obvious examples of PII in a given data set. Best of all, there are open-source libraries that provide this functionality, which are adaptable to a variety of data processing systems.

One of the most popular of these is the Python library Piicatcher. It's adaptable to suit a variety of data environments and has plugins that extend its functionality to scanning database column metadata, too. Critically, it's capable of incremental scans, so it's possible to implement Piicatcher as an ongoing PII scanning solution. That makes it useful both for businesses trying to get a handle on large volumes of unstructured data as well as for screening for PII on an ongoing basis.

Another similar open-source solution is Octopii, which includes self-correcting OCR capabilities. This gives businesses the ability to detect PII in scanned images, which often goes overlooked by other solutions.

AI-Enabled PII Detection

A more robust PII detection and categorization solution can and should make use of advanced AI for the task. It's a much more adaptable solution with a higher PII identification rate when done well. Plus, most major AI providers have models that can serve as a jumping-off point for a bespoke solution. For example, Microsoft's Azure AI features PII detection, as does Amazon Comprehend.

Some companies even sell purpose-built AI solutions just for the task of helping businesses detect and deal with PII. One such solution is, whose API features continuous PII detection capabilities. Then, there are also platforms like Private AI, which can detect and flag PII in unstructured text, images, audio, and video data. The bottom line is that there are AI solutions to fit almost every conceivable PII detection need already on the market, simply waiting for businesses to integrate them.

The Takeaway

It's clear that unknown and unprotected PII lurking within a business's data infrastructure could become a major risk if not addressed. The good news is that doing so isn't as difficult as it once was. Businesses have plenty of options for creating a PII detection workflow that fits their needs. And given the potential consequences if they ignore it, business leaders would do well to see if their data teams are already working on a solution for this widespread—if underappreciated—potential problem.