Button Text
Data Privacy
Jul 6, 2022
Min Read

Redacting Information from Documents Automatically

Share on TwitterShare on Twitter
Share on TwitterShare on Twitter
Share on TwitterShare on Twitter
Share on TwitterShare on Twitter
Sina Youn
Privacy Tech Lead

Organizations are under increasingly high pressure to protect customer and business data. In 2021, ransomware attacks rose by 13%—an increase as big as the last five years combined. This is a risky and costly trend considering an average total data breach cost of $8.19M for U.S. companies.

However, external threats aren’t the only concern for businesses. Since Europe’s General Data Protection Regulation (GDPR) went into effect in May 2018, privacy regulations have become stricter, and their enforcement stronger, with fines in Q3 2021 nearing €1B, 20 times more than in Q1 and Q2 combined.

Consumers are also becoming more concerned about data privacy, and demand more diligence from businesses handling their data, with 86% of Americans reporting that data privacy is a growing concern for them in a KPMB survey. Simultaneously, nearly two thirds of business leaders say their company should do more to strengthen existing data protection measures.

Why does data redaction matter?

Data redaction, often used interchangeably with de-identification, anonymization, or sanitization, is the process by which personal information, and other sensitive or confidential information, is irreversibly removed from data so that a data subject can no longer be identified directly or indirectly. According to global privacy regulations, sensitive and personally identifiable information includes, but is not limited to:

  • Social Security Numbers
  • Professional license numbers and drivers’ license data
  • Protected health information (PHI) and other medical information
  • Financial documents and files
  • Proprietary information or trade secrets
  • Bank account numbers
  • Financial data
  • Medical and psychiatric records
  • Addresses, birth dates, and other unique identifiers

Redaction has proven to be a straightforward approach for complying with privacy regulations, satisfying the highest IT security and privacy standards, as well as making data more accessible while maintaining privacy (e.g., sharing data with third-parties, data publication, data retention). There are various free and commercial tools available that help users streamline their redaction routine.

Unstructured data complicates redaction

However, redaction becomes much more complicated when it comes to unstructured data, which is information that doesn’t follow a predefined schema such as  documents, images, and videos. According to multiple analyst estimates, 80% to 90% of all enterprise data is unstructured. Invoices help illustrate the processing complexity that comes with unstructured data, as each business structures and formats their invoices differently. Much of the data in each invoice is the same, such as company name, billing address, and invoice date. However, relevant details may appear in different locations (e.g., top, bottom, table) or formats (e.g., handwritten) depending on the issuer.

This variability leads to problems when attempting to automate invoice (or any unstructured data type) redaction . Legacy solutions, such as RPA-based redaction, struggle to consistently detect and redact information that doesn’t follow a predefined pattern or set of rules. Processing errors and other exceptions are costly and not always accurately identified. Because of this, many companies continue to rely on manual redaction.

Humans are simply more flexible and adept at quickly detecting similar information in the absence of clear structure than legacy data processing tools. Although solutions like Adobe Acrobat offer  supportive features, such as rule-based redaction or “select and redact,” these tools involve a primarily manual process that is slow, expensive, and (still) error prone.

The AI and unstructured data processing (UDP) difference

Artificial intelligence (AI) offers an easy, cost-effective, and accurate way to overcome the limitations of existing manual and semi-automated redaction solutions. Unstructured data processing (UDP) platforms allow users to leverage different AI models and technologies depending on the task to accurately detect and redact any piece of information from any document type—regardless of formatting or structure.

The benefits of using AI–based solutions built for processing unstructured data include:

  1. Low touch setup: Modern, automated document redaction solutions streamline setup, model selection, training, ongoing maintenance, as well as creating and deploying new AI workers to continuously increase automation rates. Users simply upload their data, and let the system take care of redaction.
  2. AI model agnostic: AI models are constantly evolving and improving, quickly becoming commodified. Rather than investing in proprietary models in an attempt to compete with technology giants like Facebook and Amazon, newer automated redaction solutions allow users to leverage any AL/ML model, or combination of models, to ensure the highest quality results at all times.
  3. Comprehensive human resource management: Humans are critical for the success of AI-based document redaction, yet human resource management features are often an afterthought for developers. Solutions like Document.Redact offer access to crowdsourced workers adept at training AI models for deployment and validating results in production. Features like gamification are used to keep workers engaged, and sophisticated escalation rules to make sure a given validation task is completed in time to meet SLAs.
  4. Outcome guarantee: Newer solutions have moved beyond offering just a confidence level for the output of their AI solutions. They instead allow users to define trade-offs between quality, cost, and speed, then automatically allocate resources between AI, humans, and bots to guarantee the outcome. This is particularly important for redaction where accuracy is paramount to maintaining data privacy as well as satisfying legal and regulatory requirements.
  5. Any data type: Automated document redaction solutions built on top of a UDP platform make it possible to process any unstructured data type, including documents, emails, images, video, audio, and more. This helps users avoid purchasing new solutions as their needs evolve.

Meet Document.Redact

Document.Redact is super.AI's no-code solution built specifically for document redaction. Fully compliant with privacy regulations (e.g., GDPR, CCPA) and industry standards (e.g., SOC, ISO, HIPAA), users are able to instantly and irreversibly remove sensitive or confidential information from documents quickly and reliably.

Document.Redact delivers the highest detection accuracy and near 100% anonymization quality at speed and scale. This means users can process and redact personally identifiable information (PII), such as names and social security numbers, from thousands of documents in a matter of seconds.

How does Document.Redact work?

Step 1: Upload documents

To get started, simply import machine-readable or scanned PDF documents via the UI or API.

Step 2: Choose what to redact

Next, select the information that needs redaction. For example, automatically redact sensitive personal information within documents and tables such as:

  • Person names
  • Organization names
  • Street addresses
  • Dates
  • Phone numbers
  • Email addresses
  • Passport / ID card numbers
  • Employer identification numbers
  • Social Security / National Register Numbers
  • Faces
  • License plates
  • Brand logos
  • Text in embedded images
Selecting entities for redaction in Document.Redact.

Step 3: AI-automated detection

Based on user defined settings, super.AI’s platform then routes the redaction task to the optimal machine learning models–running on our Meta AI infrastructure as well as on Google Cloud and AWS–for processing.

An example of Document.Redact detecting faces, person names, organization names, and street addresses for redaction.

Different AI models are used to  precisely define the location of each piece of relevant information. Then, the output (bounding box locations) of multiple detection algorithms are compared, and/or combined if necessary, to return the best possible result. This assembly line approach to AI is what enables our quality guarantee.

Step 4: Non-reversible anonymization of areas/entities of interest

Information is then processed by a document post-processing engine to obfuscate the entities/areas of interest. Depending on the PDF type, there are two options:

  1. Machine-readable PDFs get redacted on a text level (each character is replaced with an X).
  2. Scanned documents and embedded images are redacted on the pixel level (entities are blurred using a Gaussian filter).

Both methods result in irreversible anonymization, meaning the redaction cannot be reverse-engineered.

Left: Example depicting AI-automated redaction of a machine-readable PDF. Right: Redaction of a scanned PDF document.
Left: Redaction of a machine-readable PDF document and embedded image. Right: Redaction of a scanned PDF document.

Step 5: Quality assurance

The super.AI platform enables human-in-the-loop (HITL) post-processing for quality insurance (e.g., enlarging, adding, or deleting bounding boxes).
The system provides an audit copy so users can review findings quickly before redaction, as well as a mark up copy to quickly review redaction results for quality assurance. Users can review the data themselves, or add review-as-a-service by super.AI for a high-performing end-to-end solution.

Nationality ("DEUTSCH") and the country of issue (in logo and written text) has been added to the list of objects to redact.

Step 6: Download redacted document(s)

Each processed document is saved as a new document file within the app, ready for download via the UI or API. The original document can be saved or deleted

Step 7: Extract more value from document(s)

In addition to the redacted document, the output includes metadata (in JSON format). This includes bounding box locations and object annotations, allowing users to leverage powerful analytics and further automate document processing.

This works for machine-readable PDFs, scanned documents, and embedded images. For the latter two, optical character recognition (OCR) is enabled so that not only character strings will be detected, but also correctly classified (e.g. names).

Example metadata output from Document.Redact.

Document.Redact at work

Document.Redact has full legal admissibility and is designed for business users and technical users alike to address the growing imperative for organizations to effectively manage data privacy compliance and minimize IT- and cybersecurity risks. Data privacy is fast becoming the rule rather than the exception. Applications of Document.Redact include:

  • All businesses in all industries: Fulfillment of Data Subject Requests  (DSR) under the Freedom of Information Act (FOIA) and GDPR.
  • Finance and Banking: Redaction of credit card statements, purchase orders, merger and acquisition (M&A) documents, and more.
  • Law Enforcement: Anonymization of court records, land records, legal discoveries, Uniform Commercial Code (UCC) Filings, insurance forms, and more.
  • Medical and Healthcare: HIPAA-compliant de-identification of medical records and patient data in clinical research and clinical trials.

Additional AI-automated redaction resources

For more information about automating redaction with AI, check out the following resources:

Other Tags:
Data Privacy
Document Automation
Share on TwitterShare on Twitter
Share on FacebookShare on Facebook
Share on GithubShare on Github
Share on LinkedinShare on Linkedin

You might also like