Organizations are under increasingly high pressure to protect customer and business data. In 2021, ransomware attacks rose by 13%—an increase as big as the last five years combined. This is a risky and costly trend considering an average total data breach cost of $8.19M for U.S. companies.
However, external threats aren’t the only concern for businesses. Since Europe’s General Data Protection Regulation (GDPR) went into effect in May 2018, privacy regulations have become stricter, and their enforcement stronger, with fines in Q3 2021 nearing €1B, 20 times more than in Q1 and Q2 combined.
Consumers are also becoming more concerned about data privacy, and demand more diligence from businesses handling their data, with 86% of Americans reporting that data privacy is a growing concern for them in a KPMB survey. Simultaneously, nearly two thirds of business leaders say their company should do more to strengthen existing data protection measures.
Data redaction, often used interchangeably with de-identification, anonymization, or sanitization, is the process by which personal information, and other sensitive or confidential information, is irreversibly removed from data so that a data subject can no longer be identified directly or indirectly. According to global privacy regulations, sensitive and personally identifiable information includes, but is not limited to:
Redaction has proven to be a straightforward approach for complying with privacy regulations, satisfying the highest IT security and privacy standards, as well as making data more accessible while maintaining privacy (e.g., sharing data with third-parties, data publication, data retention). There are various free and commercial tools available that help users streamline their redaction routine.
However, redaction becomes much more complicated when it comes to unstructured data, which is information that doesn’t follow a predefined schema such as documents, images, and videos. According to multiple analyst estimates, 80% to 90% of all enterprise data is unstructured. Invoices help illustrate the processing complexity that comes with unstructured data, as each business structures and formats their invoices differently. Much of the data in each invoice is the same, such as company name, billing address, and invoice date. However, relevant details may appear in different locations (e.g., top, bottom, table) or formats (e.g., handwritten) depending on the issuer.
This variability leads to problems when attempting to automate invoice (or any unstructured data type) redaction . Legacy solutions, such as RPA-based redaction, struggle to consistently detect and redact information that doesn’t follow a predefined pattern or set of rules. Processing errors and other exceptions are costly and not always accurately identified. Because of this, many companies continue to rely on manual redaction.
Humans are simply more flexible and adept at quickly detecting similar information in the absence of clear structure than legacy data processing tools. Although solutions like Adobe Acrobat offer supportive features, such as rule-based redaction or “select and redact,” these tools involve a primarily manual process that is slow, expensive, and (still) error prone.
Artificial intelligence (AI) offers an easy, cost-effective, and accurate way to overcome the limitations of existing manual and semi-automated redaction solutions. Unstructured data processing (UDP) platforms allow users to leverage different AI models and technologies depending on the task to accurately detect and redact any piece of information from any document type—regardless of formatting or structure.
The benefits of using AI–based solutions built for processing unstructured data include:
Document.Redact is super.AI's no-code solution built specifically for document redaction. Fully compliant with privacy regulations (e.g., GDPR, CCPA) and industry standards (e.g., SOC, ISO, HIPAA), users are able to instantly and irreversibly remove sensitive or confidential information from documents quickly and reliably.
Document.Redact delivers the highest detection accuracy and near 100% anonymization quality at speed and scale. This means users can process and redact personally identifiable information (PII), such as names and social security numbers, from thousands of documents in a matter of seconds.
To get started, simply import machine-readable or scanned PDF documents via the UI or API.
Next, select the information that needs redaction. For example, automatically redact sensitive personal information within documents and tables such as:
Based on user defined settings, super.AI’s platform then routes the redaction task to the optimal machine learning models–running on our Meta AI infrastructure as well as on Google Cloud and AWS–for processing.
Different AI models are used to precisely define the location of each piece of relevant information. Then, the output (bounding box locations) of multiple detection algorithms are compared, and/or combined if necessary, to return the best possible result. This assembly line approach to AI is what enables our quality guarantee.
Information is then processed by a document post-processing engine to obfuscate the entities/areas of interest. Depending on the PDF type, there are two options:
Both methods result in irreversible anonymization, meaning the redaction cannot be reverse-engineered.
The super.AI platform enables human-in-the-loop (HITL) post-processing for quality insurance (e.g., enlarging, adding, or deleting bounding boxes).
The system provides an audit copy so users can review findings quickly before redaction, as well as a mark up copy to quickly review redaction results for quality assurance. Users can review the data themselves, or add review-as-a-service by super.AI for a high-performing end-to-end solution.
Each processed document is saved as a new document file within the app, ready for download via the UI or API. The original document can be saved or deleted
In addition to the redacted document, the output includes metadata (in JSON format). This includes bounding box locations and object annotations, allowing users to leverage powerful analytics and further automate document processing.
This works for machine-readable PDFs, scanned documents, and embedded images. For the latter two, optical character recognition (OCR) is enabled so that not only character strings will be detected, but also correctly classified (e.g. names).
Document.Redact has full legal admissibility and is designed for business users and technical users alike to address the growing imperative for organizations to effectively manage data privacy compliance and minimize IT- and cybersecurity risks. Data privacy is fast becoming the rule rather than the exception. Applications of Document.Redact include:
For more information about automating redaction with AI, check out the following resources: