“Data is the new oil” may be a literal description of what’s fueling the development of autonomous vehicles. AI models for autonomous driving require a lot of real-world data in order to learn, and even more real-world data to validate models. How much data? Oh, just around 4 TB per day, for every car.
Fully road-ready autonomous driving algorithms may not be right around the corner yet, but they’re also not far off, and governments are anticipating this near-future reality by paving the way with new permits and regulations.
To safely get from here to there, cars must learn how to respond to an infinite variety of real-world driving scenarios. AI models learn by ingesting thousands upon thousands of images and videos that depict myriad variations in road conditions and driver behavior. All of this training and validation data doesn’t appear out of thin air; it’s gathered by driving around cities, highways, countryside, and everything in between, and recording what happens around the car.
And that’s where the trouble is. Real-world data contains real people and their private information. With more stringent privacy regulations, in particular GDPR, companies and research teams face significant roadblocks to collecting, storing, using, and sharing the volumes of real-world data they need to develop and validate AI for autonomous driving.
The goal is clear: Ensure people’s privacy is protected. The question is, how do you achieve that goal when working with large real-world data sets?
There are lots of ways, both manual and technical, to achieve privacy compliance. Organizations have different data protection options to comply with privacy laws by employing various organizational and technical measures (AKA privacy-preserving technologies) including encrypting data and signing data protection agreements (DPAs), pseudonymization or anonymization, and even opting to use synthetic data.
Which approach to data protection
should you use? Let’s take a closer look to determine the ideal approach to anonymizing large image datasets at scale.
If this is your data set, GDPR defines your role as data controller. The data controller determines the path and purpose of processing personal data. That means you have the responsibility of implementing necessary safeguards to protect personal data and could be liable to data subjects for any non-compliance, including violations by external data processors.
So, not to scare you or anything, but there’s a lot riding on your choosing the right approach to ensure data protection compliance. Take into account where your vulnerabilities lie and what your ultimate goal is.
Encrypting data means converting data to an unintelligible code or hash that can only be decoded with a matching key (decryption key). Encryption is highly effective for transferring data safely, as well as storing data. However, left as-is, decrypted data does still contain personal/private information when decrypted.
Cross-border transfer of data is a serious issue within GDPR. Consider an example: After recording driving data in the EU, you want to transfer it to India for cost-competitive labeling services. Encrypting the data for transfer is a must, but what happens after it’s decrypted? Your data contains personally identifiable information (PII) of European citizens, but GDPR does not list India as a third country having sufficiently equivalent data protection laws to allow use of the data set there.
Some organizations rely on qualifying their work as scientific research to get permission to use data containing personal information. But does developing Tesla Autopilot qualify as “scientific research?” Organizations can also opt to sign detailed data processing agreements (DPAs) with every single supplier or third party that may touch the data. In all cases, there’s a lot riding on compliance; GDPR fines have increased as much as seven times, year on year.
Homomorphic encryption is a cryptographic approach that attempts to resolve this issue by enabling the manipulation of encrypted data. However, what you can do with homomorphically encrypted data is limited and slow, making it poorly suited for computationally-intensive use cases like training autonomous vehicle algorithms.
Pseudonymization is a slightly slippery term, used with different meanings by nearly everyone. So what is pseudonymization? In GDPR terms, pseudonymization is the name for de-associating the identities of subjects within the data from the personal data being processed. Often this de-association is applied by replacing personal identifiers in the data with pseudonyms or randomly generated values.
Pseudonymization does not fully remove identifying information from the data, rather, it reduces the ability to connect a dataset with the original identities of the data subjects. That means with pseudonymization you have the ability to reconstruct the original data set.
Pseudonymization is valuable because the application of pseudonymisation to personal data can reduce the risks to the data subjects concerned, and help controllers and processors to meet their data-protection obligations.
At the same time however, it may not always be enough. Following the “better safe than sorry” axiom, experts recommend anonymizing data so you can steer clear of GDPR altogether.
Easier said than done. Anonymization means elimination of any information that relates to an identified or identifiable person to prevent re-identification by any reasonable means. In practice though, many anonymization techniques in use today are ineffective, leaving personal information vulnerable to reidentification. Although pseudonymization allows for reconstructing the original data set by design, anonymization should make re-identification impossible.
For large image and video datasets, AI-automated data redaction is a scalable, cost-effective solution that ensures the highest level of data privacy and regulatory compliance without compromising data integrity.
Image redaction is like a surgical strike; it removes PII or any unwanted sensitive information from visual data by identifying and irreversibly redacting (i.e. blurring) faces, license plates, text, logos, and more. AI-based redaction preserves all real-world aspects of the image, only redacting identifying details.
For autonomous driving, it’s the best of both worlds: Image redaction delivers a high enough level of anonymization to free your data from GDPR regulations while preserving real-world data quality and variation. It can also be applied to massive unstructured datasets at scale, without requiring excessive amounts of time or money to achieve high-quality results.
Image.Redact is a no-code AI solution built specifically for image redaction. Fully compliant with privacy regulations (like GDPR and CCPA), users are able to upload data, define the objects that require redaction, then let Image.Redact do the rest.
Built on our Unstructured Data Processing (UDP) Platform that combines the best AI models with human-in-the-loop, Image.Redact delivers the highest detection accuracy and near 100% anonymization quality at speed and scale. That means you can process and redact PII, such as faces and license plate numbers, from thousands of images in just a few seconds.
Don’t get sidetracked on the road to data privacy compliance. Rely on AI-automated image redaction to reach your destination safely, and at any scale. Ready to start your redaction engine? Connect with us for an expert consultation and demo.