Corporations dream of utilizing highly effective AI knowledge processing to accumulate extra shoppers, present higher customer support, and way more. However they’re additionally cautious of AI-related knowledge privateness dangers and compliance necessities. Because of this, many withhold or restrict the scope of their AI initiatives. However what if we instructed you that you may have this cake and eat it, too? Our shopper protected its knowledge whereas chopping as much as 95% of doc processing time with AI.
It looks like all we hear about is AI. But, in line with Boston Consulting Group, 74% of firms battle with AI adoption.
Our expertise tells us that companies could restrict the scope of AI tasks as a result of they want full knowledge integrity.
Study:
- how one can safe your delicate knowledge as you faucet into the potential of Massive Language Fashions like OpenAI,
- how we applied full knowledge anonymization for a shopper who sped up doc processing by as much as 95% with an OpenAI-based OCR answer.
It began with an effort to assist admins enhance their productiveness throughout the buyer onboarding course of.
How far are you able to push productiveness with out AI?
For five years, we’d been working with a UK firm that develops pension dashboards. Every employed Brit might use a dashboard to view the entire retirement pension plans they paid into throughout their skilled profession.
To onboard every particular person, admins needed to manually enter dozens of paperwork, dropping time on analyzing data and typing.
However as soon as the shopper acquired a pension doc supplier with a buyer base of their very own (e.g., an insurance coverage supplier), they wanted to onboard hundreds of such people directly!
The shopper’s capacity to develop was tied to how briskly they may course of paperwork. Because of this, they searched for various methods to spice up their effectivity.
Throughout our cooperation, we helped the shopper lower the onboarding time of huge enterprise shoppers from 3 months to three days by:
- introducing new doc templates,
- bettering integration with third events via APIs to acquire some knowledge routinely.
However the drive for effectivity continued.
Reducing onboarding time with AI… after which what?
Quickly, we began speaking about how Synthetic Intelligence might assist course of doc knowledge even sooner to restrict handbook labor much more.
We created a Serverless utility powered by an LLM mannequin that makes use of Optical Character Recognition to extract particular fields from paperwork. However there was a catch – the LLM mannequin couldn’t have entry to customers’ private or delicate knowledge. A dealbreaker?
The MVP processed a doc in 1 minute and 40 seconds when it could take quarter-hour of handbook work.
But when we ever wished the answer to go dwell, we wanted to determine an environment friendly and scalable approach to defend all of the Personally Identifiable Data (PII).
Knowledge anonymization for our shopper
So-called PII is any kind of knowledge that can be utilized to determine a really particular particular person. There are lots of forms of PIIs, however among the commonest embody:
- date of start,
- dwelling handle,
- telephone quantity,
- bank card quantity,
- biometric knowledge (e.g., fingertips or palm prints),
- medical data.
While you anonymize a bit of knowledge, you take away all identifiers that can be utilized to affiliate an individual with the cash worth or an insurance coverage supplier’s identify.
To strengthen your anonymization effort, you may additionally encrypt particular characters or phrases by changing them with others.
After you full all of the steps to anonymize your knowledge, you possibly can ship it for processing to an LMM.
The fundamental concept just isn’t laborious, however when your app generates tons of data, knowledge anonymization requires cautious planning and testing. It will likely be totally different for every utility or characteristic you need to anonymize.

Knowledge anonymization applied sciences
These had been a few of our key know-how picks for the anonymization work:
Python & Serverless
The fundamental OCR answer was a Serverless app written in Python leveraging AWS Step Features & Lambdas.
GPT-4o mini
It’s one of many OpenAI LLMs. We selected it because the processing answer’s engine after we thought-about the velocity and price of processing.
AWS & REST microservice
The entire knowledge anonymization performance could possibly be organized as a separate devoted Python Flask microservice that might expose an endpoint for anonymization hosted on AWS and managed with the App Runner
spaCy
We additionally selected the sPaCy library written in Python for Pure Language Processing.
Let’s take a more in-depth have a look at the precise knowledge anonymization course of.
Implementing knowledge anonymization
By how we applied knowledge anonymization, you’ll see how guaranteeing knowledge safety suits into the bigger technique of constructing an AI characteristic.
- We recognized the PII knowledge that required anonymization
There are lots of forms of paperwork that want processing. They could share some doc fields but additionally have distinctive ones. Among the commonest knowledge varieties we selected included first identify, final identify, center identify, date of start, or nationwide insurance coverage quantity.

- We outlined and acknowledged knowledge patterns
To ensure that the OCR answer is aware of the place the PII knowledge was, we used the next steps:
- textual content identification to detect and isolate textual content areas inside a picture,
- picture processing to enhance the standard of scanned paperwork to spice up recognition functionality,
- character classification to map characters and phrases to their corresponding alphanumeric or symbolic values.
That’s already the bottom for an anonymization answer, however we wanted to enhance it additional.

- We constructed up the anonymization functionality for every knowledge kind individually
We developed a Named Entity Recognition (NER) mannequin to deal with every knowledge kind in a different way, thus bettering general knowledge processing high quality. Some instruments make this activity lots simpler. For instance, the aforementioned spaCy library helped us acknowledge numerous named entities or knowledge varieties, equivalent to an individual, a rustic, a nationality, or a e-book title.
Then, we created a generalized algorithm that distinguishes between knowledge varieties and a person anonymization module for every kind.
Our knowledge anonymization service was now full, however there have been nonetheless a few steps to clear earlier than it was able to serve the shopper and its customers.

- We built-in the anonymization service into your app
To permit the Serverless OCR utility to speak with the anonymization service, we used the REST API.
- We carried out thorough end-to-end testing of the anonymization course of
We carried out testing iteratively as we moved the information anonymization characteristic via the MVP section towards a production-ready answer. To facilitate testing and observability, we arrange monitoring.
- Deploy!
The anonymization answer went dwell.
So, what did we obtain right here?
Deliverables – know-how & enterprise
From a technological standpoint, the shopper acquired:
- An environment friendly and protected OCR answer
The doc processing utility was able to routinely parsing a doc in beneath a minute. The primary PoC extracted 15-20 doc fields in 40 seconds with out ever exposing delicate PII to the LLM.
Enterprise necessities might evolve and alter the construction and sheer amount of paperwork sooner or later. As a result of we constructed a generalized course of for figuring out totally different knowledge varieties, we had been in a position so as to add new knowledge varieties just by creating new anonymization modules.
These technological achievements allowed the shopper to:
- Enhance buyer onboarding velocity
The anonymization characteristic ensured the shopper might fast-track doc processing for shopper onboarding with out placing delicate PII knowledge in danger.
- Discover a optimistic perspective about AI
This was the shopper’s first AI challenge, and so they approached it with a way of duty for his or her shopper’s knowledge. Within the technique of implementing it, they didn’t must deny themselves the total potential of AI. They gained the fitting data and perspective to sort out much more superb AI-based tasks sooner or later.
Actually, the drive for effectivity by no means ends, however it could actually additionally profit the shoppers should you take safety precautions.

Don’t be the final to see the total potential of AI
You could be afraid of endangering your delicate info throughout AI improvement. It’s a significant problem to AI innovation.
In any group, you’ll discover individuals who will rightfully level out this hazard to you.
There are already firms who’ve finished the homework and realized that they will improve their greatest enterprise use instances with AI and by no means endanger knowledge. They’ve the data, info, and expertise to alleviate inner doubts and champion AI initiatives.
Our work on the information anonymization device helped the shopper validate an AI-driven product concept securely. If your organization doesn’t need to be among the many final ones to experiment with AI, you could need to purchase builders skilled with anonymized knowledge and knowledge anonymization methods.
If in case you have expert knowledge safety and AI specialists in your aspect who can safeguard a massively profitable AI initiative from knowledge integrity points, you possibly can develop your small business sooner. Your specialists will custom-build a safety mechanism as you play to your strengths with AI.
And in case your group desires to seek the advice of AI adoption take into account attempting our workshop
The GenAI Fast Prototyping Dash™ is a 2-day AI workshop that may enable you to shortly uncover the right way to use AI fashions to generate enterprise worth.