Today’s data and doc processing solutions have evolved beyond simplistic OCR solutions to include new machine learning and searching technologies. These new, state of the art technologies, significantly expand the number of documents that can be classified and read, as well as greatly increase the number of data elements that can be extracted using automation. Machine learning, sophisticated data extraction programs and rules-based automation, make this possible. Automated Data Extraction and Automated Document Recognition are often abbreviated as ADE and ADR, respectively.
Even with a new wave of intelligent solutions available, there are still challenges technology providers need to overcome to ensure customers realize the maximum benefit possible. Below we share four common sources of error for ADR and ADE centric doc processing solutions and what technology vendors are doing to solve for them.
OCR: LoanLogics discussed OCR’s role in doc processing earlier this year in a webinar. Used as a starting point to power ADR and ADE technologies, it turns images (mortgage documents used in processing are often in PDF format) into readable text so that ADR and ADE can perform their tasks. As such, OCR can also be one of the initial sources of error, even the most sophisticated systems. Typical errors consist of confusing letters and numbers, mistaking ‘noise’ on a document for characters, not being able to decipher characters and mistaking vertical lines for the number 1.
Document Quality: The quality of PDF documents have a direct impact on how well the OCR engine can render text. Poor quality documents produce poor quality OCR, which in turn produces low quality ADE results. Documents can be poor quality if they have been re-scanned multiple times, if they are printed on poor quality printers, photographed by phones or any combination of the above.
Insufficient document samples: Mortgage documents vary greatly even within a document type. There are multiple variations for the Closing Disclosure alone. In order to train the ADR, or develop ADE solutions that can handle these variations, development staff for machine learning solutions need to leverage as many variations as possible in order to improve accuracy.
Highly variable documents: Certain documents, such as the Note, gift letters and notices, of which there are many, (i.e. Title insurance notice, notice of separate credit account, notice of legal representation, lender placed insurance notice, etc.), vary greatly in both structure and wording. This variability can negatively impact the accuracy of both ADR and ADE.
While these sources of error can seem persistent and pervasive in mortgage, experienced doc processing vendors with a focus on continuous improvement and training will prevail in this crowded space.
Committed technology providers typically have a dedicated team in place to continually improve automation results on a weekly, even daily basis. These teams are focused on gathering large samples of diverse document types to train and improve ADR and ADE results. Access to a large pool of documents is key to successful automation as are regular quality assurance reviews that can spot document issues so that developers can make adjustments to the software.
LoanLogics has processed over 510 Million mortgage documents, 1.6 Billion pages, 4.4 Billion data extractions and currently has over 7000 business rules. Because of that, we are confident in our accuracy. Nobody can leverage the power of machine learning and artificial intelligence in the mortgage industry like LoanLogics can!