Many of us tech heads are quick to give you an answer to your technical needs and propose a solution even if you did not ask. I’m no different, if you tell me you want your documents digital I will explain OCR to you and then explain the best solution for your document types. To my dismay, if you work for a large company your response will likely be, “but I’m not allowed to install anything.”
It’s very common for large organizations to lock down their employees’ computers to the point it becomes more of an appliance than a computer. This lock down makes perfect sense especially considering the amount of personal and private information these organizations encounter. The lock down however makes it very difficult for a technical operator to increase their efficiency with new technology. While the offer stands to approach an IT department with requests for new technology, the reality as we know is very small, especially with the current situation of shrinking IT departments.
Most recently I was in a conversation with someone working for a bank. She had stacks of business cards that needed to be digitized and of course being the tech head that I am, I got excited and explained about business card reading ( BCR ), and that perhaps it would be easier to get a document scanner that could scan the business cards and everything else. But to no avail, she could not install the software.
The real hurdle with the computer lock downs is not so much hardware installations. This can be overcome with a simple request. It’s the approval of new software that requires many months of review and approvals. Because OCR is a software driven process, this complicates things. Eventually, I hope that document automation becomes a part of the standard build for end-users machines. Until then, the solution is a scanner and an OCR service either web based or on an intranet.
If an organization can deploy centrally an OCR server that users send documents to and receive results from, they will eliminate the risk of installed software. Alternatively, an end-user with an attached scanner can leverage the OCR web based services that exist, either via FTP or E-Mail upload documents and receive results.
I hope soon we all have OCR as a standard so we can start removing the reliance on troublesome paper, but until then, the OCR services exist to get the job done, and may sometimes be the preference.
Chris Riley – Industry Expert
Just this morning, I was reminded of why market education is so important. I received an email in the morning from a customer who has been exposed to data capture technology for many years. This customer owns a semi-structured data capture solution that is capable of locating fields on forms that changes from variation to variation. In an attempt to help my understanding, we started a conversation about their expectations. Very wisely, the customer broke down their expectations into three categories: OCR accuracy ( field level ), field location accuracy, and amount of time to process per document. This is a step more advanced than a typical user who will clump all of this into one category. In addition to this, there should be a minimum template matching accuracy. In any case, they expect an OCR accuracy of 90%, which is reasonable considering the document they are working with are pixel perfect. They expect a 20 page document to be processed in 4 minuets which is also reasonable and right on the line. Finally, they expect field location to be 100%, RED FLAG!
This is not the first time that there is an assumption that you can locate fields on a semi-structured form with 100% accuracy, 100% of the time. To my dismay, as people seem to be learning more about the technology, this is the next class of common fallacy. And because the organization did not specify template matching accuracy, it means they must also assume templates match 100% of the time to get 100% field location accuracy. Trouble.
It’s clear as to why 100% field accuracy is important for them. That is because, basic QA processes are capable of only checking recognition results ( OCR Accuracy ), and not locations of fields. Instead of modifying QA processes, an organization’s first thought was how to eliminate the problems that QA might face. 100% accuracy is not possible no matter what is done, including straight text parsing. In this case, the reason it’s not possible is that even in a pixel perfect document, there are situations where a field might be located partially, located in excess, or not located at all. The scenario that most often occurs in pixel perfect documents is that text may sometimes be seen as a graphic because it’s so clean, and text that is too close to lines are ignored. So typically in these types of documents, any field error is usually a field located partial error. Most QA systems can be setup such that rules are applied to check data structure of fields, and if the data contained in them is faulty, an operator can check the field and expand it if necessary. But this is only possible if the QA system is tied with data capture.
After further conversation, it became clear that the data capture solution is being forced to fit in a QA model. There are various reason as to why this may happen: license cost, pre-existing QA, or miss-understanding of QA possibilities. This is very common for organizations and very often problematic. Quality assurance is a far more trivial processes to implement than data capture. When it comes to data capture it would be more important to focus on the functionality of the data capture system and develop a QA that makes it’s output most efficient.
Again, a case of expectations and assumptions.
Chris Riley – Sr. Solutions Architect
I’ve faced unique projects in the last four years and in a few, the best approach even seemed to contradict my better logic. The projects I’m talking about are ones where the data we were working with was already in a digital format, namely a PDF file that was created digitally. What this meant was that all the text in the PDF was available and 100% accurate. So why then, to accomplish the project’s goals, did we use OCR to read the already digital files as images?
I had intended for all these projects to do a logical parsing of the already digital content so I can get what I want. The problem is that even though the internal structure of the PDF has a logical standard, it’s not used logically 90% of the time by most PDF generating applications. PDF has in it a tolerance for mistakes that allows organizations to deviate quite drastically from the standard. What this means is that not only is the content in each PDF unique per company that generates it, it’s unique per number of applications able to create them. Variations on-top of variations makes logical parsing very difficult. This becomes most obvious when the documents contain tables. Because of this the only way to text parse the PDF properly would be to flatten the internal logic so that they consist of nothing but text, but by doing so you lose some of the information pointing to where tables are and their structure.
You may have guessed by now that all my projects were to parse tables from PDF. Not just any table but specific tables in PDFs where each was a unique format. As I said before, my preference would have been to use the 100% accurate data already in the PDF. In the end what I ended up doing was OCRing the PDFs because they were what is called “pixel perfect” so the accuracy was very high. Now that I was using OCR, I was able to first recognize an entire document and remove everything that was not a table which was determined by my OCR document analysis. Then I was able to use keywords to find the specific table that I wanted. The end result took me about 3 weeks of work for each project, and the result was higher accuracy in table finding, and only slightly less accurate in the text values than a table parsing.
While it seemed most logical to do the parsing, in the end I saved over 5 man-months of work by using OCR.
Chris Riley – Sr. Solutions Architect
Mar
Often times when I receive printed periodicals, my preference is to OCR them to a digital search-able format and read the articles I’m interested in on my computer, just like my online periodicals. One of these printed documents might be a magazine. Magazines are either very easy to OCR or very difficult, and usually both cases exist in a single magazine. It all has to do with the graphical elements that are often incorporated in magazines.
Text printed on graphics. Very often articles will have text printed over related graphics. If entire paragraphs are printed over a single graphic, it’s less challenging; but when text overlaps graphic and white-space, it’s problematic because a single word will change from color to black normal text in order to contrast the images.
Annotated images. Many magazines including my favorite scientific one, includes text as part of diagrams in the articles. To many this text may be irrelevant, but to me, it has become important search words at the very least. These annotations tend to be small font and often hard for the OCR engine to identify because of close proximity to images.
The good news is that for the most part the purpose of OCRing any magazine is to make its text, searchable. Anything more would probably be illegal. The other good news is that there are tricks to deal with each of these problems. First, a magazine that is being OCRed must be scanned in color. The additional information provided by the color scan will help the OCR engine to distinguish graphics from text on graphics. Second, is to enable full recognition of any engine and any settings geared to small fonts. Third, is to turn off document analysis or enable limited document analysis. This is the less obvious setting. By disabling document analysis, you don’t allow the OCR engine to get confused by strange structure, text printed on graphics, and annotated images. You are forcing it to read all possible text.
Being that text-searchable is the greatest benefit to OCRing my periodicals, I have opted for the OCR settings that produce the most text and the least structure. If you are converting similar documents, I recommend doing the same.
Chris Riley – Sr. Solutions Architect
Often the purpose of doing Optical Character Recognition ( OCR ) for individuals and companies is to get a digital version of a document where the individual intends to edit and or re-purpose. This is not the most common use of the technology but a use that requires specific attention.
In order to convert a document so that it is printable later on, it’s important to not only get the text from the document but also the format of the text. This includes layout as well as things such as graphics, and font colors. To do this, the OCR product must be able to recognize colors (requires color scanning), recognize font styles, and very importantly, recognize document structure.
Engines that support advanced document analysis have this. Document analysis ( DA ) is the process that happens before any text is read on a page. Document analysis makes sense of a document in order to improve recognition as well as get the formatting required for a formatted export. First, document analysis finds document structured, ie. columns, tables, text, paragraphs lines. Once this is done, it identifies colors in text and graphics. After document analysis has done it’s job, the recognition can begin. During recognition, the style of fonts is detected: bold, italic, underlined. All of this is put together with a result formatted as close as possible to the input document.
For those individuals that are concerned about the re-purposing of their documents, a straight text OCR engine will not work. Basic OCR engines get the text on the document in digital form and nothing more. For these individuals, it’s important to find a solution that has good documenting analysis.
Chris Riley – Sr. Solutions Architect
Check-mark processing ( OMR ) is one of the most accurate recognition technologies. Companies who properly utilize OMR are able to process documents quickly and accurately. But for the same reason OMR is accurate, it can also be very inaccurate, when not used properly.
For the most part, OMR is an all or nothing technology. Unlike the varying degrees of accuracy and uncertainty in OCR, with OMR, a field is checked or not. Where accuracy and uncertainty come into play is when you deal with collections of check-marks where the technology will compare the results of all to see whichever ones are most likely checked. The three areas where organizations make the mistake when using OMR is: improper OMR type, poor thresholds, and bad rules.
Many think of OMR fields as the traditional bubble on school tests. But there are several types of OMR fields. Rectangle, Round, Automatic, and White Field. Unlike text recognition, the wrong field type selection in OMR results in 100% incorrect results, most of the time.
Rectangle and round are the traditional fields that comes to mind when thinking of check-marks. The technology used to processes these, also includes a way to tell if a field has been corrected ( slashed out, and answer changed ). For these fields, the borders of the field are detected and when a high enough amount of black pixels is found within the border, the field is considered checked. The only time this will not be the case is when a field has been detected as having a correction.
Automatic field types are for those forms that have non-traditional border types for their fields, OR have some sort of text already existing in the field. For example, if you scan a Scantron form as a black and white image without dropout, you will get for each field a round circle with some letter or number printed in the middle. In this case you would have to use the automatic field type. What happens is that the software compares an EMPTY form to the form being processed. If for example, a field has the letter “A” printed in the middle, the software will count how many pixels in the field the A consist of and use that as a baseline. For a field to be checked, it will have to contain some number of black pixels OVER the baseline. If in this case, you used a rectangle or round check-mark type field, all fields would be considered checked because no baseline was established. Now finally are white fields.
White fields are check-mark fields that have no border. The are most often forms that have dropout scanning or sometimes fields used for unique and cool cases such as detecting signatures. These are a useful type of checkmark that simply expects there to be no border and no printed text in the field area. If there is a small amount of black pixels in the field area it’s considered checked. If you use a white field on a rectangle OMR field it will always be considered checked because of the borders. The biggest challenge for white fields is that the size of the field directly impacts it’s accuracy so proper sizes must be chosen. All check-marks have degrees of thresholds assigned to them.
A threshold is the setting that determines the amount of pixels (as a percent ) that is required before a field is considered checked. Organizations usually never need to toggle the default thresholds, and this is one of the biggest mistakes that is made. Most OMR processing packages have default thresholds for all field types. These vendors have done the research to know what the optimum field threshold is for both accuracy and avoiding false positives. Companies, when they pick the wrong threshold, get fields considered checked when they are not and the other way around. The problem is most of these are never reviewed, because they never get flagged due to custom thresholds which creates a false positive, the worse possible outcome of any exception.
As with all data capture and forms processing tools, there is usually a step of validation and rules. For whatever reason, organizations tend to over-think the rules associated with check-marks. The most common rule is that for any given collection of check-marks associated with a single question, only one or combination of ones can be checked. So for example, for a multiple choice question that asks for one answer, if the software sees two checked it will flag both fields. These rules are very useful but when improperly implemented result in either too much verification of fields, which is OK just a time waster, or like the threshold false positives. Sometimes the rules are applied during recognition and thus effect recognition results. For example, a question that has no answer but one is expected, is forced an answer. It’s easy to blame the software, but most of the time it’s just a bad rule.
OMR is a great tool when used right because it’s extremely fast and accurate, but when it’s used wrong, it’s still fast but just extremely inaccurate.
Chris Riley – Sr. Solutions Architect
I am amused at the detail of thought and complexity people put into business card scanning, BCR. In my time in the enterprise content management industry, I’ve scanned hundreds of thousands of business cards. Yes I know that is a lot, and no I’m not that popular. A vast majority of the scanning was for testing of business card scanners, and OCR/BCR technology to read them. But what cards I do want to keep are scanned and stored for later retrieval. I DO NOT use a specific card scanner, nor do I use a special card scanning application. I truly keep it simple and in my experience this is the right right way to do it.
Card scanners are inexpensive and useful when all you are scanning are business cards, but why not scan all your documents? ADF feed document scanners today support feed trays that dial down to the size of business cards. They are also able to scan stacks of business cards, front and back without a problem. So all I need is one scanner for all my documents. After extensive testing, the image quality difference is negligible, and in the top 3 are leading desktop scanners which are actually higher.
Now that you have the image, how do you get the data? Most people want to use a dedicated BCR technology to extract each data element from the card. I too am very amused with this and have had fun setting up systems to do it. But as a practicality, it does not make much sense to me. BCR that extracts separate fields such as name, email etc. can be very accurate. But when it’s not accurate it’s a problem. It takes a lot of time to correct a problem if it occurs, but more often then not, you don’t even know the problem occurred so you get part of an address in the phone number field and miss the phone number completely for example. You will only know this is the case when it’s time to follow-up with people. My second practicality complaint is that you are adding one program to operate and use regularly. We all have our favorite email client, or CRM where we keep all our contacts. Most BCR applications promote their ability to export to the most popular email clients, but as soon as there is an update to your database application you also have to buy a new version or might be stuck. Some BCR applications do not even allow the export of data so you have to manually copy and paste anyway. Is this really necessary?
OK it’s only fair now for me to tell you how I do it. I scan my business cards with my ADF fed document scanner to a hot-folder. This hot-folder is watched by a Full-Page OCR system and the cards are automatically converted to search-able PDFs. I am not getting field data and I’m not saving into a separate application. I’m making search-able PDFs just like all my other documents. This has worked for me very well for the past 4 years. When it comes time for me to find someone, all I really want is to see the card and the info. Most likely, if it’s really important I’ve already emailed them and captured their information that way. With my system I can search for people, websites, company names, even topics to find the cards of the people I want. I don’t have to worry about searching in a special UI field by field. Nor do I have to worry about missing data as full-text does not fall to the mercy of field extraction, it gets everything that is readable. In the areas of business where card scanning is used for reading medical insurance cards and drivers licenses the technology is very useful and necessary. I’m speaking only of personal business card scanning
In my experience most users of card scanners and BCR application use them very actively for a period of a month or two and soon the use dies to nothing and they revert to manual entry. I’m not doing manual entry, I’m using the latest technology in one unified process and full-text search. I’m just keeping it simple and practical.
Chris Riley – Sr. Solutions Architect
Many people inherit full-page Optical Character Recognition (OCR) technology by simply purchasing a scanner or a multi-function (MFP) device. All these pieces of hardware include various software packages and OCR is one of the most common. Often the software is never used or the use isn’t always clear. Other times, the bundle is a tight integration with the hardware and the OCR is a part of configuration of the scanner and is used during scanning unknown to the user.
Bundled OCR technology is the easiest way to learn through use, and get the technology for a low price. Bundled software has contributed a great deal to market education and understand around the advance technologies. All the top OCR engines have a consumer product bundled with a document scanner or multi-function device. But because it’s already there, it leaves many wondering why you would ever purchase the software directly.
For many, the bundled OCR is sufficient for use. The quality of documents is clean, and the demand for advanced options is not required. But for others they just need more. This is why more advanced versions exist. Bundled OCR, even from the best vendors, is limited or an older version of the product. Some of the vendors make a special “bundle only version”, while others choose to incorporate non-current versions. Not only is buying the software directly getting the latest technology with the best features, the biggest drive to purchase is a greater more specific need to focus on OCR functionality. This could be because you are scanning old documents, degraded documents, or you need special settings such as compression and PDF/A functionality that is simply not found in bundled versions.
Vendors don’t make any money on bundled OCR other than to cover costs. Because vendors use for the most part bundled versions as marketing, they don’t incorporate the latest, greatest, and most advanced features. For those who the document version process is very important, there is a clear benefit in quality OCR packages.
Chris Riley – Sr. Solutions Architect
There are a lot of technologists out there who believe that optical character recognition has its days numbered and is an aged technology. The belief is that soon paper will go away. This post is for those who believe OCR technology is going away.
The reality is that paper consumption has not really decreased. In some areas paper has been replaced with electronic data interchange EDI, but in other areas it has actually increased. Studies have also shown that because documents are being scanned more often, there is also an increase in printing when the documents need to be shared or re-purposed. But I’m not here to argue that paper is not going away and that document conversion technologies are required to convert them. I’m here to point out a few futuristic uses of the technology that technologists like to already talk about and involve OCR.
Data Security
The first futuristic use of the technology that I would like to discuss is the use of OCR in data security. Text strings sent over the Internet are far easier to sniff and unlock than a compressed JPEG image. What if you were to convert the text into a JPEG during transmission and the person on the receiving end would OCR it to get the data. By doing so the data has been masked in a more efficient and secretive way. For added security, proprietary image formats could be devised.
File Compression
Storing ASCII text takes up far less space than an image or video file. As apart of the future of compression technologies, expect that OCR will be uesd to extract the text from an image and saved as an ASCII file. Viewers will convert the text back to an image during viewing. This then removes the image portion of the text and significantly reduces file size.
Robots
How else to you expect future robots to read text? OCR of course. The eyes of the robot are essentially a camera that takes pictures of images rapidly. When the robot is faced with the comprehension of text, the image will be converted using OCR and fed through an engine to gain meaning from the text and act on it.
So there you have it, three really cool and cutting edge ways OCR is and will be used in the future. Paper is not going away, but even if it were, just look at the other cool uses of OCR technology.
Chris Riley – Sr. Solutions Architect
One of the biggest challenges in the IT space, is migration from legacy systems, often mainframe’s, to modern day operating systems and applications. Legacy systems still exist today in the form of classic green screen UNIX systems. Their life has been extended due to the critical nature of the data they contain. Modern day standards have been put into place hoping to avoid this problem in the future. However, those applications that seem most critical to conform to standards such as hospital medical records systems, airline systems, and government systems still do not conform to any. The vendors who make these systems have every intention of making it very hard to migrate from. But there is a way, and it works very well. OCR.
You may have seen in a previous post where I eluded to the possibilities of using OCR to scrape screen-shots. This is one of the best real examples of why the technology is so useful. When you don’t have XML and ODBC or any of the other great standards that allow the exchange of data from one system to another, you always have what you can see, and if you can see it you can OCR it. If you can view the data on the screen, you can move it to a new system.
Using OCR to either problematically or manual read portions of a screen where the legacy system window is displaying data, copy it to memory, and paste it into the new system is one of the most ingenious ways to ensure the neutrality of your data. Vendor lock down attempts, or old technology should not prevent you from getting to what you own, the information.
Whether it’s a manual process or a programmatic one, the ability to OCR screen-shots and to migrate data is the hidden secret to crack any proprietary software safe.
Chris Riley – Sr. Solutions Architect
























