How to make tesseract to recognize only numbers, when they are mixed with letters?

OcrTesseract

Ocr Problem Overview


I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.

Can I set a threshold value so that tesseract omits the symbols with low resemblance?

NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0.

Ocr Solutions


Solution 1 - Ocr

Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:

tesseract image.tif outputbase nobatch digits

As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.

Solution 2 - Ocr

For tesseract 3, the command is simpler tesseract imagename outputbase digits according to the FAQ. But it doesn't work for me very well.

I turn to try different psm options and find -psm 6 works best for my case.

man tesseract for details.

Solution 3 - Ocr

For tesseract 3, i try to create config file according FAQ.

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789                 

then, it works by using the command: tesseract imagename outputbase digits

Solution 4 - Ocr

If one want to match 0-9

tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789

Or if one almost wants to match 0-9, but with one or more different characters

tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE

Solution 5 - Ocr

I made it a bit different (with tess-two). Maybe it will be useful for somebody.

So you need to initialize first the API.

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);

Then set the following variables

baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");

In this way the engine will check only the numbers.

Solution 6 - Ocr

You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/

Solution 7 - Ocr

This feature is not supported in version 4. You can still use it via -c tessedit_char_whitelist=0123456789 with "--oem 0" which reverts to the old model.

There is a bounty to fix this issue.

Possible workarounds:

As stated by @amitdo

Solution 8 - Ocr

add "--psm 7 -c tessedit_char_whitelist=0123456789'" works for me when the image contain's only 1 line.

Solution 9 - Ocr

custom_oem=r'digits --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789'

text = tess.image_to_string(croped,config=custom_oem)

I am using tesseract 4.1.1.

For better result you might want to consider Image processing techniques.

Solution 10 - Ocr

What I do is to recognize everything, and when I have the text, I take out all the characters except numbers

//This replaces all except numbers from 0 to 9
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");

This works pretty well for me.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionzkunovView Question on Stackoverflow
Solution 1 - OcrJerryView Answer on Stackoverflow
Solution 2 - OcrmichaelliuView Answer on Stackoverflow
Solution 3 - Ocruser3852208View Answer on Stackoverflow
Solution 4 - OcrneoneyeView Answer on Stackoverflow
Solution 5 - OcrBlehiView Answer on Stackoverflow
Solution 6 - OcrvalenttView Answer on Stackoverflow
Solution 7 - Ocruser123959View Answer on Stackoverflow
Solution 8 - OcrYerrickView Answer on Stackoverflow
Solution 9 - OcrElie EidView Answer on Stackoverflow
Solution 10 - OcralgarroboView Answer on Stackoverflow