Limit characters tesseract is looking for

OcrTesseract

Ocr Problem Overview


Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.

Ocr Solutions


Solution 1 - Ocr

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works. I don't know. Then call tesseract similar to this:

tesseract input.tif output nobatch letters  

That will limit tesseract to recognize only the wanted characters.

Solution 2 - Ocr

To use whitelist in a config file or using the -c tessedit_char_whitelist=... command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting. Example of proper command-line for 4.0 version:

> tesseract input_file output_file --oem 0 -c > tessedit_char_whitelist=abc123

UPDATE: In newer versions (4.0) there's corrupted eng.traineddata file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar.

Update from tesseract 4.1.1

  • However, in tesseract 4.1.1 the above bug is fixed, that is, in tesseract 4.1.1 the following works like a charm

    tesseract my_image.jpg stdout -l mylang configfile myconfig

Where "myconfig" is a plaintext file located in TESSDATA/configs

load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Solution 3 - Ocr

In addition to the config file, is the -c flag:

tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6

update

confirmed working on versions:

  • 4.1.1

Solution 4 - Ocr

Just adding this for anyone using tesseract on Android. In your readOCR function where you set the language etc. add the following line;

tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

you can also do blackList for characters to exclude.

Solution 5 - Ocr

In Tesseract version 4.00, this can't be done. You only can fine-tune your model or use regex to remove extra characters from the prediction.

Solution 6 - Ocr

I am using Ubuntu 18.04.4 LTS. The default tesseract is version 4. I can not use whitelist with it. Then I upgrade it to version 5. Then I use below command and it worked.

tesseract sample.jpg stdout -l eng --oem 3 --psm 7
Warning: Invalid resolution 0 dpi. Using 70 instead.
LL £036 GL)

tesseract sample.jpg stdout -l eng --oem 3 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Warning: Invalid resolution 0 dpi. Using 70 instead.
L4036GL

sample.jpg

Solution 7 - Ocr

My answer is derived wholly from the accepted answer, and is added here to benefit any .NET windows developers using the Tesseract NuGet package - however, take note of my bullet 2 which applies to anybody using any kind of Tesseract on Windows

  1. Create a config folder inside your tessdata folder where the other training data is located.
  2. Add a letters file inside the config folder. enter image description here Use an editor like TextPad that will help you save it in UNIX format, ANSI encoding (I had initially tried UTF-8 / IBM PC and tesseract was puking an error into my Tests output)
  3. Just like your training files, ensure the letters file, in the Properties panel has a Build Action set to Content and further marked to copy to the output directory:
    enter image description here
  4. Invoke your tesseract engine class thusly:
 var ocrEng = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDanilo BargenView Question on Stackoverflow
Solution 1 - OcrBlommanView Answer on Stackoverflow
Solution 2 - OcrBartłomiej UliaszView Answer on Stackoverflow
Solution 3 - OcrjmunschView Answer on Stackoverflow
Solution 4 - Ocruser3244591View Answer on Stackoverflow
Solution 5 - OcrAndrew RavusView Answer on Stackoverflow
Solution 6 - Ocrus2018View Answer on Stackoverflow
Solution 7 - OcrbkwdesignView Answer on Stackoverflow