Skip to content

Add support for upcoming tesseract v5

Created by: LiteracyFanatic

The first commit makes tesseract read from a config file. The following parameters are of particular interest for CJK languages, especially when written vertically. Unfortunately automatic script and orientation detection is only supported by the legacy engine. The newer LSTM models usually give better results but require manually specifying the orientation.

# Tells tesseract whether the script is horizontal (6) or vertical (5)
tessedit_pageseg_mode
# Don't add spaces between characters
preserve_interword_spaces
# toggling these options on or off may improve the output in some circumstances
paragraph_text_based
textord_old_baselines
lstm_use_matrix

I've been experimenting with the in development tesseract v5 on my system, so the second commit makes crow compile with it. The v5 API is still very much a moving target, so you might prefer to wait for a stable release first. That being said, it isn't too bad since crowing is only using a small subset of the API.

Thoughts?

Merge request reports