A simple project to demo how to use python ocr lib to recognize text in pdf file
- Install tesseract(on Windows)
- wiki url
- donwload url
- Install the downloaded binary file.
- Install the python lib:
pip install pytesseract
- Set env path:
- add "C:/Program Files/Tesseract-OCR" to PATH
- test installation:
tesseract --version
- create a "tessdata" folder
d: md tessdata
- set env virable:
virable name: TESSDATA_PREFIX virable value: D:/tessdata
- download language data for tesseract:
d: cd tessdata wget https://github.com/tesseract-ocr/tessdata/raw/main/chi_sim.traineddata
- Install poppler(on Windows)
- wiki url
- donwload url
- extract the downloaded file to "D:/tools/poppler/"
- Set env path:
- add "D:/tools/poppler/Release-24.07.0-0/poppler-24.07.0/Library/bin" to PATH
- test installation:
pdftoppm -v
- Install required python libs:
-
pip install pdf2image pillow