Skip to content

bumblezhou/ocr_study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ocr_study

A simple project to demo how to use python ocr lib to recognize text in pdf file

How to install:

  1. Install tesseract(on Windows)
  • wiki url
  • donwload url
  • Install the downloaded binary file.
  • Install the python lib:
    pip install pytesseract
  • Set env path:
    • add "C:/Program Files/Tesseract-OCR" to PATH
  • test installation:
    tesseract --version
  • create a "tessdata" folder
    d:
    md tessdata
  • set env virable:
    virable name: TESSDATA_PREFIX
    virable value: D:/tessdata
  • download language data for tesseract:
    d:
    cd tessdata
    wget https://github.com/tesseract-ocr/tessdata/raw/main/chi_sim.traineddata
  1. Install poppler(on Windows)
  • wiki url
  • donwload url
  • extract the downloaded file to "D:/tools/poppler/"
  • Set env path:
    • add "D:/tools/poppler/Release-24.07.0-0/poppler-24.07.0/Library/bin" to PATH
  • test installation:
    pdftoppm -v
  1. Install required python libs: -
    pip install pdf2image pillow

About

A simple project to demo how to use python ocr lib to recognize text in pdf file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages