Skip to content
This repository was archived by the owner on Mar 28, 2026. It is now read-only.

Add training and export functionality with Python bindings#105

Merged
mosuka merged 2 commits into
mainfrom
training
Oct 2, 2025
Merged

Add training and export functionality with Python bindings#105
mosuka merged 2 commits into
mainfrom
training

Conversation

@mosuka

@mosuka mosuka commented Oct 2, 2025

Copy link
Copy Markdown
Member

This commit introduces model training and dictionary export capabilities
to lindera-python, enabling users to train custom morphological analysis
models from annotated corpus data.

Features:

  • Add train() function to train CRF-based models from corpus
    • Supports L1 regularization, configurable iterations, and multi-threading
    • Accepts seed lexicon, corpus, character/unknown word/feature definitions
  • Add export() function to export trained models to dictionary files
    • Generates lex.csv, matrix.def, unk.def, char.def
    • Optional metadata.json update support

Implementation:

  • New src/trainer.rs module with PyO3 bindings for train/export
  • Add 'train' feature flag in Cargo.toml (requires lindera/train)
  • Use local lindera path (../lindera/lindera) for latest trainer API
  • Add num_cpus dependency for automatic thread detection

Documentation:

  • Update README.md with training/export usage examples
  • Add examples/train_and_export.py with complete workflow demonstration
  • Add tests/test_trainer.py with comprehensive test coverage
  • Corpus format follows lindera/resources/training conventions
    (tab-separated surface + features with EOS markers)

Changes:

  • Modified: Cargo.toml, src/lib.rs, README.md
  • Added: src/trainer.rs, examples/train_and_export.py, tests/test_trainer.py
  • Updated: Cargo.lock, poetry.lock, pyproject.toml, Makefile

@mosuka mosuka merged commit f8a3652 into main Oct 2, 2025
5 checks passed
@mosuka mosuka deleted the training branch October 2, 2025 23:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant