-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[DRAFT] PEG Parser via rust-peg #4423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @qingshi163! Before you get going, we should probably decide on if we want to go about this with I wanted to leave this comment early so we don't go about duplicating things unnecessarily. |
|
I hope we can share a parser for multiple projects |
|
for what I can say with rust-peg, it is not 100% compatible for pagen grammar, feels like more strictly for left-recursion protection. the left-recursion need cache to prevent infinite loop which I don't know how implemented by pegen but do not need (memo) mark. it is lack of ability to handle the stream tokens(Iterators). it works well with our lexer and tokens via rust match statement. I not yet have any issue that unable to overcome. |
|
The thing I'd love to understand sooner rather than later is whether moving to rust-peg (or a PEG parser more broadly) will improve or hurt parser performance. |
|
IIRC the PEP on the PEG parser touched on this, I remember seeing that speed was along the same lines (here's the relevant section) but memory requirements jumped up. This involved pegen which might be slightly faster since it's geared towards Python parsing, something like rust-peg may or may not introduce some additional regressions, I'm personally not aware. Without having an implementation of our own, of course, it's hard to have any concrete answer. Also, the baseline used in the PEP was against an LL(1) parser so, can't say how much we can draw from their results. |
|
So now the peg-parser is working with basic features, also I had a quick look of the solution to write a rust backend for pegen, this is what I think:
|
|
@qingshi163 - In your opinion, is this in a state where I could run some benchmarks on it? (Are there unsupported language features?) Separately, do you have any enumeration of the outstanding work? I'd love to help out. |
I'm pretty sure that's what Either way, I say keep going with |
|
Might try to get Ruff running with this today on some basic Python files, just to explore. |
|
I got this running with Ruff. There are definitely limitations and I'm happy to file issues for them but I assume it's too early for that -- just let me know. I created a 40,000 line Python file with only simple top-level statements (dictionary and list creations, a few imports, Before: After: This is not to be discouraging of this work (which I find impressive), but rather, to try and understand the performance characteristics early on so that they can factor appropriately into discussion and decisions. |
sorry I means #[cache_left_rec] whitch is implicitly implemented in pegen via left_recursive flag |
|
It is maybe too early for performance things, also now we are visiting tokens via reference which is because the Tok is not copyable, this can be easily solve by puting the data to the seperated place. and our AST using empty Vec rather than NULL(witch used in pegen), so we probably can simplify some of the grammar |
|
You might want to consider removing the LALRPOP-related steps from |
|
Beside the match statement the others should be able to parser, I stuck on interactive mode, don't know what tricky thing we need to do for the incompleted line. @charliermarsh will you able to test the parser? |
|
@qingshi163 - Sure I can test it today, would that be helpful at this point? |
|
Ok, just ran this over Ruff. Out of the box, we have 84 test failures, 47 of which are fixture failures (so the linter producing different results). 914 tests passed -- which is great! In a lot of cases, the row and column numbers have changed slightly -- I'd really love these to be consistent with the old parser and/or consistent with CPython, but I can also help fix these up once we get that far: A few files are failing to parse -- here they are: E999.py (this actually does have a syntax error, but the old parser returned an error rather than panicking): |
|
Okay, I also got Ruff running over CPython using this parser (the above analysis was just Ruff's test suite). The parser failed on the following files as of CPython's
It's about 8-10x slower than the existing Ruff benchmark right now. |
|
I have a few guesses for where some of the parser failures could be:
|
|
@charliermarsh please can you do the test? Thanks |
|
@qingshi163 - Out of the box, we have 22 test failures (woo!) with 1050 tests passing. Last time, we were at 84 failures, so that's a big improvement! If I run over CPython, it throws a parse error on these files (haven't looked at them at all, just sharing the raw list): Performance looks similar to last run, so ~8x or so slower than on |
the cpython lib2to3 related tests are failing because these are python2 code. |
|
@charliermarsh will you do the test again? |
|
@qingshi163 could you join our new discord server? I'd like to make a closer communication group for parser. |

No description provided.