[DRAFT] PEG Parser via rust-peg #4423

qingshi163 · 2023-01-05T19:50:28Z

No description provided.

DimitrisJim · 2023-01-05T21:59:38Z

Hey @qingshi163! Before you get going, we should probably decide on if we want to go about this with rust-peg or if we'd like to port pegen over to rust. For the second option, there's already been some discussion. The benefit of going with pegen is that we can keep parity with CPython somewhat easier. The downside is that it might be more work.

I wanted to leave this comment early so we don't go about duplicating things unnecessarily.

youknowone · 2023-01-06T05:11:22Z

I hope we can share a parser for multiple projects

qingshi163 · 2023-01-06T15:44:02Z

for what I can say with rust-peg,

it is not 100% compatible for pagen grammar, feels like more strictly for left-recursion protection.

the left-recursion need cache to prevent infinite loop which I don't know how implemented by pegen but do not need (memo) mark.

it is lack of ability to handle the stream tokens(Iterators).

it works well with our lexer and tokens via rust match statement.

I not yet have any issue that unable to overcome.

charliermarsh · 2023-01-14T15:13:13Z

The thing I'd love to understand sooner rather than later is whether moving to rust-peg (or a PEG parser more broadly) will improve or hurt parser performance.

DimitrisJim · 2023-01-14T16:43:34Z

IIRC the PEP on the PEG parser touched on this, I remember seeing that speed was along the same lines (here's the relevant section) but memory requirements jumped up.

This involved pegen which might be slightly faster since it's geared towards Python parsing, something like rust-peg may or may not introduce some additional regressions, I'm personally not aware.

Without having an implementation of our own, of course, it's hard to have any concrete answer. Also, the baseline used in the PEP was against an LL(1) parser so, can't say how much we can draw from their results.

qingshi163 · 2023-01-27T09:40:14Z

So now the peg-parser is working with basic features, also I had a quick look of the solution to write a rust backend for pegen, this is what I think:

rust-peg is stronger than pegen with rust type system
pegen cannot have all the features we can do in rust-peg that limited by the grammar
rust-peg explicitly require the ~~#[cache]~~ #[cache_left_rec] mark while pegen implicitly implementing
as we have features like Tuple and Vec we can have better parser writing than c version
we will have code compelation etc
AST and parser is not something update fast, also it mostly connected with vm to implement the feature
we still need to write parser manually with pegen

charliermarsh · 2023-01-27T12:54:51Z

@qingshi163 - In your opinion, is this in a state where I could run some benchmarks on it? (Are there unsupported language features?)

Separately, do you have any enumeration of the outstanding work? I'd love to help out.

DimitrisJim · 2023-01-27T13:17:13Z

rust-peg explicitly require the #[cache] mark while pegen implicitly implementing

I'm pretty sure that's what (memo) is for in the CPython grammar file. (Unless I misunderstood your point).

Either way, I say keep going with rust-peg. Some point in the future (I'd like to do it when time allows) a pegen port could be implemented and we can just compare implementations and see what is more efficient. For now, having the parser support the newest Python syntax is probably more important.

charliermarsh · 2023-01-27T13:28:26Z

Might try to get Ruff running with this today on some basic Python files, just to explore.

charliermarsh · 2023-01-27T14:29:52Z

I got this running with Ruff. There are definitely limitations and I'm happy to file issues for them but I assume it's too early for that -- just let me know.

I created a 40,000 line Python file with only simple top-level statements (dictionary and list creations, a few imports, pass, etc.). And it looks like the new parser is about 10x slower than the existing parser.

Before:

Benchmark 1: ./target/release/ruff setup.py -n --silent --select ALL
  Time (mean ± σ):      80.3 ms ±   1.1 ms    [User: 73.4 ms, System: 7.0 ms]
  Range (min … max):    79.4 ms …  85.3 ms    36 runs

After:

Benchmark 1: ./target/release/ruff setup.py -n --silent --select ALL
  Time (mean ± σ):     702.9 ms ±  60.0 ms    [User: 617.1 ms, System: 69.3 ms]
  Range (min … max):   640.8 ms … 851.3 ms    10 runs

This is not to be discouraging of this work (which I find impressive), but rather, to try and understand the performance characteristics early on so that they can factor appropriately into discussion and decisions.

qingshi163 · 2023-01-27T15:42:36Z

rust-peg explicitly require the #[cache] mark while pegen implicitly implementing

I'm pretty sure that's what (memo) is for in the CPython grammar file. (Unless I misunderstood your point).

Either way, I say keep going with rust-peg. Some point in the future (I'd like to do it when time allows) a pegen port could be implemented and we can just compare implementations and see what is more efficient. For now, having the parser support the newest Python syntax is probably more important.

sorry I means #[cache_left_rec] whitch is implicitly implemented in pegen via left_recursive flag

qingshi163 · 2023-01-27T15:52:58Z

It is maybe too early for performance things, also now we are visiting tokens via reference which is because the Tok is not copyable, this can be easily solve by puting the data to the seperated place.

and our AST using empty Vec rather than NULL(witch used in pegen), so we probably can simplify some of the grammar

fanninpm · 2023-02-01T20:09:41Z

You might want to consider removing the LALRPOP-related steps from .github/workflows/ci.yaml.

qingshi163 · 2023-02-01T20:12:04Z

Beside the match statement the others should be able to parser, I stuck on interactive mode, don't know what tricky thing we need to do for the incompleted line.
also the importlib fail to initialize when we do cargo run.

@charliermarsh will you able to test the parser?

charliermarsh · 2023-02-01T20:20:42Z

@qingshi163 - Sure I can test it today, would that be helpful at this point?

charliermarsh · 2023-02-01T20:41:25Z

Ok, just ran this over Ruff. Out of the box, we have 84 test failures, 47 of which are fixture failures (so the linter producing different results). 914 tests passed -- which is great!

In a lot of cases, the row and column numbers have changed slightly -- I'd really love these to be consistent with the old parser and/or consistent with CPython, but I can also help fix these up once we get that far:

A few files are failing to parse -- here they are:

E101.py:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: LexicalError { error: TabsAfterSpaces, location: Location { row: 15, column: 1 } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:99:61

COM81.py:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 151", expected: ExpectedSet { expected: {"[Pass]", "EOF", "[False]", "[Tilde]", "[EndOfFile]", "[Name { name }]", "[Lsqb]", "[Complex { real, imag }]", "[Tok :: String { value, kind, triple_quoted }]", "[Lpar]", "[Int { value }]", "[Plus]", "[True]", "[Lambda]", "[Float { value }]", "[Await]", "[Ellipsis]", "[Tok :: None]", "[Star]", "[Minus]", "[Break]", "[Not]", "[Continue]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

F722.py:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 0", expected: ExpectedSet { expected: {"[Tok :: None]", "[Name { name }]", "[True]", "[Int { value }]", "[Await]", "[Not]", "[False]", "[Complex { real, imag }]", "[Tilde]", "[Lambda]", "[Ellipsis]", "[Float { value }]", "[Minus]", "[Tok :: String { value, kind, triple_quoted }]", "[Plus]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

E999.py (this actually does have a syntax error, but the old parser returned an error rather than panicking):

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 6", expected: ExpectedSet { expected: {"[Indent]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

charliermarsh · 2023-02-01T21:05:43Z

Okay, I also got Ruff running over CPython using this parser (the above analysis was just Ruff's test suite). The parser failed on the following files as of CPython's 442674e37eb84f9da5701412f8ad94e4eb2774fd:

cpython/Tools/test2to3/maintest.py
cpython/Lib/lib2to3/tests/data/crlf.py
cpython/Lib/lib2to3/tests/data/false_encoding.py
cpython/Lib/test/test_fstring.py
cpython/Lib/test/test_patma.py
cpython/Lib/test/test_functools.py
cpython/Lib/test/test_grammar.py
cpython/Tools/c-analyzer/c_parser/parser/_delim.py
cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
cpython/Tools/test2to3/test2to3/hello.py
cpython/Lib/lib2to3/tests/data/bom.py
cpython/Lib/lib2to3/tests/data/different_encoding.py

It's about 8-10x slower than the existing Ruff benchmark right now.

fanninpm · 2023-02-01T22:32:33Z

I have a few guesses for where some of the parser failures could be:

Lib/test/test_fstring.py:
- test_no_escapes_for_braces() — we're probably (erroneously) converting certain escape sequences before we parse the f-string expression
- test_expressions_with_triple_quoted_strings() — we should look for three ' or " characters in a row in order to terminate that string
Lib/test/test_grammar.py:
- test_var_annot_custom_maps() — does RustPython know how to parse code with annotations in the style of PEP 526?
- test_var_annot_refleak() — does RustPython know how to parse code with annotations in the style of PEP 526?

qingshi163 · 2023-02-08T09:42:32Z

@charliermarsh please can you do the test?

Thanks

charliermarsh · 2023-02-09T02:41:39Z

@qingshi163 - Out of the box, we have 22 test failures (woo!) with 1050 tests passing. Last time, we were at 84 failures, so that's a big improvement!

If I run over CPython, it throws a parse error on these files (haven't looked at them at all, just sharing the raw list):

cpython/Lib/lib2to3/tests/data/false_encoding.py
cpython/Lib/lib2to3/tests/data/crlf.py
cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
cpython/Lib/lib2to3/tests/data/bom.py
cpython/Lib/lib2to3/tests/data/different_encoding.py
cpython/Tools/test2to3/test2to3/hello.py
cpython/Tools/test2to3/maintest.py
cpython/Tools/test2to3/test/test_foo.py
cpython/Tools/c-analyzer/c_parser/parser/_delim.py
cpython/Lib/test/test_patma.py
cpython/Lib/test/test_grammar.py
cpython/Lib/test/test_fstring.py

Performance looks similar to last run, so ~8x or so slower than on main.

qingshi163 · 2023-02-11T20:30:03Z

@qingshi163 - Out of the box, we have 22 test failures (woo!) with 1050 tests passing. Last time, we were at 84 failures, so that's a big improvement!

If I run over CPython, it throws a parse error on these files (haven't looked at them at all, just sharing the raw list):
cpython/Lib/lib2to3/tests/data/false_encoding.py
cpython/Lib/lib2to3/tests/data/crlf.py
cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
cpython/Lib/lib2to3/tests/data/bom.py
cpython/Lib/lib2to3/tests/data/different_encoding.py
cpython/Tools/test2to3/test2to3/hello.py
cpython/Tools/test2to3/maintest.py
cpython/Tools/test2to3/test/test_foo.py
cpython/Tools/c-analyzer/c_parser/parser/_delim.py
cpython/Lib/test/test_patma.py
cpython/Lib/test/test_grammar.py
cpython/Lib/test/test_fstring.py
Performance looks similar to last run, so ~8x or so slower than on main.

the cpython lib2to3 related tests are failing because these are python2 code.

qingshi163 · 2023-02-11T20:30:40Z

@charliermarsh will you do the test again?

youknowone · 2023-02-13T08:22:48Z

@qingshi163 could you join our new discord server? I'd like to make a closer communication group for parser.

https://github.com/RustPython/RustPython#community

charliermarsh mentioned this pull request Jan 10, 2023

Support pattern matching astral-sh/ruff#282

Closed

3 tasks

qingshi163 added 17 commits January 31, 2023 20:31

create PegTokens

c327f74

wip

0712ca6

wip 2

f5725f8

wip 3

e805363

pass build

471688f

wip 4

dc38b63

stage 1

91ddb19

simplify async

3203aba

wip 1

3286b57

lambda

28a33ea

fix

5a370ec

fixup

77a4219

remove lalrpop

9a2888d

pass build

14b023c

fix kwds

f5be12d

fix async def

762ca27

fix slice

e6fbcb0

qingshi163 added 2 commits January 31, 2023 22:09

fix del_stmt

f8523fe

fix empty sqb assignment

3f93711

qingshi163 force-pushed the peg3 branch from f6d8017 to 3f93711 Compare January 31, 2023 20:30

fix eval

e20fc5b

qingshi163 added 8 commits February 2, 2023 20:45

impl PegTok copyable

277f560

remove lalrpop steps from ci

7a7f29b

error handling

2ad5c61

remove python.rs

22d06c2

error unrecognized token

e07cd75

fix class def location

4aa3d9b

fix function location

a8c6bfb

fix block location

fbff22a

qingshi163 added 4 commits February 11, 2023 19:47

refactor

1424e55

reduce indent

fa5063c

fix tuple

7bf0146

clearup

920ebdd

fanninpm mentioned this pull request Feb 21, 2023

Two CPython 3.11 syntax changes involving * operator are unsupported #4479

Closed

Copilot AI mentioned this pull request Dec 25, 2025

Add support for except* handling and starred subscripts #6522

Closed

6 tasks

[DRAFT] PEG Parser via rust-peg #4423

Are you sure you want to change the base?

[DRAFT] PEG Parser via rust-peg #4423

Uh oh!

Conversation

qingshi163 commented Jan 5, 2023

Uh oh!

DimitrisJim commented Jan 5, 2023

Uh oh!

youknowone commented Jan 6, 2023

Uh oh!

qingshi163 commented Jan 6, 2023

Uh oh!

charliermarsh commented Jan 14, 2023

Uh oh!

DimitrisJim commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qingshi163 commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charliermarsh commented Jan 27, 2023

Uh oh!

DimitrisJim commented Jan 27, 2023

Uh oh!

charliermarsh commented Jan 27, 2023

Uh oh!

charliermarsh commented Jan 27, 2023

Uh oh!

qingshi163 commented Jan 27, 2023

Uh oh!

qingshi163 commented Jan 27, 2023

Uh oh!

fanninpm commented Feb 1, 2023

Uh oh!

qingshi163 commented Feb 1, 2023

Uh oh!

charliermarsh commented Feb 1, 2023

Uh oh!

charliermarsh commented Feb 1, 2023

Uh oh!

charliermarsh commented Feb 1, 2023

Uh oh!

fanninpm commented Feb 1, 2023

Uh oh!

qingshi163 commented Feb 8, 2023

Uh oh!

charliermarsh commented Feb 9, 2023

Uh oh!

qingshi163 commented Feb 11, 2023

Uh oh!

qingshi163 commented Feb 11, 2023

Uh oh!

youknowone commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DimitrisJim commented Jan 14, 2023 •

edited

Loading

qingshi163 commented Jan 27, 2023 •

edited

Loading