Skip to content

Conversation

@qingshi163
Copy link
Contributor

No description provided.

@DimitrisJim
Copy link
Member

Hey @qingshi163! Before you get going, we should probably decide on if we want to go about this with rust-peg or if we'd like to port pegen over to rust. For the second option, there's already been some discussion. The benefit of going with pegen is that we can keep parity with CPython somewhat easier. The downside is that it might be more work.

I wanted to leave this comment early so we don't go about duplicating things unnecessarily.

@youknowone
Copy link
Member

I hope we can share a parser for multiple projects

@qingshi163
Copy link
Contributor Author

for what I can say with rust-peg,

it is not 100% compatible for pagen grammar, feels like more strictly for left-recursion protection.

the left-recursion need cache to prevent infinite loop which I don't know how implemented by pegen but do not need (memo) mark.

it is lack of ability to handle the stream tokens(Iterators).

it works well with our lexer and tokens via rust match statement.

I not yet have any issue that unable to overcome.

@charliermarsh
Copy link
Collaborator

The thing I'd love to understand sooner rather than later is whether moving to rust-peg (or a PEG parser more broadly) will improve or hurt parser performance.

@DimitrisJim
Copy link
Member

DimitrisJim commented Jan 14, 2023

IIRC the PEP on the PEG parser touched on this, I remember seeing that speed was along the same lines (here's the relevant section) but memory requirements jumped up.

This involved pegen which might be slightly faster since it's geared towards Python parsing, something like rust-peg may or may not introduce some additional regressions, I'm personally not aware.

Without having an implementation of our own, of course, it's hard to have any concrete answer. Also, the baseline used in the PEP was against an LL(1) parser so, can't say how much we can draw from their results.

@qingshi163
Copy link
Contributor Author

qingshi163 commented Jan 27, 2023

So now the peg-parser is working with basic features, also I had a quick look of the solution to write a rust backend for pegen, this is what I think:

  • rust-peg is stronger than pegen with rust type system
  • pegen cannot have all the features we can do in rust-peg that limited by the grammar
  • rust-peg explicitly require the #[cache] #[cache_left_rec] mark while pegen implicitly implementing
  • as we have features like Tuple and Vec we can have better parser writing than c version
  • we will have code compelation etc
  • AST and parser is not something update fast, also it mostly connected with vm to implement the feature
  • we still need to write parser manually with pegen

@charliermarsh
Copy link
Collaborator

@qingshi163 - In your opinion, is this in a state where I could run some benchmarks on it? (Are there unsupported language features?)

Separately, do you have any enumeration of the outstanding work? I'd love to help out.

@DimitrisJim
Copy link
Member

rust-peg explicitly require the #[cache] mark while pegen implicitly implementing

I'm pretty sure that's what (memo) is for in the CPython grammar file. (Unless I misunderstood your point).

Either way, I say keep going with rust-peg. Some point in the future (I'd like to do it when time allows) a pegen port could be implemented and we can just compare implementations and see what is more efficient. For now, having the parser support the newest Python syntax is probably more important.

@charliermarsh
Copy link
Collaborator

Might try to get Ruff running with this today on some basic Python files, just to explore.

@charliermarsh
Copy link
Collaborator

I got this running with Ruff. There are definitely limitations and I'm happy to file issues for them but I assume it's too early for that -- just let me know.

I created a 40,000 line Python file with only simple top-level statements (dictionary and list creations, a few imports, pass, etc.). And it looks like the new parser is about 10x slower than the existing parser.

Before:

Benchmark 1: ./target/release/ruff setup.py -n --silent --select ALL
  Time (mean ± σ):      80.3 ms ±   1.1 ms    [User: 73.4 ms, System: 7.0 ms]
  Range (min … max):    79.4 ms …  85.3 ms    36 runs

After:

Benchmark 1: ./target/release/ruff setup.py -n --silent --select ALL
  Time (mean ± σ):     702.9 ms ±  60.0 ms    [User: 617.1 ms, System: 69.3 ms]
  Range (min … max):   640.8 ms … 851.3 ms    10 runs

This is not to be discouraging of this work (which I find impressive), but rather, to try and understand the performance characteristics early on so that they can factor appropriately into discussion and decisions.

@qingshi163
Copy link
Contributor Author

rust-peg explicitly require the #[cache] mark while pegen implicitly implementing

I'm pretty sure that's what (memo) is for in the CPython grammar file. (Unless I misunderstood your point).

Either way, I say keep going with rust-peg. Some point in the future (I'd like to do it when time allows) a pegen port could be implemented and we can just compare implementations and see what is more efficient. For now, having the parser support the newest Python syntax is probably more important.

sorry I means #[cache_left_rec] whitch is implicitly implemented in pegen via left_recursive flag

@qingshi163
Copy link
Contributor Author

It is maybe too early for performance things, also now we are visiting tokens via reference which is because the Tok is not copyable, this can be easily solve by puting the data to the seperated place.

and our AST using empty Vec rather than NULL(witch used in pegen), so we probably can simplify some of the grammar

@fanninpm
Copy link
Contributor

fanninpm commented Feb 1, 2023

You might want to consider removing the LALRPOP-related steps from .github/workflows/ci.yaml.

@qingshi163
Copy link
Contributor Author

Beside the match statement the others should be able to parser, I stuck on interactive mode, don't know what tricky thing we need to do for the incompleted line.
also the importlib fail to initialize when we do cargo run.

@charliermarsh will you able to test the parser?

@charliermarsh
Copy link
Collaborator

@qingshi163 - Sure I can test it today, would that be helpful at this point?

@charliermarsh
Copy link
Collaborator

Ok, just ran this over Ruff. Out of the box, we have 84 test failures, 47 of which are fixture failures (so the linter producing different results). 914 tests passed -- which is great!

In a lot of cases, the row and column numbers have changed slightly -- I'd really love these to be consistent with the old parser and/or consistent with CPython, but I can also help fix these up once we get that far:

Screen Shot 2023-02-01 at 3 32 19 PM

A few files are failing to parse -- here they are:

E101.py:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: LexicalError { error: TabsAfterSpaces, location: Location { row: 15, column: 1 } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:99:61

COM81.py:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 151", expected: ExpectedSet { expected: {"[Pass]", "EOF", "[False]", "[Tilde]", "[EndOfFile]", "[Name { name }]", "[Lsqb]", "[Complex { real, imag }]", "[Tok :: String { value, kind, triple_quoted }]", "[Lpar]", "[Int { value }]", "[Plus]", "[True]", "[Lambda]", "[Float { value }]", "[Await]", "[Ellipsis]", "[Tok :: None]", "[Star]", "[Minus]", "[Break]", "[Not]", "[Continue]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

F722.py:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 0", expected: ExpectedSet { expected: {"[Tok :: None]", "[Name { name }]", "[True]", "[Int { value }]", "[Await]", "[Not]", "[False]", "[Complex { real, imag }]", "[Tilde]", "[Lambda]", "[Ellipsis]", "[Float { value }]", "[Minus]", "[Tok :: String { value, kind, triple_quoted }]", "[Plus]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

E999.py (this actually does have a syntax error, but the old parser returned an error rather than panicking):

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError { location: "p: 6", expected: ExpectedSet { expected: {"[Indent]"} } }', /Users/crmarsh/workspace/RustPython/compiler/parser/src/parser.rs:102:27

@charliermarsh
Copy link
Collaborator

Okay, I also got Ruff running over CPython using this parser (the above analysis was just Ruff's test suite). The parser failed on the following files as of CPython's 442674e37eb84f9da5701412f8ad94e4eb2774fd:

  • cpython/Tools/test2to3/maintest.py
  • cpython/Lib/lib2to3/tests/data/crlf.py
  • cpython/Lib/lib2to3/tests/data/false_encoding.py
  • cpython/Lib/test/test_fstring.py
  • cpython/Lib/test/test_patma.py
  • cpython/Lib/test/test_functools.py
  • cpython/Lib/test/test_grammar.py
  • cpython/Tools/c-analyzer/c_parser/parser/_delim.py
  • cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
  • cpython/Tools/test2to3/test2to3/hello.py
  • cpython/Lib/lib2to3/tests/data/bom.py
  • cpython/Lib/lib2to3/tests/data/different_encoding.py

It's about 8-10x slower than the existing Ruff benchmark right now.

@fanninpm
Copy link
Contributor

fanninpm commented Feb 1, 2023

I have a few guesses for where some of the parser failures could be:

  • Lib/test/test_fstring.py:
    • test_no_escapes_for_braces() — we're probably (erroneously) converting certain escape sequences before we parse the f-string expression
    • test_expressions_with_triple_quoted_strings() — we should look for three ' or " characters in a row in order to terminate that string
  • Lib/test/test_grammar.py:
    • test_var_annot_custom_maps() — does RustPython know how to parse code with annotations in the style of PEP 526?
    • test_var_annot_refleak() — does RustPython know how to parse code with annotations in the style of PEP 526?

@qingshi163
Copy link
Contributor Author

@charliermarsh please can you do the test?

Thanks

@charliermarsh
Copy link
Collaborator

@qingshi163 - Out of the box, we have 22 test failures (woo!) with 1050 tests passing. Last time, we were at 84 failures, so that's a big improvement!

If I run over CPython, it throws a parse error on these files (haven't looked at them at all, just sharing the raw list):

cpython/Lib/lib2to3/tests/data/false_encoding.py
cpython/Lib/lib2to3/tests/data/crlf.py
cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
cpython/Lib/lib2to3/tests/data/bom.py
cpython/Lib/lib2to3/tests/data/different_encoding.py
cpython/Tools/test2to3/test2to3/hello.py
cpython/Tools/test2to3/maintest.py
cpython/Tools/test2to3/test/test_foo.py
cpython/Tools/c-analyzer/c_parser/parser/_delim.py
cpython/Lib/test/test_patma.py
cpython/Lib/test/test_grammar.py
cpython/Lib/test/test_fstring.py

Performance looks similar to last run, so ~8x or so slower than on main.

@qingshi163
Copy link
Contributor Author

@qingshi163 - Out of the box, we have 22 test failures (woo!) with 1050 tests passing. Last time, we were at 84 failures, so that's a big improvement!

If I run over CPython, it throws a parse error on these files (haven't looked at them at all, just sharing the raw list):

cpython/Lib/lib2to3/tests/data/false_encoding.py
cpython/Lib/lib2to3/tests/data/crlf.py
cpython/Lib/lib2to3/tests/data/py2_test_grammar.py
cpython/Lib/lib2to3/tests/data/bom.py
cpython/Lib/lib2to3/tests/data/different_encoding.py
cpython/Tools/test2to3/test2to3/hello.py
cpython/Tools/test2to3/maintest.py
cpython/Tools/test2to3/test/test_foo.py
cpython/Tools/c-analyzer/c_parser/parser/_delim.py
cpython/Lib/test/test_patma.py
cpython/Lib/test/test_grammar.py
cpython/Lib/test/test_fstring.py

Performance looks similar to last run, so ~8x or so slower than on main.

the cpython lib2to3 related tests are failing because these are python2 code.

@qingshi163
Copy link
Contributor Author

@charliermarsh will you do the test again?

@youknowone
Copy link
Member

@qingshi163 could you join our new discord server? I'd like to make a closer communication group for parser.

https://github.com/RustPython/RustPython#community

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants