Perl Cannot Be Parsed: A Formal Proof (2008)

jrockway · on May 26, 2013

The article really doesn't prove that Perl can't be parsed, it proves that Perl can't be parsed unambiguously.

The example given is:

    whatever  / 25 ; # / ; die "this dies!";

This can be parsed into two variants without running any code or knowing the arity of whatever.

The case where whatever takes no arguments:

    8  <@> leave[1 ref] vKP/REFC ->(end)
    1     <0> enter ->2
    2     <;> nextstate(main 1 -e:1) v:{ ->3
    7     <2> divide[t3] vK/2 ->8
    5        <1> entersub[t2] sKS/TARG,1 ->6
    -           <1> ex-list sK ->5
    3              <0> pushmark s ->4
    -              <1> ex-rv2cv sK/128 ->-
    4                 <#> gv[*whatever] s ->5
    6        <$> const[IV 25] s ->7

The case where whatever takes a single argument:

    b  <@> leave[1 ref] vKP/REFC ->(end)
    1     <0> enter ->2
    2     <;> nextstate(main 1 -e:1) v:{ ->3
    6     <1> entersub[t2] vKS/TARG,1 ->7
    -        <1> ex-list K ->6
    3           <0> pushmark s ->4
    4           </> match(/" 25 ; # "/) sM/RTIME ->5
    -           <1> ex-rv2cv sK/128 ->-
    5              <#> gv[*whatever] s ->6
    7     <;> nextstate(main 1 -e:1) v:{ ->8
    a     <@> die[t3] vK/1 ->b
    8        <0> pushmark s ->9
    9        <$> const[PV "this dies!"] s ->a

What does this mean practically? If you want to rename "whatever" to something else, you can do that, since the name of the function is captured no matter how many arguments it takes. If you want to lint all your regular expressions, you can do that with the chance that you'll be wrong if you guessed the prototype wrong.

The only thing you can't do is figure out exactly how the code will run at runtime. But if you could do that, you wouldn't have a computer program, you'd just have a lookup table.

Another thing to keep in mind: if a static analysis tool can't understand your code, perhaps the future maintainer can't either. In that case, why not rewrite the code to not be ambiguous?

overgard · on May 26, 2013

> it proves that Perl can't be parsed unambiguously.

That's what can't be parsed means. They're pretty clear about what they mean by something being "parsable", you're muddying that definition to extend it to: "parsable: things that can be vaguely processed most of the time"

Vivtek · on May 26, 2013

This perfectly encapsulates my feelings about the Perl parsing question!

An ambiguous parse is still a parse, and yeah, the 'whatever' example is pathologically unmaintainable code. As you say, if a static parser finds it ambiguous, you should probably fix it.

Vivtek · on May 26, 2013

(Or mark it in some way, you know: "This is ambiguous, but it needs to be because _______" - because frankly, ambiguous parses are destined to be bug sources.)

joosters · on May 26, 2013

As one of the comments in the original article points out, this is pretty obvious. You just need a BEGIN { } block containing eval(string), where the string depends on evaluating code (e.g. a random number)

At least the article doesn't jump to big conclusions and make stupid declarations based on this result.

Unfortunately the commentators do: "This means that things like static code analysis, code transformation and syntax hilighting will never be reliable." So? Almost all code is still easily analysed, syntax highlighting works, etc. It's like stating that you can't do this stuff with C/C++, because the code depends upon #defines (whose values could be variable, or sourced from external processes)

ceautery · on May 26, 2013

Actually, syntax highlighting + warnings about ambiguity would be just the thing in the article's example. Suggesting that the developer choose between use of an explicit "m" for the pattern matching:

whatever m/ 25 ; # / ; die "this dies!";

...or the use of parentheses for division:

(whatever / 25) ; # / ; die "this dies!";

...would be a helpful suggestion a syntax parser could give. Also, in either event the code is syntactically valid.

alipang · on May 26, 2013

What does this mean for the potential of compiler optimization in Perl? You can't really throw around optimizations that work for 99% of the time.

I feel like the type of work that's done by browsers on JIT compiling Javascript would be really hard if you can't 100% reliably parse the language.

dbaupp · on May 26, 2013

> You can't really throw around optimizations that work for 99% of the time.

You can, if you can detect (a superset of) the cases when it is invalid.

0x0 · on May 26, 2013

I believe JITting (code generation) happens at a later stage, after the source code had been parsed (or perhaps even "tokenized"/"lex'ed" is more appropriate?)

So even an "unreliable" syntax wouldn't matter much since the jitter kicks in after the parser had delivered its results.

joosters · on May 26, 2013

I don't think this result means anything much for compiler optimization in Perl. Perl is definitely a difficult language to parse, a difficult language to interpret, and its flexibility makes for very tricky compilation (have there been any usable compilers for it?)

BUT - you can't really draw the link between a language that is turing-complete to parse, and one that is difficult to parse. An example of this would be to take the most simple, basic language you can think of, and add a backtick-like operation. As in, you can put `echo hello` into the source and the string will get replaced by the output of the enclosed command. The source language instantly becomes turing complete (you're having to execute code, after all), but your one-off parser hasn't become much more difficult (you just call exec() and wait for the output)

chromatic · on May 26, 2013

I feel like the type of work that's done by browsers on JIT compiling Javascript would be really hard if you can't 100% reliably parse the language.

That's irrelevant in this case, because a hypothetical Perl 5 JIT wouldn't do much until the Perl 5 parser finished parsing its code. Remember, all this proof proves is that you cannot produce a single, unambiguous parse tree from every Perl 5 source file because of the syntactic construct that lets you change how the parser works for specific symbols.

perlgeek · on May 26, 2013

Well, you can't parse perl without running parts of it.

You can't parse lisp without running reader macros.

You can't parse TeX without running parts of it.

You can't parse C without parsing header files. (Ok, that one is as bit different, because header files are just C code as well).

0x0 · on May 26, 2013

C++ templates are turing complete, so I guess you can't parse C++ without "running" the templates either. http://stackoverflow.com/questions/189172/c-templates-turing...

ay · on May 26, 2013

FWIW, the C preprocessor can be used to calculate PI as well:

http://chaos-pp.cvs.sourceforge.net/chaos-pp/order-pp/exampl...

kivikakk · on May 26, 2013

No. The fact that a feature is Turing complete doesn't make the language unparseable.

C is Turing complete, but you don't need to run the program to parse C. Likewise for parsing C++ templates.

0x0 · on May 26, 2013

But is parsing C turing-complete?

Because that's what I took away from reading about C++ templates - the turing-completeness happens at the parsing/preprocessing stage, before the compiler's code generator even starts translating. Was I wrong? Wouldn't a smart IDE that resolves types (for example for tab completion or type hints), looking at the Factorial<N> thing, have to solve for the value for Factorial<4>?

kivikakk · on May 26, 2013

No. (You need parse all includes to know typedefs, but that's it.)

C++ is awkward for a different reason, but there's nothing unparseable about it. Resolving types for tab completion/hinting is getting closer to static analysis (at least, with C++ it is), but parsing is well and truly done by that stage.

There's nothing about template<4> that can't be parsed, and nothing about the templates or instantiations thereof causes the syntax to be interpreted differently; this is probably the requirement for a parser to be Turing complete.

Perl is actually unparseable without execution: you can modify the allowed syntax mid-parse, based on some condition known only at runtime. Nothing about a template argument (for example) in some call causes a later template definition to be interpreted according to different syntax rules.

ExpiredLink · on May 26, 2013

> C++ is awkward for a different reason, but there's nothing unparseable about it.

Some corner cases in C++ are ambiguous. Those are 'resolved' by guessing what may be more suitable in the current context.

0x0 · on May 26, 2013

Makes perfect sense, and I see they're not the same now. Thanks for the explanation! :)

JulianMorrison · on May 26, 2013

But you can parse Go without running anything.

gngeal · on May 26, 2013

You can parse any language without running anything if you do it in your head. But most of us run at least the parser to do that. :-)

glurgh · on May 26, 2013

You don't have to be a perl user but this has never been a goal of perl.

http://bigthink.com/videos/why-perl-is-like-a-human-language

Love it, hate it, it just isn't what perl is about.

dasil003 · on May 26, 2013

Your comment lacks context. Do you mean parseability of perl has never been a goal of perl? If you read the article, you'll see it was the goal of the author, which is why he wrote this proof. So I'm not exactly clear what you're trying to say.

driax · on May 26, 2013

The point the grand parent was making seems to me to be rather simple. He says that the designers of Perl (not the author of the article) didn't consider parseability of Perl an objective, since they focused more on making a scripting language for humans. Of course the author (and many others) would have liked if Perl was statically parseable since a lot of static tools like documentation require this to be complete. So it just shows that maybe static parseability should have been an objective of the Perl designers or any other language designer in the future.

draegtun · on May 26, 2013

Previous HN discussion from 3 years ago - https://news.ycombinator.com/item?id=761103

Also came up in a comment a couple of weeks ago - https://news.ycombinator.com/item?id=5718637

bbanyc · on May 26, 2013

A language that can be statically parsed won't allow runtime redefinition of itself, like Lingua::Romana::Perligata. http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata....

As Tom Duff said in the "Duff's Device" post, I'm not sure whether this makes an argument for or against this feature.

trobertson · on May 26, 2013

It should be noted that this is from 2008, and last updated in 2009. I'm not a Perl user, so I don't know if the language has changed enough since then to affect the validity of the proof. We also don't know if it holds for Perl 6.

perlgeek · on May 26, 2013

It also holds for Perl 6 (to the same degree that it requires you to run some components of the code that is being compiled; Perl 6 does make it easier for the compiler to know which code it must run to complete the parse, though).

tudorconstantin · on May 26, 2013

That's why they say perl is easy to write and hard to read?

I'm a perl developer for almost 3 years and I simply love it.

fijal · on May 26, 2013

Er. This proof is seriously flawed, starting with "what is Perl"? How do we define a language that's not parseable? Does it only mean that you can't know what the program will do until you run it? That's true for Ruby, Python, pretty much everything that has any level of meta-programming. Also, it lacks any formal definition of what is to be proved. Statement: "Perl is unparseable" is not very well defined. "Perl" is the first problem, the second problem is "unparseable" - does it mean you can't write a grammar that does apply to all valid perl programs and does not to all invalid ones? We come back to what's "valid" and "invalid" program.

Obviously if the statement is "you can't write any useful linting tools", which seems to be what the author is after, than fine, but you have to define "useful".

chrismorgan · on May 27, 2013

Ruby cannot be parsed (e.g. for ``x - y`` the ``-`` could be unary if ``x`` is callable or binary otherwise).

Python, on the other hand, can be parsed entirely unambiguously.

happy_dino · on May 26, 2013

> How do we define a language that's not parseable?

You can read up on that.

> Does it only mean that you can't know what the program will do until you run it?

No.

slacka · on May 26, 2013

The Perl language has evolved over the years. Could it be "fixed", so that a future version could be parsed?

perlgeek · on May 26, 2013

It could, but then one would sacrifice syntactic extensibility. It's a deliberate tradeoff.

n0body · on May 26, 2013

That's how it's supposed to be. You need perl to parse perl. It's not an issue, it's a feature of the language being so flexible.

Simply solution is, k.i.s.s

pekk · on May 27, 2013

This is a troubling thing to say.

csense · on May 27, 2013

This is yet another good reason not to use Perl.

brokenparser · on May 26, 2013

I knew it!