Java Verbal Expressions

yoz · on Nov 25, 2020

Sure, it's much easier to read, especially when it comes to finding and understanding a two-character diff in a 50-char regex.

Sure, I get the benefits of type safety.

Sure, it'll save me time debugging when I accidentally create an invalid regex.

But what am I meant to do with all that time saved? Read a book? Write more code? I don't get it. Let me waste the time on a ludicrously arcane syntax where I spend half the time looking at every bracket trying to understand if it's a control character in that particular context, because the ego trip I get from mastering this ridiculousness is HUGE!

(Yes, I understand regex syntax. I've been able to explain the phrase "zero-width negative lookbehind assertion" for the past twenty years. I inhaled the Friedl book and got utterly high on the idea that the awesome power of regular expressions - which are genuinely great in how they ease flexibility in accepting input - is entwined with their completely inhuman syntax. But I was wrong.)

pwdisswordfish4 · on Nov 25, 2020

Verbose does not ‘easier to read’ make; especially when you don’t know whether ‘anythingBut’ means (?!...) or [^...].

Type safety is nice, sure, but it’s a rather small benefit in this case. It doesn’t mean abandoning commonly-understood syntax is worth it. Most regular expressions are short enough to make errors visible with the naked (or IDE-assisted) eye.

This library at best looks like a crutch for a deficient language (which Java admittedly is), and at worst an unnecessary obfuscation layer.

deepsun · on Nov 25, 2020

You don't like Java, I see.

This tool has little to do with Java, except that author decided to implement it in it. It's a regular expression composer. You could implement it in any other general-purpose language.

jehna1 · on Nov 25, 2020

It already is. In 30+ of them. You can find them on: http://verbalexpressions.github.io/

alisonkisk · on Nov 25, 2020

The "deficient language" is "regex" not Java.

Escape codes and special characters for regex semantics is a deficiency of the 1980s programming world, not Javam

romanoderoma · on Nov 25, 2020

They can be hard to read, but I don't think they are deficient, on the contrary I think they are very elegant

Stephen Cole Kleene was a brilliant mathematician and when he invented regexs in the 50s of the past century, he anticipated a lot of concepts that became popular in computer science, such as recursion (which he also founded as a branch of mathematics and computer science together with Alonzo Church, Kurt Gödel and Alan Turing)

Java on the other hand has some deficiencies here and there and it's not really a modern language free from old cruft

admax88q · on Nov 25, 2020

> (?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])")@(?:(?:[a-z0-9](?:[a-z0-9-][a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-][a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Such elegance.

Might as well write code in machine code while we're at it.

Tainnor · on Nov 25, 2020

That's like complaining that you can write ugly code in any language. The problem isn't the regular expression, it's that email addresses, while they technically may form a regular language (not sure if they 100% do), are an insanely complicated such language and not a very nice one.

How would you write a specification for that language in any other way that was more elegant? Sure, you could make it more verbose, but that wouldn't make it easier to understand the whole of it, or why it is the way it is.

admax88q · on Nov 25, 2020

Almost every RFC writes their grammars in some form of BNF not in regular expressions. RFC are written to be understandable.

> Sure, you could make it more verbose, but that wouldn't make it easier to understand the whole of it, or why it is the way it is.

Absolutely it would. The way to understand a large thing, is to understand the smaller components and then put them together. Regular Expressions to do not compose well.

Tainnor · on Nov 25, 2020

> Regular Expressions to do not compose well.

That is patently untrue. Regular expressions compose under a number of important mathematical operations, such as union, intersection and concatenation. If your PL supports string interpolation, it's trivial to compose them in these ways (well ok, maybe not intersection). Nobody says that your regex needs to be written as a single string.

ajuc · on Nov 25, 2020

Elegant regexes are almost unheard of in real world no matter if it's e-mail or anything else.

notreallytrue · on Nov 25, 2020

Principia Mathematica wants a word in private

P.s. do you realise how much harder it would be to understand the same thing in machine language?

Yeroc · on Nov 25, 2020

Is there any language 20+ years old without cruft?

notreallytrue · on Nov 25, 2020

Haskell? (30 years old)

Where are the Lispers when you need them? :)

wwright · on Nov 25, 2020

Haskell absolutely has waaaay too much cruft. Have you read the 30 page articles recommending which extensions to use? Have you ever seen MTL? Read any documentation written by Edward Kmett?

notreallytrue · on Nov 25, 2020

Modules and libraries are not part of the language

Java,for example, can't have proper generics because at the bytecode level (the "real" Java) they are not supported, it can't have static constructors because it would break inheritance and they had to come up with static blocks, it has no support for static methods with the same signature of an instance method, because the call syntax doesn't differentiate a call to a static method from a call to an instance method, so the compiler can't tell which method is being called, etc. etc.

These are all consequences of the original choices taken 25 years ago when they designed the bytecode and the sintactic sugar over the bytecode, that still live with us today

wwright · on Nov 26, 2020

IMO, the prevalence of libraries like MTL, or with poor documentation, is a consequence of design decisions as well — maybe not of the abstract language, but at least of the primary implementations.

catlifeonmars · on Nov 26, 2020

I think having an EBNF like syntax would be nice. It’s verbose, but (IMO) fairly unambiguous. A side effect would be the ability to define variables (I.e. terms) within the expression

colonwqbang · on Nov 25, 2020

> Verbose does not ‘easier to read’ make

Java is founded on the opposite principle, I think.

etripe · on Nov 26, 2020

That you can make things hard to read even when verbose?

dehrmann · on Nov 25, 2020

Yeah; I'd rather have proper multiline strings in Java and a regex documented with the COMMENTS flag set. What I don't need is a regex builder. Or SQL builder, for that matter.

AlphaSite · on Nov 25, 2020

I think java has (or is getting) multiline stings now.

MHordecki · on Nov 25, 2020

Java got them in the most recent version 15: https://openjdk.java.net/jeps/378

dehrmann · on Nov 25, 2020

This took them way too long for how easy it is to add and how many headaches it would prevent.

deepsun · on Nov 26, 2020

They don't want to bloat language with million of rarely used syntax and hard to read features, just as Go devs do. Not to repeat the failure of Scala (and I'm afraid Rust goes the same way).

shawnz · on Nov 25, 2020

Surprisingly it emits [^...]*

See: https://github.com/VerbalExpressions/JavaVerbalExpressions/b...

jakear · on Nov 26, 2020

Well that’s terribly confusing. So .anythingBut(“a”).then(“a”) matches “a”?? That’s so far off from what I’d expect that when I read the example for a URL:

(...).anythingBut(“ “).endOfLine()

I defaulted to assuming endOfLine mapped to [^\n]*\n and was left confused about the whole thing. I’ll take simple rigorously defined grammars over someone’s attempt to embed English onto them any day.

mikeyjk · on Nov 26, 2020

Dang I really liked working with the streams package in Java STL.

I've been burned by Java specifics before but overall found it fun to dev on. Why don't you like it?

grishka · on Nov 25, 2020

I've never had any problems writing and debugging regular expressions after I came across this: https://regex101.com

And since regexes are usually write-once, adding this complexity on top of them serves no additional benefit. If anything, it'd probably make it harder for the next guy to understand your code.

ziml77 · on Nov 25, 2020

I bought RegexBuddy years ago and have loved it for debugging. However it only runs on Windows. Found regex101 recently and I think it's a great alternative (though I almost didn't check it out because the domain has SEO abuse site vibes).

Hallucinaut · on Nov 26, 2020

Have you seen https://regexr.com? I've only come across it recently but I really appreciated the visual description of complicated regex, especially for interpreting someone else's work.

Something1234 · on Nov 25, 2020

Can you explain the SEO abuse vibes? I always felt it was a straight forward name. It's a site the provides the basics of regexes.

ziml77 · on Nov 26, 2020

I think that's actually the problem. I'm used to domains that are very precise being used to host junk content. Most popular sites aren't so narrowly focused and/or have some brand-name-like domain.

sixo · on Nov 25, 2020

It's fun to write regex but it is absolutely miserable to read it. This looks like an improvement.

szatkus · on Nov 25, 2020

It is. For most parts regexes like "\d+" are ok, but when there is something more complicated I pull Verbal Expessions into a project. To these days reactions on CR were mostly positive or netural at worst. If it was built into the standard library I would probably use it instead of regex, but adding a new dependency and interfacing it with libraries that expect Java regex objects has its cost.

sergeykish · on Nov 25, 2020

Regular expression defines graph, graphical representation looks like a better choice:

https://jex.im/regulex/#!flags=&re=%5E(%3F%3Ahttp)(%3F%3As)%...

Usage example — CSS Syntax Module Level 3 documentation:

https://www.w3.org/TR/css-syntax-3/#string-token-diagram

and JSON specification:

https://www.json.org/json-en.html

Have not found visual editor, made sample in quiver:

https://q.uiver.app/?q=WzAsMTEsWzIsM10sWzAsNl0sWzEsMCwiXiJdL...

wwright · on Nov 25, 2020

Graphs may be more clear, but if we rule out visual editors, IMO this approach is still a net positive.

maweki · on Nov 25, 2020

> Sure, I get the benefits of type safety.

It seems not ;)

It's as if your java-compiler would stop warning you on forgotten semicolons and would instead error out during runtime when it reaches the statement with the missing semicolon.

It's not your time saved. It's time saved not running the test suite, for example. An uncompilable regex is a category of errors that you can ban completely from your program. Like java bans syntax errors as a category of (runtime) errors. It's time saved as any developer will not break this in a way that is not a semantic error. It's time and mind saved not thinking about a whole class of errors.

yoz · on Nov 25, 2020

Thank you for the clarification! To be clear: my post is sarcastic, and I was trying to say that this library looks like a significant usability improvement over traditional regex syntax.

lgeorget · on Nov 25, 2020

Fortunately C++ won't suffer from this kind of problems since there, you can make your regex builder a constexpr!

(lol)

brown9-2 · on Nov 25, 2020

Saving time is not just about getting to use it elsewhere - you also save time fixing bugs and the harm they can cause.

TazeTSchnitzel · on Nov 25, 2020

By using this “builder” syntax you gain:

• not having to distinguish special characters in the pattern being matched from special characters part of regex syntax

• no ambiguity as to whether something is a digraph or not

• no escaping hell

• unambiguous human-readable names for all the regex features used

• the ability to use whitespace to clearly separate different parts of the regex

• the ability to comment parts of the regex

It sounds great to me. Have you ever tried making a regex matching something with backslashes in it, and then you have to put that regex inside a string literal? Have you ever had to switch between different regex environments and not known which symbols require escaping, or what is the correct way to write something in a particular environment? I've had all these problems.

TimTheTinker · on Nov 25, 2020

Many of those gains can be had by using first-class and more full-featured regexes, like those that that are available in other languages (Ruby, Perl):

- escaping hell isn't that much of a problem, since you're only ever escaping something once (not like a regex in a string)

- several languages support separating regexes across several lines

- regex commenting (including named groups) is a standard feature in many languages, and that's besides using first-class comments across multiple lines

I think you do have a point about digraphs (or homographs), but unless I misunderstand, those would be a problem whether or not the character(s) are part of a string vs. a first-class regex. As for unambiguous human-readable names for regex features used, tools like this (https://regexr.com/) are available and very effective.

I might prefer Java Verbal Expressions over java.util.regex, but to me that's more of a knock on Java and its lack of proper, first-class regexes than anything else.

tomp · on Nov 25, 2020

Is it just me or does this seem like a very bad idea? I mean it seems nicer but the reality is, if you don't know how Regexes work, you won't understand the nuances of the "verbal" regex either... Also, some optimisation maybe?

    ^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$

could be better written as

    ^https?://(?:www[.])?[^ ]*$

Or am I missing something? In that case, I'll readily admit this library is a good idea :)

simias · on Nov 25, 2020

I think it's a great idea... if you already know regex. Effectively it's just a different syntax for the same construct after all, it doesn't simplify anything, it just makes it more readable. Oh and it makes escaping a non-issue, which already almost sells me on the idea completely, since it seems that 50% of the time I spend writing regex is figuring out what needs escaping and how.

Writing regexes is not much of an issue usually (although the many dialects in common use are always a source of frustration) but reading them is always a pain, for me at least. For quick and dirty shell scripts or vim editing it's great, for stuff that's supposed to be long lived and actively maintained in a codebase I think this verbal approach is a great idea, at least in theory.

Regarding the optimization of the intermediate result it should only be a problem if you actually need to output these regexes for other uses or if you need to compile many of them at runtime with performance constraints. If your regexes are pre-compiled then the resulting DFA should look the same as far as I can tell.

If somebody makes a Rust crate with a similar concept I'll be sure to try it out next time I have to write regexes in a codebase.

hansjorg · on Nov 25, 2020

Rust version of the same library:

https://github.com/VerbalExpressions/RustVerbalExpressions

Implementations for 36 different languages:

http://verbalexpressions.github.io/

dehrmann · on Nov 25, 2020

> I think it's a great idea... if you already know regex

It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.

I agree with the common complaint that regex is effectively write-only, but this is only half due to its terse syntax. A pattern can be pretty complex on its own, and complex things are hard to understand. Imagine what code matching behavior of a complex regex would look like.

simias · on Nov 25, 2020

> It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.

I disagree, at least in my experience there are significant differences between multiple regex engines I'm used to use regularly. In no particular order: are parens and other operators treated literally by default or do they need to be escaped? Are character class like '[:alpha:]' understood, or do I need to write them explicitly? Similarly, do I have access to \w \W \s and friends? Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces? Do I have recursion? Do I have named captures?

Those are not theoretical concerns, that's stuff I routinely end up getting wrong because I forget that this one feature that works in pcre does not work in vim or works differently in sed etc...

dehrmann · on Nov 25, 2020

> are parens and other operators treated literally by default or do they need to be escaped?

> Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces?

I think that's just older tools like vi and sed. Perl, Python, Java, and Javascript use a similar modern version where + and ? work, and parentheses and braces don't need to be escaped.

lucb1e · on Nov 25, 2020

> if you know it, you know it everywhere. What you don't know is this.

Right, one language might have anythingBut(" ").endofline() and the next language might have a different . operator like anythingBut(" ")->endofline() or it might even require nesting calls. None of these things are a significant hurdle and if we standardize the names (endofline, anythingBut, ...) then you can make the same argument. It's a chicken and egg argument: just use regex because that works everywhere -> it's not universally implemented -> it won't work everywhere.

And aside from that, I have a similar experience to the sibling comment: when using some command line tool that I forgot (is it sed? Vim?) the default is that \( is a capture group whereas in normal regex ( is a capture group. Grep offers you three regex variants to choose from. I have to look up regex syntax or do trial and error every time I don't use a language that I use daily. And I don't know all of regex to begin with, I just know everything I ever needed but people posted examples here with (?:x) which I don't know. I once read it and remembered it for a few days I think... so anyway, consistent and descriptive method names seems a lot easier especially when you consider autocompleting IDEs.

bjarneh · on Nov 25, 2020

> Is it just me or does this seem like a very bad idea?

It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.

It seems the whole IT industry is obsessed with helping us do all sorts of things, even simple things, which in the end often makes things more complex. Different query languages that translate to SQL to help us out, which often create super-complex SQL. All sorts of wrappers to avoid us having to deal with all sorts of formats (JSON/XML..). Hopefully those wrappers do something useful with those date-objects you know you have in there somewhere...

_lqaf · on Nov 25, 2020

Yep, and SQL-builders are the first thing I thought of, too.

These tools are great for letting someone build something they don't understand, and leaves them completely adrift when something goes wrong.

The next step is they bring this nonstandard thing to "the expert", who has to figure out their tool before they can figure out what's going wrong...

simias · on Nov 25, 2020

I don't think SQL builders are a good comparison because:

- SQL can already be made fairly readable by default, it's not just a long series of cryptic tokens. The main point of SQL builders is not to make SQL more readable, it's to make SQL approachable by people who don't know SQL.

- There can be several ways of achieving the same result in SQL, with sometimes deep performance implications, so it's really important to understand what is being executed and in what order. Regular languages are much simpler and while the string representation of the regex might end up longer than the handcrafted equivalent, the runtime performance should end up being the same since in the end it's all deterministic finite automatons.

- SQL builders have to be at least a little bit opinionated to be really useful, in general they make it easy to create simple queries but can quickly become limiting for complex queries, especially if you already know SQL. These "verbal expressions" on the other hand can easily map 1:1 with raw regex constructs, allowing somebody who already knows regex to express exactly the same logic, just in a more verbose and human readable way.

This verbose syntax operates at exactly the same level of abstraction as normal regex, it's just a syntactical transform effectively. It's like JSON vs. CBOR or something like that.

_lqaf · on Nov 25, 2020

> There can be several ways of achieving the same result in SQL, with sometimes deep performance implications

Which is also very true of regexes, especially the more feature-rich ones variants.

And the existence of variants was a large part of what I was getting at.

> it's just a syntactical transform effectively

Yes, it is tooling that helps people do things they don't understand.

bjarneh · on Nov 26, 2020

> These tools are great for letting someone build something they don't understand

Exactly, how can the same concept re-appear in all sorts of forms in this industry?

Another thing I though of was Mule, which I used a few years ago (it's hopefully better now). A horrid mess of an Eclipse-plugin that drew boxes with arrows between them, where "standard plugins" etc. could be plugged in to transform data and move it from here to there. The problem Mule solved (in our case) was comical, the complexity of the solution was also comical, or tragic; or maybe it was both.

marcinzm · on Nov 25, 2020

>It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.

I know regex and I hate writing it. It's unreadable and I need to spend time remembering/googling/checking the exact syntax. And, of course, the syntax differs from implementation to implementation in subtle but important ways (ie: need to double escape in python, etc.).

cutler · on Nov 25, 2020

Perl and Ruby don't need to escape regex metacharacters so why do Python and Java? It's just archaic.

wutbrodo · on Nov 25, 2020

> It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.

There's a niche where this might be useful, but by definition it's small. I understand regexes a moderate amount, and can construct arbitrarily complex ones when necessary. But I do it just infrequently enough that it can be painful and halting above a certain level of complexity, with lots of testing and reference-checking. It'd be nice to use something sane like this, and I think I fall squarely into the category of "people who understand regexes but would prefer to use stuff like this". Though as I said, this niche is almost by definition small, and on top of that I can't remember the last time I used Java.

Completely independently, in any non-trivial engineering system, readability is important, and this helps a lot there.

pydry · on Nov 25, 2020

A lot of IT is the parsing and mapping of one kind of language (whether markup, DSL, Turing complete) on to another.

Doing it right is a delicate balancing act of being just powerful enough to express everything the user needs without devolving into an unreadable or repetitive mess. Some people manage to achieve neither.

dehrmann · on Nov 25, 2020

> Different query languages that translate to SQL to help us out

That and UI SQL builders. What I want is typeahead column names, not a dropdown for the column, the operator, etc.

pwdisswordfish4 · on Nov 25, 2020

Well, there’s at least one advantage: apparently this builder library automatically escapes literal strings passed to it, so you no longer need to worry about injection bugs if you construct patterns dynamically (cf. parametrised queries versus ‘come on, just use mysql_real_escape_string, it’s not that hard’),

I’m not sure this alone pulls its weight, though; most of the time, regular expressions are fixed at compile time. And I’d still prefer something that mostly preserves commonly-understood pattern syntax. Having to guess whether ‘anythingBut’ means (?!...) or [^...] is not encouraging.

(This was apparently ported from JavaScript, where it is even more pointless: template literals can take care of the escaping part without abandoning standard pattern syntax. But as far as I know, Java has no equivalent feature.)

InfiniteRand · on Nov 25, 2020

I think there's certain use case for this, a moderate regex user who's not an expert and not fully comfortable with regexes but knows the basics, and who is in a project where they need to heavily use regexes for a limited amount of time and where they will need to maintain this code going forward.

If you use regexes a lot, you are better off learning regexes, if you use regexes a little, this is a lot to learn to avoid learning a little about regexes. But there is a moderate user sweet spot where I could see this useful.

ajuc · on Nov 25, 2020

I like it as a simplistic builder. Much easier to read, autocompletes, and (I assume) handles escaping for you (because it knows you put only raw data inside).

Just escaping alone is a big selling point for me.

creata · on Nov 26, 2020

It's worse. Even if you do know how regexes work, you can still be tripped up by the counterintuitive function names. See this comment[0] and its parent for an example. I think English names with ambiguous or unexpected meanings are worse than symbols, which at least don't carry implied meaning.

[0]: https://news.ycombinator.com/item?id=25215105

toxik · on Nov 25, 2020

In fact, why test for www at all? It is a subset of [^ ]* anyway.

wffurr · on Nov 25, 2020

Now we have three problems...

lmilcin · on Nov 25, 2020

No, you are not. These "verbal" expressions are nothing more than a builder for actual expression. So you can't actually use it without understanding regular expressions.

jariel · on Nov 25, 2020

" These "verbal" expressions are nothing more than a builder for actual expression."

It may be under the hood, but there's no reason for it to be.

There's nothing inherent in our regexes that would imply they are 'the language' for that purpose, it just so happens we really only have one commonly used one.

Like most things invented forever ago, there might be opportunities for a 'cleaner, better way'.

lmilcin · on Nov 25, 2020

Obviously, there might be occasions to improve.

But, regular expressions seem quite well optimize from my point of view.

Regular expressions are used for exact same task regardless of programming language -- using single expression language regardless of programming environment seems like a huge advantage. It can be embedded in configuration file, as a string in a database, on a web page or deep in backend code, and it will still work the same.

The "Java Verbal Expressions" already have "Java" in the name and so are complete loss when it comes to portability.

Then comes the fact that "Java Verbal Expressions" are many times more code that actual regular expressions. That isn't easier to scan, it is much worse.

Regular expressions are very succinct and you can express a lot in a single line of it. Comparable JVE-s would require many lines and wouldn't be more readable for anybody other than a person that doesn't know regexes at all.

jariel · on Nov 26, 2020

'Using a single language to do everything' could be argued on the software side as well.

It might be possible that s-expressions/style would would really well for regex, but nobody has really gone through the effort to do it.

Where I think regexes don't do so well is with UTF and (Grapheme clusters, true word boundaries) and also it's confusing the difference between match/capture/ignore etc..

I'll bet if you really put your mind to it creatively, you might be able to come up with a novel/new approach that wasn't really tried before ... but even if was 'better' it wouldn't catch on for a while (or never) unless there were some big, institutional backers.

disgruntledphd2 · on Nov 25, 2020

They're much easier to scan in a large codebase though, which I suspect is the major advantage.

balfirevic · on Nov 25, 2020

I'd say what you're missing is that it's not an alternative to regular expressions, it's an alternative syntax for regular expressions.

CapacitorSet · on Nov 25, 2020

It seems that the cruft really boils down to using groups even where there is no ?/*/+ qualifier.

dkarl · on Nov 25, 2020

Sadly, the ability to mash autocomplete instead of looking at the doc page for regular expressions will be a major selling point.

Gaelan · on Nov 25, 2020

Why sadly? Autocomplete does a ton for making APIs more discoverable and easy to use.

dkarl · on Nov 25, 2020

The intersection of debugging regexes and debugging code written by someone cycling through autocomplete looking for methods that sound right should not be real. It should be a myth, a region of programmer hell, a scary story to tell children about what will happen to them after they die if they don't document their code. May Dijkstra strike down anyone who succeeds in bringing this horrible idea to production.

mrkeen · on Nov 25, 2020

Even though I still use regexes in rare circumstances - e.g. inside config files, parser combinators already do a much better job than this (or regexes) when you are writing maintainable code:

    warcEntry = do
        header <- warcHeader
        crlf
        body <- do
            contentLength <- getContentLength header
            compressionMode <- getCompressionMode header
            warcbody contentLength compressionMode
        crlf
        crlf
        return (WarcEntry header body)

If you accept crlf as "carriage-return-line-feed", the rest basically reads as pseudocode. crlf could have just as easily been written (string "\r\n") I guess.

Parser combinators can:

* call out to other parsing functions (e.g. warcHeader) - so you can build your code out of testable units.

* bind results to variables and start using them during the parse, e.g. warcHeader returns data containing contentLength and compressionMode, which is then fed to the warcbody function so it knows what to expect.

one-punch · on Nov 26, 2020

Parser combinators showed me what an expressive language (in particular Haskell) can do to let you write code as you intend, so that maintainable code is enjoyable to read, write, and reason about.

I have heard that some people first “get” Haskell because of parser combinators.

Tainnor · on Nov 26, 2020

> call out to other parsing functions

yes, regular expressions cannot do that because regular languages cannot do that. that is an important limitation of regular languages, otherwise you'd be able to match the language {a^nb^n for any n} (or more specifically, balanced parentheses).

The thing you wrote you wouldn't want to parse with a regular expression because it fundamentally doesn't look like a regular language with its hierarchical structure.

jjice · on Nov 25, 2020

While neat, I think that if you're a developer, you'd be better off learning basic regular expressions instead so you can use them in whatever language you'd like. Depending on this would probably just make moving to a new code base that doesn't use this a lot more confusing.

A normal regex with a comment above it explaining what it does (for complex cases) always worked well for me.

TonyTrapp · on Nov 25, 2020

Even as a developer you may have to assemble a regular expression at runtime, at which point a library that can do it for you may be much more handy than having to assemble the string yourself.

And even if you know regex by heart - assembling it with function calls can still be better / safer just like you shouldn't insert SQL parameters by hand into your SQL query strings.

nsxwolf · on Nov 25, 2020

I can't learn them. I've tried for over 20 years and every time I use them the knowledge is deleted from my brain immediately. A library like this would be very helpful if it worked.

One problem is that I'm more likely to need regex almost anywhere but Java code.

rhacker · on Nov 25, 2020

At the bottom there's a list of other languages that support the same API

patal · on Nov 25, 2020

That does not work so well if you're working in a team. A fairly complex regular expression is always hard to read.

We see this as early as in code review and as late as when you find a production bug because expectations of the surrounding code have changed.

For those reasons, we usually break regexes into parts anyway, and name and comment the single parts. Using the library's example, we might have:

  protocol = "^(?:http)(?:s)?"
  protocol_separator = "(?:\:\/\/)"
  url = "(?:www\.)?(?:[^\ ]*)$"
  
  regex = protocol + protocol_separator + url

Which turns out to be in the direction of these Java Verbal Expressions. I find the Verbal Expressions idea really enlightening.

spatx · on Nov 25, 2020

I think there is value in both cases. I've seen many developers that have struggled with regex even with all those hundreds of tools to learn and to build/test regex. This could be useful to them to start with, and they can learn regex according to their time/needs. I see solutions like this as a choice, and the fact that people are using these shows that there is value in having that choice, even if is not obvious to us at first glance.

kleiba · on Nov 25, 2020

Looking at the example, my immediate reaction was that the main advantage would be the `anything_but` method, relieving me from the cumbersome construction of stuff like this:

    (?:[^t]|t(?:[^r]|r(?:[^u]|u(?:[^m]|m[^p]))))

What a time-saver it would be to write

    anything_but("trump")

Except, then you look at the source code and see this:

    public Builder anythingBut(final String pValue) {
        return this.add("(?:[^" + sanitize(pValue) + "]*)");
    }

Sad face :(

jodrellblank · on Nov 26, 2020

Might be a use case for The Greatest Regex Trick Ever: http://www.rexegg.com/regex-best-trick.html

(which boils down to "trump|(what you want)" and then check the capture group to see if you got what you want).

kleiba · on Nov 26, 2020

Except not all applications of regular expressions allow for that trick to be used. Case in point: grep (which of course has the -v option which together with pipes does the whole negative match thing much more neatly anyway).

recursive · on Nov 25, 2020

Why would you be constructing stuff like that? It consumes the input string up until it differs. When is that useful in a regex?

enricozb · on Nov 25, 2020

How else would you write that you want to match all strings that don't contain string X? If you were matching at a specific position, you should use a negative lookahead (?!xyz), but I think in some cases you might need the mess above.

recursive · on Nov 25, 2020

I can't imagine a case where the mess would be useful at all.

Negative lookahead is the only way I can imagine this being possibly useful.

I.E. "Give me a string that's not trump and has a vowel in it".

Given "trunk", that mess above would match all of "trun". Would good is matching a prefix going to do ever?

kleiba · on Nov 26, 2020

If you concede that the standard regex construct [^x] is a useful construct at all -- match any character except x -- why would should such a negative match be restricted to only a single character? Why not also allow for, say, match anything except the following _two_ characters in a row?

recursive · on Nov 26, 2020

> Why not also allow for, say, match anything except the following _two_ characters in a row?

That part could be useful, but not when the pattern matches a seemingly random number of characters in the process.

[^x] is more predictable. It matches a single character, or not at all.

kleiba · on Nov 26, 2020

That part could be useful [...]

Well, then by application of your name, it should be useful for any number of characters. They're not random -- they spell out a word you would like not to match.

I'm not sure what you mean with "predictable". Are you referring to readability of the expression? I totally give you that... which kinda was my whole argument to begin with.

recursive · on Nov 27, 2020

Any substring of a shorter length than the no-no word spells out a word which doesn't match it.

My opening statement was that I can't see a practical use for it. I still can't. Maybe one exists. But I don't think we're getting anywhere.

maweki · on Nov 25, 2020

It's pretty verbose, but it is useful in the sense that you have type-safety between character groups and the control characters. It's neat that it only allows you to create valid Regexes (I hope it does). At least you have static safety that your parenthesis for capture group are properly closed.

This advantage is not explained. Not being able to construct invalid regular expressions is a good static safety guarantee that you don't get when you embed DSLs as strings.

Edit: This is the same reason why we would prefer jOOQ to embedded String-SQL, if speed/dependencies are of no concern. You're not allowed to construct invalid SQL as the java-type-system gives you these guarantees when using an embedded DSL instead of a String-DSL. This is very powerful, but of course only works if the type system of the host language is powerful enough.

surfsvammel · on Nov 25, 2020

Unlike many others, I actually like this idea. I know regular expression, but many of my colleagues do not. They often have a hard time understanding what a particular regex do, event though I often document them step by step. Something like this would make it more readable.

I do agree with others here, that it seems a bit rough around the edges and some optimisation might be needed. But I think the idea itself is sound.

maweki · on Nov 25, 2020

Maybe you should look at visualizations like Regex Railroad Diagrams. This is what helps me most.

stickfigure · on Nov 25, 2020

This is cool, but I'm disappointed to see the horrid builder pattern show up again. Imagine you had to use StringBuilder every time you wanted to manipulate a String?

Just make all fields final and combine the builder and 'working' class into a single immutable object. Like String.

`build()` everywhere is syntactic noise, and you either lose immutable safety (by passing around builders everywhere, as in the examples) or composability (by passing around the 'sealed' objects). Builders are an antipattern that should only be used in cases where extreme performance is required.

noema · on Nov 25, 2020

The main intent of Builder isn't performance, but to avoid a combinatorial explosion of constructors for every possible set of parameters.

lalaithion · on Nov 25, 2020

So why have constructors for every possible set of parameters?

Why

    VerbalExpression.regex()
        .startOfLine().then("http").maybe("s")
        .then("://")
 .maybe("www.").anythingBut(" ")
 .endOfLine()
 .build();

Instead of

    new VerbalExpression()
        .startOfLine().then("http").maybe("s")
        .then("://")
 .maybe("www.").anythingBut(" ")
 .endOfLine();

szatkus · on Nov 25, 2020

Both are equally readable to me, but with the builder pattern you have an ability to fork a builder. Cloning objects in Java could be messy.

stickfigure · on Nov 26, 2020

With immutable objects, every step in the fluent chain of calls is an independent fork of the full object state. There's no need to use java clone(); each method calls a private constructor that passes the object state (slightly altered, of course).

java.lang.String works exactly this way. You're already used to the pattern.

tpxl · on Nov 25, 2020

The builder allows you to have immutable objects. While I dislike builders and seldom use them, I dislike mutable objects even more.

Tainnor · on Nov 26, 2020

In the parent's suggestion, you still have immutable objects. "VerbalExpression()" is a valid regex, namely the empty one. Every subsequently called method concatenates some new regex onto the receiver and thus builds a new regex (since regular expressions are closed under concatenation).

Builders are used in Java* when you have an object which is invalid without passing in a bunch of parameters, but you don't want to have to remember the order of the parameters. But this is not one of these cases.

There is one drawback, however: you do have to compile the regex into a FSA at some point and it wouldn't be good to do that for every intermediate regex. So I assume that the compilation happens in the "build" step. They could have just called it "compile", though.

* named parameters in constructors seem a better alternative in every language that supports them, e.g. Kotlin

stickfigure · on Nov 26, 2020

Furthermore, with immutable objects there's no need to to have any public constructors at all.

`REGEX.startOfLine().then("http")...`

Since the start object is immutable, there's no reason to create new ones. Here's an example of this pattern applied to HTTP (my project):

https://github.com/stickfigure/hattery

Avoiding repeated Pattern.compile() is a good point though.

x87678r · on Nov 25, 2020

I wish you were interviewing me. This is Java world you're talking about and if you can't squeeze a dozen Gamma Design Patterns into your code you aren't good enough.

justin_oaks · on Nov 25, 2020

This project would be better if it wasn't exactly a 1-to-1 mapping from words/methods to regular expressions. For example, the regex "\d+" maps to the code "digits().oneOrMore()". That doesn't read well in English because it's odd to have an adjective after the noun (i.e. we say "red bird" not "bird red").

Also, a serious weakness in regex is they are "write only", or hard to read. That's because they are compact and don't have discernible sections that are then assembled together.

You can do that yourself in Java by assigning chunks of regex to variables and then concatenating them together, but the regex engine doesn't let you do that itself. You can't name sections of the regex or insert comments into it.

The example

   ^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$

could be better if it could be broken down into named pieces or commented like this:

    ^              
    (?:http)(?:s)? # http or https
    (?:\:\/\/)     # ://
    (?:www\.)?     # optional www.
    (?:[^\ ]*)     # rest of URL (no spaces)
    $

jodrellblank · on Nov 26, 2020

Is this any different from the comment style:

    a = 1   # set variable a to 1
    a += 1  # add 1 to a

?

Comments should explain the why, not the what. `\d # digit` yes that is what it says, but why a digit? What is your ^, what is (?:) what is $, why is (?:[^\ ]) "rest of URL" [I know the answers to these, the person the comments are intending to help probably doesn't]. Where's the explanation of what it's supposed to be doing with a limited subset of URLs?

    # match basic URLs such as:
    # http://example.org
    # https://www.example.org
    # Must have no surrounding spaces, no spaces in the text.

    ^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$

It's much clearer to me from the one-line regex that it matches "http://////////" and that's probably an error. I don't know why it's clearer, maybe because I'm used to looking at
https://www.example.org/path/
instead of
https :// www. example.org/path/

stickfigure · on Nov 26, 2020

Mostly we program in languages that read like a kind of shorthand english, and mature programmers are expected to be able to read the text fluently. WHAT comments tend to be superfluous.

Nobody sane reads 100+ character regexes fluently, so WHAT comments are totally appropriate. Any time I write a nontrivial regular expression, I always try leave an example of what it would match in a comment. I've thanked last-year me for this more times than I can count.

throwaway_pdp09 · on Nov 25, 2020

I thought java regexes had comments?

https://docs.oracle.com/en/java/javase/11/docs/api/java.base...

justin_oaks · on Nov 25, 2020

Huh, I didn't know that. I've read through a fair amount of Java code with regexes and never seen anyone use comments. Maybe it's because Java doesn't have proper multi-line string support built into the language.

If you don't have multi-line support in the language then you're more likely to put the comments outside the string:

    String regex=
      "^"
     +"(?:http)(?:s)?" // http or https
     +"(?:\:\/\/)"     // ://
     +"(?:www\.)?"     // optional www.
     +"(?:[^\ ]*)"     // rest of URL (no spaces)
     +"$";

throwaway_pdp09 · on Nov 25, 2020

... which has to be a better way of doing it (comments + regexps in digestible chunks) than having a rather wordy library.

dmarlow · on Nov 25, 2020

I love your example of how it should be explained. This helps people correlate the verbal aspects to the regex parts they described. This ultimately reinforces and helps people learn regex more deeply.

gambler · on Nov 25, 2020

Once you start thinking about it, it's mind boggling that we have thousands of languages and yet most of them don't have built-in facilities to construct and parse grammars (at least context-free ones). Every single designer seems to think that their language is finally good enough and will not be used as a starting point for another one.

rbonvall · on Nov 25, 2020

Raku (Perl 6) has grammars as first-class citizens: https://docs.raku.org/language/grammars#Creating_grammars

throwaway_pdp09 · on Nov 25, 2020

Because building in a parser is inappropriate; it isn't generally worth it. You use a separate tool or framework, you don't build it into a language.

raiph · on Dec 1, 2020

Perhaps Raku would be of interest to you. It was specifically designed to make it a good starting point for its ongoing replacement, Ship of Theseus style.[1]

----

The first half of this comment paints the big picture of Raku's approach to enabling this.

The second shows actual code demonstrating the approach.

----

Raku has a built in grammar + actions formalism, and corresponding built in parsing/compiling machinery.

Raku's grammars/actions are used to declare, parse, and compile Raku -- it is entirely defined in terms of itself.

----

Raku isn't just one language, but a collection of sub-languages, aka slangs. Each of these is comprised of a grammar/action pair. These are mixed together to form a composed result: "Raku".

----

Any slang can include grammar rules that invoke rules in other slangs with which its mixed. The slangs in standard Raku take advantage of this. Thus a function written in the Main (GPL) slang will accept regexes in some spots in the Main slang, with those regexes being parsed/compiled according to the Regex slang.

This ability for slangs to use each other can be (and is, for the standard slangs) mutually recursive. Thus, just as an ordinary function written in the Main slang can include a regex in its code, so too a regex can include an ordinary function.

This overall approach of weaving slangs together is called a "language braid"[2]. Multiple slangs are seamlessly woven together as if they were just one language -- because in fact they are, despite its declaration being broken out into slang modules.

----

User defined slangs can replace or modify the "official" standard slangs that ship together to comprise "Raku".

Thus, at the most trivial level comes slang modules like Slang::Tuxic.[3] This was one of the first slangs. It is just a few lines of code written by a core Raku dev; it overrides a few existing minor rules to tweak the parsing of the Main slang to make another dev called Tux stop complaining about some syntax decisions he didn't like.

More advanced slangs are internal DSLs such as Slang::SQL.[4] This weaves SQL and standard Raku code so that each can enrich the other.

----

Use of slangs is lexically scoped. That is to say, a slang can be invoked within a particular module, or function, or even just an `if` statement's True block, and at the end of that scope the slang's alteration of Raku vanishes.

And if that was done in an inner scope rather than at the top level, then at the end of that scope, pop, the old Raku would return. (Thus lending a new twist to the Ship of Theseus[1] thought experiment.)

----

This last section includes code showing the approach in action, after a bit of introduction of what I will and won't include.

Raku's grammars declare arbitrary parsers. One can arbitrarily override any part of an existing grammar by composing it into a new grammar with selective replacement of particular parts. But "them's big guns". Instead we'll stick with a simpler mechanism. It's still powerful, but it's particularly easy to use and demonstrate.

A key element underlying what I show below is an alternate rule construct that's available in any Raku grammar. This makes it trivial and practical to add any number of new alternatives for particular grammatical "slots".

The grammars used to declare Raku itself make use of this construct to declare much of its grammar. In particular, the Main slang does and then builds on that in a principled manner to exposes itself to user extension of those parts of its grammar. This is what we're going to see in action.

There are over a dozen of these grammatical "slots". For our example here we'll extend the `postfix` slot:

    sub postfix:<!> ( \N ) {    # Declare factorial op
        N > 2
            ?? ( N * ( N - 1 )! )
            !! N
    }
    say 5! # 120

The syntax aspect of this extension happens as soon as the opening brace of the operator's definition block is reached. This is why the use of `!` that's inside the block is correctly recognized, despite appearing before the block defining its semantics is even closed.

While the syntax is utterly trivial (literally just the token `!` where the Main slang accepts postfix ops), the semantics are defined by the arbitrary body of the declaration. This body compiles into corresponding AST that is added to the MAIN slang's semantic actions class.

To be crystal clear, this all happens at compile time, and this new op would be part of a module's compiled result if it were in that module, despite the code that declares the op being ordinary Raku code.

----

[1] https://en.wikipedia.org/wiki/Ship_of_Theseus

[2] https://www.reddit.com/r/ProgrammingLanguages/comments/h89sc...

[3] https://github.com/FROGGS/p6-Slang-Tuxic

[4] https://github.com/tony-o/perl6-slang-sql

miked85 · on Nov 25, 2020

I feel like one would be much better off and more efficient by just learning regular expressions.

manoDev · on Nov 25, 2020

The fact there are countless RegEx cheatsheets and pages like https://regex101.com/ or https://regexr.com/ is evidence RegExes are not intuitive or easy to remember. Composing plain-english functions can be easier to remember, and editors can provide auto-complete.

djeiasbsbo · on Nov 25, 2020

I'd say the bigger issue are the different regex implementations. If you use Java, Javascript and grep you already have to know the peculiarities of each implementation...

flatiron · on Nov 25, 2020

IntelliJ at least has a built in regex maker that you can test against strings in the IDE. Pretty close to auto complete.

tpxl · on Nov 25, 2020

What is this functionality named?

flatiron · on Nov 26, 2020

https://www.jetbrains.com/help/idea/regular-expression-synta...

pwdisswordfish4 · on Nov 25, 2020

It’s only evidence that they have to be learned, like everything else.

ed25519FUUU · on Nov 25, 2020

The argument isn’t that they’re easy to use, it’s that they’re widely used and widely available in virtually all languages. You’ll encounter them.

The time it takes to learn regular expressions will pay off because you’ll be reading them and writing them for your whole career.

setr · on Nov 25, 2020

Except that all implementations are subtly and annoyingly different, and while you can transfer your general understanding, you can’t avoid the cheatsheet if you’re using multiple tools.

But of course, since this is basically a one-to-one mapping, you can also trivially transfer your understanding to any other regex tool anyways (with a cheatsheet, which you’ll need in either case).

king_magic · on Nov 25, 2020

Eh, not necessarily. 15 years into my career, I can count the number of times I've needed regular expressions on two fingers.

teknopurge · on Nov 25, 2020

I love seeing new things and building, I also want to understand why people would find value in this? Is it because people are learning things differently and find this easier to digest instead of using regexs? or native substring tokenization/boolean primitives?

The new me is being less critical and positive... (smileyface.jpg)

nerdponx · on Nov 25, 2020

I agree.

But if you want better readability and comments, Python's "verbose" regex (?x) is a beautiful thing. You can usually also just construct regular expressions incrementally by concatenating strings or whatever your language supports.

a_e_k · on Nov 25, 2020

Emacs has had an Emacs Lisp version of this for a long time. It's implemented as a macro so it can build the string regexp at compile time.

https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx...

ajainy · on Nov 25, 2020

of course as others pointed out, writing direct exp might be optimal or every dev should learn about it.

BUT in my whole career span, whenever I have to use regex, I spend couple of hrs learning and testing. This kind of library for Java open doors for many other things. (testibility, default library using default methods etc.., integration with streaming ). And as community adds to it, it can be optimized internally. All end user needs to do upgrade versions. Can be extended part of javax validation specs.

murkle · on Nov 25, 2020

Another key point: makes the code readable!

usrusr · on Nov 26, 2020

I might actually like something like this if it did not try to replace conventional regex syntax but aimed at making it more useful by enabling composition of smaller blocks of (conventional) regex into larger. An API like that could certainly also have an equivalent of Pattern.quote(..) for the few times what you actually do want to enter unescaped (ever tried reading regex with regex?), but it would just be a minor nice to have feature.

It wouldn't even have to contain all the small blocks (like [] single char alternatives, those are best left as is), just an option to describe bigger round-brace blocks out in the java syntax where nesting is validated in compile. Lookahead, lookbehind, their negatives, capture, noncapture. Noncapture with a varargs (or builder) API of piped alternatives.

You could still provoke a PatternSyntaxException by writing invalid stuff in the conventional blocks, but if the composed pattern fails checking the subpatterns should be quite helpful (if there are no backreferences)

quickthrower2 · on Nov 25, 2020

Take 3 more steps in this direction and you can shed the regex entirely and have parser combinators

ebiester · on Nov 25, 2020

I wrote one of these in 2002, back in college, after being inspired by Icon and SNOBOL. From Wikipedia:

  s := "this is a string"
  s ? {                               # Establish string scanning environment
      while not pos(0) do  {          # Test for end of string
          tab(many(' '))              # Skip past any blanks
          word := tab(upto(' ') | 0)  # the next word is up to the next blank -or- the end of the line
          write(word)                 # write the word
      }
  }

I really think we lost out when we went toward regular expressions rather than SNOBOL/Icon syntax, but I don't think a direct substitute is as much the issue.

chubot · on Nov 25, 2020

Related: Oil has an regex syntax that composes and doesn't have escaping problems:

https://www.oilshell.org/release/latest/doc/eggex.html

Direct link to example:

https://www.oilshell.org/release/latest/doc/eggex.html#examp...

A longer example:

http://www.oilshell.org/blog/2019/12/22.html#eggex

redmorphium · on Nov 25, 2020

Reminds me of https://github.com/francisrstokes/super-expressive

PaulHoule · on Nov 25, 2020

That kind of thing works even better in Java because the static type system enforces it.

In particular generic methods don't have the problem of type erasure that affect generic classes so many things you would want to do with types "just work".

Almost everybody is afraid of it, but $ works just fine as an identifier and can be used to make a DSL that looks like jQuery in Java.

Maybe someday i will write a class like:

   class("some.namespace.MyClass").method(...)

Tainnor · on Nov 26, 2020

It's important to note that these are still regular expressions. They still conform to the same mathematical definition as the regular expressions we all love or hate, and in fact, the library uses regular expressions under the hood. It's basically just using new language constructs for the exact same capabilities.

Now, whether you like that new syntax or not, is a matter of opinion. It is of course more type-safe and probably more approachable. That said, I find the verbosity a bit off-putting, although I chalk some of this up to Java being Java - other language implementations could be a bit nicer there. I also question the API design somewhat: regexes are inductively defined, there are a couple of primitives and a couple of operators that combine these primitives and the API doesn't make this very explicit by using the builder pattern.

But ultimately, I'm more used to the way regular expressions are usually written, and for many tasks, that syntax is also more portable (e.g. regular expressions can be configuration which can be hugely useful for e.g. NLP tasks). For quick one-off text replacement tasks with very simple regexes I also think that this syntax would be way too cumbersome.

Lastly, there are some people in the comments suggesting the use of parser combinators, BNF and similar formalisms / tools. Those might be an option for your use case, but they don't give you regular languages (although, sadly: regular expressions in most languages nowadays are not regular anymore). Having a regular language might be an actual requirement for whatever reason (e.g. constant memory usage).

jjevanoorschot · on Nov 25, 2020

For everyone that doesn't see the point, take a look at the example of parsing a long string [0]. The verbal expression is _much_ easier to read than the regular expression.

[0] https://github.com/VerbalExpressions/JavaVerbalExpressions/w...

pavon · on Nov 25, 2020

I don't see it. The regex is mostly hard to read because they formatted it poorly and put in a bunch of unnecessary non-capturing groups. I find this to be just as easy (if not easier) to read as their first example:

    String pattern = (
        "(\d+)\t"+
        "(\d+)\t"+
        "([0-1])\t"+
        "(http://localhost:20\d{3})\t"+
        "([0-1])\t"+
        "(\d+)\t"+
        "([0-1])\t"+
        "(\d+)\t"+
        "(\d+)\t"+
        "([0-1])\t"+
        "(\d+)\t"+
        "(STR[0-2])"
     );

And this is just as easy to read as their second example:

    String num = "(\d+)\t";
    String bool = "([0-1])\t";
    String url = "(http://localhost:20\d{3})\t";
    String str = "(STR[0-2])";
    String pattern = num+num+bool+url+bool+num+bool+num+num+bool+num+str;

And yes, I do frequently split up my regexes like that to make them more readable.

The only improvement I see is that you don't have messy escaping in the url. That is genuinely nice. It motivates me to start using an regEsc() function instead of doing it by hand. However, I find "capt().endCapture()", and other verboseness to be a step backwards.

Edit: Actually, from what I can tell, all the escaping was unnecessary in this case as well. Updated examples without unneeded escape characters.

laszlokorte · on Nov 25, 2020

is(4).equalTo(5.plus(eulers_constant.toThePowerOf(1.toTmaginaryUnit().times(rationBetweenCircumferenceOfACircleToItsDiameter))))

dailygrind___ · on Nov 25, 2020

I think Regex is too low-level and a problem worth abstracting. It works fine with simple patterns but I don't really see how a pattern like this:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_])?\??(?:[-\+=&;%@.\w_])#?(?:[\w]*))?)/

(https://stackoverflow.com/questions/161738/what-is-the-best-...)

contributes to readability.

soco · on Nov 25, 2020

Looks a bit abandoned though, doesn't it. Otherwise I'd love it for the safety and readability (while I'd still need to re-learn all what got forgotten in the last half a year before I used regex last time)

swlkr · on Nov 25, 2020

It's semi-related, but if you're into easier regex, have a look at janet's PEGs

https://janet-lang.org/docs/peg.html

lmilcin · on Nov 25, 2020

It is a huge amount of code for a relatively simple expression.

I see not a single situation where this would actually look more readably than a proper regex.

Unless... you don't want to learn regular expressions and then you have two problems...

6gvONxR4sf7o · on Nov 25, 2020

regex suffers from the same problem that inlined code does.

For example, if you saw this in a code review, what would you say: log_and_return(rank_by_time(compute_recommendations(get_data(client_id,date), find_nearest_neighbors(client_id))))

You'd tell them to create some intermediate variables. But when it's a regex, apparently we're all fine with this:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_])?\??(?:[-\+=&;%@.\w_])#?(?:[\w]*))?)/

bwestergard · on Nov 25, 2020

This is a nice API. It seems to get right up to the edge of becoming a parser combinator library.

Is it actually improving performance to use the regular expressions internally to evaluate matches?

ridaj · on Nov 25, 2020

Is the implied contention that `regex().capt().digit().oneOrMore().endCapt().tab()` is easier to read than `([0-9]+)`?

If so, maybe this isn't for everyone :)

asicsp · on Nov 26, 2020

See also https://github.com/aloisdg/awesome-regex for curated collection of libraries, tools, frameworks and software.

https://github.com/VerbalExpressions is one of the resources in this collection.

1f60c · on Nov 25, 2020

The HN title isn't very informative.

Maybe you could change it to something like:

  Java Verbal Expressions: a DSL for regular expressions

bluepnume · on Nov 26, 2020

I tried doing the same in JavaScript/JSX a small while back: https://medium.com/@bluepnume/writing-regex-in-jsx-without-r...

Building regexes programmatically does seem to take a lot of the pain out of it.

jug · on Nov 25, 2020

I'm not sure if this is great or crazy! I try to not be swayed by the handpicked examples because this at least _feels_ like a design that could get messy once you try to do the particularly gnarly regexeps that this library claims it was designed for. If it's great, it should already have been done long ago, hmm...

tekknolagi · on Nov 25, 2020

A buddy of mine made Remake (https://docs.rs/remake/0.1.0/remake/) with this kind of thing in mind. It's a DSL for composing regular expressions in a readable way.

TimTheTinker · on Nov 25, 2020

I am all for developer ergonomics, and I'm a fan of Ruby... but the problems this library would add to a codebase/project seem too big to be worth the benefits:

- non-standard syntax requiring its own documentation, which developers would have to consult separately (even if they already know regular expressions) to modify generated regular expressions

- removing the ability to test and validate regular expressions independently of the codebase (say, in the terminal, a small shell script, or using an online tool)

- a new rabbit hole to traverse when debugging a problem

- assuming the security risks associated with handing over regex-building to a library built by someone else (even more so if the regex is parsing private or protected data)

- adding a new dependency that may or may not be maintained in the future

For those who would want to use this library, I would suggest using a separate tool to build and/or understand regular expressions. Here's one example, and I'm sure there are others: https://regexr.com/

oweiler · on Nov 25, 2020

I'm an average developer but never found regular expressions too hard to write or even read.

theparanoid · on Nov 25, 2020

Anything but the simplest regexes are tricky to correctly write.

jehna1 · on Nov 25, 2020

Anyone looking for a non-Java implementation: This library has been ported to 30+ languages, and you can find a list of them at http://verbalexpressions.github.io/

rendall · on Nov 25, 2020

This project seems to be rediscovered every so often

https://news.ycombinator.com/from?site=github.com/verbalexpr...

bmc7505 · on Nov 26, 2020

Also available in Kotlin: https://github.com/VerbalExpressions/KotlinVerbalExpressions

flying_sheep · on Nov 25, 2020

That really depends on how complicate the regular expression is. For me this debate sounds like arguing assembly vs C. We will need some sort of abstraction to develop a higher-level stuff in case we need it.

zvrba · on Nov 25, 2020

I limit my brain-time on constructing a regex to 5 minutes max. If it takes me longer than that, I reach for a parser. Pick the right tool for the job.

cutler · on Nov 25, 2020

Maybe if Java left the Stone Age and fixed the need to escape regex metacharacters this wouldn't be necessary.

prabhatjha · on Nov 25, 2020

This is a fantastic idea -- the kind you see and go why the heck this was not done before. Such a huge time saver.

skocznymroczny · on Nov 25, 2020

Looks interesting. I find out all my regexes are pretty much write-only. When I come back to them few months later, I can't make much of them and it's easier for me to start from scratch. Tools such as https://regex101.com/ are amazing though for development of regexes and later trying to make sense of them.

antpls · on Nov 25, 2020

That would definitely help code reviews and maintenance. Is there anything similar for Python?

cratermoon · on Nov 25, 2020

this link has been posted 15 times on NH. First time was over 7 years ago https://news.ycombinator.com/item?id=6200070

abhinai · on Nov 25, 2020

Beautiful though a little verbose!

cfv · on Nov 25, 2020

It'd be absolutely bonkers if you could use this exact same DSL to generate valid strings

recursive · on Nov 25, 2020

Well, since you can use the regex itself to generate valid strings, it's certainly possible.

slifin · on Nov 25, 2020

https://github.com/lambdaisland/regal

Is a regex DSL that will let you do that, wouldn't be surprised to see others

m12k · on Nov 25, 2020

You can - use this to generate a regex, then run that regex through one of these libraries: https://stackoverflow.com/a/22133/126183

chrisbrandow · on Nov 25, 2020

solve a problem with regex: now you have 2 problems.

well, now you have 3.

s4n1ty · on Nov 25, 2020

Wow, pretty sure I played with something like this for Python in the 90s. People have been trying to replace regexps with something more readable for a long time.

This seems like a decent attempt, although the syntax for captures looks a little clumsy.

tasogare · on Nov 25, 2020

I did a little class like this with a fluent API in C# to generate regex in a project that requires big ones. It make working with regex super easy and super maintainable.

throwsofaraway · on Nov 25, 2020

Trying to simplify something that doesn't simplify inherently isn't always a good idea. Regex is pretty close to the least level of abstraction that is necessary to get the job done. It could probably be improved on, but probably not by much.

Some commenters below mentioned this Java syntax is a good idea and using endless number of regex cheatsheets as a testament to why regex is not simple enough and should be replaced. It's almost silly that this is even an argument on HN. Take for example quantum physics, there are lots of videos and guides that try to explain how it works, in fact some of the smartest people tried to explain it, even Richard Feynman. But he famously said if you think you understand quantum mechanics you don't understand quantum mechanics.

Some things cannot be reduced any further, this does not mean those things are always simple in nature or somehow were designed in a convoluted way on purpose.

At least when it comes to regex it's important to keep in mind what Einstein said, "everything should be as simple as possible but no simpler."

It's ironic that people apply reductionism to simplify regex, a thing that itself one could argue is a prime example of reductionistic design, yet they complain it's too abstract while applying reductionism.

pandemic_region · on Nov 25, 2020

WHERE HAVE YOU BEEN ALL MY LIFE

jacobwilliamroy · on Nov 25, 2020

How do I learn regex? I get confused because it seems like maybe there's more than one kind of regex floating around out there, and since regex is made of lots of punctuation symbols, it's very hard to search for things about it on the web. Is there a single book I can read? A couple books? Does it depend on my runtime environment?

aparsons · on Nov 25, 2020

The example isn’t a correct URL test regex (far from correct actually - even though there are plenty of edge cases regular regex strings tend to miss also)

jefftk · on Nov 25, 2020

Their example is showing what the library can do, not trying to determine which strings are URLs.

im3w1l · on Nov 25, 2020

If you put it as a showcase, then people will use it.