Sure, it's much easier to read, especially when it comes to finding and understanding a two-character diff in a 50-char regex.
Sure, I get the benefits of type safety.
Sure, it'll save me time debugging when I accidentally create an invalid regex.
But what am I meant to do with all that time saved? Read a book? Write more code? I don't get it. Let me waste the time on a ludicrously arcane syntax where I spend half the time looking at every bracket trying to understand if it's a control character in that particular context, because the ego trip I get from mastering this ridiculousness is HUGE!
(Yes, I understand regex syntax. I've been able to explain the phrase "zero-width negative lookbehind assertion" for the past twenty years. I inhaled the Friedl book and got utterly high on the idea that the awesome power of regular expressions - which are genuinely great in how they ease flexibility in accepting input - is entwined with their completely inhuman syntax. But I was wrong.)
Verbose does not ‘easier to read’ make; especially when you don’t know whether ‘anythingBut’ means (?!...) or [^...].
Type safety is nice, sure, but it’s a rather small benefit in this case. It doesn’t mean abandoning commonly-understood syntax is worth it. Most regular expressions are short enough to make errors visible with the naked (or IDE-assisted) eye.
This library at best looks like a crutch for a deficient language (which Java admittedly is), and at worst an unnecessary obfuscation layer.
This tool has little to do with Java, except that author decided to implement it in it. It's a regular expression composer. You could implement it in any other general-purpose language.
They can be hard to read, but I don't think they are deficient, on the contrary I think they are very elegant
Stephen Cole Kleene was a brilliant mathematician and when he invented regexs in the 50s of the past century, he anticipated a lot of concepts that became popular in computer science, such as recursion (which he also founded as a branch of mathematics and computer science together with Alonzo Church, Kurt Gödel and Alan Turing)
Java on the other hand has some deficiencies here and there and it's not really a modern language free from old cruft
That's like complaining that you can write ugly code in any language. The problem isn't the regular expression, it's that email addresses, while they technically may form a regular language (not sure if they 100% do), are an insanely complicated such language and not a very nice one.
How would you write a specification for that language in any other way that was more elegant? Sure, you could make it more verbose, but that wouldn't make it easier to understand the whole of it, or why it is the way it is.
Almost every RFC writes their grammars in some form of BNF not in regular expressions. RFC are written to be understandable.
> Sure, you could make it more verbose, but that wouldn't make it easier to understand the whole of it, or why it is the way it is.
Absolutely it would. The way to understand a large thing, is to understand the smaller components and then put them together. Regular Expressions to do not compose well.
That is patently untrue. Regular expressions compose under a number of important mathematical operations, such as union, intersection and concatenation. If your PL supports string interpolation, it's trivial to compose them in these ways (well ok, maybe not intersection). Nobody says that your regex needs to be written as a single string.
Haskell absolutely has waaaay too much cruft. Have you read the 30 page articles recommending which extensions to use? Have you ever seen MTL? Read any documentation written by Edward Kmett?
Modules and libraries are not part of the language
Java,for example, can't have proper generics because at the bytecode level (the "real" Java) they are not supported, it can't have static constructors because it would break inheritance and they had to come up with static blocks, it has no support for static methods with the same signature of an instance method, because the call syntax doesn't differentiate a call to a static method from a call to an instance method, so the compiler can't tell which method is being called, etc. etc.
These are all consequences of the original choices taken 25 years ago when they designed the bytecode and the sintactic sugar over the bytecode, that still live with us today
IMO, the prevalence of libraries like MTL, or with poor documentation, is a consequence of design decisions as well — maybe not of the abstract language, but at least of the primary implementations.
I think having an EBNF like syntax would be nice. It’s verbose, but (IMO) fairly unambiguous. A side effect would be the ability to define variables (I.e. terms) within the expression
Yeah; I'd rather have proper multiline strings in Java and a regex documented with the COMMENTS flag set. What I don't need is a regex builder. Or SQL builder, for that matter.
They don't want to bloat language with million of rarely used syntax and hard to read features, just as Go devs do. Not to repeat the failure of Scala (and I'm afraid Rust goes the same way).
Well that’s terribly confusing. So .anythingBut(“a”).then(“a”) matches “a”?? That’s so far off from what I’d expect that when I read the example for a URL:
(...).anythingBut(“ “).endOfLine()
I defaulted to assuming endOfLine mapped to [^\n]*\n and was left confused about the whole thing. I’ll take simple rigorously defined grammars over someone’s attempt to embed English onto them any day.
I've never had any problems writing and debugging regular expressions after I came across this: https://regex101.com
And since regexes are usually write-once, adding this complexity on top of them serves no additional benefit. If anything, it'd probably make it harder for the next guy to understand your code.
I bought RegexBuddy years ago and have loved it for debugging. However it only runs on Windows. Found regex101 recently and I think it's a great alternative (though I almost didn't check it out because the domain has SEO abuse site vibes).
Have you seen https://regexr.com? I've only come across it recently but I really appreciated the visual description of complicated regex, especially for interpreting someone else's work.
I think that's actually the problem. I'm used to domains that are very precise being used to host junk content. Most popular sites aren't so narrowly focused and/or have some brand-name-like domain.
It is. For most parts regexes like "\d+" are ok, but when there is something more complicated I pull Verbal Expessions into a project. To these days reactions on CR were mostly positive or netural at worst. If it was built into the standard library I would probably use it instead of regex, but adding a new dependency and interfacing it with libraries that expect Java regex objects has its cost.
It's as if your java-compiler would stop warning you on forgotten semicolons and would instead error out during runtime when it reaches the statement with the missing semicolon.
It's not your time saved. It's time saved not running the test suite, for example. An uncompilable regex is a category of errors that you can ban completely from your program. Like java bans syntax errors as a category of (runtime) errors. It's time saved as any developer will not break this in a way that is not a semantic error. It's time and mind saved not thinking about a whole class of errors.
Thank you for the clarification! To be clear: my post is sarcastic, and I was trying to say that this library looks like a significant usability improvement over traditional regex syntax.
• not having to distinguish special characters in the pattern being matched from special characters part of regex syntax
• no ambiguity as to whether something is a digraph or not
• no escaping hell
• unambiguous human-readable names for all the regex features used
• the ability to use whitespace to clearly separate different parts of the regex
• the ability to comment parts of the regex
It sounds great to me. Have you ever tried making a regex matching something with backslashes in it, and then you have to put that regex inside a string literal? Have you ever had to switch between different regex environments and not known which symbols require escaping, or what is the correct way to write something in a particular environment? I've had all these problems.
Many of those gains can be had by using first-class and more full-featured regexes, like those that that are available in other languages (Ruby, Perl):
- escaping hell isn't that much of a problem, since you're only ever escaping something once (not like a regex in a string)
- several languages support separating regexes across several lines
- regex commenting (including named groups) is a standard feature in many languages, and that's besides using first-class comments across multiple lines
I think you do have a point about digraphs (or homographs), but unless I misunderstand, those would be a problem whether or not the character(s) are part of a string vs. a first-class regex. As for unambiguous human-readable names for regex features used, tools like this (https://regexr.com/) are available and very effective.
I might prefer Java Verbal Expressions over java.util.regex, but to me that's more of a knock on Java and its lack of proper, first-class regexes than anything else.
Is it just me or does this seem like a very bad idea? I mean it seems nicer but the reality is, if you don't know how Regexes work, you won't understand the nuances of the "verbal" regex either... Also, some optimisation maybe?
^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$
could be better written as
^https?://(?:www[.])?[^ ]*$
Or am I missing something? In that case, I'll readily admit this library is a good idea :)
I think it's a great idea... if you already know regex. Effectively it's just a different syntax for the same construct after all, it doesn't simplify anything, it just makes it more readable. Oh and it makes escaping a non-issue, which already almost sells me on the idea completely, since it seems that 50% of the time I spend writing regex is figuring out what needs escaping and how.
Writing regexes is not much of an issue usually (although the many dialects in common use are always a source of frustration) but reading them is always a pain, for me at least. For quick and dirty shell scripts or vim editing it's great, for stuff that's supposed to be long lived and actively maintained in a codebase I think this verbal approach is a great idea, at least in theory.
Regarding the optimization of the intermediate result it should only be a problem if you actually need to output these regexes for other uses or if you need to compile many of them at runtime with performance constraints. If your regexes are pre-compiled then the resulting DFA should look the same as far as I can tell.
If somebody makes a Rust crate with a similar concept I'll be sure to try it out next time I have to write regexes in a codebase.
> I think it's a great idea... if you already know regex
It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.
I agree with the common complaint that regex is effectively write-only, but this is only half due to its terse syntax. A pattern can be pretty complex on its own, and complex things are hard to understand. Imagine what code matching behavior of a complex regex would look like.
> It's actually a bad idea in this case because regex is mostly the same in every modern language, so if you know it, you know it everywhere. What you don't know is this.
I disagree, at least in my experience there are significant differences between multiple regex engines I'm used to use regularly. In no particular order: are parens and other operators treated literally by default or do they need to be escaped? Are character class like '[:alpha:]' understood, or do I need to write them explicitly? Similarly, do I have access to \w \W \s and friends? Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces? Do I have recursion? Do I have named captures?
Those are not theoretical concerns, that's stuff I routinely end up getting wrong because I forget that this one feature that works in pcre does not work in vim or works differently in sed etc...
> are parens and other operators treated literally by default or do they need to be escaped?
> Can I use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) or do I have to use = (vim)? Or maybe just {0,1}? But then should I escape the braces?
I think that's just older tools like vi and sed. Perl, Python, Java, and Javascript use a similar modern version where + and ? work, and parentheses and braces don't need to be escaped.
> if you know it, you know it everywhere. What you don't know is this.
Right, one language might have anythingBut(" ").endofline() and the next language might have a different . operator like anythingBut(" ")->endofline() or it might even require nesting calls. None of these things are a significant hurdle and if we standardize the names (endofline, anythingBut, ...) then you can make the same argument. It's a chicken and egg argument: just use regex because that works everywhere -> it's not universally implemented -> it won't work everywhere.
And aside from that, I have a similar experience to the sibling comment: when using some command line tool that I forgot (is it sed? Vim?) the default is that \( is a capture group whereas in normal regex ( is a capture group. Grep offers you three regex variants to choose from. I have to look up regex syntax or do trial and error every time I don't use a language that I use daily. And I don't know all of regex to begin with, I just know everything I ever needed but people posted examples here with (?:x) which I don't know. I once read it and remembered it for a few days I think... so anyway, consistent and descriptive method names seems a lot easier especially when you consider autocompleting IDEs.
> Is it just me or does this seem like a very bad idea?
It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.
It seems the whole IT industry is obsessed with helping us do all sorts of things, even simple things, which in the end often makes things more complex. Different query languages that translate to SQL to help us out, which often create super-complex SQL. All sorts of wrappers to avoid us having to deal with all sorts of formats (JSON/XML..). Hopefully those wrappers do something useful with those date-objects you know you have in there somewhere...
I don't think SQL builders are a good comparison because:
- SQL can already be made fairly readable by default, it's not just a long series of cryptic tokens. The main point of SQL builders is not to make SQL more readable, it's to make SQL approachable by people who don't know SQL.
- There can be several ways of achieving the same result in SQL, with sometimes deep performance implications, so it's really important to understand what is being executed and in what order. Regular languages are much simpler and while the string representation of the regex might end up longer than the handcrafted equivalent, the runtime performance should end up being the same since in the end it's all deterministic finite automatons.
- SQL builders have to be at least a little bit opinionated to be really useful, in general they make it easy to create simple queries but can quickly become limiting for complex queries, especially if you already know SQL. These "verbal expressions" on the other hand can easily map 1:1 with raw regex constructs, allowing somebody who already knows regex to express exactly the same logic, just in a more verbose and human readable way.
This verbose syntax operates at exactly the same level of abstraction as normal regex, it's just a syntactical transform effectively. It's like JSON vs. CBOR or something like that.
> These tools are great for letting someone build something they don't understand
Exactly, how can the same concept re-appear in all sorts of forms in this industry?
Another thing I though of was Mule, which I used a few years ago (it's hopefully better now). A horrid mess of an Eclipse-plugin that drew boxes with arrows between them, where "standard plugins" etc. could be plugged in to transform data and move it from here to there. The problem Mule solved (in our case) was comical, the complexity of the solution was also comical, or tragic; or maybe it was both.
>It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.
I know regex and I hate writing it. It's unreadable and I need to spend time remembering/googling/checking the exact syntax. And, of course, the syntax differs from implementation to implementation in subtle but important ways (ie: need to double escape in python, etc.).
> It's not just you. As you say this can only truly be used by people you understand regular expressions; and they would most likely prefer not to use this stuff.
There's a niche where this might be useful, but by definition it's small. I understand regexes a moderate amount, and can construct arbitrarily complex ones when necessary. But I do it just infrequently enough that it can be painful and halting above a certain level of complexity, with lots of testing and reference-checking. It'd be nice to use something sane like this, and I think I fall squarely into the category of "people who understand regexes but would prefer to use stuff like this". Though as I said, this niche is almost by definition small, and on top of that I can't remember the last time I used Java.
Completely independently, in any non-trivial engineering system, readability is important, and this helps a lot there.
A lot of IT is the parsing and mapping of one kind of language (whether markup, DSL, Turing complete) on to another.
Doing it right is a delicate balancing act of being just powerful enough to express everything the user needs without devolving into an unreadable or repetitive mess. Some people manage to achieve neither.
Well, there’s at least one advantage: apparently this builder library automatically escapes literal strings passed to it, so you no longer need to worry about injection bugs if you construct patterns dynamically (cf. parametrised queries versus ‘come on, just use mysql_real_escape_string, it’s not that hard’),
I’m not sure this alone pulls its weight, though; most of the time, regular expressions are fixed at compile time. And I’d still prefer something that mostly preserves commonly-understood pattern syntax. Having to guess whether ‘anythingBut’ means (?!...) or [^...] is not encouraging.
(This was apparently ported from JavaScript, where it is even more pointless: template literals can take care of the escaping part without abandoning standard pattern syntax. But as far as I know, Java has no equivalent feature.)
I think there's certain use case for this, a moderate regex user who's not an expert and not fully comfortable with regexes but knows the basics, and who is in a project where they need to heavily use regexes for a limited amount of time and where they will need to maintain this code going forward.
If you use regexes a lot, you are better off learning regexes, if you use regexes a little, this is a lot to learn to avoid learning a little about regexes. But there is a moderate user sweet spot where I could see this useful.
I like it as a simplistic builder. Much easier to read, autocompletes, and (I assume) handles escaping for you (because it knows you put only raw data inside).
Just escaping alone is a big selling point for me.
It's worse. Even if you do know how regexes work, you can still be tripped up by the counterintuitive function names. See this comment[0] and its parent for an example. I think English names with ambiguous or unexpected meanings are worse than symbols, which at least don't carry implied meaning.
No, you are not. These "verbal" expressions are nothing more than a builder for actual expression. So you can't actually use it without understanding regular expressions.
" These "verbal" expressions are nothing more than a builder for actual expression."
It may be under the hood, but there's no reason for it to be.
There's nothing inherent in our regexes that would imply they are 'the language' for that purpose, it just so happens we really only have one commonly used one.
Like most things invented forever ago, there might be opportunities for a 'cleaner, better way'.
But, regular expressions seem quite well optimize from my point of view.
Regular expressions are used for exact same task regardless of programming language -- using single expression language regardless of programming environment seems like a huge advantage. It can be embedded in configuration file, as a string in a database, on a web page or deep in backend code, and it will still work the same.
The "Java Verbal Expressions" already have "Java" in the name and so are complete loss when it comes to portability.
Then comes the fact that "Java Verbal Expressions" are many times more code that actual regular expressions. That isn't easier to scan, it is much worse.
Regular expressions are very succinct and you can express a lot in a single line of it. Comparable JVE-s would require many lines and wouldn't be more readable for anybody other than a person that doesn't know regexes at all.
'Using a single language to do everything' could be argued on the software side as well.
It might be possible that s-expressions/style would would really well for regex, but nobody has really gone through the effort to do it.
Where I think regexes don't do so well is with UTF and (Grapheme clusters, true word boundaries) and also it's confusing the difference between match/capture/ignore etc..
I'll bet if you really put your mind to it creatively, you might be able to come up with a novel/new approach that wasn't really tried before ... but even if was 'better' it wouldn't catch on for a while (or never) unless there were some big, institutional backers.
The intersection of debugging regexes and debugging code written by someone cycling through autocomplete looking for methods that sound right should not be real. It should be a myth, a region of programmer hell, a scary story to tell children about what will happen to them after they die if they don't document their code. May Dijkstra strike down anyone who succeeds in bringing this horrible idea to production.
Even though I still use regexes in rare circumstances - e.g. inside config files, parser combinators already do a much better job than this (or regexes) when you are writing maintainable code:
warcEntry = do
header <- warcHeader
crlf
body <- do
contentLength <- getContentLength header
compressionMode <- getCompressionMode header
warcbody contentLength compressionMode
crlf
crlf
return (WarcEntry header body)
If you accept crlf as "carriage-return-line-feed", the rest basically reads as pseudocode. crlf could have just as easily been written (string "\r\n") I guess.
Parser combinators can:
* call out to other parsing functions (e.g. warcHeader) - so you can build your code out of testable units.
* bind results to variables and start using them during the parse, e.g. warcHeader returns data containing contentLength and compressionMode, which is then fed to the warcbody function so it knows what to expect.
Parser combinators showed me what an expressive language (in particular Haskell) can do to let you write code as you intend, so that maintainable code is enjoyable to read, write, and reason about.
I have heard that some people first “get” Haskell because of parser combinators.
yes, regular expressions cannot do that because regular languages cannot do that. that is an important limitation of regular languages, otherwise you'd be able to match the language {a^nb^n for any n} (or more specifically, balanced parentheses).
The thing you wrote you wouldn't want to parse with a regular expression because it fundamentally doesn't look like a regular language with its hierarchical structure.
While neat, I think that if you're a developer, you'd be better off learning basic regular expressions instead so you can use them in whatever language you'd like. Depending on this would probably just make moving to a new code base that doesn't use this a lot more confusing.
A normal regex with a comment above it explaining what it does (for complex cases) always worked well for me.
Even as a developer you may have to assemble a regular expression at runtime, at which point a library that can do it for you may be much more handy than having to assemble the string yourself.
And even if you know regex by heart - assembling it with function calls can still be better / safer just like you shouldn't insert SQL parameters by hand into your SQL query strings.
I can't learn them. I've tried for over 20 years and every time I use them the knowledge is deleted from my brain immediately. A library like this would be very helpful if it worked.
One problem is that I'm more likely to need regex almost anywhere but Java code.
I think there is value in both cases. I've seen many developers that have struggled with regex even with all those hundreds of tools to learn and to build/test regex. This could be useful to them to start with, and they can learn regex according to their time/needs. I see solutions like this as a choice, and the fact that people are using these shows that there is value in having that choice, even if is not obvious to us at first glance.
Looking at the example, my immediate reaction was that the main advantage would be the `anything_but` method, relieving me from the cumbersome construction of stuff like this:
(?:[^t]|t(?:[^r]|r(?:[^u]|u(?:[^m]|m[^p]))))
What a time-saver it would be to write
anything_but("trump")
Except, then you look at the source code and see this:
Except not all applications of regular expressions allow for that trick to be used. Case in point: grep (which of course has the -v option which together with pipes does the whole negative match thing much more neatly anyway).
How else would you write that you want to match all strings that don't contain string X? If you were matching at a specific position, you should use a negative lookahead (?!xyz), but I think in some cases you might need the mess above.
If you concede that the standard regex construct [^x] is a useful construct at all -- match any character except x -- why would should such a negative match be restricted to only a single character? Why not also allow for, say, match anything except the following _two_ characters in a row?
Well, then by application of your name, it should be useful for any number of characters. They're not random -- they spell out a word you would like not to match.
I'm not sure what you mean with "predictable". Are you referring to readability of the expression? I totally give you that... which kinda was my whole argument to begin with.
It's pretty verbose, but it is useful in the sense that you have type-safety between character groups and the control characters. It's neat that it only allows you to create valid Regexes (I hope it does). At least you have static safety that your parenthesis for capture group are properly closed.
This advantage is not explained. Not being able to construct invalid regular expressions is a good static safety guarantee that you don't get when you embed DSLs as strings.
Edit: This is the same reason why we would prefer jOOQ to embedded String-SQL, if speed/dependencies are of no concern. You're not allowed to construct invalid SQL as the java-type-system gives you these guarantees when using an embedded DSL instead of a String-DSL. This is very powerful, but of course only works if the type system of the host language is powerful enough.
Unlike many others, I actually like this idea. I know regular expression, but many of my colleagues do not. They often have a hard time understanding what a particular regex do, event though I often document them step by step.
Something like this would make it more readable.
I do agree with others here, that it seems a bit rough around the edges and some optimisation might be needed. But I think the idea itself is sound.
This is cool, but I'm disappointed to see the horrid builder pattern show up again. Imagine you had to use StringBuilder every time you wanted to manipulate a String?
Just make all fields final and combine the builder and 'working' class into a single immutable object. Like String.
`build()` everywhere is syntactic noise, and you either lose immutable safety (by passing around builders everywhere, as in the examples) or composability (by passing around the 'sealed' objects). Builders are an antipattern that should only be used in cases where extreme performance is required.
With immutable objects, every step in the fluent chain of calls is an independent fork of the full object state. There's no need to use java clone(); each method calls a private constructor that passes the object state (slightly altered, of course).
java.lang.String works exactly this way. You're already used to the pattern.
In the parent's suggestion, you still have immutable objects. "VerbalExpression()" is a valid regex, namely the empty one. Every subsequently called method concatenates some new regex onto the receiver and thus builds a new regex (since regular expressions are closed under concatenation).
Builders are used in Java* when you have an object which is invalid without passing in a bunch of parameters, but you don't want to have to remember the order of the parameters. But this is not one of these cases.
There is one drawback, however: you do have to compile the regex into a FSA at some point and it wouldn't be good to do that for every intermediate regex. So I assume that the compilation happens in the "build" step. They could have just called it "compile", though.
* named parameters in constructors seem a better alternative in every language that supports them, e.g. Kotlin
I wish you were interviewing me. This is Java world you're talking about and if you can't squeeze a dozen Gamma Design Patterns into your code you aren't good enough.
This project would be better if it wasn't exactly a 1-to-1 mapping from words/methods to regular expressions. For example, the regex "\d+" maps to the code "digits().oneOrMore()". That doesn't read well in English because it's odd to have an adjective after the noun (i.e. we say "red bird" not "bird red").
Also, a serious weakness in regex is they are "write only", or hard to read. That's because they are compact and don't have discernible sections that are then assembled together.
You can do that yourself in Java by assigning chunks of regex to variables and then concatenating them together, but the regex engine doesn't let you do that itself. You can't name sections of the regex or insert comments into it.
The example
^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$
could be better if it could be broken down into named pieces or commented like this:
^
(?:http)(?:s)? # http or https
(?:\:\/\/) # ://
(?:www\.)? # optional www.
(?:[^\ ]*) # rest of URL (no spaces)
$
Comments should explain the why, not the what. `\d # digit` yes that is what it says, but why a digit? What is your ^, what is (?:) what is $, why is (?:[^\ ]) "rest of URL" [I know the answers to these, the person the comments are intending to help probably doesn't]. Where's the explanation of what it's supposed to be doing with a limited subset of URLs?
# match basic URLs such as:
# http://example.org
# https://www.example.org
# Must have no surrounding spaces, no spaces in the text.
^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$
It's much clearer to me from the one-line regex that it matches "http://////////" and that's probably an error. I don't know why it's clearer, maybe because I'm used to looking at
Mostly we program in languages that read like a kind of shorthand english, and mature programmers are expected to be able to read the text fluently. WHAT comments tend to be superfluous.
Nobody sane reads 100+ character regexes fluently, so WHAT comments are totally appropriate. Any time I write a nontrivial regular expression, I always try leave an example of what it would match in a comment. I've thanked last-year me for this more times than I can count.
Huh, I didn't know that. I've read through a fair amount of Java code with regexes and never seen anyone use comments. Maybe it's because Java doesn't have proper multi-line string support built into the language.
If you don't have multi-line support in the language then you're more likely to put the comments outside the string:
String regex=
"^"
+"(?:http)(?:s)?" // http or https
+"(?:\:\/\/)" // ://
+"(?:www\.)?" // optional www.
+"(?:[^\ ]*)" // rest of URL (no spaces)
+"$";
I love your example of how it should be explained. This helps people correlate the verbal aspects to the regex parts they described. This ultimately reinforces and helps people learn regex more deeply.
Once you start thinking about it, it's mind boggling that we have thousands of languages and yet most of them don't have built-in facilities to construct and parse grammars (at least context-free ones). Every single designer seems to think that their language is finally good enough and will not be used as a starting point for another one.
Perhaps Raku would be of interest to you. It was specifically designed to make it a good starting point for its ongoing replacement, Ship of Theseus style.[1]
----
The first half of this comment paints the big picture of Raku's approach to enabling this.
The second shows actual code demonstrating the approach.
----
Raku has a built in grammar + actions formalism, and corresponding built in parsing/compiling machinery.
Raku's grammars/actions are used to declare, parse, and compile Raku -- it is entirely defined in terms of itself.
----
Raku isn't just one language, but a collection of sub-languages, aka slangs. Each of these is comprised of a grammar/action pair. These are mixed together to form a composed result: "Raku".
----
Any slang can include grammar rules that invoke rules in other slangs with which its mixed. The slangs in standard Raku take advantage of this. Thus a function written in the Main (GPL) slang will accept regexes in some spots in the Main slang, with those regexes being parsed/compiled according to the Regex slang.
This ability for slangs to use each other can be (and is, for the standard slangs) mutually recursive. Thus, just as an ordinary function written in the Main slang can include a regex in its code, so too a regex can include an ordinary function.
This overall approach of weaving slangs together is called a "language braid"[2]. Multiple slangs are seamlessly woven together as if they were just one language -- because in fact they are, despite its declaration being broken out into slang modules.
----
User defined slangs can replace or modify the "official" standard slangs that ship together to comprise "Raku".
Thus, at the most trivial level comes slang modules like Slang::Tuxic.[3] This was one of the first slangs. It is just a few lines of code written by a core Raku dev; it overrides a few existing minor rules to tweak the parsing of the Main slang to make another dev called Tux stop complaining about some syntax decisions he didn't like.
More advanced slangs are internal DSLs such as Slang::SQL.[4] This weaves SQL and standard Raku code so that each can enrich the other.
----
Use of slangs is lexically scoped. That is to say, a slang can be invoked within a particular module, or function, or even just an `if` statement's True block, and at the end of that scope the slang's alteration of Raku vanishes.
And if that was done in an inner scope rather than at the top level, then at the end of that scope, pop, the old Raku would return. (Thus lending a new twist to the Ship of Theseus[1] thought experiment.)
----
This last section includes code showing the approach in action, after a bit of introduction of what I will and won't include.
Raku's grammars declare arbitrary parsers. One can arbitrarily override any part of an existing grammar by composing it into a new grammar with selective replacement of particular parts. But "them's big guns". Instead we'll stick with a simpler mechanism. It's still powerful, but it's particularly easy to use and demonstrate.
A key element underlying what I show below is an alternate rule construct that's available in any Raku grammar. This makes it trivial and practical to add any number of new alternatives for particular grammatical "slots".
The grammars used to declare Raku itself make use of this construct to declare much of its grammar. In particular, the Main slang does and then builds on that in a principled manner to exposes itself to user extension of those parts of its grammar. This is what we're going to see in action.
There are over a dozen of these grammatical "slots". For our example here we'll extend the `postfix` slot:
sub postfix:<!> ( \N ) { # Declare factorial op
N > 2
?? ( N * ( N - 1 )! )
!! N
}
say 5! # 120
The syntax aspect of this extension happens as soon as the opening brace of the operator's definition block is reached. This is why the use of `!` that's inside the block is correctly recognized, despite appearing before the block defining its semantics is even closed.
While the syntax is utterly trivial (literally just the token `!` where the Main slang accepts postfix ops), the semantics are defined by the arbitrary body of the declaration. This body compiles into corresponding AST that is added to the MAIN slang's semantic actions class.
To be crystal clear, this all happens at compile time, and this new op would be part of a module's compiled result if it were in that module, despite the code that declares the op being ordinary Raku code.
The fact there are countless RegEx cheatsheets and pages like https://regex101.com/ or https://regexr.com/ is evidence RegExes are not intuitive or easy to remember. Composing plain-english functions can be easier to remember, and editors can provide auto-complete.
I'd say the bigger issue are the different regex implementations. If you use Java, Javascript and grep you already have to know the peculiarities of each implementation...
Except that all implementations are subtly and annoyingly different, and while you can transfer your general understanding, you can’t avoid the cheatsheet if you’re using multiple tools.
But of course, since this is basically a one-to-one mapping, you can also trivially transfer your understanding to any other regex tool anyways (with a cheatsheet, which you’ll need in either case).
I love seeing new things and building, I also want to understand why people would find value in this? Is it because people are learning things differently and find this easier to digest instead of using regexs? or native substring tokenization/boolean primitives?
The new me is being less critical and positive... (smileyface.jpg)
But if you want better readability and comments, Python's "verbose" regex (?x) is a beautiful thing. You can usually also just construct regular expressions incrementally by concatenating strings or whatever your language supports.
of course as others pointed out, writing direct exp might be optimal or every dev should learn about it.
BUT
in my whole career span, whenever I have to use regex, I spend couple of hrs learning and testing. This kind of library for Java open doors for many other things. (testibility, default library using default methods etc.., integration with streaming ). And as community adds to it, it can be optimized internally. All end user needs to do upgrade versions. Can be extended part of javax validation specs.
I might actually like something like this if it did not try to replace conventional regex syntax but aimed at making it more useful by enabling composition of smaller blocks of (conventional) regex into larger. An API like that could certainly also have an equivalent of Pattern.quote(..) for the few times what you actually do want to enter unescaped (ever tried reading regex with regex?), but it would just be a minor nice to have feature.
It wouldn't even have to contain all the small blocks (like [] single char alternatives, those are best left as is), just an option to describe bigger round-brace blocks out in the java syntax where nesting is validated in compile. Lookahead, lookbehind, their negatives, capture, noncapture. Noncapture with a varargs (or builder) API of piped alternatives.
You could still provoke a PatternSyntaxException by writing invalid stuff in the conventional blocks, but if the composed pattern fails checking the subpatterns should be quite helpful (if there are no backreferences)
I wrote one of these in 2002, back in college, after being inspired by Icon and SNOBOL. From Wikipedia:
s := "this is a string"
s ? { # Establish string scanning environment
while not pos(0) do { # Test for end of string
tab(many(' ')) # Skip past any blanks
word := tab(upto(' ') | 0) # the next word is up to the next blank -or- the end of the line
write(word) # write the word
}
}
I really think we lost out when we went toward regular expressions rather than SNOBOL/Icon syntax, but I don't think a direct substitute is as much the issue.
That kind of thing works even better in Java because the static type system enforces it.
In particular generic methods don't have the problem of type erasure that affect generic classes so many things you would want to do with types "just work".
Almost everybody is afraid of it, but $ works just fine as an identifier and can be used to make a DSL that looks like jQuery in Java.
It's important to note that these are still regular expressions. They still conform to the same mathematical definition as the regular expressions we all love or hate, and in fact, the library uses regular expressions under the hood. It's basically just using new language constructs for the exact same capabilities.
Now, whether you like that new syntax or not, is a matter of opinion. It is of course more type-safe and probably more approachable. That said, I find the verbosity a bit off-putting, although I chalk some of this up to Java being Java - other language implementations could be a bit nicer there. I also question the API design somewhat: regexes are inductively defined, there are a couple of primitives and a couple of operators that combine these primitives and the API doesn't make this very explicit by using the builder pattern.
But ultimately, I'm more used to the way regular expressions are usually written, and for many tasks, that syntax is also more portable (e.g. regular expressions can be configuration which can be hugely useful for e.g. NLP tasks). For quick one-off text replacement tasks with very simple regexes I also think that this syntax would be way too cumbersome.
Lastly, there are some people in the comments suggesting the use of parser combinators, BNF and similar formalisms / tools. Those might be an option for your use case, but they don't give you regular languages (although, sadly: regular expressions in most languages nowadays are not regular anymore). Having a regular language might be an actual requirement for whatever reason (e.g. constant memory usage).
For everyone that doesn't see the point, take a look at the example of parsing a long string [0]. The verbal expression is _much_ easier to read than the regular expression.
I don't see it. The regex is mostly hard to read because they formatted it poorly and put in a bunch of unnecessary non-capturing groups. I find this to be just as easy (if not easier) to read as their first example:
And yes, I do frequently split up my regexes like that to make them more readable.
The only improvement I see is that you don't have messy escaping in the url. That is genuinely nice. It motivates me to start using an regEsc() function instead of doing it by hand. However, I find "capt().endCapture()", and other verboseness to be a step backwards.
Edit: Actually, from what I can tell, all the escaping was unnecessary in this case as well. Updated examples without unneeded escape characters.
Looks a bit abandoned though, doesn't it. Otherwise I'd love it for the safety and readability (while I'd still need to re-learn all what got forgotten in the last half a year before I used regex last time)
regex suffers from the same problem that inlined code does.
For example, if you saw this in a code review, what would you say:
log_and_return(rank_by_time(compute_recommendations(get_data(client_id,date), find_nearest_neighbors(client_id))))
You'd tell them to create some intermediate variables. But when it's a regex, apparently we're all fine with this:
I'm not sure if this is great or crazy! I try to not be swayed by the handpicked examples because this at least _feels_ like a design that could get messy once you try to do the particularly gnarly regexeps that this library claims it was designed for. If it's great, it should already have been done long ago, hmm...
A buddy of mine made Remake (https://docs.rs/remake/0.1.0/remake/) with this kind of thing in mind. It's a DSL for composing regular expressions in a readable way.
I am all for developer ergonomics, and I'm a fan of Ruby... but the problems this library would add to a codebase/project seem too big to be worth the benefits:
- non-standard syntax requiring its own documentation, which developers would have to consult separately (even if they already know regular expressions) to modify generated regular expressions
- removing the ability to test and validate regular expressions independently of the codebase (say, in the terminal, a small shell script, or using an online tool)
- a new rabbit hole to traverse when debugging a problem
- assuming the security risks associated with handing over regex-building to a library built by someone else (even more so if the regex is parsing private or protected data)
- adding a new dependency that may or may not be maintained in the future
For those who would want to use this library, I would suggest using a separate tool to build and/or understand regular expressions. Here's one example, and I'm sure there are others: https://regexr.com/
Anyone looking for a non-Java implementation: This library has been ported to 30+ languages, and you can find a list of them at http://verbalexpressions.github.io/
That really depends on how complicate the regular expression is. For me this debate sounds like arguing assembly vs C. We will need some sort of abstraction to develop a higher-level stuff in case we need it.
I limit my brain-time on constructing a regex to 5 minutes max. If it takes me longer than that, I reach for a parser. Pick the right tool for the job.
Looks interesting. I find out all my regexes are pretty much write-only. When I come back to them few months later, I can't make much of them and it's easier for me to start from scratch. Tools such as https://regex101.com/ are amazing though for development of regexes and later trying to make sense of them.
Wow, pretty sure I played with something like this for Python in the 90s. People have been trying to replace regexps with something more readable for a long time.
This seems like a decent attempt, although the syntax for captures looks a little clumsy.
I did a little class like this with a fluent API in C# to generate regex in a project that requires big ones. It make working with regex super easy and super maintainable.
Trying to simplify something that doesn't simplify inherently isn't always a good idea. Regex is pretty close to the least level of abstraction that is necessary to get the job done. It could probably be improved on, but probably not by much.
Some commenters below mentioned this Java syntax is a good idea and using endless number of regex cheatsheets as a testament to why regex is not simple enough and should be replaced. It's almost silly that this is even an argument on HN. Take for example quantum physics, there are lots of videos and guides that try to explain how it works, in fact some of the smartest people tried to explain it, even Richard Feynman. But he famously said if you think you understand quantum mechanics you don't understand quantum mechanics.
Some things cannot be reduced any further, this does not mean those things are always simple in nature or somehow were designed in a convoluted way on purpose.
At least when it comes to regex it's important to keep in mind what Einstein said, "everything should be as simple as possible but no simpler."
It's ironic that people apply reductionism to simplify regex, a thing that itself one could argue is a prime example of reductionistic design, yet they complain it's too abstract while applying reductionism.
How do I learn regex? I get confused because it seems like maybe there's more than one kind of regex floating around out there, and since regex is made of lots of punctuation symbols, it's very hard to search for things about it on the web. Is there a single book I can read? A couple books? Does it depend on my runtime environment?
The example isn’t a correct URL test regex (far from correct actually - even though there are plenty of edge cases regular regex strings tend to miss also)
Sure, I get the benefits of type safety.
Sure, it'll save me time debugging when I accidentally create an invalid regex.
But what am I meant to do with all that time saved? Read a book? Write more code? I don't get it. Let me waste the time on a ludicrously arcane syntax where I spend half the time looking at every bracket trying to understand if it's a control character in that particular context, because the ego trip I get from mastering this ridiculousness is HUGE!
(Yes, I understand regex syntax. I've been able to explain the phrase "zero-width negative lookbehind assertion" for the past twenty years. I inhaled the Friedl book and got utterly high on the idea that the awesome power of regular expressions - which are genuinely great in how they ease flexibility in accepting input - is entwined with their completely inhuman syntax. But I was wrong.)