It's always neat to see where one's ideas go! AFAIK, I was the first person to create dynamic railroad diagrams for regular expressions (maybe 12 or 13 years ago). I got the idea from json.org, which I think was Douglas Crockford's brainchild.
My initial implementation was strfriend.com (in Lisp: well under 1,000 lines, including views), and I think its main claim to fame was that Jeff Atwood made fun of it on Twitter. (I was truly clueless at promoting myself back then. Not only did I not have a Twitter account, but it didn't occur to me to submit it to HN.)
Every so often, I toy around with the idea of making it into a proper local/native application -- maybe someday. In the meantime, my obsession with regular expressions lives on in my current application (see bio) where I parse and rewrite the user's entered regex syntax (ICU) into whatever regex syntax the backend requires. I don't know of any other application that does this, but I predict that in 15 years it'll be commonplace (and I'll still be poor)!
Railroad diagrams for regular expressions were common long before 2005, and generating such images on demand isn't that unusual, so your work is unlikely to have been a major influence here. Similarly, automatically translating regular expressions from one engine to another is something that people have done before (out of necessity, for compatibility).
Do you have an example? I love reading original sources. I'd never seen a regex as a railroad diagram before that, though I admit it's entirely possibly I'd seen it somewhere and forgotten.
I don't know of any software that translates regular expressions, either, though I'm sure I can't be the first.
You seem to be chasing the idea that somehow your idea was formative, even though pretty much nobody will have seen it, even though you yourself don't know whether or not there were earlier instances.
Railroad diagrams are also called syntax diagrams [1] and were used in print as early as 1973 [2]. They are an obvious kind of diagram to use for regular expressions, which express a grammar/syntax.
Basically, many people will have had the idea, independently, because representing a regex as a railroad diagram is an obvious invention, as is converting them between variants. Doing these on demand is also obvious.
Trying to take some sort of credit (even as inspiration) for other people's work is not likely to make you many friends. The world is a vast place with many things going on independently. Keep on inventing.
I think the first place I such a diagram was in the Smalltalk blue book. It has railroad diagrams for the language grammar. So not regular expressions, but very similar.
According to Wikipedia's article on railroad diagrams one of the first appearances of it was in "Pascal User Manual" written by Niklaus Wirth in 1973. Hardly a new idea. It's been used in academia for ages when first teaching regular expressions and Extended Backus–Naur form to students.
couldn't find jeff atwoods comment on twitter but he did have this to say on Stack Overflow back in 2009 [ answered Jan 13 '09 at 10:15 ][ https://meta.stackexchange.com/a/79880 ]:
It's even worse: those strfriend URLs in the form of
I was taught the algorithms to do this stuff in my Computer Science class over 20 years ago (RE is equivalent to DFA). You weren't the first person to implement regular expression visualizations.
It was pretty hard to do viz on the web 13 years ago. In 2006 I made a web tool to turn regular expressions into NFAs and DFAs and animate their states as you typed. It took a lot of code (drawing and animating along beziers, AJAX to a server for graphviz and a regex compilation and minimization package I wrote for this). https://imgur.com/gallery/Yqqoh
These days there’s a lot more tooling and components that can snap together to make this kind of thing.
> A dynamic visualization on the web 13 years ago?
Depends on what you count as dynamic, but I think so yes. There was one in use at my Uni ~2000. Obviously there were no fancy canvas/svg options to work with client-side (though IIRC flash was very much a thing by then so that could have been used) so it produced an image server-side that was updated when you submitted a change.
Not entirely dynamic due to the manual post to the server for each update of the diagram, but it counts IMO. It could have been more automated via the JS/Dom methods available at the time I'm sure, I can think of a couple of ways, but I don't remember it being so.
> He may have been the first.
I'd say not the first. He may well have come up with the combination of ideas independently. How many times have you thought "X would be brilliant, I'm a genius" only to find when describing the idea to other that several "geniuses" preempted you and it already exists (or worse: it has been tried and proven to be a terrible idea in practice)! It has certainly happened to me a fair few times, and back then it wouldn't have been as easy to search out similar ideas/implementations.
Thad first one doesn't seem very good. It seems like there are many places states could be merged. Eg, there are 4 different "^0" states and 3 "^1" states. or am i misreading something?
A regular regex doesn't need to be optimized how it is written if the matcher is dfa based because the minimal dfa is unique. Regex engines however are more complex than that, and the structure this shows isn't going to be how it is actually recognized.
This should reallt only be used for understand what regex matches, not as am optimization tool. In that regard it should display the simplest graph possible to aid understanding.
I actually started typing in random characters and tracing the matches through the graph for the Regex from StackOverflow mentioned earlier. I didn't know what the regex matches, so I played it like a game , trying to reach the finish by typing one character at a time.
When I finished, I saw that I had typed "01/31/1691" and then realised that it's a regex for dates.
I think regexper more suited for complex structures whereas regexr is better for checking the meaning of simple expressions because of the hover-over function.
emacs also has a visual regexp builder mode (M-x re-builder) that shows the 200 first matches in the current buffer to validate that the regular expression does what you want. But AFAIR there are syntax differences between different regexp flavors and emacs uses the elisp flavor of course.
After 20 years of software development I‘ve come to adopt a best practise:
Whenever I start writing a regular expression, I stop and write a „manual“ domain specific parse function instead.
Saved me a LOT of debugging time.
Since I can now use kotlin pretty much anywhere (jvm, browser, shellscripts) this is easy because of the superb stdlib („startsWith“, „lastIndexOf“, „substringBeforeLast(...)“)
The time saved I invest in Unittests for the parser.
I can't shake the feeling that Regexp could be written just as efficiently as a fluent interface with a more human friendly syntax.
I've been telling Jr devs bucking for promotion for years to explain what they're doing in plain english, then write code that looks like that. Basically telling them to skip right over the "gee look what a clever fuck I am" stage and write good code instead of creating riddles.
The Regexp problem just screams this at me. What am I doing? I'm looking for a line that starts with a capital T, then has some quantity of alphanumeric characters greater than n (if n is not 0, 1, or infinity, this requires extra work in Regex), followed by an equals sign with or without whitespace characters around it.
Give me an API that does exactly that, instead of Regex. Something gets lost in translation every time.
I think the fact that the origin of Regex is the command line interface is pretty telling. We didn't and we don't have a convenient way to type in imperative code on a command line. So an arcane syntax was created so you could do the whole thing in a quarter line of text.
Speaking as someone who has had a Unix shell for 25 years, and routinely works on mini tools for their fellow developers, I don't think we actually type stuff into a shell that often anymore. The difference between documenting a one-liner in a README and just building a shell script that does the same thing is not that big. There's a difference in development effort but building a script can allow you access to a debugger. Personally, I'd be willing to pay that tax any day.
> I can't shake the feeling that Regexp could be written just as efficiently as a fluent interface with a more human friendly syntax.
You can use SRL - Simple Regex Language (https://simple-regex.com/) for making readable regex/matching rules. It is supported in C++, Java, C#, PHP, Javascript, and Python. Also, you can use the web version to generate equivalent regex if your language is one of the above.
Here is an example from the website for matching an e-mail address:
begin with any of (digit, letter, one of "._%+-") once or more,
literally "@",
any of (digit, letter, one of ".-") once or more,
literally ".",
letter at least 2 times,
must end, case insensitive
regexReplaced n = do
char 'T'
spaces
x <- concat (replicateM n alphaNum)
y <- concat (many alphaNum)
spaces
char '='
return (x ++ y)
The above is a Haskell function that does the parsing required above, returning the alphanumeric characters if the parse succeeds and returning an error if it does not. You may not speak Haskell, but this is probably still more readable than (n) => {new RegExp(`T\s(\w{${n}}\w)\s*=`)}, which is the Javascript function that does a similar thing.
Yep. I primarily use Go, so you are often forced down this way because it uses a simpler regex engine. I used to complain but in hindsight I realized it was a blessing in disguise. Particularly in Go's case, it has excellent character set library support, especially unicode, so those really tricky corner cases with unicode characters are non-existent now as well. I will be happy if I never see a regex with a unicode range again.
Here is a regexp to match an IPv4 address - looks quite nice and easy to understand compared to the regexp! In fact the visualisation makes it easy to spot the mistake.
That covers the most common format accepted by the BSD and POSIX inet_* functions, but misses the less common ones.
If anyone wants to have a go at a more complete one, here are some test cases for you that it misses, using Google's well known public name server 8.8.4.4. These all work in the classic command line tools like ping on MacOS, Linux, and Windows:
134743044
8.525316
8.8.1028
POSIX and BSD also allow the numbers to be written in hex:
I don't see the problem. Though not a host address, that's known in the sockets networking API as INADDR_ANY, useful for binding a socket to listening on all networks, for instance.
The GNU C Library getaddrinfo accepts 000.000.000.000 with the leading zeros and all; I just tried.
It is important to support special addresses like 255.255.255.255 and 0.0.0.0 in the dot notation. For instance, in the configuration of some daemon, you may need to be able to specify that the bind address is 0.0.0.0. The value can't be rejected due to not being an address.
Also, you need to be able specify network as opposed to host addresses, and netmasks. You know, like 10.0.0.0 and so on.
Great explanation thank you! The 24 bit and 16 bit variants are highly useable but they are often overlooked by regexps, biggest problem is web apps. Being able to use octal and hex is even less common.
I referred to the double colon as something I saw used with IPv6 to indicate a sequence of 0's.
I wasn't aware that inet_aton supported all these forms and the regex I provide won't parse them. It seems like inet_aton supports specifying the numbers in octal and hex too.
Looks neat, but after a quick look I think I still like https://www.debuggex.com/ better. Going step by step through the regex for a given string is really a killer feature for me.
Yep, and not having to click "Display" button each time.
Also, partially highlighting the text you write is a pretty hard feature to implement, I did it once. Kudos to debuggex.com for working correctly even with browser zoom on.
Wow. Despite the utterly insane complexity of a regex of that size, it doesn't seem to do an insane amount of branching. Maximum depth of choices seems to be about 4, which is less than a lot of other regex examples I've seen here.
That being said... That regex is just noise. Nobody can tackle it all at once or by themselves, unless they specialise in just regex. It's 6599 characters, at least on my system. So at a wild stab, its the equivalent of 4,500 lines of obfuscated code.
You can't audit it, you just kinda have to trust it, and hope.
But with something like regexper, I can at least read it.
I can't remember where, but there's some version of that regex somewhere that uses variables for interpolation in subsequent regexes.
Viewed like that it's really not that complex, most of it is repetition of previously used regex sequences, it's only when fully expanded that it becomes so humongous.
The pattern is Jeffrey Friedl's, from his book Mastering Regular Expressions.
And as clear as the source could be made, I feel the fact that so many people have just copied and pasted it means that any understanding is lost, and they're just praying and hoping, because a regex of that size is actually difficult for them to comprehend.
This is a nifty tool, and to be honest I was unaware that regex visualization was a thing before now. I usually write comments in BNF next to my regex, so that I can make sense of them later. I'm going to keep doing that, but visualization is going to be great for debugging and for figuring out other people's less carefully commented regex.
I love the graphics. I would like the equivalent for Python and the GNU flex (lexer) RE's.
No... wait.... I know what I want! I want a sphinx extension that allows me to include an RE in a Python docstring and have it render as a railroad track graphic in the generated documentation:
some_re_string = r'ab[0-9]*z'
""":regex: Any string starting whith 'ab', followed by
digits, ending with 'z'.
"""
That should be able to go pick up the documented string and render a railroad track diagram along with the text in the generated documentation.
I'd love to have it the other way around. Human speech to RegExp.I find it really hard to write these expression, e. g.: A colleague needed to write a RegExp for a nickname alias for a URL. It had to have only letters and numbers or a specific number. So either name34 na34me or 23 would be valid.
I'm an rx geek and can often craft what I'm looking to get without a lot of help, but I have used this tool many times before -- it's very slick. It would be nice if it supported something other than JavaScript[0], but hey, it's on the web, it probably makes a lot of sense to be that way (and it's nice that it's all client-side and I don't have to wait for it to ship my regular expression back to a server for processing).
Regular expressions are simply awesome and I'm continually surprised at how frequently I run into developers who have next-to-no understanding of them.Case in point, I ran into some code a few months ago that spanned two methods and 20-lines to do something that a 6-character regular expression could have solved (and would have done so more performantly[1]); the best part was that part of what I was responsible for handling was a bug that ended up residing right inside one of those methods. And then there's all of the things related to "dealing with strings" that many regular expression libraries just handle, such as "\d" vs "[0-9]" in a world with unicode strings[3]. It feels cryptic[4] when you encounter it and you're not familiar with the syntax, but to learn the "80% most useful parts", you needn't study much more than content that would fit on a single printed sheet of paper (and to get the last 20%, you'd need, maybe 2, ... 3?)
All of that said, there's also the other side of the coin; if ever the saying "If all you have is a hammer, everything looks like a nail" had application, it's with regular expressions. I'm not sure how many times the question "How do I write a regular expression to parse HTML" has to be responded with "don't" before folks quit trying[2]. It tends to be the first thing I reach for when I have a need to process text, even when there are better tools; heck, all of my find/replace dialogues in every application that supports it have the "Regex" box checked by default (and it really throws me off when I hit up "Find" in the browser and need to search for something with a ( or ) in it which I escape due to muscle memory)
[0] I have an occasional need for PCRE and .NET style; and I really miss named-groups when I have to do something complex in JavaScript.
[1] While it's easy to accidentally end up in hell, ala https://blog.codinghorror.com/regex-performance/, poorly written string-search code can be worse when the complexity of the pattern your searching for reaches a certain point, and that's to say nothing of the errors per x lines of code and readability (not that rx is particularly readable under complexity).
[2] And hey, I've got a shell script that downloads a few status pages on my server at home that uses awk with regular expressions to extract values from a web page. I wouldn't say it necessarily qualifies as "parsing HTML" since it's really only concerned with looking for a small string which it filters a second time to get the value -- horribly inefficient, but it's worked for 5 years through page changes without requiring adjustment.
[4] While it's usually written cryptically, many (most?) implementations support flags to ignore whitespace and support comment features. I've had a few crazy-ugly rx's that I had to use to extract data from a ticketing system's "blob field" to insert into a structured format; were it not for that feature, it would have been impossible to write and support.
My initial implementation was strfriend.com (in Lisp: well under 1,000 lines, including views), and I think its main claim to fame was that Jeff Atwood made fun of it on Twitter. (I was truly clueless at promoting myself back then. Not only did I not have a Twitter account, but it didn't occur to me to submit it to HN.)
Every so often, I toy around with the idea of making it into a proper local/native application -- maybe someday. In the meantime, my obsession with regular expressions lives on in my current application (see bio) where I parse and rewrite the user's entered regex syntax (ICU) into whatever regex syntax the backend requires. I don't know of any other application that does this, but I predict that in 15 years it'll be commonplace (and I'll still be poor)!