Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Writing a SQL database from scratch in Go (eatonphil.com)
430 points by v3gas on April 12, 2020 | hide | past | favorite | 54 comments


Unexpected, but welcome, to see the first part in the series on here right now. I just published the second part [0] today, featuring binary expressions and WHERE filtering. It also updates the REPL to use a prettier table-printing library and a readline implementation.

The repo [1] has some additional bare notes on architecture and links to similar, more mature projects (primarily go-mysql-server and ramsql).

[0] https://notes.eatonphil.com/database-basics-expressions-and-...

[1] https://github.com/eatonphil/gosql


This is great. Do you plan on continuing this blog series? If so I hope you post it here. Cheers.


Definitely! I typically do post here. There are also links for various ways to subscribe on the site itself (e.g. twitter, email).


The idea is nice, but it seems like the text is not explaining much. If you don't have a background in parser writing you have no clue what types are defined there and why in the very first code block. If you're not an SQL expert you might still know what SELECT and WHERE is, but do you think everybody knows what's an INTO keyword is used for? Where does the lex function come from? Why do I have a lexer and a lex()? Where does the cursor come from that I give into it? Why don't I give it a string? My user input SQL statement is a string, right? Why does the lex function have two for loops? What can't be done in one? Does this whole logic actually have a name maybe? What other alternatives to parsing SQL are out there? Why choose this path?

All that knowledge can't be gathered by just staring at the source code. And for people who understand the source code well enough to gather all this info themselves maybe they don't need the blog post around it, right?

Now all this I don't say to demotivate you. I hope you're strong enough to get over negative feedback and think about rewriting it to something more useful. I see a lot of potential here. But every good writer needs one (or ten million hacker news reading) editors to get to the really good content.


Great blog post!

I'd just like to add for the curious, that usually you'd use goyacc for parsing SQL.

And most serious SQL projects in Go have started with the SQL parser from vitess and adapted it to their use case (which is just funny trivia, but for anything big, I recommend it, did the same for OctoSQL [0]).

[0]: https://github.com/cube2222/octosql


This is not the first educational project I've seen that insists on manually writing the parser.

Call me contrarian but I find the article's approach poor (at least, as it is now).

If one looks at the overall content of the two articles of the series, 70%/80% of the content (or possibly more) is parsing, which is arguably the least interesting, or at least, the most boilerplate, part of a database system.


I like this project! I think I've asked you about it supporting parquet before... I'll add it to my list of similar projects.

Regarding goyacc and "usually", I myself was curious about this years ago and asked on Stack Exchange: Do modern languages still use parser generators? [0].

Most of the time, major languages _do_ use hand-written parsers.

Of course it may literally be the case that most other SQL projects in Go do use goyacc. My point is that in general, hand-written parsers are more common in real software than you might think.

[0] https://softwareengineering.stackexchange.com/questions/2502...


Indeed, I remember! :) The parquet support is mostly ready as far as I know here https://github.com/cube2222/octosql/pull/153/ but I haven't tested it myself yet. However, we'll surely have it ready for the "streaming" release which is planned for May/June.

Regarding the parsing part, that's interesting. Though the cockroachdb parser is a _great_ lecture about how to add nice error messages with goyacc. (You add a special error token which captures anything after a syntax error, so the parser doesn't break down. At least something around this, haven't yet dived deep into it.)

Though I know Go itself uses a handwritten parser.


Why did all of the Golang SQL parsers come from Vitess? I would love to know more about this history.

Was it because they were the first and people just started using it or is it the best for some reason?


SQL is a humongous spec. Any serious project would rather piggy-back off an existing parser. Most of the interesting parts for most people is implementing backends against in-memory, disk, S3, HDFS, etc.


Also, a contributor to dolthub shared on Reddit last time this post came up that they originally wrote their own SQL frontend but gave up because it was so much to maintain and get correct. They ended up going with go-mysql-server which uses vitess.

https://www.reddit.com/r/golang/comments/fgwwlx/database_bas...


No idea about the history.

From my perspective (as the author of OctoSQL): I need a SQL parser. Vitess has a fully fledged MySQL compatible one. I'd probably want to copy the open source vitess one and add the necessary features myself, instead of writing it from scratch. It's just a big endeavor, a few thousand lines of goyacc code and even more Go boilerplate.

I'm not saying everybody is using the same library, because the code has been heavily modified after being copied. A look at the cockroachdb parses suffices to see how much changed it is.

If you look at OctoSQL, there too have been many changes to add stuff like table valued functions.


Probably the first article Ive ever read about lexical parsing with code that i have actually understood - and I dont even program in golang. Great job.


It's recursive-descent --- a pretty common and intuitive way to write a parser in general.


Lexical analysis using these bespoke methods (writing the finite state machine) is so tedious and error prone. I don't have that much experience but I just went through crafting interpreters and replaced this same module with https://github.com/J-F-Liu/pom, which is a parser combinator library, and it was way easier.


which is a parser combinator library, and it was way easier

I like parser combinators but a word of warning.

A lot of parser combinators (not all) have problems with recursive grammar such as Json and SQL [1].

For instance, a JSON map can contain a JSON map and this can lead to stack overflows when defining the grammar.

[1] https://fsharpforfunandprofit.com/posts/understanding-parser...


thanks for the link will take a look. like i said i don't have that much experience.


I guess it's relevant.

Here is a great set of videos from "Low Level Javascript" channel explaining how to write parser combinators from scratch. - https://www.youtube.com/watch?v=6oQLRhw5Ah0&list=PLP29wDx6Qm...


Parser combinators are neat. nom is my go-to now in Rust, it's nice not have to separate your lexer and parser in simpler grammars.


(it’s actually also straightforward to write a recursive descent parser directly on the char stream w/o a lexing step.)


how? it's not like you can magically skip the actual tokenization. you're basically saying you can do lexical analysis and semantic analysis in the same function. sure but that makes the code that much hairier - there's a reason why they're typically factored into a lexer and a parser.


> you're basically saying you can do lexical analysis and semantic analysis in the same function.

yes

> that makes the code that much hairier

true, but it does save you the rigamarole of the lexer.

    func parseDeclaration(s string, i0 Pos) (n Node, i Pos, err error) {
        i = i0
        i = skipSpace(s, i)
        if word(s, i, "var") {
            v := &ast.VarDecl{VarPos: i}
            i += 3
            i = skipSpace(s, i)
            if v.Ident, i, err = parseIdent(s, i); err != nil {
                return
            }
            i = skipSpace(s, i)
            if is(s, i, "=") {
                i += 1
                i = skipSpace(s, i)
                if v.Value, i, err = parseExpr(s, i); err != nil {
                    return
                }
            }
            if !is(s, i, ";") {
                return nil, i, fmt.Errorf(`unexpected input or EOF (expected ";")`)
            }
            ⋮


Possible sure, but without a combinator framework, you have to do a ton of the rote work yourself.


selectKeyword keyword = "select"

I don't work with Go, so this may be a requirement of the language that I don't know, but whenever I see lines like this, it automatically brings up the question why? --- do you really expect to need to rename the SELECT keyword? Especially when it's named "selectKeyword". Why not just use the string constant? Ditto for the others like "leftparenSymbol" --- I see there's explicit character constants in some of the other code too... it reminds me of the classic anti-pattern like "int five = 5;".

Also, you may find the full SQL grammars interesting to look through --- they are quite a bit more complex than the subset presented in the article: https://ronsavage.github.io/SQL/


The biggest reason is typo and type checking. Any typo is a compilation error. But the type checking is even more interesting. You basically create a subtype of string which is correspondingly type checked by the compiler. Functions might return a keyword instead of a string.

One funny thing you can do in Go is for example to create a subtype of int and change its String() method to return the hexadecimal representation of that int. Very neat.


The IDE will autocomplete selectKeyword but not "select".

Other type systems, like TypeScript, support string literal types, and in Typescript we generally would just define the keyword type to be a union of string literals and then use those string literals in the code.

But without string literal types, using a string constant gives the compiler more information it can use to make sure your code is correct. That's a useful thing to have.


Another reason is that by providing an identifier for this string literal, misspellings of it can be detected by the compiler whereas there is nothing tying separate string literals together.


I can hardly imagine a case where you would misspell "select" and not notice it at some point, nor use it in more than one place (the keyword detector) in the parser.


The pattern (which I employ only sometimes) is to have almost all literals defined in this way. Perhaps SELECT isn’t likely to be the word you misspell, but I’m a careless typist and make 10 typos a minute. Taking this added step helps your editor save you from yourself.


Having standards like that and keeping them helps a lot. Next time you have a different keyword, you don't have to think "does it deserve a constant?" - all of them do.

Similar to how linters stop you from overthinking indentation in specific cases, or some naming standards, and later rename/reformat-wars.


all of them do.

That's what leads to dogmatic cargo-culting. Good software is written by thinking about the circumstances and doing what makes the most sense, not by mindless rule-following that don't always make sense.

I don't know why someone would be so worried about typos and introduce more verbosity and redundancy in the process; but then again, I don't use an IDE and I've never had this problem.


I'm not sure you're really making the argument you think here...

Good software may mean thinking about circumstances like "we're dealing with lots of text parsing and have seen bugs from typos in common keywords" and doing what makes most sense: "let's prevent those in the future, but using constants for keywords". It's only cargo culting if you don't know why you're doing something.

Setting a rule so you don't have to debate something later is a valid solution and may still offset some redundancies introduced this way.


You'd be surprised...


It’s one of the more common bugs in our office. Before we integrated a Python linter in our CI for a prototype project, we would see several such typo errors each week with a small team (and that was with symbols, not string literals).


Python and a lot of the other dynamic languages are in a slightly different situation.


The point is that if you can make typos in identifiers in (unlinted) Python then you can make typos in string literals in Go. In both cases, there is no static analysis to help you.


The biggest reason to do this in Go is to approximate enums. Now anywhere the `keyword` type is used, it must be one of the ones specified there (or you'd have to cast a string).

Yep, it would be a little harder to implement all of the SQL spec in a single post. Maybe over time though.


Aside from obviously preventing typos, I really like being able to very quickly find all references that use the string.

For example when storing session state into a Dictionary<string,object> all access to the session state is through static strings, never a string literal. This makes is super easy to find everywhere that’s accessing that particular session variable.


Curious - would Rust be more appropriate than Go for such a task?


Golang works too. There is a few databases written in Golang, such as Prometheus, InfluxDB, CockroachDB, or tidb.

https://github.com/topics/database?l=go&o=desc&s=stars


Note that neither CockroachDB or TiDB use Golang for their actual storage engine, which is in both cases written in C (RocksDB). They do use Golang for SQL parsing though, which is what this post was mostly about.


However, cockroachdb does all other work (including query execution) in Go too.

There's also DGraph based on Badger as a storage engine which is an all-go stack.

And badger does compare favorably to rocksdb under certain workloads.


> Note that neither CockroachDB [...] use Golang for their actual storage engine

We do, now. We're looking to move away from RocksDB to https://github.com/cockroachdb/pebble/.


Woah, that's big news. Thanks for sharing.


VictoriaMetrics [1] is written entirely in Go. By default it uses canonical zstd library for compression (the library is written in C), but it supports pure Go mode when built with `make victoria-metrics-pure`. In this mode it uses zstd implementation written in Go [2].

[1] https://github.com/VictoriaMetrics/VictoriaMetrics

[2] https://github.com/klauspost/compress/tree/master/zstd#zstd


rust would be more appropriate if your intention was to use this in production. Databases ad GC don't mix well. (its done but it makes tuning a nightmare)

As much as I love rust, its learning curve is high and I'm sure Op doesn't want to spend half his article teaching all the intricacies of types and the borrow checker. Go is easy to learn over a weekend so its probably a better medium for illustrating the concepts as everything is laid out simply.


A garbage collector is not inherently incompatible with a low-latency database. I was generally very happy with the performance of Go's garbage collector enough to have built Prometheus, the time series database, on it back in 2012, when the collector was considerably more naive.

https://blog.golang.org/ismmkeynote


Truth is I've written about how to parse in better languages before (JavaScript, Python, Standard ML, etc.) and I wanted to figure out a good approach for doing it in Go.

As for databases and GC, you may be right but there are still major databases out there written in Java for example.

https://www.quora.com/Which-databases-data-stores-are-writte...


Sometimes I wonder how Go would look like with a borrow checker...


Do you know about a tutorial or a book that helps write Database from scratch either in Java or C?



Thanks!


Are you going to add persistence in an upcoming part?


Probably! That's the most interesting and most difficult part so it may take me a while.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: