More

dbaupp · 2025-09-12T21:52:49 1757713969

UTF-8 encodes each character into a whole number of bytes (8, 16, 24, or 32 bits), and the 10 continuation marker is only at the start of the extra continuation bytes, it is just data when that pattern occurs within a byte.

You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).

dbaupp · 2025-01-02T23:01:58 1735858918

People use stars differently.

https://arxiv.org/pdf/1811.07643 is some investigatory research describing, among other things, 4 clusters of reasons for starring: to show appreciation, bookmarking, due to usage, due to third-party recommendation.

dbaupp · on Nov 15, 2024

FWIW, pex now also has options to unzip the archive to a cache directory on startup (I believe this happens by default now, but am not at a computer to confirm), to side step the zipapp limitations that you reference.

BiteCode_dev · on Nov 15, 2024

I just checked, and there is indeed a `--pex-root` option with and even "-c" to specify a custom entry point.

Thanks for pointing it out.

dbaupp · on Nov 15, 2024

As of a few months ago, pex supports including an interpreter, either directly inline or lazily downloaded (and cached): https://docs.pex-tool.org/scie.html

dbaupp · on Oct 15, 2024

https://qntm.org/abolish explores this idea in a fair amount of detail.

dbaupp · on Sept 2, 2024

Semgrep also a CLI, that can run offline and without a cloud account.

At work, we use it for enforcing a bunch of custom lint rules configured as a yaml file committed directly to our repo, entirely cloud-free.

(I may be overreading your comment as suggesting that these were reasons to use ast-grep over semgrep.)

gregwebs · on Sept 2, 2024

ast-grep is based on treesitter. I found Semgrep great for simple things but impossible due to edge cases for complicated things. ast-grep is more difficult for simple cases but all the information you need is there for complex cases.

beardedwizard · on Sept 7, 2024

Semgrep is also based on tree sitter

pdimitar · on Sept 5, 2024

As the other sibling commenter said, both `ast-grep` and `gritql` are based on Treesitter which means that you can in fact just look for certain function call and it will be found no matter how it's formatted, something that plain grep and sometimes semgrep I am not sure can do.

I have used `ast-grep` to devise my own linters with crushing success.

dbaupp · on Aug 25, 2024

In addition to what you say, it can also be easier for a (appropriately-skilled) human to verify a small program than to verify voluminous parsing output, plus, as you say, there's the semi-automated "verification" of a very-wrong program failing to execute.

dbaupp · on Aug 25, 2024

> That probably should have been the first thing to try.

The point of the post is not "how to make mmap work best with async/await" or "how to optimise mmap with async/await", but exploring the consequence of incorrect code (and thus explaining why one might need potential remedies like those). Sorry if that didn't come across!

akira2501 · on Aug 25, 2024

I think it's harder for you to write "correct code" here because the crate is hiding most of the actual detail from you. I put that in quotes because there's absolutely nothing incorrect about the code, it's really just suboptimal, and most probably because it can't even use the full syscall interface.

Seriously, I hate to be a curmudgeon, but that crate looks like a particularly bad and naive wrapper around mmap. It works very hard to provide you things you don't need when the basic interface is much more flexible. Aside from having to put `unsafe` around the call and re-import the kernel header constants, there's almost no reason to even have this in a crate.

dbaupp · on Aug 25, 2024

I have a feeling we’re talking at cross purposes here: I was actively trying to write incorrect code. This post isn’t about memmap2 crate specifically at all, it just happens to be a convenient way to get the exact (“incorrect”) syscall I wanted from Rust.

I see where you’re coming from but… it feels like you’re trying to convince me of something about the post? If you feel like convincing a larger audience of the limitations of the memmap2 crate specifically, I suggest writing your own blog posts and/or getting involved with it. :)

dbaupp · on Aug 25, 2024

Do you have info about current (production) implementations that increase the number of workers?

In https://tokio.rs/blog/2020-04-preemption#a-note-on-blocking (2020), there's reference to .NET doing this, and an explicit suggestion that Go, Erlang and Java do not, as well as discussion of why Tokio did not.

neonsunset · on Aug 25, 2024

Yes, it is .NET as Tokio blog post references.

Unfortunately, it does not appear to look into .NET's implementation with sufficient detail and as a result gets its details somewhat wrong.

Starting with .NET 6, there are two mechanisms that determine active ThreadPool's active thread count: hill-climbing algorithm and blocking detection.

Hill-climbing is the mechanism that both Tokio blog post and the articles it references mention. I hope the blog's contents do not indicate the depth of research performed by Tokio developers because the coverage has a few obvious issues: it references an article written in 2006 covering .NET Framework that talks about the heavier and more problematic use-cases. As you can expect, the implementation received numerous changes since then and 14 years later likely shared little with the original code. In general, as you can expect, the performance of then-available .NET Core 3.1 was incomparably better to put it mildly, which includes tiered-compilation in the JIT that reduced the impact of such startup-like cases that used to be more problematic. Thus, I don't think the observations made in Tokio post are conclusive regarding current implementation.

In fact, my interpretation of how various C# codebases evolved throughout the years is that hill-climbing worked a little too well enabling ungodly heaps of exceedingly bad code that completely disregarded expected async/await usage and abuse threadpool to oblivion, with most egregious cases handled by enterprise applications overriding minimum thread count to a hundred or two and/or increasing thread injection rate. Luckily, those days are long gone. The community is now in over-adjustment phase where people would rather unnecessarily contort the code with async than block it here an there and let threadpool work its magic.

There are also other mistakes in the article regarding task granularity, execution time and behavior there but it's out of scope of this comment.

Anyway, the second mechanism is active blocking detection. This is something that was introduced in .NET 6 with the rewrite of threadpool impl. to C#. The way it works is it exposes a new API on the threadpool that lets all kinds of internal routines to notify it that a worker is or about to get blocked. This allows it to immediately inject a new thread to avoid starvation without a wind-up period. This works very well for the most problematic scenarios of abuse (or just unavoidable sync and async interaction around the edges) and allows to further ensure the "jitter" discussed in the articles does not happen. Later on, threadpool will reclaim idle threads after a delay where it sees they do not perform useful work, with hill-climbing or otherwise.

I've been meaning to put up a small demonstration of hill-climbing in light of un-cooperative blocking for a while so your question was a good opportunity:

https://github.com/neon-sunset/InteropResilienceDemo there are additional notes in the readme to explain the output and its interpretation.

You can also observe almost-instant mitigation of cooperative (aka through managed means) blocking by running the code from here instead: https://devblogs.microsoft.com/dotnet/performance-improvemen... (second snippet in the section).

dbaupp · on Aug 25, 2024

Thanks for the up-to-date info.

> .NET 6

(I’m under the impression that this was released in 2021, whereas the linked Tokio post is from 2020. Hopefully that frames the Tokio post’s more accurately.)

neonsunset · on Aug 25, 2024

UPD: Ouch, messed up the Rust lib import path on Unix systems in the demo. Now fixed.

dbaupp · on Aug 25, 2024

As the author, I don't think there's a clear definition of "blocking" in this space, other some vibes about an async task not switching back to the executor for too long, for some context-dependent definition of "too long".

It's all fuzzy and my understanding is that what one use-case considers being blocked for too long might be fine for another. For instance, a web server trying to juggle many requests might use async/await for performance and find 0.1ms of blocking too much, vs. a local app that uses async/await for its programming model might be fine with 10ms of "blocking"!

https://tokio.rs/blog/2020-04-preemption#a-note-on-blocking discusses this in more detail.