Hacker Newsnew | past | comments | ask | show | jobs | submit | abcdabcd987's commentslogin

Related discussion on serving finetuned LLMs: https://news.ycombinator.com/item?id=38196661


Thanks for your encouragement! We are working on quantization as well. We recently submitted a paper, Atom [1], that uses 4-bit quantization, delivering 7.73x throughput compared to FP16 and 2.53x compared to INT8. Atom is able to maintain a perplexity (i.e., model accuracy) close to FP16, outperforming existing quantization approaches.

We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :)

[1] https://arxiv.org/abs/2310.19102


Added to my reading list! The world of quantizations is moving so fast even TheBloke might not be able to keep up!

So Atom base models would be compatible with Punica?

I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?


Certainly! We'd like our good designs to be picked up by frameworks and serve all users. Currently, Punica is built on top of PyTorch and HuggingFace Transformers ecosystems. Therefore, vLLM and LMDeploy, which are also in the PyTorch ecosystem, should have a smooth adaption. As for Nvidia Triton and TensorRT-LLM, since our kernels are written in CUDA, I believe it will also work seamlessly.

We call for the open source community to help us integrate Punica with all frameworks, thus the whole society can benefit from the efficiency improvement!


Thank you! We are also very excited about combining the fast fine-tuning and efficient serving. In fact, what you just said is very related to one of our very first motivations. In my previous blog post [1], I call this scheme "Just-in-time Fine-tuning". Our previous measurement is that, for a medium-sized webpage (~10K tokens), it takes around 30 seconds to 2 minutes to finetune a LoRA model. Another good side of this JIT fine-tuning scheme is that, we can turn any model into a long-context model.

We'll keep doing more research on finetuning. And hopefully, we'll see the results soon.

[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/


Thanks for the question. Currently Punica is on the ecosystem of PyTorch and HuggingFace Transformers. So PyTorch users can start to use Punica now.

Look forward to collaboration with TVM and MLC to reach more users :)


On a related note, Google also uses optical circuit switches in their datacenter network. See the paper form SIGCOMM'22 [1].

[1] Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking. https://research.google/pubs/pub51587/


I'm using 27-inch 4K monitors with Gnome. I found it more practical to use no scaling but 1.25x text size (settings in gnome-tweaks). The problem for me with 2x scaling or fractional scaling is that it scales UI (icons, margins) as well. As someone who appreciates functionality (i.e., displaying text) over overdesigned white spacing, scaling the whole UI is just wasting my workspace size whereas scaling text-only is a perfect balance between good-looking text and workspace size.

But anyway, good to see that Linux desktop is gaining fractional scaling support!


There is a way to toggle experimental fractional scaling in GNOME.

https://wiki.archlinux.org/title/HiDPI -> 1.1.1 Fractional Scaling


It's done by upscaling and downscaling by integer amounts, so it's not quite the same as true fractional scaling.


It works ok, Apple does it the same way. However, I find Microsoft's approach better.


>It works ok

...if you're OK with blur and wasted performance. I like my text crisp, rendered exactly to the desired size, with subpixel antialiasing, and knowing my hardware is not wasting cycles in rendering to a higher-than-needed resolution and then throwing some of that away with raster resampling.


What does overdesigned mean? I suppose you mean relying on scientifically proven design practices such as improving readability through use of whitespace and font sizes.

Do you have an example of this being overdone?



Can you receive SMS from apps that aren't the default SMS handler on modern versions of Android? I seem to recall Google putting a bunch of silly restrictions on it to the point where there wasn't even a permission you could grant if you wanted to.


> If you find a really good tutorial by programmers who are both excellent teachers and experienced in that particular field, it beats JIT learning on a personal project in one important respect. You get exposure to the One True Way of doing things.

That's exactly my feeling when I was watching Jon Gjengset's Rust tutorials. I like his real reactions to unexpected problems. Really learned a lot from this kind of lengthy but realistic videos. https://www.youtube.com/c/JonGjengset/featured


Just mentioning two related programs in my mind: Franz (https://meetfranz.com/) and nativefier (https://github.com/jiahaog/nativefier).


I personally use Rambox (https://rambox.pro/) instead of Franz these days, or just click the "install as app" button in my browser.

Really, I see little value over these browsers with vertically stacked tabs. There's already free solutions out there, what's the added value? Workspaces are easy to replicate using Firefox's containerised tabs and the vertical tab structure can be replicated with something like Tree Style Tabs. The remaining minor improvements are nice to have but certainly not worth $20 in my opinion.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: