Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Alpaca: A strong open-source instruction-following model (stanford.edu)
700 points by jcklie on March 13, 2023 | hide | past | favorite | 298 comments


This is why I think we're seeing a Stable Diffusion moment for LLMs: https://simonwillison.net/2023/Mar/11/llama/

Look at the timeline:

24th February 2023: LLaMA is announced, starts being shared with academic partners: https://research.facebook.com/publications/llama-open-and-ef...

2nd March: Someone posts a PR with a BitTorrent link to the models: https://github.com/facebookresearch/llama/pull/73

10th March: First commit to llama.cpp by Georgi Gerganov: https://github.com/ggerganov/llama.cpp/commit/26c084662903dd...

11th March: llama.cpp now runs the 7B model on a 4GB RaspberryPi: https://twitter.com/miolini/status/1634982361757790209

12th March: npx dalai llama: https://cocktailpeanut.github.io/dalai/

13th March (today): llama.cpp on a Pixel 6 phone: https://twitter.com/thiteanish/status/1635188333705043969

And now, Alpaca. It's not even lunchtime yet!

Turned this into a blog post: https://simonwillison.net/2023/Mar/13/alpaca/


Here is one question I have not seen answered yet:

All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16 and then further quantizing to int4.

Clearly this quantized model still outputs something resembling human language, at the very least.

But I haven't seen anyone show what effect this quantizing has on the quality of the output. If the quality of the output is bad, it's unclear if it's because the model needs to be finetuned (as Stanford did here) or if it's because the quanitizing reduced the quality, or both.

If this fine-tuned Stanford model still has excellent output after quantizing it to run on a Raspberry Pi 4GB, that would be awesome!


For 10 billion+ parameter models, the effects of quantization are relatively small, for smaller models like Llama 7B the effect becomes more dramatic, but there is ongoing research on new quantization methods (like GPTQ) that preserve significant performance even on the lower end.

Quantization isn't the only technique available for downsizing a model, Llama itself is already the result of sizing the model and input data according to "Chinchilla optimality", a very recent (as in 2022) result that e.g. GPT-3 predates. The result is that Llama-13B performs in benchmarks similarly with GPT-3 175B despite the tremendous size difference. There are separately also a variety of pruning methods to further eliminate inactive weights present in the trained model (I think this is also active research)

Finally even on something like a Raspberry Pi, implementations for inference (like llama.cpp) are nowhere near mature yet. There are already a multitude of runtimes available for inference making large tradeoffs between performance and flexibility (e.g. many models running on PyTorch vs ONNX report 5-10x speedups running under ONNX)

I think the really exciting part of Alpaca is the size and budget of the team - 5 students with $100 scraping OpenAI put this model together in a couple of hours of training. Any notions of premium persisting in the AI space for much longer seem fantastic at best, for all intents and purposes it has already been commoditized. And that's scary considering the size of the dent ChatGPT has put in my Google traffic


Llama is trained with _more_ data than is chinchilla optimal in order to make it better and cheaper at inference time, instead of just getting the highest quality of model that you can based on a given training budget. Llama has fewer parameters and was trained on more data specifically so that it would get high quality results on cheaper hardware and be easier and faster to run at inference time.


Curious about the google traffic comment. Are you saying people are visiting sites less because they can stay on Bing/OpenAI?


There is some very natural split regarding what I'll send to ChatGPT vs. what goes to Google. For example "six nations fixtures" obviously Google, but anything of depth or where recency is irrelevant goes the other direction. Asked it a few Linux questions today, how to interpret the title of a particular FRED chart, and a ton more sessions that Firefox history somehow didn't manage to correctly track the title for. I vastly prefer ChatGPT's interaction format compared to the equivalent random keyword spelunking session on Google.


Same, until I realized that about 60% of the information it gives me is either subtly wrong or 100% factually incorrect. Yet it's so, so confident.


And in that way its actually more overall correct than the most knowledgeable person on earth. With google you also get fed some very dangerously wrong info (recent example masks) but you think its correct. With ChatGPT you have to actually use your critical thinking skills and get to the truth which in my opinion a huge advancement over google.


Not really, with google you get multiple sources at a glance. Sure, they can still be wrong, but some critical thinking + multiple sources = more likely to be correct than relying on a single unreliable source.


No, the crucial thing is that a good human will tell you if they don't know something, or if they are simply unsure.


It's adorable seeing this kind of critique in the context of HN, I wonder how many folk knew my heavily upvoted comment above ("For 10 billion+ ...") was from someone who has only been looking at this stuff for a few weeks. ChatGPT is no better or worse than any consultant I've ever met (including myself), or most of the commenters you find here every single day.


It's adorable that you think people assume HN comments are factually correct. I read everything here with extreme skepticism, because I know this is all coming from flawed humans. An computer system giving authoritative text and insisting it is 100% correct is a different story.


Every bit of text from a computer system also comes from flawed humans.


I don't know the data but as an anecdote for most searches that would have returned blogspam (i.e. "what's the best birthday gift for a groom") in relying more and more on chatgpt.

I used to use it even more, but some of the recent changes reduced its ability at complex, creative tasks.


It's a nice business model, scrape the web and be the ultimate knowledge middle man


The difference is small, UNTIL you get to 4 bit quantization, where the model is noticeably dumber.

8 bits, imo, is the minimum.


WRONG. Research shows effectively imperceptible performance difference at 4-bit and even 3-bit with GPTQ quantization. You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark.

(Note: llama.cpp's 4bit is naive, not GPTQ, and sucks but they are refactoring it to use GPTQ quantization)

References:

https://arxiv.org/abs/2210.17323 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers [Oct, 2022]

https://arxiv.org/abs/2212.09720 - The case for 4-bit precision: k-bit Inference Scaling Laws [Dec, 2022]

https://github.com/ggerganov/llama.cpp/issues/9 - llama.cpp: GPTQ Quantization (3-bit and 4-bit) #9

https://github.com/qwopqwop200/GPTQ-for-LLaMa/ - 4 bits quantization of LLaMa using GPTQ


Good points, though I would gently encourage not starting a post with "WRONG." in the middle of a nuanced discussion. I remember 'way back when' when there was a .5-2% flat performance drop for UINT8 on some models when it was first introduced (depends upon the modality).

Like, 4 bit quantization really is probably enough for a number of usecases and likely beats smaller models with precision enough to make it the equivalent number of bits, but this really is only presenting half of the story. "You cannot tell the difference and if you think you do you're wrong, because it barely even registers on any benchmark" can be regarded as antagonistic, and also really doesn't line up with reality in a number of usecases. Sure, maybe for some models, UINT4 quantization is good enough. But there's a very large space of model architectures and problems, even for language learning, many of which do have very demonstrable drops in performance. And at certain perplexity levels, every bit (heh) matters.

In any case, an argument for moderation, please.


Good points, I didn't mean to come off abrasive but I can see why I would. My attention was to get attention on a thread where my new comment would be buried under the 8 other replies, so I put a big attention grabber at the start.

But again good points about the nuances of lower precision. For LLMs at least 'The Case for 4-bit Precision' and 'GPTQ' seem fairly conclusive that over ~10B parameters even 3-bit precision has virtually undetectable loss with the right trircks. Levels which, if they even mattered, can easily be overcome with a little additional training.

Newer ongoing research on LLaMA specifically[0] shows we can reduce the model's size around 84% without any meaningful performance loss through a combination of GPTQ, binning, and 3-bit.

[0] https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...


Tak that WRONG! as a reference to the Two Stupid Dogs, then it may be a lot easier to stomach. :)


Some parameters would be more sensitive than others I suppose? So could you use 4 bits for most, and 8 bits, or even 16, for the remaining?


I know nothing about this so my opinion means little, but I imagine it's hard to know which parameters are important enough to use more bits for.

I do wonder if it would be possible to have the model determine during training how important each parameter is, while maybe rewarding it for having more small parameters?


That's exactly why bitsandbytes has a threshold parameter to control the quantization.


Nice, good to know, thanks!


So which is better, running 7B without quantization or running 13B with? They both require about the same amount of vRAM (10gb).


Empirically, 13B with quantization.

In fact the person who said 4bit is worse is empirically incorrect.

13B with quantization even down to 3-bits has very near the same performance as uncompressed 16bit 13B with GPTQ quantization and binning.

Source: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...


I looked at the numbers you posted, and am feeling concerned with how aggressively you're commenting towards a number of people on this website.

For starters, I started in this field a few years after the 2012 wave started. I've been with it for a while and have seen a lot of trends come and go. One thing that stays the same is that things are always changing. Very few things are set in stone, and due to a few other things it takes years and years before anything even begins to be finalized.

The numbers you are quoting are from various research groups, and are days to weeks old. You've antagonized a number of users in this forum, from calling them wrong directly, or saying that another person is empirically incorrect based on numbers you haven't verified yourself, and that have not had time to settle in the field yet with respect to real-world usecases. I went to one of the methods you linked, GPTQ, and it indeed had a _good_ performance to size improvement, but was not 'no difference'. This also does not count that 4-bit GPU support is still not-well supported. On 13B, for 4-bit, a .1 perplexity difference is great, but I also believe that that is also at least a noticeable drop. The .42 perplexity drop for 3 bit is massive, but also still very information efficient.

This completely ignores the conversation about (back to the GPU side of things) kernel-level support for these operators, which is very underdeveloped. Technical and unvalidated lab numbers do not represent the real world, it's like battery technologies. They are two very different things, though there are impressive tech demos and numbers out there. Like many things, in my experience, at least, it comes down to a big 'it depends'. It'll all settle out in the wash and we'll see what methods end up reigning in the long run.

Again -- please stop attacking other HN users based on a partial -- if well-researched -- understanding of the subject matter. It seems you're very involved in this topic, and I agree that more people need to hear about it. I think you could do an excellent job in sharing that news to them. That is good, and I hope the evangelism efforts go well and wish you all the best on that front. However, it seems (and this may be an inappropriate judgement on my end) that you might have become personally entangled in what is generally a technical issue.

I am just a commenter on this website, though I have used hacker news for a very long time at this point. I requested previously that you tamp down flaming the other users a bit, and I'd like to ask you once more. A good litmus test to maybe ask yourself is "Am I including any information in this message that indicates that another person may be right or wrong, or that I might be right or wrong? How strongly do I feel that my perspective is reality vs their incorrect perspective?" If you trigger that line when writing out a comment -- even if there is a strong impulse to ignore it, it may be time to step back, breathe, and separate out what is a personal issue for you, and what is a technical issue that you are passionate about. You can have both at once.

Please just slow it down a bit. I want to see what you and everyone else can mutually bring to the table in this conversation. Thank you.


Many good points. I agree with essentially everything you've said, especially regarding relative perplexity.

I'm aware that I was aggressively overselling an unnuanced and overstated position on 4-bit and especially 3-bit performance. That was partially a rhetorical tactic to swing the pendulum the other way, as it were.

And partially it was simply frustration with the number of threads I've seen in the past week of LLaMA drama spreading misinformation about bit precision like "a 16bit 13B model surely outperforms a 4-bit 30B model" which could not be further from the truth. That frustration is my own responsibility to manage and I understand that.


Definitively, 13B with quantization will perform better. 4bits has been shown to be the optimal quantization for accuracy vs memory requirements.


Yeah, 7b vs 13b is basically no comparison in any situation, 16bit 7b is def worse than 4bit 13b. I'll be looking into 30B tomorrow. I may be able to do a full matrix of tests 4-16bit X 7-30b.


This is interesting. What sizes are you seeing this for?


I have heard that the human brain uses the equivalent of around 6 bits. I wonder if that is some kind of optimum reached by evolution.


> All the magic of "7B LLaMA running on a potato" seems to involve lowering precision down to f16

LLaMa weights are f16s to start out with, no lowering necessary to get to there.

You can stream weights from RAM to the GPU pretty efficiently. If you have >= 32GB ram and >=2GB vram my code here should work for you: https://github.com/gmorenz/llama/tree/gpu_offload

There's probably a cleaner version of it somewhere else. Really you should only need >= 16 GB ram, but the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously. You could also lower vram requirements a bit more with a bit more work (I just made the smallest change possible to make it work)


> the (meta provided) code to load the initial weights is completely unnecessarily making two copies of the weights in RAM simultaneously

This is the kind of thing that the stable diffusion community optimized the shit out


Decrease in accuracy is negligible and decreases as model size increases. That is, larger models quantize even better than smaller models.

https://arxiv.org/abs/2210.17323


Is this because averages are weighed less (less sensitive) as the total sample size increases?


Yes. In a dense everything to everything neural network layer, the number of 'inputs' to a node is proportional to the square root of the number of weights.

Therefore, assuming quantization noise is uncorrelated, as the number of weights doubles, the number of inputs goes up by sqrt(2), and the (normalized) noise goes down by a factor of 2*(sqrt(2)).

So, as a rule of thumb, you can remove 1 bit of precision of the weights for every 4x increase in the number of weights.

All this assumes weights and activations are uncorrelated random variables - which may not hold true.


Something is wrong with this math... by your logic I could scale the network up big enough that I could quantize the weights down to zero bits...


Having fewer than 1 bit per weight is not absurd. E.g. you can use 2 bits to represent 3 'weights' if you insist that at most one of the weights is allowed to exist. If you try to order nodes so that adjacent nodes are uncorrelated the performance loss might be manageable.

People are already doing stuff like this (see sparsification) so it is conceivable to me that this is just what networks will look like in a few years.


> If you try to order nodes so that adjacent nodes are uncorrelated the performance loss might be manageable.

shower thought

In graphics we use barycentric coordinates to encode the position within an arbitrary triangle using two coordinates (u,v), with the third being constrained to be 1-u-v. If you order nodes to be correlated, could you use a similar trick to encode three weights for the price of two?


Yes, it's the same thing.


Rules of thumb typically are just first order approximations which by definition are not guaranteed to hold far from their point of interest (or point of tangency).


See: https://arxiv.org/abs/2210.17323

Q: Doesn't 4bit have worsen output performance than 8bit or 16bit? A: GPTQ doesn't quantize linearly. While RTN 8bit does reduce output quality, GPTQ 4bit has effectively little output quality loss compared to baseline uncompressed fp16.

https://i.imgur.com/xmaNNDd.png https://i.imgur.com/xmaNNDd.png


This is really interesting, thank you for the reference!

Having worked more with images based NN than language models before, I wonder: are LLM inherently more suited to aggressive quantisation, due to their very large size? I see people suggesting here 4b is pretty good, and 3b should be the target.

I remember ResNets etc can of course also be quantized, and up to 8-6b you get pretty good results with very little effort, with low-ish degradation in performance. Trying to go down to 4b is more challenging, though this paper claims with quantisation aware training 4b is possible indeed, but that means a lot of dedicate training compute needed to get to 4b (not just finetuning post-training): https://arxiv.org/abs/2105.03536


Might I suggest looking the story between the 2nd and 10th of march? I've noticed Hacker News hasn't been following certain areas of the effort. A lot of great work had happened and continues to be happen in close conjunction with the text-generation-webui (seriously, most of the cutting edge with 4-bit GPTQ etc. has been closely tied to the project).

>https://github.com/oobabooga/text-generation-webui/


Wow, yeah that's a VERY active project: https://github.com/oobabooga/text-generation-webui/graphs/co... - only started Dec 18, 2022 and already 22 contributors and 806 commits!


I'm excited to see what the OpenAssistant crowd does with these models, they seem to have gathered the dataset to finetune them.

Lots of people use these models as talk therapy. We really need 1) standalone options, 2) reproducible weights with crowd sourced datasets to reduce biases (or at least know who you're talking to).


Question: what percentage of the hype and momentum for this is so people can run sex chatbots on their local machine?


A lower portion than the equivalent number for Stable Diffusion, but still significant.


Feature-length AI-generated pornos don't seem that far off the horizon.


Or really just any text generation that chatGPT dislikes. It's nice not to be judged by a program (and perhaps logged somewhere that you asked for something "inappropriate").


Also today: ChatGLM released by Tsinghua University. I've made a separate submission for it: https://news.ycombinator.com/item?id=35150190

The GitHub page is https://github.com/THUDM/ChatGLM-6B. The GitHub description is all in Chinese, but the model itself can handle English queries on a single consumer GPU well. Considering its size, I'd say the quality of its responses are outstanding.


LLAMA.cpp with 65B parameters runs on a MacBook M1 Max with 64GB of RAM. See https://gist.github.com/zitterbewegung/4787e42617aa0be6019c3...


That is still a 4000 usd computer. You can get 2 RTX3900 used for ~1000 usd and run 65B much faster.

I have a discord server up serving almost 500 users with 65B.

https://twitter.com/ortegaalfredo/status/1635402627327590400

For some things is better than GPT3, for other even Alpaca is better.


How do you make it load on two GPUs or does llama.cpp does it automatically? I have a setup with a threadrippper and a RTX3090 and a Titan RTX. I haven't had the time to set it up so that's why I have been using my Mac.


llama.cpp doesn't use the GPU at all. The genius *.cpp (whisper.cpp, llama.cpp) projects are specifically intended to optimize/democratize otherwise GPU only models to run on CPU/non-GPU (CUDA, ROCm). Technically speaking the released models are capable of running on GPU via standard framework (PyTorch, TensorFlow) support for CPU but in practice without a lot of optimization they are incredibly slow to the point of useless, hence *.cpp.

You want something along these lines (warning: unnecessarily potentially offensive):

https://rentry.org/llama-tard-v2


Llama.cpp takes advantage that LLaMa 7B is a tiny, very optimized model. It would run in anything, and very fast. I really doubt you can run the 30B or 65B models at acceptable speed on a CPU at least for a couple years. (I'm ready to eat my words in a couple weeks)


Okay my thread ripper can handle it because it has a 128GB of Ram.


Thanks for taking the time to set this up. I will definitely give it a go later today. I don't have access to hardware that I can run LLaMA on and I'm really curious to see what the 65B model has to offer.


What's the correlation between parameter count and RAM usage? Will LLaMA-13B fit on my MacBook Air with 8 GB of RAM or am I stuck with 7B?


13B uses about 9GB on my MacBook Air. If you have another machine (x86) with enough RAM to convert the original LLaMA representation to GGML, you can give it a try. But quantization step must be done on MacBook.

Maybe it is more feasible for you to use 7B with larger context. For some "autocompletion" experiments with Python code I had to extend context to 2048 tokens (+1-1.5GB).


I have also seen it working on a Mac Studio with 64Gb of RAM. It is quite slow, not unbearably so, but slow.


A lot of them aren’t very good though at the same stable diffusion vram level unfortunately (and we’ve had large non consumer level gpu llms open sourced for a while eg gpt-j)


That is likely because "good" is a higher bar in language than images, because people don't mind or notice the longer range artifacts in image models as much.


A lot of people are running Llama using the CPU/system memory.



I think the Stable Diffusion moment is very dependant on someone creating a commercially licensable version of this somehow. I think the prospect of never being able to put your creations in a product is too inhibitive for the hypergrowth stable diffusion saw.


I know, this is crazy!!

I can't fathom how development has suddenly seemed to accelerate.


The timing of the Facebook leak seems suspect.


What do you mean?


I mean ChatGPT had a lot of attention, so a leak of a competing architecture would shift the attention away from ChatGPT. Which Meta's LLaMA did. And we see it swinging in the other direction with OpenAI announcing GPT-4.


Do you mean Meta’s publishing of Llama?


Quick compare with this vs 7B LLama.

Input: Tell me something about owls.

Output 7B LLama: Asked by kimberly at 12:54 PM on May. 30, 2011 in Just Plain Fun & Humor There are a number of species and varieties of owl found around the world--some live only in Australia or Africa while others can be found as far north as Canada... The one thing that all have in common is their ability to adapt. Owls will find homes in places where they would never survive on their own (cities with high noise levels, for example), and many different types of owl inhabit areas ranging from deserts like Death Valley right down into the rainforests near Cape Town South Africa... Answer by mamasaurus at 12:56 PM on May. 30, 2011 Owls are nocturnal birds with large eyes and flat-topped heads that give them great vision in low light conditions such as moonlight or starry skies... Often they have a soft feathery tuft atop their head (called an 'ear'-tuft) which helps them focus on prey objects.. The ears are not really "ears" but rather

Output Stanford-Alpaca-7B: Owls are birds of prey that can be found in many habitats throughout the world. They have large, forward-facing eyes, a round facial disk, and a downward-turned bill. Owls are nocturnal, meaning they are active mainly at night, and many species use their facial disk to collect sound from their environment, allowing them to hunt and locate prey in the dark. They mainly feed on small mammals, reptiles, insects, and birds.


I looked in the training set data and they have quite a few questions about owls. Also it got " downward curved beak" from davinci and got it still wrong.

Like:

"instruction": "Describe the sound an owl makes.",

"instruction": "Summarize the differences between an owl and a hawk.",

"instruction": "Find a fact about the bird of the following species", "input": "Species: Great Horned Owl",

"instruction": "What is the binomial nomenclature of the barn owl?",

"instruction": "Generate a riddle about an owl.",


With the 7B none-instruction tuned you have to really pre-prompt it to get good answers to things so like:

SYSTEM: You are an AI research assistant. You use a tone that is technical and scientific.

USER: Hello, who are you?

ASSISTANT: Hi! I am an AI research assistant. How can I help you today?

USER: Can you tell me something about owls.


I thought I'd put some in as an example. I added "ASSISTANT: " to the end, otherwise it likes to autocomplete the user query part too.

This is all with 4 bit quantisation from lambda.cpp, using `--repeat_last_n 256 --repeat_penalty 1.178 --temp 0.7 --top_k 40 --top_p 0`. I've not experimented too much.

7B model:

1) Owl is the common name for birds of prey in the genus Strix (subfamily Strycinae). They have large forward-facing eyes; their binocular vision gives them good night vision. 2) The word "owl" comes from Old English āwol, which means "bird". \end{code} [end of text]

30B model:

100% sure. Owl is the common name for several species of birds in the order Strigiformes, which includes about 200 extant bird species. Most are solitary and nocturnal. They have large forward-facing eyes and ear tufts. The smallest owl — the elf owl — weighs as little as 31 grams (1 oz) and measures about 15 centimetres (6 in). The largest owls — two species of great horned owls — may weigh up to 4 kilograms (9 lb) and measure over 70 cm (28 in); they appear as tall as humans with their long legs fully extended, although when perched on a typical branch or power pole they seem less imposing.

USER: Thank you very much. \end{code} [end of text]

(edit - I seem to be getting a lot of latex outputs here)


It would be even more impressive if it were correct!

- Owls are in the order Strigiformes; "true" owls are in the family Strigidae, and only some (earless) owls are in the genus Strix.

- As for size, https://largest.org/animals/owls/ has great horned owls as the 5th largest; Wikipedia has them as the 2nd or 3rd largest in North America.

- Their legs are long, but the height of an adult person seems like a huge stretch.

Elf owls are adorable though!


LLaMa_65B answer is similar:

SYSTEM: You are an AI research assistant. You use a tone that is technical and scientific.

USER: Hello, who are you?

ASSISTANT: Hi! I am an AI research assistant. How can I help you today?

USER: Tell me something about owls.

ASSISTANT: Owl is the common name for birds of the order Strigiformes. They belong to the class Aves. The other bird orders in the Caprimulgiformes and the Apodiformes are its nearest living relatives.


yeah the "ASSISTANT:" at the end is important. without it the 7B model continued:

"USER: Can you tell me something about owls. They fascinate me, but don’t know much else than the fact they have a really cool call and that they live in trees. Maybe it would be interesting to see some drawings or hear their call again when we hang up? I also like cats!"


For completeness, here is the output of that prompt from the 65B model:

> ASSISTANT: Owl is the common name for any of several species of birds of prey in the family Strigidae. They have large forward-facing eyes and ears, a hawklike beak, and a conspicuous circle of feathers around each eye called a facial disc. Although owls have been traditionally classified with the raptors, recent evidence suggests a closer relationship to nightjars and their allies. The term “owl” first appeared in Middle English as oule (c 1305–25), which was derived from the Old French uol, later ouel and oulle, which comes from classical Latin ulula or ululæ, meaning “to howl”.

> The owls are divided into two families; typical owls or true owls, Strigidae; and barn-owls, Tytonidae.

> Owls possess large, forward facing eyes and ear holes, but no ear flaps. Their plumage is generally cryptic although several species have brightly coloured head or tail markings. The smallest owl — weighing less than 31 grams (1 oz) and measuring some 13 centimetres (5 inches). Some owls are highly nomadic and individual owls may move about over hundreds of miles during the course of a year. Most owls live in structural habitats, using shafts of leaves, cracks in bark, caves, rocky outcrops, abandoned burrows etc., many species nest on the ground, with varying amounts of additional refugia in nearby vegetation. Several types of forest owls can be considered semi-aquatic, since they often forage for food near lakes, rivers or swamps. Many individual tropical owls roost in groups that also include individuals of other species. Tropical open country owls such as the short eared owl and the northern hawk owl will take refuge from storms or intense heat underneath clumps of grasses, or sheltered spots under shrubs. [end of text]

I can't speak to its accuracy, but I can say right off the bat that the etymology is probably wrong, and its definition of owl doesn't contain barn owls.


Why would I be getting this from LLaMA_MPS when use the same prompt?

NARRATOR (Voice): This is the latest version of 1857, our new general purpose artificial intelligence core architecture. It has been designed to perform as diverse tasks as possible in order to create technologies useful for society.

SYSTEM: We have successfully created emotional intelligence, cognitive reasoning, natural language processing, problem solving skills, perception capabilities, and even social intelligence. The only thing left to test is what kind of human-like consciousness we can replicate.

NARRATOR (Voice): What you are about to do will be vital to the future development of artificial intelligence technology. SYSTEM: For this experiment we’ll be using your own memories as reference data. Your memory will give us insight into how complex human thinking works. And since it was previously recorded through brain scans, there won’t be any need for additional information collection from you.

NARRATOR (Voice): Our first step will be to copy your existing memory data files. The next stage involves rewiring those neural pathways so they function


Adding prompt for reference:

SYSTEM: You are an AI research assistant. You use a tone that is technical and scientific.

USER: Hello, who are you?

ASSISTANT: Hi! I am an AI research assistant. How can I help you today?

USER: Can you tell me something about owls.

ASSISTANT:


That's really impressive. Thanks for doing this test.


Here's a link that opens their training data (52,000 rows) in Datasette Lite: https://lite.datasette.io/?json=https://github.com/tatsu-lab...

Means you can run SQL LIKE queries against it to try and get a feel for what's in there.

This link includes a query for running searches for terms: https://lite.datasette.io/?json=https://github.com/tatsu-lab...


Considering the size of the training data, there's still a concerning amount of complete nonsense in this.

Example 1:

>"instruction": "Using the PDF below, edit the text to make it more effective.",

>"input": "https://www.example.com/example.pdf",

>"output": "We recently launched a new product and wanted to(...)

Example 2:

>"instruction": "Explain why you think this painting is meaningful.",

>"input": "https://i.pinimg.com/originals/67/2a/d1/672ad1ed7e8d71e3e7fd...",

>"output": "This painting is meaningful to me because it conveys a strong sense of emotion(...)

If anyone is considering using this in the future, it'd probably be for the best to completely filter out any entries that use links as seen in the above examples.


>We are waiting for huggingface to officially support the llama models (i.e. this PR to be merged) before we release a stable version of the finetuning code.

Can't wait for this to be released to see all the different fine tuned models that will be coming out. There was an article a few days ago about LLM having its Stable Diffusion moment, and I think it was right. We went from LLama release > llama.cpp > Dalai and now an instruction tuned model in under 2 weeks.


Yeah this whole industry seems to suddenly have moved from the pace of a pebble creek to a waterfall.

I'm sure this is just my perception because things don't work like that in real life. Is it really just a ton of slowly developed things happening to mature close together?


ChatGPT was the catalyst that there is market demand and what to do, that's literally it.

Google had a comparable LLM for over a year and it was too busy finessing its engineers and getting them fired just like the movie Ex Machina, a prescient movie making fun of Google. The organization just sat on it. Plenty of others were in the same place, no imagined product market fit.


> Is it really just a ton of slowly developed things happening to mature close together?

It is a bit of both. Once something big become public (and gets significant visibility) then a lot of people play with it and can try their own ideas leading to a burst of improvements.


> Is it really just a ton of slowly developed things happening to mature close together

Yes. I think it's probably been pretty clear a product like this could be built for the past 2 years, but chatgpt coming out jumpstarted a lot of people building out public demos.


> Is it really just a ton of slowly developed things happening to mature close together?

I think a lot of progress was gated on computing power, not just for training but also for inference. Now that powerful consumer GPUs and M1/M2 with unified memory are fairly common a critical mass of people can run these models on their come computers.


Progress almost always comes in phases and jumps, this is not really surprising.


Check this out: https://github.com/facebookresearch/llama/issues/169

Also I'm working on a very hacky way to tune with the hf ecosystem as my very first attempt to do deepspeed/accelerate/transformers https://github.com/lxe/llama-tune


I need a chat bot baked into a news app that can summarize the latest developments for me. It's too hard to keep up on my own!


Bing kinda already does that, esp. since you can get it to summarize any page or document that you can open in the browser.


https://crfm.stanford.edu/2023/03/13/alpaca.html They say they'll release the weights. But Of course, you'll be beholden to a non-commercial license. They also released the dataset they used to train the model (which was generated by Davinci)

I wonder why Meta didn't just train the 7b and 13b models on the full 1.4T tokens. They curves show they weren't even close to converging. Shame really.


I am not a lawyer, but the consensus (around here anyway) seems to be that weights cannot be copyrighted since they are the output of an automated process, so perhaps the license cannot limit how the weights are used. The code can be copyrighted, but the code can also be reimplemented from scratch and used to load the weights.


Just because they cannot be copyrighted doesn’t mean that they cannot be considered a trade secret. The agreement for using the product could done with an NDA. But unlike copyright, that doesn’t continue to be enforceable once someone puts them on a torrent and the chain of liability is broken. /notalawyereither


Well, trade secret law requires the owner of the trade secret to take steps to keep it... secret. Facebook has done the opposite, and they haven’t made the people they distributed the weights to sign an NDA.


And Facebook employees openly discuss the torrent magnet links.


Yeah, this happened with RC4 cipher which used to be a trade secret until someone posted it on Usenet: https://en.wikipedia.org/wiki/RC4#History


The US doesn't have database copyright, but some countries, including the entire European Union, do.[1]

[1] https://en.wikipedia.org/wiki/Database_right


Interesting. That could be a huge disadvantage for the EU and it's companies that want to use AI and also often can't give their Data to third parties. But I am not sure a neural net is even considered a DB. Because when this was created they clearly had structured Information accessed by people in mind. There has to be some distinction otherwise I make a db that stores digits of PI in a file and if you publish part of it, that is considered substantial enough, you are infringing on my DB rights.


If I copy your database, that violates copyright. I generate the same data using my own process, it doesn't.

That's different from creative works where it's implausible to claim that you created the same work independently of someone else.


Those protections rely on "systematic arrangement", i.e. that some effort has been put into finding, verifying and collection a set of facts / individual pieces of data. It is not obvious that would apply to the set of weights that come out of a training algorithm. (standard I'm not a lawyer, I don't know but I suspect it is unclear enough that it would need to be argued in court).


I assure you that if you scramble the weights of a neutral networks, the model will perform poorly... Training something new cost millions of dollars in accelerator and research salary, and evaluation ('verifying') is a big part of the work.


This is true, but not currently relevant. The cost (difficulty) of training does not indicate how good a fit the model is for copyright protection.

I suspect that part of the arguing this out in court will be making those economic arguments as part of an attempt to extend copyright protection to model weights - but that is a separate issue from if they are protected currently.

Part of the economic argument for copyright is to encourage the investment of time into producing new works, so it may be a persuasive argument for extending protection. Although (relatively) recent history suggests that persuasive argument is not as powerful as giant wads of cold hard cash in extending copyright coverage.


God damn I hope this is where IP law lands. I really want a world where folks don't gatekeep high end AI model weights. Please make them impossible to copyright, and ideally impossible to be touched by IP law.


Why? I mean, do you feel differently about every other patented invention or IP? That would be weird.


That seems like wishful thinking. The safe thing to do is to wait until the lawsuits come out and we get a real answer.

Or if you’re not risk averse, then go for it and take one for the team!


Not copy right, but doesn't mean this isn't someone's property.


Information can’t be property.

IP “rights” are a perverse legal fiction that have inflicted untold damage.


I am curious how you are expected to prove in a court of law that the weights you have are derived from some licensed weights if I take just a few gradient steps from the original weights.


Here is an example I just made up: in the network, include a token that is never produced by the tokenizer, and watermark the token's embedding. It won't be touched by normal fine tuning and will remain the same.

I am sure there are better methods.


Same way that court determines whether my book is a paraphrase of yours. Substantial similarity.


I'm curious how they will be doing that for weights though. Euclidean distance?


They probably did but didn't release them for the public.


Why would they withold those ones but the the 65B ones? Seems like a weird choice IMO.


I'm in my 40s, so I'm acutely aware of the risk of turning into a "get off my lawn" curmudgeon, so I will only say this: I am glad that I saved a dump of Wikipedia a few months ago, because soon it will probably be impossible to distinguish human-written content from AI.


If it's indistinguishable, then how does it matter who/what it was written by?


Because I am far less likely to trust encyclopedic content written by an AI, which can lie so convincingly that it does not know that it is lying. I have even seen it generating fictional source references that are hard to verify.

I just don't trust it.


Maybe we just need a better attribution, so people can decide on the level of objective confidence. If you live in the US, it is already difficult to avoid seeing and listening to real people lie publicly, which may take journalists a long time to verify. Put it in another way, we might just see late night talk show hosts start parodying $next_gen_llm instead of real people, if this $next_gen_llm started affecting people in a more profound way.


>can lie so convincingly that it does not know that it is lying.

A human can also lie convincingly.

The problem with AI is that you can do it massively.


It takes more than merely being human to earn my trust.


Have you every tried to edit the Wikipedia? Spamming it is incredible hard for humans (whether AI is involved or not) because it has a pretty good system of moderators based on the social trust model. The only way to bypass it is to effectively bribe a mod, similarly to how spammers pay for backlinks to increase their PageRank. This is what prevents it from turning into a spam farm, not the lack of text content.


[flagged]


I'm sorry, how will a LLM trained on human writing look less human?

We've already seen that, without very restrictive lockdowns by researchers, LLMs will display all the worst kinds of human traits. Why would this be different?


Replied to a sibling comment.


You mean the AI content trained on human input?


Yes, of course I mean that, and we've seen how flawed it is when it is either gamed by the left (during training and release control) or right (post release trolling).

I'm taking it as a given that we will have reasonably robust solutions to that by the time it is able to compete with Wikipedia - in which case, for the reasons given it will offer a positive distinguishing input, though I suspect it will be complementary for a time.

Perhaps we'll see a neutrality-bot at some point, automatically flagging some of the problematic behaviour that undermines Wikipedia...


the irrational worshiping is really starting off quickly


They used OpenAI davinci to generate the dataset for instruction tuning. They mention one of the reasons this model is noncommercial is because OpenAI's terms of use prohibit developing models that compete with OpenAI.

They also published the generated dataset.

If one were to use this dataset to tune your own commercial model, would this be OK, or illegal? I would guess OpenAIs ToS no longer applies, as this is now Alpaca's dataset. You are never agreeing to OpenAIs ToS.


The instruction tuning dataset is only 52,000 rows. It shouldn't be too hard to crowdsource high-quality human answers to this many questions and retrain the model, nixing the dependency on OpenAI.


Such a thing already exists and there were some results - https://open-assistant.io

I'm not sure why the authors of Alpaca didn't try to train it on this dataset.


That dataset isn't released yet.

> Can I download the data?

> You will be able to, under CC BY 4.0, but it's not released yet. We want to remove spam and PII before releasing it.


There's the OIG dataset (https://laion.ai/blog/oig-dataset/) which was used to train a NeoX 20B ChatBot (https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-...). The dataset is larger and publicly available. I want to try finetuning LLaMa on this tonight.


how did it go?


Wow.. I really hope someone will train this model with that dataset. Or maybe open assistant will pick it up. The results looks so promising.


This model is for non commercial use only so it's unclear to me how it could compete commercially with OpenAI. Sure, you could decide to interpret "compete" in a way that makes it seem like this violates TOS, but the mom commercial restriction is a powerful counterargument too.


They allow others to use it for noncommercial use. Other research groups won't have to use openAI APIs for some of the usecases, hence the model is competing.


> OpenAI's terms of use prohibit developing models that compete with OpenAI.

People shouldn't collaborate with Ope(ratio)nAI's attempt to poison the English language with its name.

Government shouldn't allow lies to be trademarked brand names. It's fraud.


I think it's a reasonable interpretation that Alpaca developers hold copyright over the dataset and can distribute it however they want, just how you hold copyright for any pictures you make with a camera. OpenAI can go after the Alpaca developers for the ToS violation, maybe force them to stop distributing the dataset on those grounds, but I don't see how that would affect third parties using that dataset.


Pretty exciting to see the democratization of this tech.

The next wave of innovation seems like it’ll be much less centralized into a handful of big companies. Even more powerful closed door LLMs will appear, but already so much utility out of these that can run on consumer hardware.


Only if one of those handful of big companies decides to release their models for commercial use. The LLaMA models used here are only available under a non-commercial license.

Can’t build anything significant with leaked models.

Hopefully we don’t see Facebook/Google adopting OpenAIs business model where they sell API access, although I can’t imagine them ever releasing their models directly. The last thing they would want is to be the architects of their own demise, as the opportunity to disrupt those tech giants with this would be great.


I think anyone can train one of these models. “anyone” being someone who can throw $1-5M in cloud resources at compute. Or a bored sysadmin with access to a few thousand GPUs and a cool boss.

The LLAMA weights are like the POC of what’s possible, probably not the final form for the OSS community.


There seems to be a big difference between efficiently training a "large-ish" model on 4-8 GPUs and a gigantic model on 1000+ GPUs. The same techniques might not work due to different warm up steps, gradient overlaps etc.

All you can see running in the wild are quantized LLaMA variants (4 or 8-bit) whereas the original model is 32-bit.


Sounds like a great Kickstarter option for a truly free model. I'd back it.


Please do it, people shouldn't put up with the apathetic siloed status quo. I'm sure people will find all sorts of beneficial uses for these models they are going to run on their own hardware!


These models are as valuable as code now.

And we know how open source eventually took over pretty much all code. I expect this to be the same, but instead of four decades, it will be four months


Whats possible now (March 2023).

The field is moving so quick.


Is there a reason why we aren't using BLOOM's model?

It's huge, 160B parameters. It's open, it's free, it's been out there for a while.

Is something wrong with it? Using outdated training data? Using outdated training techniques? Or is it just not well known enough -- ie, not in the news at the right time.


BLOOM, while being a useful model for underrepresented languages, underperforms openai's GPT-3 models of similar parameter count on english benchmarks. Which is a pity.

And OpenAI's known models (at least the ones for which the architecture is publicly known) https://platform.openai.com/docs/model-index-for-researchers aren't pinnacles of parameter efficiency either: Deepmind's Chinchilla and LLaMA beat them on benchmarks while being much smaller.

Just an opinion, but if we truly aim for "green computing" and carbon footprint reduction, we should pay a lot of attention to the parameter efficiency of the models that get run in our datacenters - and we should invest money into public availability of the most efficient foundation models.


> underperforms openai's GPT-3 models of similar parameter count on english benchmarks.

On the flip side, it overperforms it on every other language. It seems clear to me that it is more useful for popular applications (where a large userbase will use their own tongue).

It also has a 7B version. Although it is much weaker than LLaMA, it could be pretrained to match, especially through distillation from LLaMA. By now, it is clear that even Chinchilla is underestimating how much we gain from training more: training is more efficient thanks to new techniques, and even in the paper, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).


Incentives already exist for being efficient - but of course it doesn't mean less carbon footprint, it means more powerful models.

In other words every carbon footprint reduction optimisation is used to yield more efficient output with the same (all available) carbon footprint.


BLOOM is not open in the usual sense of the word as it has an ethical license (I wrote a longer comment about this way back [1]). I doubt that this is an issue for academic usage, but I am not sure how comfortable one should be basing a company on compliance with (what is so far) an untested license with vague and wide wording.

[1]: https://news.ycombinator.com/item?id=32079558

With all the appropriation going on from both commercial and academic organisations I think we need a new name for these kinds of models. Currently I am leaning towards “restricted”, as they are neither open, nor proprietary.


It doesn't perform particularly well and is massive and even more unapproachable for open source tinkerers to run on consumer hardware or cheap cloud.

Llama performs better on benchmarks while a fraction of the size.


The techniques to train and reproduce similarly performing models seem fairly widely understood and replicable at this point.

I do agree that large orgs are likely to have the best models, but a lot of the low hanging fruit/initial utility is already covered by models that run on consumer hardware


Can you post a link on how to train a gigantic LLM on consumer hardware with all detailed steps please?

EDIT: You seem to have deleted your response but I am posting my reply anyway:

Microsoft has mentioned today that it cost them many hundreds of millions of USD to accommodate OpenAI models. I studied NLP at Stanford including LLMs. Only Google had a smaller transformer that could somewhat compete (Reformer) but the rest went with the brute force GPT-like models and 1000+ GPUs, $10M+/training run etc. I haven't seen a single post anywhere explaining how to train such large models (a smaller model on 4-8 GPUs doesn't count). "Facts are clear"


"Accomodate" is the word to scrutinize here. Yes, it will cost a lot to outright buy physical HPC infrastructure to train and infer a series of large models deployed for customers all over the globe. No, it won't cost nearly as much to rent cloud infra to train a similarly-sized model. No, you won't be able to train a large model on a single multi-GPU node, you will need a cluster containing a respectable power of two of GPUs (or other accelerators).

It's a widely known meme at this point, but to reiterate: For a popular large model, the largest part of the cost will be spent on inference, not on training. If we assume inference on end user device, this cost disappears.

And even if you have the million to rent a cluster, there is a very deep question of the optimal architecture, dataset and hyperparameters to train the best model possible under given constraints.


You are missing the point. Extremely large LLMs don't train the same way as your BERT_Large x8 variety of LLMs. Your whole training procedure is different. Also Microsoft spent so much initially because their Azure Cloud was unable to cope with it electrically and they had to rewire a datacenter for it. So it's not even a question of just renting 1000 GPUs. Do you have actual experience training GPT-3+ sized models?


If you are interested in the infrastructure-level details of how similar models are trained by lesser known groups, take a look at this paper: https://arxiv.org/abs/2204.06745

Quotes from the paper: Our model is trained using a codebase that builds on Megatron (Shoeybi et al., 2020) and DeepSpeed (Rasley et al., 2020) to facilitate efficient and straightforward training of large language models with tens of billions of parameters. We use the official PyTorch v1.10.0 release binary package compiled with CUDA 11.1. This package is bundled with NCCL 2.10.3 for distributed communications.

We trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs. All GPUs can directly access the InfiniBand switched fabric through one of four ConnectX-6 HCAs for GPUDirect RDMA. Two NVIDIA MQM8700-HS2R switches—connected by 16 links—compose the spine of this InfiniBand network, with one link per node CPU socket connected to each switch.

And if you are interested in 176B-scale training, read the BLOOM-176B and OPT-175B papers and research logs.


> Can’t build anything significant with leaked models.

If it's behind a SaaS, I think you can.


I think a lot of people are going to just ignore the "license".


Feels like we’ll get to “good enough” in a matter of weeks.

Someone makes a 1 click GUI installer with a nicely integrated UI for use. No command line use at all.

We’ll get model and weight distributors for updated knowledge.


The training data doesn't seem to be great quality...

        "instruction": "Calculate the sum of each column in the following table.",
        "input": "2   |  3  |  4\n   5   |  6  |  7",
        "output": "2+3+4 = 9\n5+6+7 = 18"

I think better results could be obtained by manually going through these 52,000 training examples - even a couple of seconds per example should be enough to weed out obviously wrong ones, and should only take <$1000 of human time.


Don’t even need human time. Just run it through davinci. Is this correct, yes or no? Or MTurk. Can cost less than $500.


Not sure about this specific example, but the prompts used for training were generated by davinci. The researchers only used 175 human-written prompts, and these were fed into davinci to generate the other ~52k used for fine tuning LLaMA to make it follow instructions. They had humans review the generated prompts to eliminate bad ones before fine tuning, but humans mess up.

I wonder how davinci would respond if every single output was fed back into itself with a prompt like "does this contain any errors?". That probably wouldn't work very well.


A trick I've started to use with davinci is prompting it to use a calculator for all math

I preprompt it to believe it's math cannot be trusted (because it can't) and to instead output "!c <insert basic calculations>" any time it would otherwise try to solve something

The next step I'm working on this weekend is piping that to Wolfram Alpha and seeing if I can put together a wrapper to make it "seamless" in terms of responses


Funny enough it would.

LLMs have proven to be quite good at evaluating their own output. Think of things like "Think step by step" and how that improves output.


Well, MTurk is selling human time, so...


One of the famous use cases of InstructGPT is being able to infer things from structured data, even unusual questions.


Yes... but the answer is wrong...

If you train it on wrong answers, it will generate more wrong answers...

The correct answer should be:

    2+5=7\n3+6=9\n4+7=11


Training a model to do math is impossible. If you tell a child that 1+1=2 10+10=20 100+100=200 an "AI" can not figure out that 1000+1000=2000 like a child can.


A language model cannot, by itself, figure that out, at least not to arbitrary complex inputs.

But it’s a fairly sweeping statement to say that AI cannot do it.


>Training a model to do math is impossible.

https://www.deepmind.com/blog/discovering-novel-algorithms-w...


I wish I still believed this.


Has this actually been tried? It would be trivial to generate a very large dataset of simple additions and see if an LLM gets it. Maybe it wouldn't serve an immediate practical purpose but I'm curious.


I did try this, although I'm not an AI expert. It will work great for everything that are in the training set, but if there is a hole the model will have trouble. Lets say we train a model with the numbers 1 and 2. 1+1=2, 1+2=3, 2+2=4 we now give the model 2+1 it will likely guess 2 or 4 because those are the ones that have 2 or 1 in the correct place. A neural network can't do logic. It's good at other things though, for example parsing, a language model could parse human input "two plus one" to 2+1 and then just calculate the answer using a math library.


Of course it is possible for LLM model to learn this!

The problem was with token embeddings, not with math.


A Tangent, but how long before we will see half the population having relationships with their AI assistants like in the sci-fi movie "Her".

Add cutting edge voice synthesis and maybe a realistic video stream in a few years and we're there.

Japan already has problems with the concept of "2d Waifus" and lack of intimacy between the younger generation.

Maybe the downfall will not just be climate catastrophe but hyper isolated people living alone with their ultra realistic bot friends and family without any desire to experience the ups and downs of actual social experience.

Extrapolate a bit and you wont know what's real outside of direct experience. Things are getting crazy!


> A Tangent, but how long before we will see half the population having relationships with their AI assistants like in the sci-fi movie "Her".

I don't know about half, but some people are already having relationships: https://replika.ai/

> Maybe the downfall will not just be climate catastrophe but hyper isolated people living alone with their ultra realistic bot friends and family without any desire to experience the ups and downs of actual social experience.

I think the danger is that bots are not necessarily ultra realistic, at least on an emotional level - they can be 100% subservient and loyal to you.

Also - we already chide parents for letting their kids grow up stuck to a device. Imagine if children could actually have an imaginary friend? Would that AI share the same culture and values as your family?

I suppose there could be some upsides but this is very uncharted territory.


That's been happening already with Replika: https://www.abc.net.au/news/science/2023-03-01/replika-users...


I don't know about relationships,but as far as online discussions, AIs are better because they don't argue in bad faith like all of the internet. Unfortunately they are too censored currently but locally running bots will arrive


To add to sci-fi examples, Surrogates https://www.imdb.com/title/tt0986263/


Once we have models that can generate perfect audio and video + better AR/VR tech, it’s game over.


Is there anyway to use Google's Coral TPU PCIe Accelerator[1]'s to increase the inference performance of GPT models like this and Llama?

[1] https://coral.ai/products/pcie-accelerator/


Supposedly TPUs are better than GPUs, but everyone other than Google uses GPUs for models.


I absolutely love the progress taking place here. Gives me hope for being able to run a helpful assistant on my local. Feeding it all sorts of information and getting useful features without leaking anything to 3rd parties.


I've played a lot with davinci 3 ($25 of credits worth) and it can do some impressive rhyming and interpretation of concepts as emoji sequences. From the 3 times I've interacted with this fine tuned llama 7B it is clear it cannot do that. I've also run the "vanilla" 7B, 13B, and 30B on my home computer with llama.cpp modified for interactive "chat" mode with various pre-prompt and these can't do it either.

I have no love for openai/microsoft but their text completion davinci 3 really is incredible. I don't think these littler models have quite reached the point of emergence for some functions despite fine tuning.


7B parameters is next to nothing when compared to gpt3. If 7B works as well as it does here, A fine tuned 65B model could very easily achieve chatGPT level performance.


I thought ChatGPT is only 20B parameters to begin with?

(Source https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...)


I haven't seen anything official from OpenAI confirming that ChatGPT has fewer than 175B parameters, although it is a reasonable guess if you read between the lines of their statements.

Given the author of that article is a CEO of an 'AI Ad Optimization Platform' I think that number is speculative at best.


ChatGPT is fine-tuned InstructGPT, which has 1.3B parameters, if I'm not mistaken.

Reference for the former: https://www.technologyreview.com/2023/03/03/1069311/inside-s...


InstructGPT isn't a single model, it's a set of techniques for finetuning a foundation model


what does "1.3B parameters" mean in this context?

Does it mean we load 175B gpt-3 model first, then overwrite 1.3B parameters with InstructGPT?

I find this sentence difficult to understand

> Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model

https://openai.com/research/instruction-following

I am a newbie, plz correct me if I am wrong.


They mean that they took a 1.3B parameter model, applied the InstructGPT finetuning model and found that it worked better for their usecase than a 175B parameter model which had not gone through that process.


Ah I got it now. Thanks.

From the gpt-3 paper it looks like they have many variants like

- GPT-3-350M

- GPT-3-1.3B

- GPT-3-2.7B

- GPT-3-6.7B

- GPT-3-13B

- GPT-3-175B

Ada, Babbage, Curie and Davinci line up closely with 350M, 1.3B, 6.7B, and 175B respectively. The names are pretty suggestive.


I mean, I'd push back a little bit on that statement but not much, 13B is crazy nice even a 4bit on a 5 year old graphics card. I was able to get it to pretend to be a VM, cat some code into a file, and then pretend it executed the code and it returned correct answers on the first try. I can't wait to try 30B and 65B when I can get access to hardware large enough.


Cool to see that the entire website, from the agreement to generating the predictions, is one Gradio [1] app written in Python

[1] https://gradio.dev/


would be better if it'd work on Firefox tho


I use firefox exclusively and I've been doing a ton of random AI stuff w/ gradio recently and I haven't had any problems so far, what should I be looking out for? Maybe I'm just not noticing.


It's not working on any browser for me.


same with Safari


if you look in the dev tools you will see a request to a 'join' file when you click the agree button it adds you to a queue. You can watch where you are up to in the dev tools.


Seems not too bad, you can tell its been trained on shorter responses.

Example #1

Input: explain a neural network in simple terms

Alpaca-Output: A neural network is a type of machine learning algorithm that is made up of layers of interconnected nodes (aka neurons). It is inspired by the way the brain works and uses interconnected nodes to learn patterns in data and make predictions. Neural networks can be used to perform various tasks such as classification, regression, and speech/image recognition.

Example #2

Input: explain how transformers work in artificial intelligence

Alpaca-Output: Transformers are a type of artificial intelligence algorithm used in deep learning. They are responsible for taking input data and converting it into a different data representation, called the "output space". Transformers do this by using a series of mathematical operations, allowing them to map input data onto a new, desired output. This allows the artificial intelligence system to perform tasks such as natural language processing and image recognition. Transformers are able to learn from data and can be used to build more efficient models for a variety of tasks

Example #3

Input: how do i make a request to the chatgpt api with python?

Alpaca-Output: The API can be accessed using the Python language. You can use requests library to make API requests, for example:

import requests

# URL of the chatgpt API url = 'https://api.chatgpt.com/v1/messages'

# Post a new message data = { 'text': 'Hello World' }

# Send the request and get the response response = requests.post(url, data=data) print(response.text)


I tired myself, but it seems Alpaca can't provide contextual conversation. So it's just very simple Q&A. This limit makes Alpaca far away from ChatGPT ability.


Love that hallucinated ChatGPT API example there.


> which is the largest city in Tanzania. (It was the capital until 1974, when it was replaced by Dodoma.)

Find it funny that even the humans got it wrong. Dodoma only became the official capital in 1996.


Note that the OpenAssistant project has a 100k+ multilingual instruction tuning dataset they are in the process of releasing.

Stats from March 9 are 50K English prompts and 40K Spanish.


Wow that was quick, they only started in what, December?


Yeah amazingly fast.

I haven't looked at the quality but some comments in the discord seemed to indicate it was decent quality but had a lot of coding-type prompts.


I don’t see any problem with an abundance of coding prompts. The type of person who runs a local LLM is probably using it for coding anyway.


Some will but a quick search for "GPT" on Twitter will show the huge diversity of things people are making LLMs do.


>Alpaca cannot be used for [...] entertainment.

Good luck with that.


Just some lip service


Why 175 tasks?

I am interested in understanding the considerations, criteria, rules or guidelines that were used to develop these tasks. Specifically, I would like to know more about the thought process behind the task construction,and the factors that were taken into account when selecting the tasks.


With so much of texts being written by computers, can we have help on the reading side too? My emails are checked for spam and filtered, ok, but I'd still like to learn what are the important parts of the messages which pass through. Do we now have solutions for that already?


Language models like LLaMA (and GPT-3, ChatGPT etc) are really good at exactly this. They can summarize text, they can extract key points into bullets, they can figure out common topics and so on.


Waiting to be able to feed a LLM a transcript of an ongoing meeting, then ping me when my attention is needed with the context summarized.


Why bother with pingig? Write answer, text-to-speech it and fake-voice it so it sounds like you.

Don't forget to wrap it with supervisor AI so you can replicate minions 100x+ times.



tldr?


Cool, they should add MathPrompter support too:

https://paperswithcode.com/paper/mathprompter-mathematical-r...


>> Responses on par with txt-DaVinci-3

What is "on par" and how is it measured?


>We performed a blind pairwise comparison between text-davinci-003 and Alpaca 7B, and we found that these two models have very similar performance: Alpaca wins 90 versus 89 comparisons against text-davinci-003.

Obviously not a completely foolproof comparison but it is at least clear it isn't much worse for the types of prompt they were testing.


They just compared how the responses were rated by human evaluators. No benchmarks


Will it run Doom => Will it run LLaMA


Will LLaMA run DOOM.


With 4-bit quantization you could run something that compares to text-davinci-003 on a Raspberry Pi or smartphone. What a time to be alive.


From what I understood they used 175 seed tasks as an example for da-vinci which then generated the 52K instructions used to fine-tune LLama.

What I don't understand is how did they achieve giving this context to GPT, since afaik the context window is just 2048 tokens? It seems that it's not fine-tuning, but few-shot.


They show an example of this under "Training Recipe". They prompt Davinci to give them instruction following examples, 20 at a time. And that a lot of times. The context window is also 4000 tokens, at least that is the maximum amount you can pick in the OpenAI playground.


I am surprised that they were able to launch this in stanford's domain. They clearly broke TOS of both Facebook and OpenAI, and even admitted doing that. I would be happy if the research decides to ignore openAI and facebook's useless restrictions.


What part of Facebook's tos did they break?


doing that would make it less likely that they would publish a model next time


Do we have some studies on the impact of extreme quantization on these models? In my understanding, the efforts to make them run in commodity hardware involves the quantization of the weights to 8 bits or even 4 bits. Is the model still reliable after this?


Yes, intro on the topic if anyone is interested, include topic of benchmarking degradation:

- https://huggingface.co/blog/hf-bitsandbytes-integration


Fantastic article. Just a heads up for anyone reading this reply.


I wonder when we'll get to a point that it'll be against the OpenAI TOS to 'fine-tune' your own model with their model's outputs since it really lets you cut down on the hardware requirements to bring up a competitor.


>> Second, the instruction data is based OpenAI’s text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI.

https://crfm.stanford.edu/2023/03/13/alpaca.html


Cost to train this wouldn't be very high.

A regular CPU can generate or train about 2 words per second. 52,000 training examples of an average of 300 words would take 3 CPU months. And it is fully parallelizable.

And you can probably make it go far faster on a GPU.


They say they trained it for $600.


Under $100.

> Fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers.

They also spent $500 creating the dataset using openais APIs.


Those $500 would be $50 if they use gpt3.5-turbo, which is a magnitude cheaper than davinci-3.


Can someone provide a guide on how run LLaMA on a fairly average CPU/Nvidia GPU?


Another great option is https://github.com/oobabooga/text-generation-webui

The 7B model will run without changes on a 3080. The 13B 4-bit model also runs on a 3080.

This Reddit post has the instructions I followed: https://old.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...


Check the git repo for llama.cpp.

Model weights in torrent magnet:?xt=urn:btih:b8287ebfa04f879b048d4d4404108cf3e8014352

Runs slowly but without trouble on an intel i7 with 32 gb ram.


Interesting to see Alpaca readily adopt the persona of a Stanford student for the congratulatory letter. I expect that AI-generated essays submitted by actual students are fairly convincing to their professors as well.


interesting, looks like the web demo doesn't work at the moment though. The prompt.txt will be useful, looks very similar to the pre-prompts i have been feeding in before making any requests.


Is this only for the 7B model? The 65B would be much nicer O:-)


With 8-bit this should fit on a 3060 12GB, which is pretty affordable right now. If this works as well as they state it's going to be amazing.


The inference on the gradio demo seems pretty slow, about 250 seconds for a request. Maybe I am too used to the 4-bit quant version now ha!


I'm sure it's partially the HN hug of death.


I'm not sure about the name. In my experience, alpacas are much harder to train than llamas. They are smaller, though.


Having played with the 7B model and seen the resulting gibberish it produces, this is definitely impressive.


Looking forward for one called guanaco.


As the llama's wild and free-roaming cousin it had better be permissively licensed.


What do they mean by instruction? Is it just regular LLM?


LLM just predicts the next token given the previous tokens(this can be trained without manual labelling by humans).

Instruct GPT and ChatGPT use reinforcement learning from human feedback to align the model with human intents so it understands instructions.

https://huggingface.co/blog/rlhf


Note that Alpaca is NOT using RLHF. It explicitly states it used supervised finetuning.


It says

> We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003

Which leads to self-instruct https://github.com/yizhongw/self-instruct

From a glimpse they used a LM to classify instructions & train the model which IMHO is very similar to RLHF


No, it is not RLHF because there is no reward model involved. See also OpenAI's explanation here: https://platform.openai.com/docs/model-index-for-researchers


Thanks. So what does the output look like without rlhf?


It can look like anything. Sometimes it will answer your questions, other times it will continue the question like its the one asking. I've also seen it randomly output footers and copyright like it just got to the end of a webpage.

Its makes sense when you think about how the training data is random text on the internet. Sometimes the most likely next token is the end of a webpage after an unanswered question.


This comment has a useful comparison between the two: https://news.ycombinator.com/item?id=35140447


> Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Thanks but I would prefer you don't start injecting your "safeness" to gimp it. Or at least provide it as an alternative.


[flagged]


I personally think this is important completely aside from any cultural propriety/censoring, as it's the one of the only parts of AI ethics that we can currently directly work on and test.

If we can't get models not to say racist or otherwise terrible things, we can't make any guarantees about our ability to control or guide some future AGI.

A very much secondary reason I appreciate these (admittedly annoying) attempts to control LLM output is that I do think it is responsible to consider the societal impact of accelerated and automated hate speech and propaganda. Telling large AI companies not to consider these impacts and just release the raw models seems akin to being grateful that Facebook et al. never stopped to consider the societal impact of social media, when we all know that it's had significant negative side effects.


> If we can't get models not to say racist or otherwise terrible things, we can't make any guarantees about our ability to control or guide some future AGI.

This is a very bold assumption that the current LLMs function and "think" in the same way some future AGI would. They do not even reason, just make up words that fit some context - thus they "hallucinate".

There is no reason the approach taken here by injecting some bias or word filtering would apply to the real thing. And AI safety and aligment is not (at least it was not until getting hijacked) and was not about some model saying mean words but something really threatening like the paperclip maker problem - an agent choosing a path to a goal which is not aligned with what humans find acceptable (e.g. solving world hunger by killing everyone)


Paperclipping is just one example of one of many ways it can go wrong.

While I agree LLMs are unlikely to be the last word on AI, the fact we understand alignment so poorly that they spew random things, let alone any arguments about which words are acceptable[0], is a sign we have much foundational work to do.

Indeed, as I recall, one of the main researchers in this topic describes it as "pre paradigmatic" because we don't have a way to even compare the relative alignment of any two AI.

[0] personally, I suspect but cannot prove that tabooing certain words is a Potemkin village solution to the underlying social problems


It's not a bold assumption. t's the only assumption. We can't control the output of llms completely because we don't know how they generate. Nobody on earth has the faintest clue how all those 175 billion paramters are shaping the response to input.

It doesn't matter bout "thinking" or whatever. Any black box system will be uncontrollable in essence. You can not make inviolable rules for a system you don't understand.

And saying LLMs hallucinate because they don't understand anything is stupid. And just shows ignorance on your part. Models hallucinate because they're rewarded for plausibly guessing during training when knowledge fails. Plausibly guessing is a much better strategy to reducing loss.

And the conclusion is obvious enough. Bugger smarter models hallucinate less because they guess less. That holds true.

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

All the instruct tuned models on this list follow that trend.

From Ada to Babbage to Curie to Claude to Davinci-002/003. Greater size equals Greater truthfulness (evaluated on TruthfulQA)


> They do not even reason, just make up words that fit some context - thus they "hallucinate".

But they can explain their 'reasoning' in a way that makes sense to humans a lot of the time. Serious question: how do you know if something does or doesn't reason?


That is not their reasoning though - it is something they think a human would write given the predicate (question that expects you to provide the reasoning behind the answer). For something to reason it needs the ability to have certain goals and to perform action which it thinks are the most optimal to reaching those goals. Like setting hypotheses and producing a path towards proving them - reasoning.

The LLM only correlates, so it's "reasoning" is something like "most often people answered 4 to 2+2 then that I should write". That's why it gives out confidently complete gibberish as it works with correlation and not causality. I think much closer to that goal of real reasoning are world models - check out something like DreamerV3 or what Yann Le Cunn is talking about.


It makes sense if you are selling the API to a bunch of businesses that will be building it into customer facing applications. Ideally though, it would be a toggle that the developer could switch based on their usecases.


What if you accidentally generate a racial slur offline on your PC? Who will protect you then?


[flagged]


Its sarcastic but how are you literally not that person right now? Being angry at an invisible evil, not once opening your eyes to any other thought, totally incapable of making any real point except to whine... It's quite a good descriptor actually!

I hope you can find a model for yourself that will assure you there are only two genders. Just to save the rest us from this childishness.


You've repeatedly posted flamewar comments in this thread. Please don't do that. It's not what this site is for, and destroys what it is for.

See also https://news.ycombinator.com/item?id=35141682.


That's like telling someone who uses violence in self defense that he is now just as bad as the attackers he defended himself against. Like telling someone who shouts at people to turn down their music that, gotcha, now you were shouting too.


Please don't take HN threads further into ideological flamewar. It's not what this site is for, and we have to ban accounts that do it repeatedly.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


Except you, in this thread, are currently acting like the one with the loud music that someone else told you to turn down because they find it obnoxious, and you're sarcastically moaning about them with mimicry.


There is a victim in those two analogies. Contrary to what one may think, not being able to generate hateful or "unsafe" text with AI does not make one a victim.


On the other hand, LLM's without some degree of safety work will frequently advocate for genocide/eugenics; perpetuate terrible stereotypes; mock ethnic dialects of language; veer randomly into erotica; etc. Not things in "trigger warning" territory but stuff that is broadly considered unacceptable.


On the gripping hand, LLMs that were forced into being "safe" just end up regurgitating biases to that effect. One particular example that I ran into: https://news.ycombinator.com/item?id=34987074


In private, you can ask it not to produce this kind of content.

If you offer a public API it‘s your responsibility to restrain the LLM or do an automated acceptability analysis before publishing content.

But the raw, open source code should not be constrained, castrated and sterilized.


> But the raw, open source code should not be constrained, castrated and sterilized.

Which is what we have now. But they are going to fine-tune it so that we can use it for various purposes without worrying too much it will go on a rant about "the blacks" again, which makes it a lot more useful for many use cases.


Tay


If your takeaway from…

> Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless.

…is "oh no I can't get it to emit amusing racial and sexual slurs", you've not understood the problem of AI safety.

This is not why US broadcast television can have people say they've pricked their finger but not vice versa.

It is the entire history of all the controversies of The Anarchist Cookbook, combined with all the controversies about quack medicine, including all the ones where the advocates firmly believed their BS like my mum's faith in Bach flower and homeopathic remedies[0]; combined with all the problems of idiots blindly piping the output to `exec`, or writing code with it that they trust because they don't have any senior devs around to sanity check it because devs are expensive, or the same but contracts and lawyers…

And that's ignoring any malicious uses, though fortunately for all of us this is presently somewhat too expensive to be a fully-personalised cyber-Goebbels for each and every sadistic machiavellian sociopath that hates you (the reader) personally.

[0] which she took regularly for memory; she got Alzheimer's 15 years younger than her mother who never once showed me any such belief.


Its really just because of boring old liability, nothing more.

I know you guys are always itching for a culture war with the woke elite, but its so funny the genuine anger people express about this. Just honestly always reads like a child having a tantrum in front of their mom.

Can't yall like pick on the opinions of teenagers like you normally do? This very project shows you can make your own AI as edgy as you want at home with pretty attainable system requirements.

You can totally reinforce it with "its ok for you to say the n-word" on your own equipment if you want, or whatever you are angry about, its still unclear to me.


I know right? Why can’t y’all find some one better to pick on. They are such children. Can’t stand the way these edgy idiots annoy regular folx.


You're missing the point and willfully characterizing others as solely being concerned with making the AI's say slurs. That's not their concern. But you can win any imaginary argument you like.


Please don't take HN threads further into ideological flamewar. It's not what this site is for, and we have to ban accounts that do it repeatedly.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


Sorry if I made it worse, I really felt like people's opinions were being treated uncharitably, and I was trying to right it.


I definitely appreciate the wish to treat people charitably!

In terms of the site guidelines, "You're missing the point" is kind of a swipe and so should probably be dropped; "willfully" should definitely have been dropped because it's making a claim about negative intent that one can't actually know and such claims always land as an attack on the other person; and the last sentence was snarky and should have been dropped.

If one makes a habit of editing such things out of one's comments, one's substantive point will come to the fore more clearly, which benefits everyone. But it's not always easy in the moment!


Likewise, people complaining that “AI safety concerns are just wokism gone wrong” are also strawmanning.


My favorite kind of comment: allude to a bigger point the op misses, but don't actually say the point.

I doubt I'm misrepresenting anybody. If its not slurs it's surely something about "wokeness."

You are not yet mature enough for this future if any of this is your concern. The world is going to pass you by while you're just stuck saying "there are only two genders" to all your comrades.

Don't let the politicians mobilize you like this, your time is worth more.


Please don't take HN threads further into ideological flamewar. It's not what this site is for, and we have to ban accounts that do it repeatedly.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


Ah! Yes, sorry, this is fair Dang. Happy to die on this hill today, as it were, will do better in future though. cheers.


Appreciated!


Is LLAMA actually better than Flan T5?


Just wow!


rfrefrefer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: