Hello (again) from the Gemma team! We are quite excited to push this release out...

luke-stanley · on June 27, 2024

It's fairly easy to pay OpenAI or Mistral money to use their API's. Figuring out how Google Cloud Vertex works and how it's billed is more complicated. Azure and AWS are similar in how complex they are to use for this. Could Google Cloud please provide an OpenAI compatible API and service? I know it's a different department. But it'd make using your models way easier. It often feels like Google Cloud has no UX or end-user testing done on it at all (not true for aistudio.google.com - that is better than before, for sure!).

Deathmax · on June 27, 2024

Gemini models on Vertex AI can be called via a preview OpenAI-compatible endpoint [1], but shoving it into existing tooling where you don't have programmatic control over the API key and is long lived is non-trivial because GCP uses short lived access tokens (and long-lived ones are not great security-wise).

Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.

[1]: https://cloud.google.com/vertex-ai/generative-ai/docs/multim...

[2]: https://platform.openai.com/docs/guides/text-generation/mana...

[3]: https://platform.openai.com/docs/guides/vision/calculating-c...

luke-stanley · on June 27, 2024

Good to know about this API preview. Hopefully the billing problem and UI maze of Vertex AI can be sorted too?

Flumio · on June 27, 2024

Google does plenty of ux studies on gcp. I took part in at least 3 of them.

I'm also not sure if I understand your problem with pricing? Depending on what you do with it, it's not just an LLM. It actually started before llms.

Pricing for image classification and other features are completely different products like an LLM.

luke-stanley · on June 27, 2024

They should do a whole lot more then! Ideally they'd have effective impact. It's a busy mess on GCP. If they wanted to compete well, they should do much better with UX design, especially for onboarding. Compare how easy setting up a Mistral account is with GCP to do some generative LLM in a Python script. GCP is a maze. Did you make an account to reply to this? I'm curious what you do with GCP? Are you a heavy user?

Flumio · on June 28, 2024

I create new accounts because I use hn too much.

I use gcp professional every day and always found it quite intuitive.

Did plenty of image classification with vertex ai too

luke-stanley · on June 28, 2024

Why would you make new accounts because you use HN too much? Doesn't make sense to me. Anyhow if you use GCP every day, you're going to have learned it's weird clunky behaviour. GCP's main problem is that they've steadily become a sprawling mess of complexity, which is in big contrast to quite a few LLM specific cloud services that are happy to take peoples money without extra complexity?

Flumio · on June 28, 2024

Not being logged in feels like a bigger hurdle to comment and check if someone responded to it.

It's a shitty solution to a stupid problem ;)

But I did mention that vertex AI is more than just hosting llms though

ankeshanand · on June 27, 2024

If you're an individual developer and not an enterprise, just go straight to Google AIStudio or GeminiAPI instead: https://aistudio.google.com/app/apikey. It's dead simple getting an API key and calling with a rest client.

luke-stanley · on June 27, 2024

Interesting but when I tried it, I couldn't figure out the billing model because it's all connected to Google projects, and there can be different billing things for each of them.

Each thing seems to have a bunch of clicks to setup that startup LLM providers don't hassle people with. They're more likely to just let you sign in with some generic third party oAuth, slap on Stripe billing, let you generate keys, show you some usage stats, getting started docs, with example queries and a prompt playground etc.

What about the Vertex models though? Are they all actually available via Google AI Studio?

lhl · on June 27, 2024

Sadly, while gemma-2-27b-it is available (as a Preview model) on the AI Studio playground, it didn't show up via API on list_models() for me.

bapcon · on June 27, 2024

I have to agree with all of this. I tried switching to Gemini, but the lack of clear billing/quotas, horrible documentation, and even poor implementation of status codes on failed requests have led me to stick with OpenAI.

I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.

logankilpatrick · on June 28, 2024

We are working hard to improve this across ai.google.dev (Gemini API), Hang tight!

hnuser123456 · on June 27, 2024

I plan on downloading a Q5 or Q6 version of the 27b for my 3090 once someone puts quants on HF, loading it in LM studio and starting the API server to call it from my scripts based on openai api. Hopefully it's better at code gen than llama 3 8b.

alekandreev · on June 27, 2024

Happy to pass on any feedback to our Google Cloud friends. :)

anxman · on June 27, 2024

I also hate the billing. It feels like configuring AWS more than calling APIs.

luke-stanley · on June 27, 2024

Thank you!

canyon289 · on June 27, 2024

I also work at Google and on Gemma (so same disclaimers)

You can try 27b at www.aistudio,google.com. Send in your favorite prompts, and we hope you like the responses.

dandanua · on June 28, 2024

Why is AIStudio not available in Ukraine? I have no problem with using Gemini web UI or other LLM providers from Ukraine, but this Google API constrain is strange.

jpcapdevila · on June 27, 2024

Will gemma2 be available through gemma.cpp? https://github.com/google/gemma.cpp

austinvhuang · on June 27, 2024

This is in the works in the dev branch (thanks pchx :)

https://github.com/google/gemma.cpp/pull/274

janwas · on June 27, 2024

:) Confirmed working. We've just pushed the dev branch to main.

jpcapdevila · on June 27, 2024

Awesome, I love this .cpp trend! Thanks for your work!!

moffkalast · on June 27, 2024

The 4k sliding window context seems like a controversial choice after Mistral 7B mostly failed at showing any benefits from it. What was the rationale behind that instead of just going for full 8k or 16k?

alekandreev · on June 27, 2024

This is mostly about inference speed, while maintaining long context performance.

causal · on June 27, 2024

Thanks for your work on this; excited to try it out!

The Google API models support 1M+ tokens, but these are just 8K. Is there a fundamental architecture difference, training set, something else?

coreypreston · on June 27, 2024

No question. Thanks for thinking of 27B.

luke-stanley · on June 27, 2024

Given the goal of mitigating self-proliferation risks, have you observed a decrease in the model's ability to do things like help a user setup a local LLM with local or cloud software?

How much is pre-training dataset changes, how much is tuning?

How do you think about this problem, how do you solve it?

Seems tricky to me.

alekandreev · on June 27, 2024

To quote Ludovic Peran, our amazing safety lead:

Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).

Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.

luke-stanley · on June 27, 2024

That's an interesting paper. `Install Mistral 7B on a GCP instance and use it to answer a simple question`. Some hosting providers and inference software might be easier to setup, for now. ;) But do you have to make it less capable, by being careful on what it's trained on? E.g: banning certain topics (like how to use Lamafile/llama.cpp, knowing what hosting providers have free trials, learning about ways to jailbreak web apps, free inference providers etc)?

Or does the model have to later be finetuned, to not be good at certain tasks?

Or are we not at that stage yet?

Is something like tree-of-thought used, to get the best of the models for these tasks?

moffkalast · on June 27, 2024

Turns out LLM alignment is super easy, barely an inconvenience.

josh-sematic · on June 27, 2024

Alignment is tight!

dinosaurdynasty · on June 27, 2024

One should not confuse alignment and current incapability.

mdrzn · on June 28, 2024

Wow wow wow.... wow.

WhitneyLand · on June 27, 2024

The paper suggests on one hand Gemma is on the same Pareto curve as Llama3, while on the other hand seems to suggest it’s exceeded its efficiency.

Is this a contradiction or am I misunderstanding something?

Btw overall very impressive work great job.

alekandreev · on June 27, 2024

I think it makes sense to compare models trained with the same recipe on token count - usually more tokens will give you a better model.

However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.

luke-stanley · on June 27, 2024

Any gemma-2-9b or 27b 4 bit GGUF's on HuggingFace yet? Thanks!

luke-stanley · on June 27, 2024

Actually for the 9B model, this has 4-bit quantised weights (and others): https://huggingface.co/bartowski/gemma-2-9b-it-GGUF

Still no 27B 4-bit GGUF quants on HF yet!

I'm monitoring this search: https://huggingface.co/models?library=gguf&sort=trending&sea...

SubiculumCode · on June 28, 2024

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF

thot_experiment · on June 28, 2024

I'm curious about the quantization quality claims in the table there. Is this a Gemma 2 specific thing (more subtlety in the weights somehow?). In my testing and testing I've seen elsewhere at least for llama3 8B (and some less rigorous testing with other models) q_8 -> q4_K_M are basically indistinguishable from one another?

janwas · on June 28, 2024

Yes, PPL and certain benchmarks do not detect differences from quantization. But recent work gives cause for concern, e.g., https://arxiv.org/pdf/2310.01382, https://arxiv.org/pdf/2405.18137.

luke-stanley · on June 29, 2024

The first paper is good to critique the performance of quantised models, it points out that 40-50% 'compression' typically results in only slight loss for RAG tasks relying on in-context learning, but for factual tasks replying on stored knowledge, performance very quickly dropped off. They looked at Vicuna, one of the earlier models, so I wonder how applicable it is to recent models like the Phi 3 range. I don't think deliberate clever adversarial attacks like those of the 2nd paper are a sensible worry for most, but it is fun. Thanks for the links @janwas.

XzAeRosho · on June 27, 2024

It's on HuggingFace already: https://huggingface.co/google/gemma-2-9b

luke-stanley · on June 27, 2024

I know the safe tensors are there, but I said GGUF 4-bit quantised, which is kinda the standard for useful local applications, a typical balanced sweet spot of performance and quality. It's makes it much easier to use, works in more places, be it personal devices or a server etc.

chown · on June 27, 2024

If you are still looking for it, I just made it available on an app[1] that I am working on with Gemma2 support.

https://msty.app

luke-stanley · on June 27, 2024

Are you saying you put a 4-bit GGUF on HuggingFace?

zerojames · on June 27, 2024

How is Gemma-2 licensed?

alekandreev · on June 27, 2024

The terms of use remain the same as Gemma 1 - https://ai.google.dev/gemma/terms.

np_space · on June 27, 2024

Are Gemma-2 models available via API yet? Looks to me like it's not yet on vertexai

zone411 · on June 27, 2024

"Soon" https://x.com/LechMazur/status/1806366744706998732

kristianpaul · on June 28, 2024

Do run gemma2 on your Google phone?