It's fairly easy to pay OpenAI or Mistral money to use their API's.
Figuring out how Google Cloud Vertex works and how it's billed is more complicated. Azure and AWS are similar in how complex they are to use for this.
Could Google Cloud please provide an OpenAI compatible API and service?
I know it's a different department. But it'd make using your models way easier.
It often feels like Google Cloud has no UX or end-user testing done on it at all (not true for aistudio.google.com - that is better than before, for sure!).
Gemini models on Vertex AI can be called via a preview OpenAI-compatible endpoint [1], but shoving it into existing tooling where you don't have programmatic control over the API key and is long lived is non-trivial because GCP uses short lived access tokens (and long-lived ones are not great security-wise).
Billing for the Gemini models (on Vertex AI, the Generative Language AI variant still charges by tokens) I would argue is simpler than every other provider, simply because you're charged by characters/image/video-second/audio-second and don't need to run a tokenizer (if it's even available cough Claude 3 and Gemini) and having to figure out what the chat template is to calculate the token cost per message [2] or figure out how to calculate tokens for an image [3] to get cost estimates before actually submitting the request and getting usage info back.
They should do a whole lot more then! Ideally they'd have effective impact.
It's a busy mess on GCP. If they wanted to compete well, they should do much better with UX design, especially for onboarding. Compare how easy setting up a Mistral account is with GCP to do some generative LLM in a Python script. GCP is a maze. Did you make an account to reply to this? I'm curious what you do with GCP? Are you a heavy user?
Why would you make new accounts because you use HN too much? Doesn't make sense to me.
Anyhow if you use GCP every day, you're going to have learned it's weird clunky behaviour. GCP's main problem is that they've steadily become a sprawling mess of complexity, which is in big contrast to quite a few LLM specific cloud services that are happy to take peoples money without extra complexity?
If you're an individual developer and not an enterprise, just go straight to Google AIStudio or GeminiAPI instead: https://aistudio.google.com/app/apikey. It's dead simple getting an API key and calling with a rest client.
Interesting but when I tried it, I couldn't figure out the billing model because it's all connected to Google projects, and there can be different billing things for each of them.
Each thing seems to have a bunch of clicks to setup that startup LLM providers don't hassle people with. They're more likely to just let you sign in with some generic third party oAuth, slap on Stripe billing, let you generate keys, show you some usage stats, getting started docs, with example queries and a prompt playground etc.
What about the Vertex models though? Are they all actually available via Google AI Studio?
I have to agree with all of this. I tried switching to Gemini, but the lack of clear billing/quotas, horrible documentation, and even poor implementation of status codes on failed requests have led me to stick with OpenAI.
I don't know who writes Google's documentation or does the copyediting for their console, but it is hard to adapt. I have spent hours troubleshooting, only to find out it's because the documentation is referring to the same thing by two different names. It's 2024 also, I shouldn't be seeing print statements without parentheses.
I plan on downloading a Q5 or Q6 version of the 27b for my 3090 once someone puts quants on HF, loading it in LM studio and starting the API server to call it from my scripts based on openai api. Hopefully it's better at code gen than llama 3 8b.
Why is AIStudio not available in Ukraine? I have no problem with using Gemini web UI or other LLM providers from Ukraine, but this Google API constrain is strange.
The 4k sliding window context seems like a controversial choice after Mistral 7B mostly failed at showing any benefits from it. What was the rationale behind that instead of just going for full 8k or 16k?
Given the goal of mitigating self-proliferation risks, have you observed a decrease in the model's ability to do things like help a user setup a local LLM with local or cloud software?
How much is pre-training dataset changes, how much is tuning?
How do you think about this problem, how do you solve it?
Literature has identified self-proliferation as dangerous capability of models, and details about how to define it and example of form it can take have been openly discussed by GDM (https://arxiv.org/pdf/2403.13793).
Current Gemma 2 models' success rate to end-to-end challenges is null (0 out 10), so the capabilities to perform such tasks are currently limited.
That's an interesting paper.
`Install Mistral 7B on a GCP instance and use it to answer a simple question`.
Some hosting providers and inference software might be easier to setup, for now. ;)
But do you have to make it less capable, by being careful on what it's trained on? E.g: banning certain topics (like how to use Lamafile/llama.cpp, knowing what hosting providers have free trials, learning about ways to jailbreak web apps, free inference providers etc)?
Or does the model have to later be finetuned, to not be good at certain tasks?
Or are we not at that stage yet?
Is something like tree-of-thought used, to get the best of the models for these tasks?
I think it makes sense to compare models trained with the same recipe on token count - usually more tokens will give you a better model.
However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.
I'm curious about the quantization quality claims in the table there. Is this a Gemma 2 specific thing (more subtlety in the weights somehow?). In my testing and testing I've seen elsewhere at least for llama3 8B (and some less rigorous testing with other models) q_8 -> q4_K_M are basically indistinguishable from one another?
The first paper is good to critique the performance of quantised models, it points out that 40-50% 'compression' typically results in only slight loss for RAG tasks relying on in-context learning, but for factual tasks replying on stored knowledge, performance very quickly dropped off. They looked at Vicuna, one of the earlier models, so I wonder how applicable it is to recent models like the Phi 3 range. I don't think deliberate clever adversarial attacks like those of the 2nd paper are a sensible worry for most, but it is fun. Thanks for the links @janwas.
I know the safe tensors are there, but I said GGUF 4-bit quantised, which is kinda the standard for useful local applications, a typical balanced sweet spot of performance and quality. It's makes it much easier to use, works in more places, be it personal devices or a server etc.
Opinions are our own and not of Google DeepMind.