Founder Conversations: A Week of LLM Insights in the Bay Area

Valtteri Karesto, Joni Juup, Arttu Laitinen

October 3, 2023

At gatherings like Ray Summit, a shift was evident: while RAG is gaining traction in LLM deployment, it's not without challenges. Fine-tuning smaller models, like the LLama 7B, offers cost-effective competition to giants like GPT-4. As LLM applications grow in intricacy, the industry is pivoting from 'chaining' to 'orchestration,' grappling with performance evaluations and the nuances of open-source vs. proprietary models.

Introduction

We spent a week talking to founders and builders at Ray Summit, TechCrunch Disrupt, and various Bay Area GenAI meetups to understand the challenges they face when building value-providing LLM-based apps.

Overview

RAG is omnipresent. Retrieval Augmented Generation is the current trend everyone is working on. While it addresses many challenges of deploying LLMs in production, it introduces several others. We'll delve deeper into these later in this post.
Fine-tuned smaller models are making an impact. In certain use cases, a fine-tuned LLama 7B model can outperform GPT-4 at a fraction of the cost.
The term "orchestration" seems to be largely replacing "chaining." As LLM apps become more intricate, orchestration is a challenge that development teams are grappling with.
Another challenge is evaluating LLM performance, or "evals", which involves determining if your chain/agent is producing accurate answers, or whether the changes you implement improve or degrade the system.
The essence of RAG-based system performance lies in chunking strategies and the subsequent vectorization of data chunks. If the contextual data you retrieve to aid your LLM in answering questions is inaccurate or of poor quality, your LLM will not respond correctly.
If the retrieved context is correct, it works. Open-source LLMs (like Llama 2 models) and Proprietary models (like GPT-3 and GPT-4 models) seem to perform almost equally well at forming answers.
Anyscale released reasonably cheap endpoints for Llama models. They will add fine-tuning endpoints later this year. Switching from OpenAI models is easy since the API is almost 1-to-1 match with OpenAI API.

Ray Summit 2023 — Highlights

John Schulman, Co-founder, OpenAI

Ray Summit Day 1 Keynote Watch the keynote

Paraphrasing John regarding whether they were surprised by ChatGPT's success: "We were very surprised. We had friends and family using it for a few months beforehand. There were some enthusiastic users, particularly those using it for coding, but overall, the excitement was muted. Not all users kept returning to it. I believe that once it became widely accessible, users taught each other how to use it effectively. The social element was crucial."

→ Sometimes testing with real users is not the whole truth, especially when it comes to new technology that might have use cases beyond the initial imagination.

Goku Mohandas, ML & Product, Anyscale & Philipp Moritz, Co-founder and CTO, Anyscale

Ray Summit Day 1 Keynote Watch the keynote

Discussing the development of a chat co-pilot for Ray documentation:

The assistant employs a hybrid routing solution based on a fine-tuned LLama2-70B. However, it redirects some of the more intricate queries to OpenAI's GPT-4, merging cost efficiency with superior quality when required.
To control or minimize hallucinations, they've implemented RAG and evaluations of queries. This in turn aids in routing queries between LLama2 and GPT-4.

Albert Greenberg, VP of Engineering, Uber & Wei Kong, Engineering Management, Uber

Ray Summit Day 2 Keynote Watch the keynote

Uber introduced an AI-powered coding assistant into their development tools, trained on their unique codebase, to enhance development speed and the user experience.
Their AI-driven app testing tool, DragonCrawl, leverages generative AI to replace manual tests and improve app quality.
Uber integrates both broad and task-specific LLMs in their AI toolkit.
On Generative AI's impact, they noted: "GenerativeAI democratizes and benefits almost everyone in the company."
They underscored Generative AI's primary roles in Creation, Summarization, Discovery, and Automation.

M Waleed Kadous, Chief Scientist, Anyscale

Open Source LLMs: Viable for Production or a Low-Quality Toy? Read slides

M Waleed Kadous of Anyscale discussed the distinctions between proprietary and open LLMs, their applications, and current gaps.
Anyscale's RayAssistant uses both Llama 2 models and GPT-4, with fine-tuned Llama models directing requests to the most suitable model.
Specialized fine-tuned models can outperform proprietary ones like GPT-3.5 or GPT-4.
Solely using GPT-4 would have yearly costs of ~$35,000; smart use of open models reduces this to around $900 annually.
Anyscale offers endpoints for Llama 2 models at $1/M tokens and plans to release fine-tuning endpoints later this year.

GenAI Collective Meetup — Highlights

We participated in the GenAI Collective meetup on September 18th in San Francisco. Werqwise co-working space was packed with people and the event was sold out. Meetup also had a few interesting product demos.

Matt Huang, Knowbl & Ryan Reede, MovieBot

Product demos

Knowbl, a platform that can process your knowledge base and then build on-site search, and agent assistant on top of it. Their copilot keeps the brand message in check by providing only pre-approved answers.
Moviebot, an application that lets you prompt a conversation between characters you have created with their editor, and then it uses a game engine to render a video from it.

Sophie McNaught, Vouch Insurance

Insuring GenAI Products

To drive the adoption of their genAI services, Microsoft and AWS already offer defense to users of their generative AI tools if they're sued for copyright infringement.
However, there are many use cases that they don't cover.
Insuring the inherent risks of using generative AI in products is emerging as an area of interest, and it's something founders should monitor closely.