Blog

Building a prompt engineering playground in 2 days with Retool

May 14, 2024
Kieran Taylor
Kieran Taylor
Table of Contents
What changed for sales productivity in 2023
What changed for sales productivity in 2023
Productivity Trend #1
Productivity Trend #2
What changed for sales productivity in 2023
What changed for sales productivity in 2023
Share on

At Dashworks, we answer users’ work-related questions with information retrieved from the tools they use daily in their jobs, such as Slack, Google Drive & Notion. We use a mixture of self-hosted / open-source and SaaS / proprietary language models to deliver these answers in natural language. How this disparate information is presented to the language models (i.e., how we construct our prompts) can massively impact the response accuracy. Therefore, we wanted to build a playground where it would be easy to tinker with the prompts we use.

We first checked out tools available on the market today. We looked everywhere, from solutions by leading LLM providers to startups and OSS platforms. We ended up building our tool on the Retool platform and couldn’t be happier with the choice.

Goals and Requirements

Our main requirement when building this tool was deep integration with our existing internal tooling systems. We extensively dogfood our product and use issues we encounter as the driver for triaging projects. If we’re not happy with aspects of our product, there’s a good chance a portion of our user base won’t be happy either.

We previously built a retool dashboard that lets us see debug logs for our own personal queries to identify any poorly performing parts of the pipeline quickly. Our goal was to enable the flow:

  • A teammate flags a query that is not performing for themselves while dogfooding Dashworks
  • The on-call reviews their session and identifies a poor LLM interaction
  • The on-call experiments whether changes to the prompt resolve the issue
  • The tuned prompt is tested against our standard evaluations and the change is incorporated into production

When we evaluated the current tooling, we found it lacking in a few ways, all involving ease of integration with our existing systems:

  • All required static data files for running against multiple parameterized inputs, e.g. we would have to upload a CSV of the session data via some UI
  • None could easily integrate with our downstream automated scoring systems, e.g. “Did this answer improve on the last answer?”
  • None integrated with our domain-specific parameter tweaking, e.g. most expose “model temperature” but we couldn’t control “max document chunk length.”

In the past, I have been highly skeptical of low-code/no-code solutions, but given our team’s prior experience working with Retool we sketched out what our version would look like and set aside a few days to make it happen.

What we built

So what does a usual interaction with this tool look like?

Let’s start by asking a question that will help me write this blog post:

What were the considerations that went into building our prompt engineering toolkit?

I can then find this session in a connected retool dashboard and open it in our prompt engineering toolkit so that we can start tweaking the prompt and re-running the answer generation. This was the original answer:

The considerations that went into building our prompt engineering toolkit were multifaceted, focusing on deep integration with existing internal systems, ease of iteration on prompts, and the ability to conduct thorough evaluations of prompt changes. Here are the key points:

Deep Integration: The toolkit was designed to integrate deeply with our internal tooling systems, allowing us to use issues we encounter as a primary driver for triaging projects [1].

Evaluation:The toolkit needed to support evaluations that could compute point-wise metrics for our answer pipeline, such as whether an answer satisfied a query or conformed to a style guide. This was to ensure that any prompt changes did not introduce regressions and improved the user experience.

Integration with Automated Scoring Systems: The toolkit had to integrate with automated scoring systems to assess whether an answer improved upon the last1.

Domain-Specific Parameter Tweaking: The ability to tweak domain-specific parameters, such as "max document chunk length," was a requirement, as most available tools did not offer this level of control1.

Golden Dataset: A curated golden dataset specific to the answer generation component was considered to test different scenarios and improve the tooling based on common failure modes identified over time [2].

Existing Tooling: The decision to build in-house tooling was driven by the need for secure integration with our retrieval pipeline and the ability to replicate the most useful features of existing applications without being tightly coupled to their tooling [2].

Future Evolution: Considerations were also made for future versions of the application, such as supporting agentic interactions and configurable styles, which would require the toolkit to handle a wider range of responses [2].

These considerations were aimed at creating a prompt engineering toolkit that was not only functional and integrated but also flexible and capable of evolving with the needs of our product and users [1 2].

The answer we generated was already pretty great, so for illustration purposes, I will tweak the prompt by adding “Make sure you answer as sarcastically as possible” to the end of the system prompt.

Oh, let me put on my detective hat and dive into the sea of information to unearth the treasures of knowledge about our prompt engineering toolkit. Here's what I found:

Integration with Internal Systems:The toolkit was designed for deep integration with existing internal tooling systems, which was a primary driver for its development. The team wanted to ensure that any issues encountered during dogfooding could be quickly addressed, reflecting a commitment to product quality and user satisfaction [1].

Ease of Use:The goal was to create a playground for easy tinkering with prompts. This included the ability to quickly recreate input conditions, refine prompts, and test changes across a set of known queries. The toolkit was built to support non-technical stylistic guidance and convert it into effective prompts [2].

Automated Evaluations: The toolkit was designed to support automated evaluations, such as pointwise and pairwise comparisons, to ensure that changes in prompt construction did not lead to regressions in response quality. This was complemented by internal A/B testing for a comprehensive understanding of user experience impacts [2].

Customization and Control: The team wanted to avoid the limitations of existing tools, which often required static data files and did not integrate well with automated scoring systems or allow for domain-specific parameter tweaking. The toolkit was built to provide more control over these aspects [1].

Future-Proofing: Considerations were made for how the application might evolve, such as supporting agentic interactions and configurable styles, which would require the toolkit to handle a wider range of responses [2].

Efficiency: The toolkit aimed to streamline the process of prompt engineering, making it faster and more efficient to iterate on prompt changes and regression test against a variety of queries [2].

Collaboration: The toolkit was designed to facilitate collaboration among team members, allowing for quick identification and resolution of poorly performing queries.

These considerations reflect a thoughtful approach to building a tool that is deeply integrated, user-friendly, and adaptable to future needs, all while maintaining a high standard of response quality and efficiency [1 2 3].

This is better, but I wouldn’t call it exactly sarcastic. Let’s try again with some more prompt fiddling.

Oh, let me enlighten you with the grand saga of our prompt engineering toolkit's creation. It's not like you were involved or anything, right? Here's the lowdown:

Deep Integration: The toolkit was designed to play nice with our existing internal systems. We love our own product so much that we use it to find its own flaws - talk about being self-critical [1].

Ease of Tinkering:It's like a playground for prompts. You can mess around with them until they behave just the way we want, which is crucial for tweaking those pesky LLM interactions [1 2].

Automated Scoring Systems: We wanted to know if our new answers were actually better than the old ones. None of the existing tools could tell us that, so we made our own [1].

Domain-Specific Tweaking: Because who doesn't love a bit of customization? We wanted to control things like "max document chunk length," not just the "model temperature" [1].

Retool to the Rescue: We chose Retool for building our toolkit because it's like a Swiss Army knife for developers. Plus, we had a good time with it before [1]

Evaluation Flow: We've got a fancy flowchart and everything. It starts with identifying a subpar response and ends with a shiny new prompt going live. It's like a beauty pageant for prompts, where only the best make it to production [2].

Golden Dataset: We've got a special set of queries to test our prompts against, because using the same old dataset is just boring. This way, we keep things fresh and challenging [2]

In-House Tooling: We're not about that outsourcing life. We built our own tooling to keep things tight and secure with our retrieval pipeline [2].

So, in summary, we wanted a toolkit that was as flexible and integrated as our own yoga instructor, with all the bells and whistles to make prompt engineering less of a chore and more of a delight [1 2].

That’s more like it!

The next step is to take those changes and run a full evaluation to see how our new sarcastic prompt does against our benchmarks. Ideally, this would be a one-click from the tool, and this is something we have on our roadmap, but for now, the handy prebuilt Retool copy text components will suffice.

Extra goodies

We got to implement a couple of fun things with Retool with minimal extra effort.

Keyboard shortcuts help to satiate my inner vim user, who always asks for a way to eliminate the mouse entirely. We currently have shortcuts for:

  • Navigation in and between each section
  • Basic vim motions like gg for “go to top” and G for “go to end.”
  • Quick actions like re-running the prompt
  • A shortcut (?) for opening a view of the list of shortcuts

JSON explorer components expose a full view of semi-structured data. For example, if the user wants to look at the complete session data, adding a modal that directly exposes the data is really easy.

Consistent brand colors and typography. Everything in Retool conforms to a standard color swatch, so it was easy to plug in our brand colors and give everyone a familiar feeling working with the tool.

Challenges

There were a couple of frustrations that would be a big quality of life improvement if fixed:

  • The Retool code editor is great at quickly working with your components, but it is still very basic. No syntax highlighting, some weird, quirky state issues & worst of all, no vim bindings (semi-kidding on the last one).
  • The new list view components offer a virtual view of components meant to significantly improve render time, but this makes interacting with the list elements challenging outside of the component itself. For example, we wanted to add a “reset all text boxes” button that would reset to the standard prompt. Because this virtual list view exposed our prompt elements, it was almost impossible to do this. We had to add this slightly strange warning symbol to reset each text input box individually from inside each prompt message, where the reset button lives inside the virtual context required to emit the reset action.
  • Writing results directly back to BigQuery is harder than it should be, so we chose to give up entirely and use the BigQuery connector as read-only and instead built a lightweight endpoint on an API we host inside our VPC for proxying writes to BigQuery.

Summary

While not perfect, the experience of building with Retool was quite delightful. I have built many front-ends in the past with plain HTML/CSS/JS, React, Flutter, and a variety of other frameworks. While these still offer far more flexibility for user-facing applications, I wouldn’t hesitate to use Retool again for purpose-built, internal-facing applications where functionality and time to implementation are the biggest motivators.

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Sign up for Dashworks

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Schedule a demo
Book demo
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Explore more posts

Get a demo

  • Free trial
  • Instant onboarding