At Dashworks we provide a natural language interface for our users to interact with information that lives in the tools they use in their daily work, such as wikis, code repos and messaging apps. I work on the accuracy of our re-ranker, the component that enables us to ensure that our users get the most accurate results when they ask a question by prioritizing the most relevant information to their query. When a query comes in, we cast a wide net in all of their connected apps to find any relevant information, then re-rank the fetched results before synthesizing them into a natural language answer.
The types of questions our users ask our AI range from, “How do I deploy frontend changes to staging?” to, “Summarize the last email exchange with client X.”
Our approach is unique in that we do not index data at rest and instead make real-time search API calls to a user’s connected apps when they ask Dash AI a question.
We believe this approach enables us to:
1. Reduce setup time by plugging directly into a company's apps without heavyweight indexing.
2. Surface the most up-to-date information at any given point in time.
3. Leverage access controls the customer has already spent valuable time building.
On the flipside, we have fewer guarantees about the relevance of the data we collect from each app. As a result, our re-ranker is the key to surfacing the most relevant results and ensuring we can match or beat the response quality of alternative approaches.
Our re-ranker uses a cross-encoder to figure out how relevant each chunk of data is to the user’s query. In this post I explain how we distilled the capabilities of GPT-4 into our cross-encoder via fine-tuning for almost no cost. We were able to significantly improve the accuracy without sacrificing latency. The lessons from this post will be useful for anyone building a RAG application that needs to adapt a language model to their domain-specific data.
For a quick refresher — a bi-encoder converts chunks of text into a vector representation that is supposed to capture semantic meaning. In order to find the most semantically similar chunks of text to a query, you encode each of the chunks independently then take the cosine distance of those resulting vectors from the encoded query. These are very commonly used to produce vector embeddings that are then stored somewhere like Pinecone.
A cross-encoder on the other hand takes in <span id="code" class="code-text">(query, document)</span> pairs and produces a single score for how similar they are. The benefit over a bi-encoder is that the cross-encoder can apply its attention heads to the query and document at the same time. This in practice produces much higher performance.
Given the criticality of this component in our pipeline, we knew that off-the-shelf open source models wouldn’t be up to par, and that we would need to invest considerable effort in fine tuning..
We also knew that the biggest hurdle in the fine tuning process would be the collection of adequate data. Each query that we process can collect hundreds of data chunks, so the amount of human effort required to label even just a few questions was going to be daunting. On top of that, it would be difficult for a human to actually pick the correct label. Dash AI connects to a wide variety of sources, so some questions are answered from chunks in GitHub, others from Slack, and still others from wikis. Parsing the raw text of code snippets and Slack messages promised to be painstaking.
After reading this article from our friends at Refuel, we realized that cutting-edge LLMs will likely be able to provide equal or better labeling accuracy than human reviewers.
If we could get it to work, the cost of training and time required would be dramatically reduced. We would also get some nice secondary benefits – such as the ability to eventually train models for our customers without requiring a human to view the data. It would also give us the potential ability to set up fully automated daily training.
That said, there were risks involved, mainly from it being a “turtles all the way down” approach. Any bias in the LLM would be propagated to our cross-encoder. To mitigate this, we built a fun CLI game where we could pick our own label for each chunk before being shown the LLM’s choice. We also added some confidence scores to the labels (more on that later).
In our review sessions we rarely found ourselves disagreeing with the labels outputted by our final iteration of the LLM labeller. In the event of a disagreement, we usually found that the mistake had been at our end.
We tried a few different approaches with the labeler.
First, we abstracted out an interface for the LLM calls which needed to provide two functions <span id="code" class="code-text">complete_next_message()</span> and <span id="code" class="code-text">call_function()</span>. The latter could be done with OpenAI GPT-style function calls, or limiting the LLM to just produce some structured output, such as a single <span id="code" class="code-text">true/false</span> token. This let us plug and play different LLMs and iterate quickly.
Next we assessed the labeler by getting it to directly call the function <span id="code" class="code-text">is_relevant()</span>. In practice this looked something like:
[highlight]
<span id="code" class="code-text">"""</span><br />
<span id="code" class="code-text">Given the following query & document, decide whether or not the document</span><br />
<span id="code" class="code-text">is useful for answering the query.</span><br />
<span id="code" class="code-text">Your output should look like:</span><br />
<span id="code" class="code-text">is_relevant(True)</span><br />
<span id="code" class="code-text">"""</span><br />
[/highlight]
When compared to hand labels, this produced poor results. It was also often difficult to understand where the LLM had gone wrong.
To help with debugging, we tried adding a <span id="code" class="code-text">remarks</span> parameter to our function call to get the model to output its rationale. This improved performance and made debugging much easier. However, we soon realized that we were inadvertently performing in-context chain of thought computation, similar to the popular ReAct prompting technique. To make this explicit, we then broke out the labeler into two calls:
The bet was that this technique would produce better results due to the autoregressive nature of transformer-based models. Essentially, the model produces one token, adds that to the existing context (prompt) and then produces another token and keeps going until a stop token is produced. Therefore, in the function call approach we were getting the model to make a decision within a single token, whereas with the chained approach it was deciding the appropriate label by aggregating all of the tokens in the explanation prose.
Said another way, the next token an LLM produces is dependent on the tokens it has already produced. Therefore, by encouraging the LLM via prompting to produce a diversity of “thought,” you could ensure a single incorrectly-produced token did not steer the entire output in the wrong direction.
This approach produced the best results. It also had a few other nice properties:
We dogfood our product internally as much as possible, both because it’s the best way to expose its flaws and because it often produces some pretty magical moments when it pulls an answer from somewhere totally unexpected (and saves me an hour of furious searching).
As a result, we have a big repository of questions to pick from and a captive audience (aka employees) who can provide us the answer they were expecting to receive.
We split this question set into the typical train / test / eval split, pulled the relevant chunks in the same way we would internally, and then let the labeller rip.
To fully label the ~100k <span id="code" class="code-text">(query, chunk)</span> pairs took ~one day and cost ~$100. (I know those numbers sound made up given how round they are. It’s because I’m being liberal in my rounding. In reality it was 85k chunks, cost $93 and actually took 30 hours. But that doesn’t sound as nice, does it?)
Once we had the labels produced, the fine-tuning part was fairly trivial. We followed an in-depth LlamaIndex tutorial to put together most of the code. Then, we just needed to tinker with how the data was loaded to get a working job up and running.
To evaluate the model we applied two techniques.
First, we calculated chunk-level ranking metrics and produced the standard MAP (mean average precision), MAP@k (MAP within a top k window), and DCG (discounted cumulative gain).
Second, we kept a holdout set of questions with an expected answer. We ran our full answer pipeline and compared the results to produce, what I coined, an answer “coherence rate.” We used another call to an LLM to compare the answers. To be coherent, the generated answer had to:
1. Contain all significant information from the expected answer
2. Not introduce any conflicting information
3. Not decrease the understandability of the answer.
Once we had an eval with reasonable confidence, we performed a grid search of base models, hyper-parameters, and just about anything else that could have had an effect on the model’s accuracy.
By the time we landed on our final iteration, we saw very nice gains in both eval metrics. MAP jumped by 83% and our coherence rate jumped 10 percent on the e2e eval.
Of course, the fun doesn’t just end with eval numbers. We are currently monitoring the production deployment of the fine-tuned model and seeing equally nice improvements in the production quality metrics we track.
If you found this interesting and want to check out the real thing, head over to www.dashworks.ai. Interested in working with us? Check out our open roles.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.