Leveraging genAI to reduce redundant data in relational databases
With the widespread adoption of generative AI, textual content is increasingly being generated and stored in databases across various applications. From customer service responses to personalised content recommendations, AI-driven text generation is transforming how we interact with data. However, this increase in generated text presents a significant challenge: managing and maintaining database performance. As similar or near-duplicate strings accumulate, they can quickly fill up storage, leading to inefficiencies and performance bottlenecks. In this blog post, we explore how leveraging large language models (LLMs) can help detect and merge near-duplicate strings in your database, streamlining your data and enhancing overall application performance.
Approach
As a working example, let’s take a customer service chatbot that retrieves common responses from a relational database, which have previously been generated by an LLM, to help customers with various issues. Different users can describe similar issues in various ways, and as such, the database might contain a lot of responses that all mean the same thing (near-duplicates). The goal is to identify groups of near-duplicate responses in order to merge them into a single response.
Let’s start by defining ‘similarity’ of responses in a quantifiable way. This can be done by transforming the response text into an embedding, which is its numerical vector representation, by leveraging an embedding model (e.g. OpenAI’s text-embedding-ada-002). Embeddings enable us to compute the distance between their corresponding texts, giving us an indication of how similar these are. For computing the distance, most commonly the euclidean and cosine distances are used. Experimentally, we can determine the right distance thresholds for identifying near-duplicate texts and clustering them into groups to be merged.
Prior to merging, validation of the clusters is recommended to prevent merging clusters where texts are similar but mean something different. This can be done manually by a person for a small number of clusters. However, it is not uncommon to have thousands of clusters when working with a large database, which makes manual validation quite time consuming. In that case, a more pragmatic validation approach can be taken by instructing an LLM using a prompt to filter out incoherent clusters (e.g. containing logically contradictory texts). The entire data flow is illustrated in the figure below.
Closing statement
With an increasing number of AI-powered digital products in our portfolio, we have frequently encountered the challenge of managing and maintaining database performance amidst the surge of generated text. By leveraging embeddings to detect and merge database rows with near-duplicate texts, we have trimmed down a significant amount of redundant data, ensuring our databases remain clean and performant to ensure a responsive user experience. Finally, we are further investigating AI techniques to preemptively reduce database bloat by avoiding the addition of similar content in the first place.
See also
Are you looking for an entrepreneurial digital partner? Reach out to hello@panenco.com or schedule a call