The JavaScript Package Selection Task: A Comparative Experiment Using an LLM-based Approach

0:00 0:00

Download the Audio (Right-click, Save-As)

NPM is a fantastic package manager in a number of ways, but it's really bad at search. To be fair, every other package manager is bad at it too: PIP, Crates, Maven, Homebrew, RPM, etc. They all struggle to give meaningful search-results for a query.

When you have a problem that you need a library to solve, your chances of finding a relevant package might hinge on your ability to guess what that package might be named. The search-bar in NPM returns such irrelevant results, that you'll likely spend your day jumping from search engines, to Github, to blog posts, to youtube videos, to forum discussions. You'll go back and forth to NPM over and over again, trying out different packages to see what fits. You'll spend time reading their docs, checking their Github issues, checking if they're stable, well maintained and recently updated. It’s a pain, to say the least. I’ve always thought that non-programmers would be shocked to see how much of a Software Engineers day is spent trying to find and use new packages. Out of exasperation, many developers just choose the most popular package that seems like it might possibly do the job. Not the package best suited to the problem, not the package with highest test coverage, not the package with the best documentation, or the lowest number of open issues, just the one with the most impressive download graph.

There’s got to be a better way, no? The authors of today's paper say yes, there certainly is. RAG: Retrieval Augmented Generation. With a functioning RAG system a user could just describe their problem in natural language, and receive results that actually make sense. No more scouring package managers for hours; just describe your problem, and get matched with a library. In this paper they describe the process of building this kind of system, and then benchmark it against other options.

This paper is structured as a comparison between four methods of Package Discovery (the process of finding a package that fits your needs).

Manual search: They recruited 21 Software Engineers to participate. Each of them had at least 5 years of experience. These Engineers would attempt to manually find packages that matched specific requirements.
AIDT: In a prior paper the authors had created a search aggregator (aka a meta-search engine) called AIDT. When you enter a query into AIDT, it spawns out and performs your query in multiple places: Google, Bing, NPM etc, then gathers all the results, parses them and returns a ranked list. Internally, AIDT is two components: a retrieval-mechanism that queries the search engines, and a ranking-mechanism that consolidates the results. The ranking-mechanism is based on the Borda Fuse method, which in-essence ranks them by overall popularity.
Zero-Shot LLM: A zero-shot model has a specific definition, but in this paper they’re using that term to describe any LLM that isn’t being used with a RAG technique. Specifically they used ChatGPT, Llama2 and Cohere.
RAG: They curated thousands of Javascript packages from Github, and used them as the dataset for a RAG system.

Each of these four methods produced a set of search results. The authors then assessed the quality of those results (using a method we'll explore later), and then ranked the methods. Before we find out who the winner is, let’s go back and talk about RAG. Understanding RAG is critical to understanding this paper.

If you’ve been an active Software Engineer for the last couple years you’ve undoubtedly heard the term RAG (Retrieval Augmented Generation). But, you might not be clear on how it works. So let’s explore what makes it different from a normal LLM.

Let’s walk through a scenario. Let’s say you go to ChatGPT and type “Hey ChatGPT, tell me the best 5 pizza places within walking distance of me”. ChatGPT will probably respond with a generic list of national chains, or it’ll ask for more information about your location or your pizza tastes. In this situation you realize that ChatGPT doesn't have enough context for the question you're asking, so you go to Google Maps and find all the pizza places around you, and then you copy-paste the restaurant descriptions and all of the reviews into a document. Then you go back to ChatGPT and you type “Hey ChatGPT, here are all the pizza places around me, and all their Google reviews. Please sift through all this context and tell me the best 5 pizza places for New-York style thin crust pizza”. At that point, you’ll probably get back an informed answer that makes sense. The process you just walked through is broadly referred to as "prompt-engineering", but this specific technique is called an “enriched prompt”. To recap: you’re retrieving information from somewhere else, and it is information that is narrowly tailored to your prompt-question. Then you're stating your question and giving ChatGPT all that context at the same time. You retrieved relevant data, and used it to augment your prompt. This allowed the LLM to generate a more meaningful response. See what I did there? Retrieval Augmented Generation. RAG.

In practice, RAG isn't done by hand of course. It's done programmatically by writing a script that pairs an LLM with a database of relevant information. In the previous example, rather than using Google Maps to manually look for restaurants, you would have a database full of pizza places. You'd write a script that would hit your database, pull out results, craft a prompt that incorporates those results, and then send the prompt to an LLM.

But now, imagine you’re not answering a simple question. You need your script to respond to a natural language query instead. Each query describes a problem and your job is return relevant solutions. Luckily, you have a database full of solutions. In order to find the most relevant solution to a query it may be a good idea to store the data as embeddings in a vector database or something similar. If you remember from yesterday’s episode, vector embeddings store both the structure of data and its correlations to other data. A vector database is really just a collection of vector embeddings that you can query in order to receive a list of similar items from the database. These kinds of queries are called a cosine similarity search or a semantic search. So in this example, your query is the description of a problem. You send that query to the vector database and get back all the solutions that are similar to the query.

In the case of today’s research, they took 4,600 Github Repositories (including all of the code, and docs and comments), and ingested them into a storage engine similar to a vector database, as embeddings. Then they setup a RAG system. A query would come in, for example: "quick-sort algorithm", they’d perform a semantic-search, rank the results, then send all that context and the original query to one of the three LLMs. The LLM would generate the final response to the user.

Okay back to the research. Before they could compare the four methods (manual, AIDT, zero-shot, and RAG) they had to get a baseline of results to compare them to. So, for two weeks they had two hand-selected Senior Engineers write down every time they searched for an NPM package, and include the names of the packages they found, which packages were good matches, and which were not. Then those two Engineers teamed up with two of the authors to review and debate the selected packages. That process resulted in a consensus-list; a baseline of ground-truth. The baseline was a list of queries, and for each query they listed and ranked the good results, and the bad results.

Now that they had a baseline, they proceeded with each of the 4 methods. Each method attempted to find packages for 25 separate queries. These included :

Check valid email address
Send SMS
Sentiment Analysis
Captcha authentication
Credit Card validation

…and many more. The group of 21 engineers performed the manual tasks, and the authors ran the other three methods programatically.

Once the data was in, they used mAP (mean average position) and nDCG (Normalized Discounted Cumulative Gain) to see which method produced results most-closely aligned with the baseline. Surprisingly (to me at least), the winner was AIDT. That being said, the LLM-based methods, while producing results less-aligned with the baseline, were able to justify their decisions in much more detail.

What are we to make of this? Is RAG useless? No, in fact quite the opposite. Since AIDT is essentially a search aggregator, it is able to take advantage of processes in the underlying search-engines that may (or may not) be very similar to RAG. We don’t really know how Google or Bing produces its results, but we do know that they’re getting more and more sophisticated over time. So it’s quite possible that this "RAG vs AIDT" shootout was actually one small RAG vs the amalgamated results of several of the most sophisticated RAGs in the world. We don’t really know. It’s also worth noting that since the researchers defined quality as the quality of being similar-to-expert-results, the outcome of this paper may be overweighting the importance of the tools that the experts used. That all being said, it is still useful to know that a tool like AIDT most-faithfully recreates the slow, tedious process that an Engineer goes through when they're trying to find packages.

I’m excited to see where this research goes in the future. As it's trying to solve a problem that I've personally experienced, I would love to take a tool like this for a spin someday. Especially if it gets more sophisticated and is able to take more context into consideration. As tools like these grow and change, I'm sure more papers will be published on the subject. And when they do, you can bet that we'll cover it here.