Building an Multi-Source Search Engine with LangChain Agents

Overview

In today’s world of vast digital information, having an efficient search engine that integrates multiple sources is crucial. This blog explores how to build a powerful search engine using LangChain, Wikipedia, Arxiv, and a custom retriever tool. The system is orchestrated by an AI agent powered by OpenAI’s Llama3-8b-8192 model, making it capable of fetching relevant information seamlessly.

1. Importing Necessary Libraries

To get started, we import the required libraries for integrating search functionalities from Wikipedia and Arxiv.

from langchain_community.tools import ArxivQueryRun, WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper, ArxivAPIWrapper

2. Setting Up Wikipedia Tool

We create a Wikipedia API wrapper that retrieves results with a maximum of 250 characters.

api_wrapper_wiki = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=250)
wiki = WikipediaQueryRun(api_wrapper=api_wrapper_wiki)
print(wiki.name)  # Outputs: 'wikipedia'

This tool enables quick access to summarized Wikipedia content.

3. Setting Up Arxiv Tool

Arxiv is a repository for research papers. We configure an Arxiv API wrapper to fetch concise results from scientific articles.

api_wrapper_arxiv = ArxivAPIWrapper(top_k_results=1, doc_content_chars_max=250)
arxiv = ArxivQueryRun(api_wrapper=api_wrapper_arxiv)
print(arxiv.name)  # Outputs: 'arxiv'

Both Wikipedia and Arxiv tools are combined into a list:

tools = [wiki, arxiv]

4. Creating a Custom Retriever Tool

To enhance search capabilities, we integrate a custom retriever tool using FAISS for vector-based search. Idea is if any query comes related to Langsmith, it should come to this tool.

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

We load documents from a specific URL, split them into smaller chunks, and create a retriever tool.

loader = WebBaseLoader("https://docs.smith.langchain.com/")
docs = loader.load()
documents = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(docs)
vectordb = FAISS.from_documents(documents, OpenAIEmbeddings())
retriever = vectordb.as_retriever()

Creating and adding the retriever tool:

from langchain.tools.retriever import create_retriever_tool
retriever_tool = create_retriever_tool(retriever, "langsmith-search", "Search any information about Langsmith")
tools.append(retriever_tool)

5. Setting Up the AI Model and Agent

We configure ChatGroq as the AI model and load environment variables for API keys.

from langchain_groq import ChatGroq
from dotenv import load_dotenv
import openai
import os

load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")
llm = ChatGroq(groq_api_key=groq_api_key, model_name="Llama3-8b-8192")

6. Creating the AI Agent

We pull a pre-defined prompt template from LangChain’s hub to guide the AI agent’s responses.

from langchain import hub
prompt = hub.pull("hwchase17/openai-functions-agent")

Next, we create an AI agent that integrates the tools, language model, and prompt.

from langchain.agents import create_openai_tools_agent
agent = create_openai_tools_agent(llm, tools, prompt)

To execute the agent, we set up an AgentExecutor.

from langchain.agents import AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

7. Executing Search Queries

The agent can now be invoked to fetch information from Wikipedia, Arxiv, or the custom retriever tool.

agent_executor.invoke({"input": "Tell me about Langsmith"})

Additional example queries:

agent_executor.invoke({"input": "What is machine learning"})
agent_executor.invoke({"input": "What's the paper 1706.03762 about?"})

Conclusion

This multi-source AI-powered search engine is an effective tool for retrieving information from Wikipedia, Arxiv, and a custom document retriever. The combination of LangChain, FAISS, OpenAI’s Llama3-8b-8192, and ChatGroq creates a dynamic and scalable search system.

mlTutor