In the evolving landscape of artificial intelligence, retrieving relevant information from vast sources has become a crucial aspect of building intelligent applications. This blog demonstrates how to harness the power of LangChain, FAISS, and Groq’s deep learning models to extract and process information from web sources like Times of India.
Introduction
Retrieving relevant and up-to-date information is essential for AI-driven applications, whether it be in journalism, research, or customer service. This guide will show you how to scrape, process, and query web-based data using LangChain and FAISS, with Groq’s LLM as the backbone of intelligent responses.
Prerequisites
Before running the code, ensure you have the following installed:
langchain
langchain_community
faiss-cpu
groq
dotenv
requests
beautifulsoup4
You can install them using pip:
pip install langchain langchain_community faiss-cpu groq python-dotenv requests beautifulsoup4
Additionally, you need a GROQ API Key, which should be stored in a .env file.
Code Breakdown
1. Setting Up Environment and LLM
We begin by loading environment variables and initializing Groq’s LLM for text generation.
import os
from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_groq import ChatGroq
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
# Load API keys and environment variables
load_dotenv()
if "GROQ_API_KEY" not in os.environ:
raise ValueError("GROQ_API_KEY not found in environment variables. Please add it to your .env file.")
# Initialize the language model
llm = ChatGroq(model="deepseek-r1-distill-llama-70b", groq_api_key=os.getenv("GROQ_API_KEY"))
2. Scraping and Processing Web Content
Using LangChain’s WebBaseLoader, we scrape content from Times of India.
# Load web content from Times of India
loader = WebBaseLoader("https://timesofindia.indiatimes.com/")
document = loader.load()
# Split content into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = text_splitter.split_documents(document)
3. Embedding and Vector Store Creation
To efficiently store and retrieve relevant information, we use FAISS along with OllamaEmbeddings.
# Create document embeddings and store in FAISS
embeddings = OllamaEmbeddings(model='mxbai-embed-large')
vectorstore = FAISS.from_documents(documents, embeddings)
4. Creating Retrieval and Query Chains
We define the structure for document retrieval and querying.
# Define prompt structure
prompt = ChatPromptTemplate.from_template(
"""
Answer the following question based only on the provided context:
<context>
{context}
</context>
"""
)
# Create document chain
document_chain = create_stuff_documents_chain(llm, prompt)
# Set up retrieval mechanism
retriever = vectorstore.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)
5. Querying the Model
Finally, we can ask questions based on the retrieved knowledge.
# Example query
result = retrieval_chain.invoke({"input": "What are the latest news headlines?"})
print(result['answer'])
Conclusion
With this approach, we can effectively scrape, process, and query data from Times of India, allowing for real-time, AI-driven information retrieval. This method can be extended to other sources, making it valuable for various business applications, including journalism, market research, and automated content summarization.
By integrating LangChain, FAISS, and Groq’s LLM, we enhance the way AI interacts with real-world information, providing accurate and contextually relevant responses.
Comments