RAG with Citations for SEC 10K tickers
Implementation of Retrieval-Augmented Generation (RAG) with source citations.
General Timeline
- This is an overview of the project’s timeline, detailing various tasks, tools, and methodologies used during the development of this project.
- Link to project demo video
- The file ms.png was the one used for Visual QnA using LlaVa, it has been uploaded here.
Tasks and Tools Overview
- Understanding 10-K Filings and company tickers:
- Watched a YouTube video to grasp the concept of 10-K filings.
- Learned about their contents and how to skim through them efficiently.
- What is a company ticker? A ticker symbol is a stock symbol; an abbreviation of a company’s name that uniquely identifies its publicly traded shares on stock exchanges.
- Exploring SEC-EDGAR:
- Explored the Securities and Exchange Commission’s Electronic Data Gathering, Analysis, and Retrieval system (SEC-EDGAR).
- Downloading and Formatting Filings:
- Used the
sec_edgar_downloaderlibrary to download 10-K filings. - Developed scripts to add HTML tags for proper formatting.
- Utilized parameters for human-readable formats like HTML.
- While reading the documentation of the method, I saw a parameter which was useful for human readable formats like .html etc. So I used that parameter. This gave me .html documents. I didn’t need to extract any html content.
- Used the
- Organizing and Renaming Files:
- Developed a Python script to organize and rename folders and corresponding HTML files based on their year.
- Introduction to RAG (Retrieval Augmented Generation):
- The task of getting insights from a document unseen to an LLM can be achieved using RAG (Retrieval Augmented Generation). I had prior experience doing it, so I thought that’s what I’d try. I wanted to make a chatbot-like application interface.
- Planned to create a chatbot-like application interface.
- I had only worked with PDF or text files, never with html files. I had to read the documentation and load html files. Then I learnt how to load multiple documents at once.
- Experimenting with Models:
- Explored various models including OpenAI (Did not have access tokens), HuggingFace, Local models using ollama such as LLama2,3, phi3, mistral and LlaVa, and Anthropic AI’s Claude-3-Opus.
- Evaluated models like Intel/dynamictinybert, deepset/roberta-base-squad2, claude-3-opus, etc.
- Front-end Development with Streamlit:
- Chose Streamlit for its easy integration with Python and multi-page application support.
- I learnt about session state and variables, since values are not stored across pages in a streamlit application. Some variables such as the ticker chosen, had to be shared across pages.
Task 1.1: Data Retrieval
- Selected companies: Apple, Microsoft, Visa, Nvidia.
- Retrieved 10-K filings using tickers and downloaded HTML files.
- Used the
download_detailsargument for obtaining HTML files. - Combined relevant HTML files into folders and loaded them using langchain’s document loader.
- Split data into chunks and created a FAISS database for similarity search based on user queries.
- I found the JSON mapping a company name to its ticker from the EDGAR website, which was extremely helpful.
ticker= input(“Enter company ticker: “)
from sec_edgar_downloader import Downloader
dl = Downloader(“MyCompanyName”, “my_mail@email_provider.com”)
dl.get(“10-K”, ticker, after=”1995-01-01”, before=”2023-12-31”, download_details=True)
The “download_details” argument was of utmost help, since it gave me html files instead of just plaintext files.
Task 1.2: Data Collection, preprocessing and Text Analysis from 10-K Filings
- 10-K filings are complex and lengthy, requiring insights extraction.
- Planned to use LLMs to quickly generate insights such as sales trends, risk factors, and product sales breakdown.
- Why would a user care?
- 10-K filings are extremely long and tedious for humans to read through and understand. With the amount of context modern-day LLMs can sift through and store, using them would be a great way to quickly generate insights such as:
- What are the sales of a particular financial year?
- How was the trend in the past few years?
- What are the risk factors of the company?
- What is the device based split up of sales in an electronics company such as Apple?
- These insights will help the user make informed decisions, after manual verification, about various aspects such as investments or job opportunities or for a general understanding of the indstry the company is in.
Task 2: Application Construction and Deployment
- Chose Python as the backend language due to its compatibility with all LLMs, and my proficiency in it.
- Selected Streamlit for its ease of integration with existing Python code.
-
Utilized Streamlit’s capability to build multi-page applications that run quickly.
- Created a homepage for selecting company tickers and timeframes.
- Implemented separate pages for different LLMs, enabling users to chat with them.
- Introduced multi-modality with LlaVa for analyzing uploaded images.
- Deployed visualizations generated by Claude’s answers using QuickChart.
The visualizations generated by Claude’s answers are shown as links below: