Infusing ChatGPTs Body of Knowledge With Your Custom Documents

Dirk Kemper

Mar 23, 2023 10 min read Language models, Enterprise AI

Summary

Large language models like ChatGPT are not trained on data about recent events or on internal documents living inside companies.
There is a need for infusing the information from both recent events and from internal documents from corporate knowledge bases into the body of knowledge of the GPT-based networks in order to have a conversation about their contents.
In this blog post I will demonstrate how to use LlamaIndex for extracting information from an internal document and feeding this to ChatGPT so it will be able to readily answer questions about its contents.

Introduction

The hype about GPT-based networks doesn’t seem to be going away soon. On the 14th of March OpenAI introduced GPT-4, an updated version of the capable GPT-3 model which appears to perform even better than its predecessor on language-oriented tasks. Still, even though GPT-4 appears to be more capable at first glance, one of the major drawbacks of these networks is their lack of up-to-date knowledge on recent events. This was true for GPT-3 and ChatGPT, but remains true for the recently released GPT-4 which has a cutoff date of September 2021. There will also be many occasions where you would like it to regard specific additional information you it has not seen during training, such as documents which are not publicly accessible.

Just think of the huge amount of documents being stored in large corporations which are outside of the public scope of the GPT-based language models. Wouldn’t it be powerful if you could infuse the knowledge of the GPT models with the specific domain knowledge stored in these documents so you can have a conversation with ChatGPT about their contents? In this blog post I will demonstrate a way how to set this up.

All companies I have come across use own some form of knowledge base or a document repository like SharePoint where teams or business units place the documents they produce in folder-like structures. New joiners of those companies will be struggling to get a grip on the hundreds of documents they are assumed to have knowledge of, sometimes spanning thousands of pages. Of course document search functionality will be helpful in this regard, but it seems like a bot assisting a new joiner with questions about these datasets would be extremely helpful.

There are multiple ways to have the GPT-based models work with your custom set of documents:

Finetune one of the GPT models on the custom dataset. This essentially uses the original model weights as trained by OpenAI, adds your example documents into the training phase of the model and updates the neural network weights accordingly. This however does not guarantee that ChatGPT will present the knowledge from those documents with a priority when asked to do so, as it simply adds the information to its vast body of additional knowledge. This means that it may still make up information from other sources it has used during training.
Add the information you require ChatGPT to know about directly into the prompt. You can paste in (parts of) the document you will be querying and directly ask it a question about the document, all in the same prompt. This is of course limited to the maximum number of tokens of 4096, which is roughly equivalent to the number of words, which means that larger documents cannot be sent at once. GPT-4 introduces a new default limit of about 8000 tokens and an increased limit of 32000 tokens in a larger (and probably more expensive) model, making the latest version of GPT a better choice for this scenario.
Make use of an indexing package such as LlamaIndex to build an index from the additional information and use that index for querying. LlamaIndex will handle the boilerplate such as tokenizing the documents, building an index from them and sending it in chunks to GPT so it will not surpass its token limits.

This blog post will focus in the LlamaIndex solution as this is easy to set up, it has a plug-in library for ingesting many sources of information and will work with OpenAI’s cheaper gpt-3.5-turbo model. For more details please have a look at the documentation at https://gpt-index.readthedocs.io/en/latest/index.html and the GitHub repo at https://github.com/jerryjliu/llama_index.

High-level steps

The following high-level steps need to be taken when using LlamaIndex for answering questions about your custom documents:

Load in your required document using a data connector, for instance for reading files from a folder, a web page or from Wikipedia. There currently is a substantial library of plug-in data connectors available at https://llamahub.ai/
Construct an index from the document. LlamaIndex sends the document to the text-embedding-ada-002-v2 tokenizer, also offered by OpenAI as part of their API portfolio. The tokenizer converts the document into word embeddings, which are basically numerical vectors representing the words in the document.
The index can now be queried with your specific question. This constructs a prompt from your converted index (in the form of word embeddings) together with your question to the actual GPT language model in order to retrieve an answer.

There is a very important caveat to this solution, one which will be important when working with internal or private documents that should not leave the internal network. The architecture of the solution is to add the document text in the form of text embeddings to the prompt given to ChatGPT. This means that your document will leave the network and is processed by software behind the OpenAI API and may possibly be cached or stored there. You should not use this for confidential information which need to be kept within your company.

An alternative to this architecture is setting up a local language model like BLOOM or BERT yourself on privately managed hardware and finetuning that on your custom dataset. This will be outside of the scope of this blog post though.

Code

The Jupyter notebook including a description how to work with this yourself is available at https://github.com/kemperd/llamaindex.

In this example I will be using the following document from AXA, a large and globally operating insurance firm: http://www.axainsurance.com/home/policy-wording/policywording_153.pdf. This document contains an example of various insurance coverages that AXA offers and is the type of corporate PDF document you may encounter in large extent in these organizations.

First I will load and store the document in a local folder:

# Download the PDF file which will be indexed

PDF_NAME = 'policywording_153.pdf'

file = requests.get('http://www.axainsurance.com/home/policy-wording/{}'.format(PDF_NAME), stream=True)
with open(PDF_NAME, 'wb') as location:
    shutil.copyfileobj(file.raw, location)

Now we can instantiate a language model. Note that I am using gpt-3.5-turbo in this setting, which is the model underlying ChatGPT. You can also select other models here, like the new GPT 4 model if you have beta access to that. I prefer gpt-3.5-turbo because of its low costs as usage is billed according to the number of tokens and quite a few tokens are sent across.

After the language model the has been instantiated the locally stored PDF is read and converted into an index. Creating the index will incur some costs as at this point the tokenizer model is used to create the word embeddings.

llm_predictor = LLMPredictor(llm=OpenAIChat(temperature=0, model_name="gpt-3.5-turbo"))

PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path(PDF_NAME))
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor)

Now the index has been built, querying it is easy:

response = index.query("Where can I send a complaint? Please provide full address details and phone number.")
print(response)

This will give the following response:

You can send a complaint to the Financial Ombudsman Service at Exchange Tower, Harbour Exchange Square, London, E14 9SR. Their phone numbers are 0300 123 9123 or 0800 023 4567.

And indeed, on page 33 of the document we find the following text:

Results

The first example indicates that the GPT model seems to be correctly using the information from the supplied document. In order to determine if the model actually took these answers from the document itself or whether it is using prior knowledge from training, I will compare a few of the answers from LlamaIndex against standard ChatGPT.

The below table shows the a few of the questions I will be asking about the provided document. For the full overview please refer to the Jupyter notebook at https://github.com/kemperd/llamaindex/blob/main/llamaindex-insurance.ipynb and cross-check the ChatGPT responses yourself, as I had to truncate them a bit to keep it readable.

Question	LlamaIndex response	ChatGPT response	Analysis
Where can I send a complaint? Please provide full address details and phone number.	You can send a complaint to the Financial Ombudsman Service at Exchange Tower, Harbour Exchange Square, London, E14 9SR. Their phone numbers are 0300 123 9123 or 0800 023 4567.	If you wish to make a complaint regarding AXA insurance, you can contact their customer service department in your respective country. Please find below the contact details for the AXA insurance customer service centers in some of the countries: AXA US: Address: AXA Equitable Life Insurance Company 1290 Avenue of the Americas New York, NY 10104 Phone: 1-888-292-4636 answer truncated	The LlamaIndex answer can be found on document page 33. ChatGPT provides 4 addresses across various countries, but not the address from the document. This indicates it uses pre-existing knowledge from other sources.
What promises does AXA make? Please provide them as a bulleted list.	- Acknowledge all complaints promptly. - Investigate quickly and thoroughly. - Keep the complainant informed of progress. - Do everything possible to resolve the complaint. - Use the information from complaints to continuously improve their service.	As a global insurance and financial services company, AXA has a range of products and services that may vary by region and market. However, some of the promises that AXA makes to its customers include: Protection: truncated Personalized solutions: truncated Peace of mind: truncated	The LlamaIndex answer can be found on document page 34. ChatGPT appears to provide generic answers: the terms “personalized solutions” and “peace of mind” do not appear in the document.
How can I make a change to my home insurance policy?	You can make changes to your home insurance policy online. It is your responsibility to make any changes to your policy online. If you request AXA to make any changes or corrections to your policy on your behalf, they reserve the right to charge up to £30. If you are in doubt, you can ask AXA.	To make a change to your AXA home insurance policy, you can contact the AXA customer service department in your respective country. Here are the general steps you can follow: Review your policy: truncated Contact AXA: truncated Provide the necessary information: truncated Receive confirmation: truncated	The LlamaIndex answer can be found on document page 21. ChatGPT appears to give a generic without references to the actual document.
What is AXA’s website?	The context information does not provide information about AXA’s website.	AXA is a global company that provides insurance and financial services in various countries and regions. Depending on your location, the website for AXA may vary. However, the global website for AXA can be found at www.axa.com. From there, you can select your country or region to access the local AXA website and find information about the specific products and services available in your area.	LlamaIndex is unable to provide just one website, even though websites have been provided in the document. ChatGPT appears to use its pre-existing knowledge.

Wrapup

In this blog post I have shown you how you can infuse ChatGPT’s large body of knowledge with your own documents using the LlamaIndex package. Comparing the answers from LlamaIndex with those from standard ChatGPT show that it by using LlamaIndex it will be possible to have a conversation with ChatGPT about the custom content you require it to know about.

GPT-3 ChatGPT GPT-4 LlamaIndex RAG