Exploring Question Answering From Tabular Data With GPT-3 and TAPAS

Summary

  • Large language models such as ChatGPT and GPT-3 are disrupting IT-based roles such as software development with their code-generation capabilities.
  • In this article I will examine what the capabilities of these models are for replacing business intelligence-based roles by investigating how they perform in extracting information from tables.
  • It appears that both ChatGPT and GPT-3 will present incorrect albeit very close answers to simple queries which require summations.
  • TAPAS, a model tailored for table parsing and information extraction performs better than the GPT-based models, although it is still not perfect.
  • I will present an alternative approach by using the code-generation capabilities of GPT-3 to generate SQL for a database which in turn handle the query execution, showing great accuracy!

Introduction

Large language models are emerging as a tool for solving many information and knowledge-related matters. Back in 2020 GPT-3 made huge improvements in the way that humans are able to interact with an AI, where the model showcased its ability to engage in compelling conversations with a human counterpart.

In November 2022 we have seen the release of ChatGPT which is a spinoff of the GPT-3 model and is even better suited for real-world conversations. The fact that ChatGPT has been used by millions of individuals and has seen large-scale media exposure tells that this field too is getting ready for real-world applications quickly.

A major use case for ChatGPT and GPT-3 lies in their code-generation capabilities: they are not only capable of generating small snippets of program code but can generate larger programs like simple computer games which can be compiled and run. With some experimentation it becomes very clear that the software development world will be disrupted by AI as well.

Disrupting the Business Intelligence market

As I have a background in business intelligence I was interested to see whether the AI capabilities that are currently making waves in other fields will also be influencing BI work in the future. Business intelligence is about gathering insights from tabular data by transforming the data, adding features and presenting it in tabular format or on a graphical dashboard. This may not be fully aligned with “showbiz AI” projects like ChatGPT or DALL-E 2, but the reality is that all of the world’s corporations are run and managed based on tabular data sitting in data warehouses or databases. Presenting this information in a suitable manner is a worldwide multi-billion dollar business spanning tooling and project costs.

Strangely enough getting information on applying deep learning models for extracting information stored in tabular form is not abundant. This may partially be due to the fact that the majority of the AI world is focusing nowadays on the aforementioned image generation techniques. Secondly, proper tabular data in the form of scorecards or dashboards used in businesses is not easily accessible in the public domain so researchers are for instance limited to Wikipedia tables. Just compare this to labeled image libraries which contain tens of millions of images nowadays.

Dataset and questions

For all examples in this post I will be using the below data table containing GDP figures from the last years. The figures are taken from the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

Note that the 2022 GDP figures are the IMF estimates while the 2021 are from the World Bank, all in USD.

Country Region GDP Year
United States Americas 25.035.164 2022
China Asia 18.321.197 2022
Japan Asia 4.300.621 2022
Germany Europe 4.031.149 2022
India Asia 3.468.566 2022
United Kingdom Europe 3.198.470 2022
France Europe 2.778.090 2022
Canada Americas 2.200.352 2022
Russia Europe 2.133.092 2022
Italy Europe 1.996.934 2022
United States Americas 22.996.100 2021
China Asia 17.734.063 2021
Japan Asia 4.937.422 2021
Germany Europe 4.223.116 2021
India Asia 3.173.398 2021
United Kingdom Europe 3.186.860 2021
France Europe 2.937.473 2021
Canada Americas 1.990.762 2021
Russia Europe 1.775.800 2021
Italy Europe 2.099.880 2021

The questions I will be asking to the models are the following. I have noted the aspect of each question which may pose an additional challenge to the model.

Question Challenge
What is the GDP of the United States in 2022? Simple selection
What is the total GDP of Europe in 2022? Selection and summation
What is the average GDP of all countries in 2022? Selection and averaging
What is the GDP of Italy in 2021? Simple selection, needs to be different from 2022
How many distinct countries are there in Asia? Distinct count
Total European GDP in 2021? Selection and summation, not literally naming “Europe”
What countries have a GDP higher than 10 million in 2022? Selection and filtering
How much does the total GDP in 2022 differ from the total GDP in 2021? Selection, summation and subtracting two figures
What is the percentual increase of the US 2022 GDP vs 2021? Two selections, calculating percentual difference

Tabular information extraction using GPT3 and ChatGPT

The first paper I ran across from Srisvastava, Ganu and Guha was called Towards Zero-Shot and Few-Shot Table Question Answering using GPT-31. This used the tried and tested GPT-3 model for answering questions from tables. Their approach was to provide the table in JSON-format to the model accompanied by a few queries in order to retrieve the answers. This is called zero-shot learning; if you would also provide a few extra questions and answers for the model to extract a pattern it would have been few-shot learning.

GPT-3 can be easily tested by using the OpenAI Playground at https://beta.openai.com/playground/ (requires you to create an account first).

The notebook containing the code is located here: https://github.com/kemperd/tables/blob/main/table_tapas_direct_answering.ipynb

I have formatted the data in JSON-format as suggested by the authors and have posted this together with the question to GPT-3 (only 2022 shown):

The following table in JSON-format shows the GDP per country:
{
	"columns": ["Country",
		"Region",
		"GDP",
		"Year"
	],
	"values": [
		["United States", "Americas", 25035164, 2022],
		["China", "Asia", 18321197, 2022],
		["Japan", "Asia", 4300621, 2022],
		["Germany", "Europe", 4031149, 2022],
		["India", "Asia", 3468566, 2022],
		["United Kingdom", "Europe", 3198470, 2022],
		["France", "Europe", 2778090, 2022],
		["Canada", "Americas", 2200352, 2022],
		["Russia", "Europe", 2133092, 2022],
		["Italy", "Europe", 1996934, 2022]
	]
}

What is the total GDP of Italy in 2022?

Now let’s examine the answers to the Q&A table as provided earlier:

Question Answer Analysis
What is the GDP of the United States in 2022? 25035164 Correct
What is the total GDP of Europe in 2022? The total GDP of Europe in 2022 is 12,774,893. Incorrect, should be 14.137.735
What is the average GDP of all countries in 2022? The average GDP of all countries in 2022 is 23,159,556 Incorrect, should be 6.746.364
What is the GDP of Italy in 2021? 1996934 Incorrect, answer provides the 2022 figure
How many distinct countries are there in Asia? There are 5 distinct countries in Asia: China, Japan, India, The country list is correct, although it incorrectly counts the 3 countries as 5
Total European GDP in 2021? 7235223 Incorrect, should be 14.223.129
What countries have a GDP higher than 10 million in 2022? United States, China, Japan, Germany, India, United Kingdom Incorrect, should be only US and China
How much does the total GDP in 2022 differ from the total GDP in 2021? The total GDP in 2022 differs by $50,312,764 Incorrect, should be 2.408.761
What is the percentual increase of the US 2022 GDP vs 2021? The percentual increase of the US 2022 GDP vs 2021 is 10. Incorrect, should be 8.87%

It seems that table lookups are handled pretty well by GPT-3 as long as the data is provided in the table. However, a very important aspect of analyzing tables correctly is the ability to perform aggregations such as summations or averaging over the table entries. Let’s see how GPT-3 performs in this area:

  • What is the total GDP of Europe? (correct answer in this dataset is 14.137.735)
    • Answer 1: 11,741,336
    • Answer 2: 11,362,235
    • Answer 3: 12,678,936

These are some very peculiar findings. Although GPT-3 excels at language-oriented tasks it cannot even perform a simple activity like summing up 5 figures together, however it does seem that the numbers are somewhat close to the correct answer.

Let’s have a look at ChatGPT at https://chat.openai.com/chat to see if that gives any improvement:

Now that’s interesting! ChatGPT provides a good explanation of what 5 figures need to be summed up to get the result. Although it extracts the correct figures from the table, it does not perform the summation correctly to get the correct answer which is 14.137.735. The presented outcome is however very close, which may be very misleading if you don’t recalculate the numbers yourself.

Let’s see if we can guide it in the right way:

Although it apologizes nicely for its foolish mistake it still does not sum up correctly. This seems a weak spot in all of the GPT-based models and may be due to their probabilistic nature where it just tries to finish a sentence in the most probable way. It seems that the models settle on answers that are about right, which may be true for text completion but will not work when performing calculations.

Other prompt engineering tricks like providing the table data in CSV format or replacing the continent names by symbolic names as “Continent A” do not help in this regard.

Information extraction using TAPAS

In my search for an alternative to the GPT-based models I came across the TAPAS model (for Table Parser) which is once again a delivery by the Google Research team. The model is based on the BERT large language model, which in itself predates GPT-3 by a few years, however it is specifically purposed for extracting information from tables whereas GPT-3 is a generic model.

The blog at https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html provides a nice overview: the main idea is that apart from the word embeddings for the table data, the model uses are a few extra embeddings to encode the table structure like its rows and columns.

Each query outputs the cells which are used to calculate the end results including an optional aggregation operation which indicates what to do with these cells such as summing, counting or averaging their values.

The models have been trained on millions of publicly accessible tables from the following datasets:

Dataset Task Description
SQA Conversational Conversational, only cell selection questions
WTQ (WikiTableQuestions) Weak supervision for aggregation Questions might involve aggregation, and the model must learn this given only the answer as supervision
WikiSQL Strong supervision for aggregation Questions might involve aggregation, and the model must learn this given the gold aggregation operator

When selecting a model for your own use it is important to keep in mind the intended usage, for instance the SQA model will struggle with aggregating figures just as much as the GPT-based models do. I will be using the WTQ model for the next examples. Model sizes range from large, base, medium, small and tiny. I will use the large model going forward as a few quick tests with the base models gave too many incorrect answers. I have disregarded models smaller than the base version altogether.

I have selected the TAPAS implementation from the excellent Huggingface Transformers library, documented at the following link: https://huggingface.co/docs/transformers/model_doc/tapas. As always Huggingface makes it extremely easy to test your own use case on a broad range of transformers from their library.

Please have a look at the following Jupyter notebook to get a view on how easy it is to work with the Huggingface TAPAS implementation: https://github.com/kemperd/tables/blob/main/table_tapas_direct_answering.ipynb

The Q&A table as provided by TAPAS looks as follows:

Question Answer Analysis
What is the GDP of the United States in 2022? SUM > 25035164 Correct
What is the total GDP of Europe in 2022? SUM > 4031149, 3198470, 2778090, 2133092, 1996934 Correct
What is the average GDP of all countries in 2022? AVERAGE > 25035164, 18321197, 4300621, 4031149, 3468566, 3198470, 2778090, 2200352, 2133092, 1996934 Correct
What is the GDP of Italy in 2021? SUM > 2099880 Correct
How many distinct countries are there in Asia? COUNT > China , Japan , India , China , Japan , India Incorrect, count is not distinct
Total European GDP in 2021? SUM > 4223116, 3186860, 2937473, 1775800, 2099880 Correct
What countries have a GDP higher than 10 million in 2022? United States , China , Japan , Germany , India , United Kingdom , France Incorrect
How much does the total GDP in 2022 differ from the total GDP in 2021? SUM > 25035164, 18321197, 4300621, 4031149, 3468566, 3198470, 2778090, 2200352, 2133092, 1996934, 22996100, 17734063, 4937422, 4223116, 3173398, 3186860, 2937473, 1990762, 1775800, 2099880 Incorrect, the model only seems to support summation on the selection
What is the percentual increase of the US 2022 GDP vs 2021? COUNT > 25035164 Incorrect

These results are indeed better than the results that GPT-3 gave, especially when summing up figures. The more complex queries on calculating differences between selections are unfortunately still answered incorrectly.

SQL-based information extraction from databases using GPT-3

As a last experiment I will use GPT-3 once more, but this time not for analyzing the tables directly but to generate SQL statements for a database that holds the data. My thinking was that as GPT-3 is a language model, it should excel in generating the proper code for selecting the data from a database, while the database should be able to handle the proper aggregation and summation logic.

I have loaded the aforementioned table into an SAP Cloud HANA database with the below table structure. I will use the hana_ml Python library for interfacing with HANA Cloud.

Column Name # Data Type Length
COUNTRY 1 NVARCHAR 50
REGION 2 NVARCHAR 50
GDP 3 DECIMAL 34
YEAR 4 NVARCHAR 4

For this scenario I will be using a programmatic approach and will not be interfacing with the OpenAI Playground. Instead I will use the openai Python package to query the model using the API. If you want to follow along please have a look at the following Github repository: https://github.com/kemperd/tables. The notebook used for this example is table_gpt_sql_query.ipynb.

The main function receiving the generated SQL from GPT-3 and inputting it into the database to get the data is the following:

def apply_model(query):
    prompt = """Please regard the following table: {}

    The table name is COUNTRIES_BY_GDP. Use ' as the quote character. Quote column aliases with ". Write a SQL query to answer the following question: {}""".format(json.dumps(data_gdp), query)

    request = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        temperature=0.9,
        max_tokens=3500
    )
    sql_query = request.choices[0].text
    print("===> {}: {}\n".format(query, sql_query))    

    df = cc.sql(sql_query)
    return df.collect()

Note that the prompt contains a JSON-formatted table just as the previous examples, but this time it asks for a corresponding SQL-statement. I had to include a few hints for the database such as the table name and some details on which quote characters to use. I am using SAP HANA which requires column names to be quoted in double quotes if they contain spaces. Your database may have similar syntax requirements.

Also note that I have increased the max_tokens parameter to 3500 to have the model return longer answers.

The Q&A table for this approach looks as follows:


Question

Answer

Analysis

What is the GDP of the United States in 2022?

SELECT GDP
FROM COUNTRIES_BY_GDP
WHERE COUNTRY = ‘United States’
AND YEAR = ‘2022’;

Answer: 25035164

Correct

What is the total GDP of Europe in 2022?

SELECT SUM(GDP) FROM COUNTRIES_BY_GDP
WHERE YEAR = 2022 AND REGION = ‘Europe’;

Answer: 14137735

Correct

What is the average GDP of all countries in 2022?

SELECT AVG(GDP)
FROM COUNTRIES_BY_GDP
WHERE YEAR = 2022;

Answer: 6746363.5

Correct

What is the GDP of Italy in 2021?

SELECT GDP
FROM COUNTRIES_BY_GDP
WHERE COUNTRY = ‘Italy’
AND YEAR = 2021;

Answer: 2099880

Correct

How many distinct countries are there in Asia?

SELECT COUNT(DISTINCT COUNTRY)
FROM COUNTRIES_BY_GDP
WHERE REGION = ‘Asia’;

Answer: 3

Correct

Total European GDP in 2021?

SELECT SUM(GDP)
FROM COUNTRIES_BY_GDP
WHERE REGION = ‘Europe’
AND YEAR = ‘2021’;

Answer: 14223129

Correct

What countries have a GDP higher than 10 million in 2022?

SELECT COUNTRY
FROM COUNTRIES_BY_GDP
WHERE GDP > 10000000
AND YEAR = ‘2022’;

Answer: United States, China

Correct

How much does the total GDP in 2022 differ from the total GDP in 2021?

SELECT SUM(GDP) -
(SELECT SUM(GDP)
FROM COUNTRIES_BY_GDP
WHERE YEAR = ‘2021’) AS TOTAL_GDP_INCREASE_2021_2022
FROM COUNTRIES_BY_GDP
WHERE YEAR = ‘2022’;

Answer: 2408761

Correct

What is the percentual increase of the US 2022 GDP vs 2021?

SELECT round((((25035164-22996100)*100)/22996100),2) as “Percentage_Increase_US_GDP”
FROM COUNTRIES_BY_GDP
WHERE YEAR = ‘2022’ AND COUNTRY = ‘United States’;

Answer: 8.87

Correct

Now that seemed to do the trick! We are combining two tools fit for their purpose: GPT-3 is handling the generation of the correct SQL code while the database handles the tasks that GPT-3 is not suitable for, like aggregations and calculations. This approach will be very easy to adapt to other tables, as long as both the table contents and resulting answer fit within the ~4000 tokens that GPT-3 is able to process at once (table and output combined).

Wrapup

In this blog post I have explored various alternatives towards question answering from tables. Direct question answering from a table supplied on a GPT-3 prompt performed poorly. TAPAS, which is a model specifically trained for table parsing and question answering did better in this regard, although being a much simpler model.

The best results were obtained by having GPT-3 write SQL code which could be executed on a database to circumvent the poor aggregation and counting capabilities that the GPT-based models appear to have.


  1. Srisvastava, Ganu and Guha, Towards Zero-Shot and Few-Shot Table Question Answering using GPT-3 https://arxiv.org/pdf/2210.17284.pdf ↩︎