足ることを知らず

Data Science, global business, management and MBA

Day 81 in MIT Sloan Fellows Class 2023, Advanced Data Analytics and Machine Learning in Finance 8, SQuAD

Question Answering Tasks in Machine Learning

Question answering is one of the major problem sets in NLP, in a nut shell, this is simpler Reading Comprehension problem in GMAT or other types.

 

There are multiple types of questions and texts, but the most famous and modern datasets would be SQuAD standing for the Stanford Question Answering Dataset.

The Stanford Question Answering Dataset

 

This dataset showed remarkably sophisticated in two dimensions, quality(variety of questions and contexts), and quantity(# of contexts and questions). 

As the past examples, we have MCT test, ALgebra test, WikiQA etc, but it could not resolve trade-off between quality and quantity.

 

The structure of the dataset

It consists of 536 Wikipedia articles with 23,215 paragraphs. There are over 100K questions and thirds of the questions are unanswerable. These unanwerable questions may make ML very confused. 

 

As the actual data, the structure is as follows.

There are context(some paragraphs or a paragraph), questions and answers with short sentences. 

 

Methodology to resolve

Transfer learning is only methodology if you want to create actual ML to make sense.

 

Text Extraction with BERT

 

This one would be a great exercise for everyone about how to structure data including tokenization, split, and padding. 

 

If you are interested in top-notch methodology, you can find IE-Net with ensemble. So far, it recorded top performance in the competition. 

 

Potential business case

For example, if you have clear questions around finance, you can ask the questions with using financial documents such as 10-K for context. 

For example, "what is the most growing sector in company AAA?". Then, AI would automatically extract possible answers from 10-K and annual reports. 

An ability to deal with "unanswerable question" would be crucial in practical usage.