NLP in Finance

Financial Question Answering with Jina and BERT — Part 2

Tutorial on how evaluate and improve your Financial QA search results with Jina

Bithiah Yuan

Published in

Towards Data Science

11 min readJan 15, 2021

Part 1 — Learn how to use the neural search framework, Jina, to build a Financial Question Answering (QA) search application with the FiQA dataset, PyTorch, and Hugging Face transformers.

Part 2 — Learn how evaluate and improve your Financial QA search results with Jina

In the previous tutorial we learned how to build a production-ready Financial Question Answering search application with Jina and BERT. In order to improve our application and retrieve meaningful answers, evaluating the search results is essential for tuning the parameters of the system. For example, it can help us decide which pre-trained model to choose for the encoder, the maximum sequence length, and the type of Ranker we want to use.

Recall that Jina provides us the building blocks that enable us to create a search application. Therefore, instead of implementing the evaluation metrics by ourselves, we can use Jina’s Evaluator, a type of Executor.

So far, we have already seen some of the Executors: Encoder, Indexer, and Ranker. Each of these Executors is responsible for the logic of their corresponding functionalities. As the name suggests the Evaluator will contain the logic of our evaluation metrics. We have also learned how to design an Index and Query Flow, which are pipelines for indexing and searching the answer passages.

To evaluate the search results, we will need to create an evaluation pipeline, the Evaluation Flow, for us to use in our Financial QA search application.

Tutorial

In this tutorial we will learn how to add the evaluation pipeline in our Financial QA system by designing a Flow to evaluate the search results with Precision and Mean Reciprocal Rank (MRR).

We will evaluate the search results before and after reranking with FinBERT-QA. Here is overview of the Evaluation Flow:

Figure 1: Overview of the Evaluation Flow (Image by the author)

Setup

If you are coming from the previous tutorial, you will need to make some small changes to app.py, FinBertQARanker/__init__.py , and FinBertQARanker/tests/test_finbertqaranker.py. Joan Fontanals Martinez and I have added some helper functions and batching in the Ranker to help speed up the process.

Instead of pointing out the changes, I have made a new template to simplify the workflow and also show those of you who are already familiar with Jina how to implement the evaluation mode.

Clone project template:

git clone https://github.com/yuanbit/jina-financial-qa-evaluator-template.git

Make sure the requirements are installed and you have downloaded the data and model.

You can find the final code of this tutorial here.

Let us walk through the Evaluation Flow step-by-step.

Step 1. Define our test set data

Our working directory will be jina-financial-qa-evaluator-template/. In the dataset/ folder you should have the following files:

Figure 2: dataset structure (Image by the author)

For this tutorial we will need:

sample_test_set.pickle: a sample test set with 50 questions and ground truth answers

qid_to_text.pickle: a dictionary to map the question ids to question text

If you want to use the complete test set from FinBERT-QA, test_set.pickle, which has 333 questions and ground truth answers, you can simply change the path.

The test set that we will be working with in this tutorial is a pickle file, sample_test_set.pickle. It is a list of lists in the form [[question id, [ground truth answer ids]]], where each element contains the question id and a list of ground truth answer ids. Here is a slice from the test set:

[[14, [398960]],
 [458, [263485, 218858]],
 [502, [498631, 549435, 181678]],
 [712, [212810, 580479, 527433, 28356, 97582, 129965, 273307]],...]

Next, similar to defining our Document for indexing the answer passages, we will create two Documents containing the data of the questions and the ground truth answers.

Figure 3: Evaluation Flow — Step 1 defining the Query and Ground truth Document (Image by the author)

Recall in our Index Flow, when we defined our data in index_generator function we included the answer passage ids (docids) in the Document. Therefore, after indexing, these answer ids are stored in the index and they are important because they are part of the search results during query time. Thus, we only need to define the Ground truth Document with the ground truth answer ids for each query and compare these answer ids with the answer ids of the matches.

Let’s add a Python generator under the load_pickle function to define our test set for evaluation. For each Document, we will map the corresponding question and from the test set to the actual text.

Step 2. Encode the questions

Similar to the Query Flow, we will pass our two Documents into the Encoder Pod from pods/encode.yml. The Driver will pass the question text to the Encoder to transform it into an embedding and the same Driver will add the embedding to the Query Document. The only difference this time is that we are passing two Documents into the Encoder Pod and the Ground truth Document is immutable and stays unchanged through the Flow.

In flows/ lets create a file to configure our Evaluation Flow called evaluate.yml and add the Encoder Pod as follows:

The output of the Encoder will contain the Query Document with the embeddings of the questions and the Ground truth Document stays unchanged as shown in Figure 4.

Figure 4: Evaluation Flow — output of step 2: the question embeddings are added to the Query Document and the Ground truth Document stays unchanged. (Image by the author)

Step 3. Search Indexes

Next, the Indexer Pod from pods/doc.yml will search for the answers with the most similar embeddings and the Driver of the Indexer will add a list of top-k answer matches to the Query Document. The Ground truth Document remains unchanged.

Let’s add the doc_indexer to flows/evaluate.yml as follows:

The output of the Indexer will contain the Query Document with the answer matches and their corresponding information and the Ground truth Document.

Figure 5: Evaluation Flow — output of step 3: the answer matches are added to the Query Document and the Ground truth Document stays unchanged. (Image by the author)

Step 4. Evaluation

Since I mentioned in the beginning that we will evaluate the search results both before and after reranking, you might think that now we will add the following sequence:

Evaluator for the match results
Ranker
Evaluator for the reranked results

However, since evaluation serves to improve the results of our search system, it is not an actual component of the final application. You can think of it as a tool that provide us information on which part of the system needs improvement.

Having the goal to allow us to inspect any part of the pipeline and enabling us to evaluate at arbitrary places in the Flow, we can use Jina Flow API’s inspect feature to attach the Evaluator Pods to the main pipeline so that the evaluations will not block messages to the other components of the pipeline.

For example, without the inspect mode, we would have a sequential design mentioned above. With the inspect mode, after retrieving the answer matches from the Indexer, the Documents will be sent to an Evaluator and the Ranker in parallel. Consequently the Ranker won’t have to wait for the initial answer matches to be evaluated before it can output the reranked answer matches!

The benefit of this design in our QA system is that the Evaluator can perform evaluations without blocking the progress of the Flow because it is independent from the other components of the pipeline. You can think of the Evaluator as a side task running in parallel with the Flow. As a result, we can evaluate with minimal impact to the performance of the Flow.

You can refer to this article to learn more about the design of the evaluation mode and the inspect feature.

Let’s take a closer look at the evaluation part of the Flow:

Figure 6: A closer look at the evaluation parts of the Flow (Image by author)

In Figure 6, we can see that the evaluate_matching Pod, which is the Evaluator responsible for evaluating the answer matches is working in parallel with the Ranker and theevaluate_ranking Pod, the Evaluator responsible for evaluating the reranked answer matches.

gather_inspect is used to accumulate the evaluation results for each query. Furthermore, the auxiliary Pods shown before and after the Ranker are constructions that allow the the Evaluation Flow to have the connections of the Pods in the same way as if the Query Flow would not have the Evaluators, so that the performance of retrieving and reranking the answer matches would only be affected minimally.

Previously, we used an Encoder and Indexer from Jina Hub, an open-registry for hosting Executors via container images. We can again take advantage of Jina Hub and simply use the Precision and the Reciprocal Rank Pods that are already available!

Now let’s take a look at the Matching and Ranking Evaluators:

Matching Evaluator

After the Indexer Pod outputs the Query Document with the matches, one workflow will involve the Matching Evaluator, which is responsible for computing the the precision and reciprocal rank of the answer matches (without reranking).

The Driver of the Matching Evaluator Pod interprets both the Query and Ground truth Document and passes the answer match ids and the desired ground truth answer ids per query to the Matching Evaluator, which computes the precision and reciprocal rank values as shown in Figure 7.

Figure 7: The Driver of Matching Evaluator Pod passes the answer match id and desired answer id to the Matching Evaluator. (Image by the author)

Now, let’s create our Matching Evaluator. In the folder pods/, and create a file called evaluate_matching.yml. We will add the following to the file:

PrecisionEvaluator and ReciprocalRankEvaluator are the Evaluators for precision and reciprocal rank from Jina Hub. We specify eval_at: 10 to evaluate the top-10 answer matches. We also indicate the name for each component, which we will use later in the Evaluation Flow.

You might be wondering from the last tutorial why we don’t we need to specify the Driver in pods/encode.yml and pods/doc.yml since the Peas in the Pods require both components. This is because the Drivers for these two Pods are are commonly used and are already included by default. However, since we want to use two Evaluators of our choice (Precision and Reciprocal Rank) from Jina Hub, we need to specify the Driver for each Evaluator, namely RankEvaluateDriver.

Next, let us add this Pod to the Evaluation Flow. In flows/evaluate.yml add evaluate_matching as follows:

Here we indicate method: inspect because we are using the inspect feature from the Flow API to inspect the performance of our application in the middle of the Flow.

The Driver of the Matching Evaluator will add the evaluation results of the answer matches to the Query Document for each query as shown in Figure 8.

Figure 8: The evaluation results of the answer matches are added to the Query Document (Image by author)

Well done! We have successfully implemented the first Evaluator. Let us look at how to evaluate the reranked search results next.

Ranking Evaluation

Another workflow after the Indexer Pod involves the Ranking Evaluator, which is responsible for computing the precision and reciprocal rank of the reranked answer matches using FinBERT-QA. The construction of the Ranking Evaluator Pod is similar to the Matching Evaluator Pod, the only difference is that we are passing the reordered match ids to the Ranking Evaluator as show in Figure 9.

Figure 9: The Driver of Ranking Evaluator Pod passes the reordered answer match id and desired answer id to the Ranking Evaluator. (Image by author)

Let’s create our Ranking Evaluator. In the folder pods/, create a file called evaluate_ranking.yml. We will add the following to the file:

Notice that this is almost identical to evaluate_matching.yml except for the naming conventions.

In the previous tutorial we learned how to build a custom Ranker. We need to build the docker image for this Ranker in order to use it as a Pod in our Flow.

Note: you will need to rebuild this image even if you have built it in the previous tutorial because batching has been added.

Make sure you have the Jina Hub extension installed:

pip install “jina[hub]”

In the working directory type:

jina hub build FinBertQARanker/ — pull — test-uses — timeout-ready 60000

You should get a message indicating you have successfully built the image with its tag name. Depending the current Jina release. Make sure to change the tag name accordingly when using it as a Pod.

Next, let us add the Ranker and Evaluate Ranking Pod to the Evaluation Flow in flows/evaluate.yml:

The Indexer Pod will pass the Query Document with the answer matches and the Ground truth ids to the Ranker containing FinBERT-QA. Then the Ranker Pod will output the Query Document with the reordered answer match ids and the Ground truth Document, which will both be passed to the Ranking Evaluator. The output shown in Figure 10, would be the same as the Matching Evaluator, with the difference of the evaluation values computed from the reordered answer matches.

Figure 10: After the Ranker and Ranking Evaluator, the evaluation values computed from the reordered answer matches would be added to the Query Document. (Image by the author)

Great job! We have just finished the design of the Evaluation Flow. Next let us see how to use it in our search application.

Step 4. Get Evaluation Results

Similar to the index function, in app.py let’s add an evaluate function after evaluate_generator, which will load the Evaluation Flow from flows/evaluate.yml and pass the input Query and Ground truth Documents from evaluate_generator to the Flow. We set our top-k=10 to evaluate precision@10 and reciprocal-rank@10.

Since we want to compute the average precision@10 and mean-reciprocal-rank@10 across all queries in the test set, we will write a function, print_average_evaluations , to compute the average of the evaluation values

The final evaluation values will be stored in the Query Document, like the print_resp function, we can write a function to print out the evaluation response by looping through the evaluations in our Query Document, d.evaluations and printing out the values for each Evaluator:

Hooray! 🎉🎉🎉 We have just implemented an evaluation mode in our Financial QA search engine! We can now run:

python app.py evaluate

As this tutorial is for educational purposes, we only indexed a portion of the answer passages and used a small sample test set. Therefore, the results cannot be compared to the results from FinBERT-QA. Feel free to index the entire answer collection and evaluate on the complete test set and share your results!

You will see the individual questions being evaluated and printed. Here is an example for question id: 1281.


Evaluations for QID:1282 
    Matching-Precision@10: 0.10000000149011612 
    Matching-ReciprocalRank@10: 1.0 
    Ranking-Precision@10: 0.10000000149011612 
    Ranking-ReciprocalRank@10: 0.125

At the end you will see the average evaluation results:

Average Evaluation Results
     Matching-Precision@10: 0.056000000834465026
     Matching-ReciprocalRank@10: 0.225
     Ranking-Precision@10: 0.056000000834465026
     Ranking-ReciprocalRank@10: 0.118555556088686

Summary

In this tutorial, I introduced the evaluation feature in Jina and demonstrated how we can design an Evaluation Flow for our Financial QA search application. We learned how to use the inspect mode to create our Evaluator Pods and that it benefits our application by minimizing the impact of evaluation to the performance of the pipeline.

Be sure to check out Jina’s Github page to learn more and get started on building your own deep-learning-powered search applications!

Community

Slack channel — a communication platform for developers to discuss Jina
Community newsletter — subscribe to the latest update, release and event news of Jina
LinkedIn — get to know Jina AI as a company and find job opportunities
Twitter — follow and interact with Jina AI using hashtag #JinaSearch
Company — know more about Jina AI and their commitment to open-source!

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.