Query Classifier
The Query Classifiers in Haystack distinguish between three different classes of queries:
- Keywords
- Questions
- Statements
Based on its classification, it can route the query to a specified branch of the Pipeline. By passing on queries to Nodes that are more suited to handle them, you get better search results.
For example, the Dense Passage Retriever is trained on full questions and so it works best if you only pass questions to it. By choosing also to pass keyword queries to a BM25 Retriever, such as the ElasticsearchRetriever, you can reduce the load on the GPU-powered Dense Passage Rertriever.
Position in a Pipeline | At the beginning of a query Pipeline |
Input | Query |
Output | Query |
Classes | TransformersQueryClassifier SklearnQueryClassifier |
The Query Classifier will populate the metadata fields of the Query with its classification and can also route it based on this.
Query Types
Keyword Queries
Such queries don't have sentence structure. They consist of keywords and the order of words does not matter:
- arya stark father
- jon snow country
- arya stark younger brothers
Questions
In such queries users ask a question in a complete, grammatical sentence. A Query Classifier should be able to classify a query regardless of whether it ends with a question mark or not.
- who is the father of arya stark?
- which country was jon snow filmed in
- who are the younger brothers of arya stark?
Statements:
This type of query is a declarative sentence, such as:
- Arya stark was a daughter of a lord.
- Show countries that Jon snow was filmed in.
- List all brothers of Arya.
Usage
To use the Query Classifier as a stand-alone Node:
from haystack.nodes import TransformersQueryClassifier
queries = ["Arya Stark father","Jon Snow UK", "who is the father of arya stark?","Which country was jon snow filmed in?"]
question_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")# Or Sklearn based:
for query in queries: result = question_classifier.run(query=query) if result[1] == "output_1": category = "question" else: category = "keywords"
print(f"Query: {query}, raw_output: {result}, class: {category}")
# Returns:# Query: Arya Stark father, raw_output: ({'query': 'Arya Stark father'}, 'output_2'), class: keywords# Query: Jon Snow UK, raw_output: ({'query': 'Jon Snow UK'}, 'output_2'), class: keywords# Query: who is the father of arya stark?, raw_output: ({'query': 'who is the father of arya stark?'}, 'output_1'), class: question# Query: Which country was jon snow filmed in?, raw_output: ({'query': 'Which country was jon snow filmed in?'}, 'output_1'), class: question
Note how the node returns two objects: the query (e.g.'Arya Stark father') and the name of the output edge (e.g. "output_2"). This information can be leveraged in a pipeline for routing the query to the next node.
You can use a Query Classifier within a pipeline as a decision node. Depending on the output of the classifier only one branch of the Pipeline will be executed. For example, we can route keyword queries to an ElasticsearchRetriever and questions + statements to DPR.
Below, we define a pipeline with a TransformersQueryClassifier
that routes questions/statements to the node's output_1
and keyword queries to output_2
.
We leverage this structure in the pipeline by connecting the DPRRetriever to QueryClassifier.output_1
and the ESRetriever to QueryClassifier.output_2
.
from haystack import Pipelinefrom haystack.nodes import TransformersQueryClassifierfrom haystack.utils import print_answers
query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")
pipe = Pipeline()pipe.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])pipe.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
# Pass a question -> run DPRres_1 = pipe.run(query="Who is the father of Arya Stark?")
# Pass keywords -> run the BM25Retrieverres_2 = pipe.run(query="arya stark father")
One alternative set up is to route questions to a Question Answering branch and keywords to a Document Search branch:
haystack.pipeline import TransformersQueryClassifier, Pipelinefrom haystack.utils import print_answers
query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier")
pipe = Pipeline()pipe.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])pipe.add_node(component=bm25_retriever, name="BM25", inputs=["QueryClassifier.output_2"])pipe.add_node(component=reader, name="QAReader", inputs=["DPRRetriever"])
# Pass a question -> run DPR + QA -> return answersres_1 = pipe.run(query="Who is the father of Arya Stark?")
# Pass keywords -> run only BM25Retriever -> return docsres_2 = pipe.run(query="arya stark father")
Models
The TransformersQueryClassifier is more accurate than the SkLearnQueryClassifier as it is sensitive to the syntax of a sentence.
However, it requires more memory and a GPU in order to run quickly.
You can mitigate those downsides by choosing a smaller transformer model.
The default models that we trained use a mini BERT architecture which is about 50 MB
in size and allows relatively fast inference on CPU.
Transformers
Pass your own Transformer
binary classification model from file or use one of the following pretrained models hosted on Hugging Face:
Keywords vs. Questions/Statements (Default)
TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection") # output_1 => question/statement # output_2 => keyword query
Learn more about this model from its model card.
Questions vs. Statements
```pythonTransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier") # output_1 => question # output_2 => statement ```
Learn more about this model from its model card.
Sklearn
Pass your own Sklearn
binary classification model or use one of the following pretrained Gradient boosting models:
Keywords vs. Questions/Statements (Default)
```pythonSklearnQueryClassifier(query_classifier = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/model.pickle", query_vectorizer = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/vectorizer.pickle") # output_1 => question/statement # output_2 => keyword query ```
Learn more about this model from its readme.
Questions vs. Statements
```pythonSklearnQueryClassifier(query_classifier = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/model.pickle", query_vectorizer = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/vectorizer.pickle")
output_1 => question output_2 => statement ```
Learn more about this model from its readme.