Extractive Question Answering with Data from Dell Forums ⛏️
By Cian Prendergast
Hello there, data enthusiasts! Today, we’re diving deep into the world of Extractive Question Answering (QA). For those unfamiliar, extractive QA is all about pinpointing exact answers from large text sources. Imagine being able to instantly find the solution to a hardware issue from an ocean of forum posts. Sounds cool, right? Let’s get started!
Let’s Dive into Scraping!
Why Dell?
We’ve chosen the Dell support forums, specifically the PowerEdge-Hardware-General section, as our data source. These forums are a goldmine of real-world hardware issues and their solutions, making them perfect for our project.
The Challenge:
Web scraping isn’t always a walk in the park. We encountered some challenges like: - The pesky “Accept Cookies” button (a special shoutout to our EU-based networks!). - The need to filter posts by “Solved” cases, ensuring we only get the cream of the crop. - The “Load More” button that seemed to play hard-to-get by moving further down with each click.
But fear not! With a sprinkle of Python, a pinch of Selenium, and a dash of patience, we tackled these issues head-on.
The Solution:
To deal with the increasing page length, we used a mix of time.sleep()
and the execute_script
method. This allowed us to ensure that every time we reached the bottom of the webpage, we could successfully click the “Load More” button and grab more data. For the curious, our code looks something like this:
def click_accept_cookies_button(driver):
"""Click the 'Accept All' button for cookies."""
= WebDriverWait(driver, 10).until(
accept_button '//a[@aria-label="allow cookies"]'))
EC.element_to_be_clickable((By.XPATH,
)
accept_button.click()
def select_solved_option(driver):
"""Select the 'Solved' option from the dropdown."""
= driver.find_element(By.ID, 'messages-loader-type')
select_element = select_element.find_element(By.XPATH, "//option[@value='solved']")
option_solved
option_solved.click()
def scroll_to_bottom(driver):
"""Scroll to the bottom of the page."""
"window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script(
def click_load_more(driver, num_clicks):
"""Click the 'Load more' button the specified number of times."""
for _ in range(num_clicks):
= WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'btn-load-more')))
load_more_button 3) # Wait before clicking
time.sleep(
scroll_to_bottom(driver)3) # Wait after scrolling
time.sleep(
load_more_button.click()6) # Wait after clicking
time.sleep(
def extract_urls(driver):
"""Extract and return all href URLs on the page."""
= driver.find_elements(By.XPATH, '//a[@href]')
urls return [url.get_attribute('href') for url in urls]
def automate_dell_forum(num_clicks):
# Instantiate the Selenium web driver
= webdriver.Chrome()
driver # Maximize the browser window
driver.maximize_window()
# Navigate to the Dell community forum page
'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/PowerEdge-General-HW')
driver.get(
click_accept_cookies_button(driver)
select_solved_option(driver)
# Wait for the page to load after selecting "Solved" option
2)
time.sleep(
scroll_to_bottom(driver)2) # Wait before clicking "Load more"
time.sleep(
click_load_more(driver, num_clicks)
# Extract URLs
= extract_urls(driver)
url_list
# Close the Selenium web driver
driver.quit()
return url_list
# Refactored function usage remains the same
= automate_dell_forum(189) # To run with 189 clicks
url_list
After some clicks and scrolls, we saved the scraped URLs to a CSV file and a text file. We now have a treasure trove of forum posts ready for our next steps!
If your wondering how I was able to figure out which HTML elements to target, its simple a matter of using a browsers Developer Tools and going down the tree, seeing what gets highlighted. I will go into more details in a future blog post!
Filtering and Cleaning: Getting the Best Data
Once we scraped our data, we were left with a vast list of URLs. However, not every URL was relevant to our goals. To sift through the noise and get to the good stuff, we implemented a filtering function:
def filtered_urls(url_list):
= []
filtered_urls = [
exclude_urls 'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/PowerEdge-General-HW#',
'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/custom.dell.link.solutions.href',
'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/custom.dell.link.careers.href',
'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/custom.dell.link.about.href',
'https://www.dell.com/community/PowerEdge-Hardware-General/bd-p/PowerEdge-General-HW'
]for url in url_list:
if url.startswith('https://www.dell.com/community/PowerEdge-Hardware-General/') and url not in exclude_urls:
filtered_urls.append(url)return filtered_urls
# Filter the urls
= filtered_urls(url_list) filtered_urls
Extracting the Essence: Questions and Answers
With our refined list of URLs, we proceeded to extract the real treasures: the questions and answers from the forum posts.
Using the power of BeautifulSoup, we devised a function to navigate each URL and fetch the required data:
def extract_elements_with_class(urls, class_name):
= []
elements_list = requests.Session()
session for url in tqdm(urls, desc="Extracting elements"):
try:
= session.get(url)
response = BeautifulSoup(response.content, 'html.parser')
soup = soup.find_all(class_=class_name, limit=2) # Limit to the first two elements
elements for element in elements:
elements_list.append(element.text.strip())for _ in range(2 - len(elements)):
"") # Append empty strings if elements are not found
elements_list.append(except requests.exceptions.RequestException:
"", ""]) # Append empty strings if there's an error
elements_list.extend([return elements_list
= "lia-message-body-content"
class_name = extract_elements_with_class(filtered_urls, class_name)
extracted_elements
# Ensure extracted_elements has an even number of elements
if len(extracted_elements) % 2 != 0:
"") # Append an empty string to make it even
extracted_elements.append(
# Split the extracted elements into Questions and Answers lists
= extracted_elements[::2]
Questions = extracted_elements[1::2]
Answers
# Create a dataframe called QA
= pd.DataFrame({"Questions": Questions, "Answers": Answers}) QA
This function allowed us to neatly segment our data into questions and their corresponding answers, paving the way for the creation of our QA_large
DataFrame.
Data is most useful when it’s clean and organized. To this end, we saved our carefully curated dataset into a CSV file:
'QA_large_cleaned.csv', index=False) QA_large.to_csv(
SQUaD Format with Haystack Annotation
Models trained alreasy on the SQuAD dataset will expect new smaller closed domain datasets to follow the same strucure.
This is where the Haystack Annoptation tool comes in, not only is its a GUI that allows us to highlight answer spans, it automatically exports in the expected SQuAD nested JSON format!
Please note: Deepset will shortly no longer support the online tool, you will have to run it locally via Docker, please see details here.
For importing our data into the annptation tool, we must use comma separated CSVs
and include the header in the first line. You might also want to wrap text around quotation marks, e.g. if it contains commas. We always use pandas to_csv
method, this should format the file in the right way.
Example:
docs.csv:
document_identifier,document_text
id1,bbbbb
id2,muh
the same holds for the questions.csv:
question,document_identifier,question_identifier
question1,id1,qid1
question2,id2,qid2
We can no use the tool to highlight relevant passages from the community answers:
Once we have highlighted the answer spans, we can export and save as JSON file (Dell_QA_200.json)
. Now are dataset is structured as so:
import json
with open('Dell_QA_200.json', 'r') as file:
= json.load(file)
data print(json.dumps(data, indent=4)) # A pretty print to visualize the dataset.
{
"data": [
{
"paragraphs": [
{
"qas": [
{
"question": "Does v9 license works with v8 platform?When I try to import the v9 license I am having this error. LIC017.\u00a0",
"id": 1082044,
"answers": [
{
"answer_id": 980739,
"document_id": 1619992,
"question_id": 1082044,
"text": "IDRAC v9 license is not compatible with iDRAC V8 platforms.",
"answer_start": 0,
"answer_end": 59,
"answer_category": null
}
],
"is_impossible": false
}
],
"context": "IDRAC v9 license is not compatible with iDRAC V8 platforms. IDRAC license is tied with each system and you will not be able to install on another system other than the server it is intended.Thanks,DELL-Shine K#IWork4DellView solution in original post",
"document_id": 1619992
}
Question and Answer Length
Later we come across an issue with the Haystack annotation tool and its restriction to 255 characters for questions. So lets quickly check how many questions exceed that length limit:
# Get the length of each text entry in 'column_1'
'text_length'] = df['document_text'].str.len()
df[
# Calculate average, highest, and lowest length
= df['text_length'].mean()
average_length = df['text_length'].max()
max_length = df['text_length'].min()
min_length
# Calculate percentage of entries with length greater than 255 characters
= (df[df['text_length'] > 255].shape[0] / df.shape[0]) * 100
percentage_over_255
print(f"Average Length: {average_length:.2f}")
print(f"Maximum Length: {max_length}")
print(f"Minimum Length: {min_length}")
print(f"Percentage of Texts over 255 characters: {percentage_over_255:.2f}%")
Average Length: 597.39
Maximum Length: 27039
Minimum Length: 0
Percentage of Texts over 255 characters: 79.01%
We can also plot these:
import matplotlib.pyplot as plt
# Assuming you have already calculated 'average_length' and have a DataFrame 'df'
=(10, 6))
plt.figure(figsize'text_length'], bins=50, range=(0, 3000), color='skyblue', edgecolor='black') # Adjusted bins and range
plt.hist(df[='red', linestyle='dashed', linewidth=1, label=f"Average Length: {average_length:.2f}")
plt.axvline(average_length, color255, color='green', linestyle='dotted', linewidth=1, label="Length: 255 characters")
plt.axvline("Distribution of Text Lengths in 'column_1'")
plt.title('Text Length')
plt.xlabel('Number of Entries')
plt.ylabel(
plt.legend()='y')
plt.grid(axis plt.show()
255 char limit
We’ve observed that a significant portion of questions are longer than 255 characters. This isn’t surprising given the need for detailed explanations of technical challenges and prior troubleshooting. This unexpected length initially led us to truncate questions, unfortunately losing valuable information. Going forward, we could segment longer questions and later recombine them after annotation. Alternatively, we could narrow our focus to questions of 255 characters or fewer, but this would exclude a vast 79% of our dataset. Importantly, there’s no character limit for answers, a design influenced by the original SQuAD’s emphasis on concise questions derived from comprehensive Wikipedia content.
My approach, while inspired by SQuAD and the TechQA project, focuses more on real-world, detailed questions and succinct forum answers. Solely highlighting articles referenced in answers, like how-to guides, would mean losing vital reasoning context. Consider the example below:
Using only the firmware landing page linked here from the above answer would deprive us of essential “step-by-step update” details.
Training the Dell QA Model: A Deep Dive
So our dataset is stored in a file named Dell_QA_200.json
. We can transform this intricate JSON structure into a more palatable DataFrame:
= []
rows for dat in data['data']:
for paragraph in dat['paragraphs']:
= paragraph['context']
context for qa in paragraph['qas']:
= qa['question']
question for answer in qa['answers']:
= answer['text']
answer_text = answer['answer_start']
answer_start
rows.append([question, answer_text, answer_start, context])
= pd.DataFrame(rows, columns=['question', 'answers.text', 'answers.answer_start', 'context']) df
Crafting the Train/Test Split
Before diving into training, it’s crucial to split our dataset. We’ve reserved 80% for training and set aside the remaining 20% for testing:
import random
"data"])
random.shuffle(squad_data[
= int(0.8 * len(squad_data["data"]))
train_size = len(squad_data["data"]) - train_size
test_size = squad_data["data"][:train_size]
train_data = squad_data["data"][train_size:]
test_data
# Saving them for posterity (and our model!)
with open('Dell_QA_200_train.json', 'w') as train_file:
"data": train_data}, train_file)
json.dump({
with open('Dell_QA_200_test.json', 'w') as test_file:
"data": test_data}, test_file) json.dump({
Tokenization: The First Step Towards Understanding
Welcome back! If you remember from our last post, we dived deep into preparing our data. This time, let’s talk about how we make sense of this data: Tokenization.
What’s Tokenization Anyway?
In simple terms, tokenization is the process of converting our text (questions and contexts) into smaller chunks, called tokens. This is an essential step because our models don’t understand text as we do; they understand numbers. Tokens are a bridge between human-readable text and machine-understandable numbers.
from transformers import AutoTokenizer
= "deepset/minilm-uncased-squad2"
model_ckpt = AutoTokenizer.from_pretrained(model_ckpt) tokenizer
Using a sample question and context, we can see tokenization in action:
= "How much music can this hold?"
question = """An MP3 is about 1 MB/minute, so about 6000 hours depending on file size."""
context = tokenizer(question, context, return_tensors="pt")
inputs
= pd.DataFrame.from_dict(tokenizer(question, context), orient="index")
input_df input_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
input_ids | 101 | 2129 | 2172 | 2189 | 2064 | 2023 | 2907 | 1029 | 102 | 2019 | … | 2061 | 2055 | 25961 | 2847 | 5834 | 2006 | 5371 | 2946 | 1012 | 102 |
token_type_ids | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | … | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
attention_mask | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | … | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
3 rows × 28 columns
Simplifying with Pipelines
Transformers
offer a nifty feature called pipelines, which simplifies our QA process:
from transformers import pipeline
= pipeline("question-answering", model=model, tokenizer=tokenizer)
pipe =question, context=context, topk=3) pipe(question
But what if our context doesn’t contain the answer? The pipeline gracefully handles such scenarios:
="Why is there no data?", context=context, handle_impossible_answer=True) pipe(question
{'score': 0.906841516494751, 'start': 0, 'end': 0, 'answer': ''}
Sliding Windows: Seeing the Bigger Picture
Remember the challenge of tokenizing long passages? The sliding window approach is our knight in shining armor. By using return_overflowing_tokens=True
and specifying a stride
, we ensure that our model gets overlapping views of a passage without missing any potential answers.
Let’s see it in action:
= df.iloc[0][["question", "context"]]
example = tokenizer(example["question"], example["context"],
tokenized_example =True, max_length=100,
return_overflowing_tokens=25) stride
Each window has a specific number of tokens:
for idx, window in enumerate(tokenized_example["input_ids"]):
print(f"Window #{idx} has {len(window)} tokens")
Window #0 has 89 tokens
When we decode these tokens, we can visualize the overlapping windows:
for window in tokenized_example["input_ids"]:
print(f"{tokenizer.decode(window)} \n")
[CLS] does v9 license works with v8 platform? when i try to import the v9 license i am having this error. lic017. [SEP] idrac v9 license is not compatible with idrac v8 platforms. idrac license is tied with each system and you will not be able to install on another system other than the server it is intended. thanks, dell - shine k # iwork4dellview solution in original post [SEP]
With this setup, our model gets a comprehensive view of the context, ensuring we don’t miss out on potential answers.
ElasticSearch: Turbocharging Our QA System
ElasticSearch is a powerhouse for search and analytics. It’s our chosen tool to index and retrieve relevant contexts efficiently when posing questions.
Setting it up involves:
- Downloading ElasticSearch:
= "https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.9.0-linux-x86_64.tar.gz"
url !wget -nc -q {url}
!tar -xzf elasticsearch-8.9.0-linux-x86_64.tar.gz
- Performing system configurations:
!sudo usermod -aG docker $USER
!sudo reboot
- Installing required packages:
!pip install --upgrade numexpr
With ElasticSearch in place, our QA system is turbocharged, ensuring rapid and accurate retrievals.
Launching ElasticSearch (The Docker Way)
For those using Docker, here’s a quick way to get ElasticSearch up and running:
from haystack.utils import launch_es
launch_es()
A quick check to ensure ElasticSearch is running:
!curl -X GET "localhost:9200/?pretty"
Populating ElasticSearch: From DataFrame to DocumentStore
To make our data retrievable, we need to format it appropriately and populate our ElasticSearch index:
= []
docs
# Convert DataFrame rows to dictionary format for the DocumentStore
for _, row in df.iterrows():
= {
doc 'content': row['context'],
'meta': {} # You can add additional metadata here if needed.
}
docs.append(doc)
# Populate the DocumentStore
="document") document_store.write_documents(docs, index
BM25: Simple Yet Powerful
BM25 is a classic technique in information retrieval. Let’s initialize it and see it in action:
from haystack.nodes.retriever import BM25Retriever
= BM25Retriever(document_store=document_store) bm25_retriever
A quick retrieval test:
= "How to fix my KVM?"
query = bm25_retriever.retrieve(query=query, top_k=3)
retrieved_docs
print(retrieved_docs[0])
<Document: id=1a93a3494d130482abb1d9341f3a4642, content='With a KVM you really will need to do some deductive troubleshooting.If port one is the only port w...'>
Dense Passage Retrieval (DPR)
While BM25 is great, there’s another retrieval technique on the horizon: DPR. It’s a more advanced method that uses deep learning to understand context and provide better results. You can learn more about DPR here.
DPR offers a more granular retrieval technique, diving deeper into our contexts to find the most relevant snippets:
from haystack.nodes import DensePassageRetriever
= DensePassageRetriever(
dpr_retriever =document_store,
document_store="facebook/dpr-question_encoder-single-nq-base",
query_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
passage_embedding_model=False
embed_title )
With DPR initialized, it’s time to update our document embeddings:
=dpr_retriever) document_store.update_embeddings(retriever
The Power of Pipelines
By combining our retriever and reader, we form a powerful QA pipeline:
from haystack.pipelines import ExtractiveQAPipeline
= ExtractiveQAPipeline(reader=reader, retriever=dpr_retriever) pipe
Let’s see this in action:
= 3
n_answers = pipe.run(query=query, params={"Retriever": {"top_k": 3},
preds "Reader": {"top_k": n_answers}})
print(f"Answer {idx+1}: {preds['answers'][idx].answer}")
for idx in range(n_answers):
print(f"Question {idx+1}: {preds['answers'][idx].answer}")
print(f"Answer snippet: ...{preds['answers'][idx].context}...")
print("\n\n")
Question 3: I want to buy DDR3L HPE 647883-B21 16Gb DIMM ECC Reg PC3-10600 CL9This modules may not be supported.Kingston for Dell (317-6142 370-20147) DDR3 DIMM 16GB (PC3-10600) 1333MHz ECC Registered Low Voltage ModuleThis module can be used.
Answer snippet: ....I want to buy DDR3L HPE 647883-B21 16Gb DIMM ECC Reg PC3-10600 CL9This modules may not be supported.Kingston for Dell (317-6142 370-20147) DDR3 DIMM 16GB (PC3-10600) 1333MHz ECC Registered Low Voltage ModuleThis module can be used....
Fine-Tuning & Evaluation
While our models are pretrained on diverse data, fine-tuning helps adapt them to our specific use case:
from pathlib import Path
import torch
def train_and_evaluate():
# Initialize model, optimizer, loss function, etc.
= "Dell_QA_200_train.json"
train_filename = "Dell_QA_200_test.json"
dev_filename = "Dell_QA_200_test.json" # or another test file
test_filename = Path(".") # using pathlib's Path
data_dir = "./model"
save_dir
# Fine-tuning your model
=data_dir, use_gpu=True, n_epochs=1, batch_size=16,
reader.train(data_dir=train_filename, save_dir=save_dir)
train_filename
# Evaluate the model
= reader.eval_on_file(data_dir=data_dir, test_filename=test_filename, device="cuda" if torch.cuda.is_available() else "cpu")
result print(result)
# Call the function to start the process
train_and_evaluate()
The above snippet demonstrates how we fine-tune our reader on the Dell QA dataset, optimizing it for our domain.
Metrics: The Scorecards of AI
Evaluation is crucial in understanding how well our system performs. Some key metrics include:
- Exact Match (EM): Measures the percentage of predictions that match any one of the ground truth answers exactly.
- F1 Score: Considers both the precision and the recall of the test to compute the score.
- Top_n_accuracy: Evaluates how often the correct answer is within the top n retrieved results.
After fine-tuning with train_and_evaluate()
, our evaluation showcased impressive results:
{'EM': 27.50, 'f1': 44.76, 'top_n_accuracy': 67.5,
'top_n': 4, 'EM_text_answer': 24.32, 'f1_text_answer': 42.98,
... }
These metrics offer a comprehensive view of our model’s performance, highlighting areas of excellence and potential improvement.
Conclusion
So that is my attempt at using Exctractive QA! With Deepset dropping it support for the online version of Haystack Annotation tool, I suspect many people are moving towards Generative QA system - I will be doing a blog post on this shortly!
My full code and datasets are here on my GitHub.
Happy coding!