homepage

Introduction

In this project, we focused on the challenge of identifying duplicate questions on many popular question-and-answer platforms. Duplicate or similar questions pose a problem for platforms like Quora, StackOverflow, and natural language processing emerges as an effective solution. The goal was to determine whether two questions are semantically equivalent, even if they are phrased differently. By leveraging machine learning techniques, we explored Quora's question pair dataset and trained models to classify questions as either duplicates or non-duplicates. This binary classification task required extracting relevant features to capture the intent behind the questions. By participating in Quora's challenge on Kaggle, we aimed to contribute to the overall user experience by enhancing the quality and quantity of content presented.

Future Scope

  1. It saves the question asker time if their question has already been answered previously on the site. Instead of waiting minutes or hours for a response, they can get their answer immediately.
  2. Frequently repeated questions can frustrate highly engaged users whose feeds become polluted with redundant questions. Many users who answer questions on a particular topic see slight variations on the same question appearing many times in their feed, and this creates a negative user experience for them.
  3. Q&A knowledge bases have more value to users and researchers when there is a single canonical question and collection of answers, instead of knowing fragmented and spread throughout the site. This reduces the time it takes for users to find the best answers, and allows researchers to better understand the relationship between questions and their answers.
  4. Knowing of alternative phrasings of the same question can improve search and discovery

Technology Used:

  • Python
  • NLP
  • Machine Learning
  • Jupyter NoteBook
  • Streamlit

Links To

Github

Website