StackOverflow Search Engine and Question Recommendation

Sumit Mishra
10 min readAug 20, 2020

1. Business/Real-world Problem

1.1 Introduction

  • Stack Overflow is a question and answer site for beginners as well as professional enthusiast programmers.
  • It features questions and answers on a wide range of topics in programming and now it is not limited to programming alone but answers questions on a wide range of the spectrum.
  • The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers. The better an answer, the higher the votes it gets, which also increase a user’s reputation.

1.2 Problem statement

  • Given its popularity, until now more than 19.8M questions have been asked on StackOverflow.
  • However, this huge amount of information also makes it difficult to search for the solution one is looking for.
  • It’s not that big of an issue for programming veterans and other experienced professionals, because they are aware of the correct keywords required to get an appropriate answer.
  • However, for a beginner programmer, this poses a great concern.

2. Machine Learning Problem

2.1 Data

2.1.1 Data Overview

Source: https://archive.org/download/stackexchange

There is a lot of data available for multiple topics. For this case study I’ve selected a few topics which are as follows:

ai.meta.stackexchange.com.7z
ai.stackexchange.com.7z
android.meta.stackexchange.com.7z
android.stackexchange.com.7z
arduino.meta.stackexchange.com.7z
arduino.stackexchange.com.7z
computergraphics.meta.stackexchange.com.7z
computergraphics.stackexchange.com.7z
cs.meta.stackexchange.com.7z
cs.stackexchange.com.7z
datascience.meta.stackexchange.com.7z
datascience.stackexchange.com.7z
iot.meta.stackexchange.com.7z
iot.stackexchange.com.7z
robotics.meta.stackexchange.com.7z
robotics.stackexchange.com.7z
softwareengineering.meta.stackexchange.com.7z
softwareengineering.stackexchange.com.7z
webapps.meta.stackexchange.com.7z
webapps.stackexchange.com.7z

The total size of these datasets is 690 MB in compressed format.

  • After uncompressing, it is of size 3.92 GB.
  • All individual topics contain several files in XML format which are Badges.XML, Comments.XML, PostHistory.XML, PostLinks.XML, Posts.XML, Tags.XML, Users.XML, and Votes.XML.
  • Out of these, I’ve selected Posts.XML, as this file contains the textual information about the post such as Title, Body, etc.
  • After picking all Posts.XML the size of the used dataset is 948 MB and has more than 650k data points.

2.1.2 Data Field Explanation

Id - Unique identifier for each questionTitle - The question's titleBody - The body of the questionTags - The tags associated with the questionTopic - The topic or category of the question

2.2 Mapping the real-world problem to an ML problem

2.2.1 Type of Machine Learning Problem

  • The problem is to build a search engine and related question recommendation based on StackOverflow questions.
  • The search results should include the semantic meaning, with a scalable architecture that returns results in very little time.
  • Natural Language Processing (NLP) the subfield of Artificial Intelligence has proven to work very well in the past few years due to fast processors and sophisticated model architectures and thus has immense potential for solving various language comprehension tasks.

2.2.2 Performance Metric

Pairwise distance:

  • This method provides a safe way to take a distance matrix as input while preserving compatibility with many other algorithms that take a vector array.
  • We’ll be using pairwise distance as a metric to rank the semantically similar result.

2.2.3 Real-world/Business objectives and constraints

  • Our objective is for the platform to actually understand the content of what the user is trying to search for, and then return the most similar results based on that.
  • Since we are building this as a search engine in addition to the semantic relevance of the predicted posts with respect to the query post or text, there are additional constraints that need to be satisfied.

Low latency — time to return recommended result should be less,

Scalability — should work even when the volume of data increases tremendously.

3. Exploratory Data Analysis

3.1 Data Loading

3.1.1 Parsing XML files and converting to the pandas' data frame

Out[ ]:

(650561, 5)

3.1.2 Basic statistics and analysis

3.1.2.1 Analysis on “Title” field

Out[ ]:

count                         243391
unique 242988
top 2018: a year in moderation
freq 10
Name: Title, dtype: object
Total number of unique word(vocab) in title column is: 63235

Out[ ]:

Countplot of top 10 rare words of the title
Output of so_title_word_distribution.py

Observation:

  • The above graph shows the distribution of all unique each word with its count in title data corpus.
  • As we can observe, almost 60k unique words in the title have occurred very less whereas other 3k words have occurred more.
Output of so_last_3k_title_word_distribution.py

Observation:

  • The above graph shows the distribution of the last 3k+ word with its count in the title data corpus.
  • As we can observe, the last 200+ words have the highest frequency.

Out[ ]:

Total number of rare words(occured less than 3 times) are: 44221
69.93 % of words in total title data is rare.
Total number of most occured words(occured more than 100 times) are: 2318
3.665 % of words in total title data has most occured.

3.1.2.2 Analysis on “Tags” field

Out[ ]:

Getting the count of all individual tags

Out[ ]:

Observation:

  • More than 200 tags are used more than 1000 times.
  • Out of which 20 to 25 tags are used more than 4000 times.

Most Frequent Tags

3.1.2.3 Analysis on “Topic” field

Out[ ]:

count                  650561
unique 10
top softwareengineering
freq 229127
Name: Topic, dtype: object
Total number of unique Topics: 10Unique topics are: ['ai' 'arduino' 'computergraphics' 'cs' 'android' 'robotics' 'webapps'
'datascience' 'iot' 'softwareengineering']
Topic softwareengineering has the highest frequency of 59026.
Distribution of data points by topics

Observation:

  • The above graph shows the topic-wise distribution of questions.
  • As we can observe, 35.2% of questions are from software engineering topics which is the maximum.
  • Whereas topic IoT has the least number of questions which is 0.7%.

3.1.2.4 Analysis on “Body” field

Analysis of “Body” column can be done same as “Title” field as both are text feature.
The main objective of anlysing these text feature is to understand the vocabulary of our corpus because for search engine it is very crucial to understand text features like distribution of words, which words are rare, total percentage of rare words in our corpus, etc.

3.2 Data De-duplication and Cleaning

3.2.1 Removing data points with a null and duplicate title

3.2.2 Data preprocessing for feature Title and Body

3.2.3 Data preprocessing for feature Tags

3.2.4 Merging all Preprocessed data with a data frame

Out[ ]:

(242987, 7)

4. Feature Engineering

4.1 Description

  • Feature Engineering is one of the most important and crucial step of solving any data science or machine learning problem.
  • Good Feature engineering done could help to improve the performance of machine learning algorithms.
  • Here, I have come up with these new features:

Document embedding of Textual feature “Title”, “Text” and “Tag”.

  • For vectorization of the text data, I’ve used the following embeddings:
    a. TF-IDF(Term Frequency — Inverse Document Frequency)
    b. TF-IDF weighted W2V using pre-defined glove vectors
    c. Word2Vec from scratch
    d. DistilBERT embedding
    e. Universal sentence encoder

4.2.1 TF-IDF(Term Frequency — Inverse Document Frequency)

  • TF-IDF transforms text to feature vectors that can be used as input to the estimator.
  • Its vocabulary is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
  • In each vector, the numbers (weights) represent features TF-IDF score.

4.2.2 TF-IDF W2V using pre-defined glove_vectors

4.2.3 Word2Vec embedding from scratch

  • Bag of Words, Bi-gram, and TF-IDF are very common approaches for vectorizing.
  • StackOverflow is very technical and they use a very specific vocabulary of words.
  • However, pre-trained WordEmbedding like glove_vectors has lots of good words but they are trained on plain English text and would not be able to understand the relation between the words in the vocabulary.
  • Hence, it is not a good idea to use pre-trained WordEmbedding (although Google has a lot of good ones).
  • Thus, I’ve decided to train a WordEmbeddings model from scratch.

4.2.4 Distilbert embedding

  • DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling Bert base. It has 40% fewer parameters than Bert-base-uncased, runs 60% faster while preserving over 95% of Bert’s performances.
  • Since we are building a search engine, low latency is the important constraint that needs to be considered.
  • Hence, I’m implementing distilBERT which preserves almost more than 95% of BERT performance by reducing half of the parameters.

4.2.5 Universal sentence encoder

  • DistilBERT is much faster than the BERT model but even though in search engine its not a good choice to use.
  • Because it not only takes much more time to train and convert text to vector but also doesn’t keep the semantic meaning of words.
  • BERT kind of model is good practice to use for linguistic transformation because it allows us to learn word context based on surrounding words rather than just the word that immediately precedes or follows it not on the basis of semantic similarity.
  • The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words.
  • The sentence embeddings can then be trivially used to compute sentence-level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

5. Model for topic prediction

  • While recommending a question given input query post, we can retrieve the topic or category of input question using the index of query post but for a given raw text as input, we need to predict the topic or category of the searched question text.
  • To do so, implementing a model that will predict the topic or category of the query question text.
  • So, here our class label will be the ‘Topic’ feature, and input will be ‘Title’ feature.
  • Using machine learning models we can predict the topic of a question.
  • In this case, I’ve used Naive Bayes, Logistic Regression, and SVC model. Out of which SVC model worked very well and gives high precision and recall for every category of class label.

Out[ ]:

Observation

  • As we can observe in the classification report of our topic prediction model, the SVC model works very well as compared to Naive Bayes and Logistic Regression.
  • Although, there is not much difference between SVC and Logistic Regression results. But if we compare the time to train, the SVC model trains much faster than Logistic Regression by also giving slightly better results.

6. Spell correction for searched text

  • When the user enters a text to search, there might be a possibility that he/she does some spelling mistakes.
  • So, creating a function that will correct the spelling of words by using its own dictionary of corpus words.

7. Search Engine and Question Recommendation

To implement, this created below GUI on Jupyter notebook using ipywidgets that will let the user select search type, text embedding techniques, and the number of results to be displayed.

  • If the user selects search type as ‘Text’ it will prompt the user to enter the query text.
  • Then correction function will check if there are any spelling mistakes in the input text and then coverts the raw text to vector using passed text embedding techniques.
  • And then it will predict the category of the search question whether it belongs to ai, ml, web apps, and so on.
  • After that, it will find the pairwise distance between query text and all other question titles belonging to the same category as input text and filter only ‘n’ most similar question.

Here ‘n’ is number of results to be displayed that will be selected by user using slider widget.
Most similar question will have less distance so we’ll filter the results having least pairwise distance.

  • If the user selects search type as ‘Id’ it will prompt the user to enter a question id.
  • Then question title, text, and tags will be fetched from the database using input id.
  • And it will do the same steps as it was for search type ‘Text’ except that it will find the pairwise distance between only those questions which has the same tag as input question.

Below is the one demo output that is searched using text with input as ‘how to tackle vanishign grafient problem?’.

I’ve used simple IPython.display to view result in an HTML format in an IPython notebook itself.
As we can see, input text has spelling mistakes but similar questions are searched after correcting the mistakes and also in very little time.

The below video shows the predicted result using both search types and all text embedding techniques.

Future Work

  • We can try with other different text embedding techniques to improve its performance.
  • Also, we can use different techniques like Locality Sensitive Hashing, NMSLIB, KD-Tree, FAISS, Inverted index, and many more to improve the performance of the search engine by making it much faster.

References

--

--