My Quora Codesprint Submission
(this is x-posted on Quora)
I’ve had some experience in the past with machine learning, but I feel like I still don’t have a proper methodology. I’d like to hear what you guys think about what I’ve done here.
Regression/Classification Models
I did try to use the SVC and SVR with RBF kernels, but, like Alec said, due to the time limitation, these took time to train, and so couldn’t be used. So like him, I used linear models, which were faster to train.
One thing that I realised a little later was that performing a log-transform on some of the variables yielded better results. For example, take a look at the distribution of number of followers against interest (2nd challenge):
After a log transformation:
The r values are pretty close, but I did a regression using only the number of followers and the interest values, before and after log, and used the RMSLE score. Before log: 47.864%, after: 51.822%.
Naturally, I went crazy and started log-transforming everything I could find. But that obviously didn’t help as much.
(I’d like to know if this log-transformation thing is the right thing to do here, or if it is, why. I’ve seen some articles saying count data should not be log-transformed… but why not if it helps? Heh. I think this has to do with directly minimising the error function, since the metric used for comparison deals with the log space as well.)
Features
I obviously went with total followers as one of my features. For the textual data, I experimented with using just the counts, TF-IDF, and just a binary value for each word. In most cases, just the counts perform well.
I also used the topics (not the words of the topics) in which they fall under. The context topic was not helpful for all of the scenarios though. When I did use the context topic, one-hot encoding was used.
Doing all of the above made my feature vector insanely huge and sparse, and also created a lot of noise in the prediction. I then used the SelectKBest class from sklearn, and used the f_regression score for feature selection on the topic and textual features.
I also added in features based on the formatting of the question: Was there a why, who, what, when, etc. at the start, and a question mark at the end of those sentences? Are there dictionary words all in caps?
These were all created from looking through high scoring questions and low scoring questions.
Examples of low scoring questions:
Guys who love being sissies? DELETED
WHY ARE INDIA YOUTH NOT POLITICS INTEREST?
Harry Potter, what should I read next?
Examples of high scoring questions:
What are some truly epic Facebook conversations you have come across?
How can we keep artificial intelligences from opposing us?
What is it like to be discreetly wealthy?
What are some of the most mind-blowing facts that sound like “BS” but are actually true?
Why is there not a single Indian university within the world’s top 100 index list?
One interesting thing I found while working on “Interest” was, while thinking about how I could incorporate the position of words, I decided to score words earlier in the question higher. It made sense to me since people might just not bother reading after the first few words. So I came up with:
$$\text{score}(w_i) = \frac{1}{i + 1}$$
Where $w_i$ is the $i$-th word (after discarding stop words and words of length < 3, I don’t use those). I had some increase in accuracy, but I later realised that since I had forgotten to cast the denominator as a float, all it was doing was scoring only the first word! And fixing this actually caused a drop in accuracy. So I decided to only extract the first word from the questions.
Optimising for Accuracy
What then remained for me was to tune the number of features carefully for optimal performance. Initially for speed I simply trained against the training data and ran the model against the test data. This got me pretty far, but after a while, I felt I was overfitting, and used 10-fold cross-validation and a grid search through different values of K for the feature selection. I think this helped later on when the hidden test set was used against my best-performing models.
In summary, what helped me most eventually was adding good features and removing bad ones. Tuning hyper parameters or choosing better regression algorithms may help, but can only get you so far. Ensuring that you don’t get an overfitted result by using cross-validation is also very important.
@misc{tan2013-07-29,
title = {My Quora Codesprint Submission},
author = {Tan, Shawn},
howpublished = {\url{https://blog.wtf.sg/2013/07/29/my-quora-codesprint-submission/}},
year = {2013}
}