So recently I’ve been taking the Web Intelligence and Big Data course on coursera (www.coursera.com), offered by IIT Delhi with professor Gautam Shroff. In this course, students were exposed to the basic machine learning algorithm, naive bayes, and also were asked to calculate likelihood ratios by hand. The algorithm was presented as a classification method, able to discern sentiment from text by probabilistically determining the likelihood that certain words would appear or not appear in the query, compared to the ‘trained’ model.
I tried to apply this to the Yelp competition dataset, found at http://www.yelp.com/dataset_challenge/ .
I wanted to try to determine, with the reviews dataset, whether or not an arbitrary review was ‘good’ or ‘bad’. Thus, I took the reviews dataset, and attempted to divide it into good and bad pieces. Three stars and above were considered to be ‘good’, and this left two and one star examples as ‘bad’.
I used roughly 5k reviews as the sample space, and validated it with a set of roughly 1k reviews, yielding an accuracy of 70.07%.