For the tweets with one emoji were
For the purpose of this assignment, we have chosen to work on the data providedfor the SemEval-2018 shared task 1, more specifically, for the subtask 1- Emoji Prediction in English. The goal of the task is to design a system thatgiven a tweet in English predicts the emoji that is most likely to be associatedwith it.The data that we will use consists of 500.
000 tweets in English that werecollected with the Twitter APIs from October 2015 to February 2017 and aregeolocalized in the United States. The emojis used in the tweets were thenremoved and used as labels, and it is important to point out that only the tweetswith one emoji were used. The emojis chosen as labels are the 20 most frequentin English tweets. Emojis and the numbers used to identify them in the data areshown in figure 1, while the figure 2 summarizes the distribution of tweets bydifferent emojis; x-axis represents the labels, while the y-axis is the total numberof tweets in which a certain emoji appeared. Data is then finally split into training(90%) and test (10%).
We Will Write a Custom Essay Specifically
For You For Only $13.90/page!
The methods we will be using for emoji prediction are Multinomial NaiveBayes (MNB) and Support Vector Machines (SVM). These methods are veryfrequently used in the literature on Twitter sentiment analysis (for example in1, 25, 23, etc.). MNB is a popular classification method since it is computationallyefficient and showed relatively good predictive performance 12, whilelinear SVM classifier demonstrated to perform better than other well known machinelearning techniques, such as, for example, the mentioned Naïve-Bayesclassifiers or k-Nearest Neighbour classifiers 12, 19.
We intend to test theperformance of these algorithms on our data, hence, evaluate their performanceover a different kind of sentimental analysis – using emojis as labels- since few papers have been focused on this kind of problem (for example 5,14, 17. This project provides a good opportunity to learn in detail and implementtwo techniques which are commonly used in natural language processing.