Things to consider before using sentiment analysis
- How do you handle irony?
- What is the age group you are surveying? (Different datasets are targeted at different age groups)
- What slang words will the dataset not understand? And on the flip-side what technical language may be
- Do certain slang terms mean different things in different countries? (e.g. "fag" in Britain means
cigarette, but in America can be used as a derogatory term for homosexuals)
- What are the differences in culture between those who are being surveyed? (i.e. certain colours may have
different connotations in different areas)
- How will you deal with compound sentences? (i.e. "This problem was making me so angry, thanks for fixing
- How will it cope with emojis?
- How will you cope with differences in culture? 🙏 can mean prayer in Western, Christian
societies but simply high-five in others.
- Might some emojis get confused based on your age group? 💀 can be used ironically by younger
people but quite literally for those of an older generation
- Do you trust the emoji labels given by companies? 😥 is labelled as "DISAPPOINTED BUT
but it could easily be confused as "disappointed and crying"
If you are making a training dataset, who are you creating the dataset for? What kind of people do you want
to target in your survey?
Is it ethical to make decisions based off "mined" sentiment? (i.e. from using Twitter or other social media
Does tone or content affect sentiment polarity the most?
How might your system cope with idioms?
How might you deal with spelling errors? It is unlikely a sentiment system will notice that "hapy" is
How might it deal with phrases that come down purely to tone of voice? "It's not amazing" can be both used
as something that is okay/quite good, or can come across as incredibly rude
What functions will you apply to your polarity scores? If you read 3.5.1 of [1] you can
see some techniques you might use
At what cut-off point does polarity become positive or negative?
What sentences are difficult for simple sentiment analysers to work?
"He is using you" Negative (marked as neutral, connotation not understood)
"He is using the computer" Neutral
"Sample mean." Neutral (marked as negative because of "mean")
Would analysing a likert scale make more sense?
Further Reading
On the Subjectivity of Emotions in Software Projects: How
Reliable are Pre-Labeled Data Sets for Sentiment Analysis?