Sunday, February 5, 2017

How to detect clickbaits?

From some time I'm working on Chrome extension which will help me identifying clickbait links.
I was reading article The Pot-Belly of Ignorance and realized that clickbaits are very addictive, but you don't get anything from those. So I decided to found a way to get them rid of my reading ;-)

My first idea was to use naive Bayes classifier to detect clickbaits.
It works, but isn't perfect.

So my second try was with logistic regression.

I'm currently talking on-line classes from Machine Learning, and what would be better training in this than using some real world example? ;-)

This try was better ;-)

But lets back to the beginning.

First problem was to find some data to train algos...
Happily I found it on GitHub :-)

Idea was that to detect clickbait it should be possible to use only title of link, without looking on url.
It is rather more probable to see "you" in clickbait link than in proper article, and to see "washington" in proper article is more probable than to see it in clickbait.

Next assumption was that it doesn't matter if words are in capital or lower case, YOU and you should be treated the same.

With those assumption I started to play ;-)

First I used naive Bayes classifier.
Here I done whole training by myself... with little help with Python code from "Machine Learning in Action" (I needed to translate if from Python with NumPy to JavaScript for node.js).

It worked, but even when numbers from test set looked good, it was sometimes too eager to classify something as clickbait.
I suppose that this was caused by this that model assumed at the beginning that for new title it is more probable to be clickbait than proper article.

So I moved to logistic regression.

Here I decided to use GNU Octave to do dirty work of calculating everything ;-)
JavaScript only prepares matrix with vectors build from articles titles, next Octeve is using gradient to find values in vector which will be used latter to classify new data.

It seems to work better. On training set it have 100% accuracy ;-) on test set it is about 97%.

If you are interested in looking on some code, you may find it on my GitHub ;-)
You may find there extensions, and some code for training.

Similar postsbeta
My road to automation ;-)
Which language is fastest? ;-)
How to get negative number from size() in LinkedList in Java? ;-)
Recursion is evil ;-)
Google Buzz - let's mark some comments as not spam ;-) or how to unhide "hidden" comments ;-)

No comments:

Post a Comment