Category Archives: Data in Context

Spy or Not: week one data

In one week the Gamify project has gotten roughly 700 participants from around the world, including Kazakstan!

Of the new visitors (which we assume are coming to play for the first time) they are averaging 3.4 pages per visit, most are completing the experiment, which takes an average of 5 minutes to complete. We won’t know for a few weeks how many of the participants have usable data.

Spy or Not has seen visitors from around the world, most are viewing all three stages of the game.

Surprisingly, we had a few installs on the Android Market, many of whom also went through all three stages of the game.

Spy or Not installs on Androids also shows activity from around the world, most importantly where we need the most participants Russia, UK and South Africa.

Thanks everyone for playing and sharing our game, our goal is 500 participants from Russia, UK and South Africa. It takes only 5 minutes to spread the word, challenge your friends to beat your score!

Word Edit Distance Web Widget

If you have a spell checker, you want it to suggest a number of words that are close to the misspelt word. For humans, its easy for us to look at ‘teh’ and know that it is close to ‘the’, but how does the computer know that? A really simple Language Independent way to do it if you don’t have any gold standard data, is to assign costs to the various edits, substitution (2), deletion (1) and insertion (1), and picking the cheapest one.

The table below applies Levenshtein’s algorithm (basically, substitution costs 2) letter by letter. The total distance between the two words, 4 is in the top right corner, because it costs 2 to substitute ‘u’ for ‘i’ and 2 to substitute ‘t’ for ‘k’.

At the Lab, we put together an interactive javascript so that you can input whatever words you like and find out their edit distance. Just enter the words you want to compare!

Word 1:

Word 2:

And if you really like it, you can download it from github.
Click here to read more about edit distance.

The stuff people say

iLanguage is all around us.  Each of us, has a unique background and we use language in unique settings that determines how we speak. This is exemplified in the latest internet meme* referred to under the technical term: “Shit _____ say”

*A meme acts as a unit for carrying cultural ideas, symbols or practices, which can be transmitted from one mind to another through writing, speech, gestures, rituals or other imitable phenomena. [Wikipedia]

In these memes, we see phrases associated with specific groups of people.  The obvious candidates show up such as gender, ethnicity, and location.  However, perhaps more revealing is how specific some of these memes are.   There is pretty much one for every subculture, gamers, hipsters, yogis, republicans, atheists etc.

In addition, people take into account the context, by making memes about what people say to specific groups of people, such as twins, tall girls, and pharmacists.  Not only who we are influences how we speak, but who we are speaking to or what we are speaking about. This showcases the role of context in any Natural Language Processing task. Maybe your reaction to these videos is something like this.

What are some phrases, expressions or idioms that are unique to you? What would be included in your “Shit I say” meme?

Autocorrect vs. iLanguage

An obvious place where natural language must get filtered through technology is texting.  However, with the advent of autocorrect, texting has become a rather perilous endeavor with humorous results.

What makes so funny??

1.  Although “sex-ting” has made headlines, I am pretty sure that the primary purpose of texting for most people is not to send dirty words, and especially not to one’s parents.  This is something Apple appears not to have taken into account when creating their autocorrect algorithm. (Or did they?)  For most people, dirty words are less frequent than other types of words, particularly for the domain of texting.

2.  On the other hand, one thing the autocorrect algorithm does appear to take into account is part of speech.  If autocorrect algorithm just returned the closest word, it might not be the same part of speech than the word intended, resulting in just gibberish.  But the mistaken texts are funny because they do make sense, and do have semantic meaning, just not the intended meaning.

If autocorrect had provided a verb “broil” or an adjective “broiled” instead of an noun the result would have just merely been weird rather than funny.

It is precisely this mixture of getting some things right and some things wrong that permits these texts to occur.  While autocorrect does take the immediate linguistic context into account, it does not consider the context of the text itself, which is not something easily determined by an algorithm.

What do you think? Do you see any patterns in the autocorrect mistakes? Do you have any observations about texting that you think might improve the algorithm?