Category Archives: NLP

Taking a look at iLanguageCloud user reviews

Its been a few years since Josh originally released the iLanguageCloud project. The iLanguageCloud project uses Jason Davies D3.js cloud library and some statistics to tokenize and identify stopwords so that it can support text in any unicode charset in any language.

Since the app was released a surprising number of users have found the app and have been using it. Users have been requesting features and providing feedback on the Play Store and Chrome Store.

Using @iLanguageLab word cloud to collect & display words to describe the moon. One S uses Word Central for help!

Some teachers have even tweeted about the app!

This summer Veronica will be looking over iLanguageCloud user reviews in order to document what needs to be done in the next releases. First she found that most of the reviews indicate that there are different user groups who have different goals when they open the iLanguageCloud project. Some users want to paste a full text and see a cloud, but most users want to see all the words they paste.

She started by identifying the user types with a CouchBD map reduce and learning how to do statistical analysis in LibreOffice. Once she had identified stats to categorize user types, she added tests for these user types in the codebase using Jasmine.

Users are often creating tag clouds, not full text clouds. We attribute this to users being used to having to pre-filter their words to only the words they want to show with random text sizes rather than text size which depends on their frequency or other factors.

 

While she is learning the tools (Angular.js, Travis) to make the modifications so that her user types tests pass, Veronica created a video tutorial showing how you can use the Chrome app so that users can have some instructions.

 

To help decide features get done first visit our GitHub feature list.

Fork me on GitHub

Week 7: Searching for court cases in Kartuli

Since Kartuli is an agglutinative language with very rich verb morphology searching for appropriate results is very difficult. Over the past few weeks of observing it seems like most Kartuli speakers prefer to search using Russian search engines, using Russian vocabulary. Mari (who is a lawyer) and Gina decided to create a corpus of law cases in Kartuli, and see if the FieldDB glosser can help build a stemmer that might be used for searching in Georgian.

While Mari was teaching Gina and Esma how to use the Georgian court websites, in the middle she showed them how she modifies her search terms to get some results in supreme court cases, unlike the constitutional court search page which lets you search for an empty string and see all results… This was an illuminating experience of searching as a minority language speaker, so we decided to share it as an unlisted YouTube video despite the poor image quality.

Supreme Court

* Requires search to find documents
* Need to use very general search terms to get any results, and results you get are not always relevant to your case you are working on
* Documents are .html which is excellent for machines but Mari didn’t seem to excited about it, we will ask her more later

vs

Constitutional Court
* Requires no search to find documents
* Documents are in .doc format which users are used to
* Easy to download documents so you can read them offline when you are in the village, or put on a usb key if you are using someone else’s computer for the internet.

 

Making your apps smarter @Notman House

Next Wednesday our software engineering intern Bahar Sateli will be presenting her OpenSource Named Entity Recognition library for Android which is powered by the Semantic Software Lab‘s Semantic Assistants web services platform, which in turn, is powered by GATE, an Open Source General Architecture for Text Engineering developed at the University of Sheffield.

As part of her MITACS/NRC-IRAP funded project in collaboration with iLanguage Lab she created an Android Library to make it possible to recognize people, locations, dates and other useful pieces of text, on Android Phones. The sky is the limit as it can run any GATE powered pipeline.

The current open source pipelines range from very specialized (recognizing Bacteria and Fungi entities in bio-medical texts) to very general (recognizing people, places and dates).

She will be presenting her app iForgot Who which takes in some text, and automatically creates new contacts for you, a handy app for all those party planners out there. It is a demo application to show new developers how they can use her OpenSource system to make their own apps smarter and automate tasks for users.

The presentations start at 6:30, and we will be going out for drinks afterwards at around 8:30/9:30 at Pub Quartier Latin (next to La Distilierie, corner of Onario and Sanguinet, 1 block walk from the talk).

Come one, come all, for the presentation and/or for drinks!

Code is open sourced on SourceForge.

 

The Google+ event

Directions to presentation:

View Larger Map

Directions to drinks:

View Larger Map

 

 

Bahar presents at Android Montreal

Bahar presents to a record breaking crowd at Android Montreal

Word Edit Distance Web Widget

If you have a spell checker, you want it to suggest a number of words that are close to the misspelt word. For humans, its easy for us to look at ‘teh’ and know that it is close to ‘the’, but how does the computer know that? A really simple Language Independent way to do it if you don’t have any gold standard data, is to assign costs to the various edits, substitution (2), deletion (1) and insertion (1), and picking the cheapest one.

The table below applies Levenshtein’s algorithm (basically, substitution costs 2) letter by letter. The total distance between the two words, 4 is in the top right corner, because it costs 2 to substitute ‘u’ for ‘i’ and 2 to substitute ‘t’ for ‘k’.

At the Lab, we put together an interactive javascript so that you can input whatever words you like and find out their edit distance. Just enter the words you want to compare!

Word 1:

Word 2:


And if you really like it, you can download it from github.
Click here to read more about edit distance.

Stanford NLP Class registration ends soon!

If you’re interested in Natural Language Processing or you have been scraping and have lots of text data the Stanford NLP class a great opportunity to brush up on your regular expressions and learn some tricks. The professors are Dan Jurafsky and Chris Manning. Dan Jurafsky is a leading researcher on investigating the connection between Prosody and written text, and the co-author of Speech and Language Processing.

Natural language processing is the technology for dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties. In the past decade, successful natural language processing applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email spam detection to automatic question answering, from detecting people’s opinions about products or services to extracting appointments from your email. In this class, you’ll learn the fundamental algorithms and mathematical models for human language processing and how you can use them to solve practical problems in dealing with language data wherever you encounter it.

 

We are hosting a small bi-monthly NLP get together to discuss and apply the Stanford NLP class to some local Montreal data. If you’re interested you can join us, leave us a comment below and we will tell you about our meeting times.