Since the app was released a surprising number of users have found the app and have been using it. Users have been requesting features and providing feedback on the Play Store and Chrome Store.
This summer Veronica will be looking over iLanguageCloud user reviews in order to document what needs to be done in the next releases. First she found that most of the reviews indicate that there are different user groups who have different goals when they open the iLanguageCloud project. Some users want to paste a full text and see a cloud, but most users want to see all the words they paste.
She started by identifying the user types with a CouchBD map reduce and learning how to do statistical analysis in LibreOffice. Once she had identified stats to categorize user types, she added tests for these user types in the codebase using Jasmine.
While she is learning the tools (Angular.js, Travis) to make the modifications so that her user types tests pass, Veronica created a video tutorial showing how you can use the Chrome app so that users can have some instructions.
Since Kartuli is an agglutinative language with very rich verb morphology searching for appropriate results is very difficult. Over the past few weeks of observing it seems like most Kartuli speakers prefer to search using Russian search engines, using Russian vocabulary. Mari (who is a lawyer) and Gina decided to create a corpus of law cases in Kartuli, and see if the FieldDB glosser can help build a stemmer that might be used for searching in Georgian.
While Mari was teaching Gina and Esma how to use the Georgian court websites, in the middle she showed them how she modifies her search terms to get some results in supreme court cases, unlike the constitutional court search page which lets you search for an empty string and see all results… This was an illuminating experience of searching as a minority language speaker, so we decided to share it as an unlisted YouTube video despite the poor image quality.
* Requires search to find documents
* Need to use very general search terms to get any results, and results you get are not always relevant to your case you are working on
* Documents are .html which is excellent for machines but Mari didn’t seem to excited about it, we will ask her more later
* Requires no search to find documents
* Documents are in .doc format which users are used to
* Easy to download documents so you can read them offline when you are in the village, or put on a usb key if you are using someone else’s computer for the internet.
Next Wednesday our software engineering intern Bahar Sateli will be presenting her OpenSource Named Entity Recognition library for Android which is powered by the Semantic Software Lab‘s Semantic Assistants web services platform, which in turn, is powered by GATE, an Open Source General Architecture for Text Engineering developed at the University of Sheffield.
As part of her MITACS/NRC-IRAP funded project in collaboration with iLanguage Lab she created an Android Library to make it possible to recognize people, locations, dates and other useful pieces of text, on Android Phones. The sky is the limit as it can run any GATE powered pipeline.
The current open source pipelines range from very specialized (recognizing Bacteria and Fungi entities in bio-medical texts) to very general (recognizing people, places and dates).
She will be presenting her app iForgot Who which takes in some text, and automatically creates new contacts for you, a handy app for all those party planners out there. It is a demo application to show new developers how they can use her OpenSource system to make their own apps smarter and automate tasks for users.
The presentations start at 6:30, and we will be going out for drinks afterwards at around 8:30/9:30 at Pub Quartier Latin (next to La Distilierie, corner of Onario and Sanguinet, 1 block walk from the talk).
Come one, come all, for the presentation and/or for drinks!
If you have a spell checker, you want it to suggest a number of words that are close to the misspelt word. For humans, its easy for us to look at ‘teh’ and know that it is close to ‘the’, but how does the computer know that? A really simple Language Independent way to do it if you don’t have any gold standard data, is to assign costs to the various edits, substitution (2), deletion (1) and insertion (1), and picking the cheapest one.
The table below applies Levenshtein’s algorithm (basically, substitution costs 2) letter by letter. The total distance between the two words, 4 is in the top right corner, because it costs 2 to substitute ‘u’ for ‘i’ and 2 to substitute ‘t’ for ‘k’.
And if you really like it, you can download it from github. Click here to read more about edit distance.
Natural language processing is the technology for dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties. In the past decade, successful natural language processing applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email spam detection to automatic question answering, from detecting people’s opinions about products or services to extracting appointments from your email. In this class, you’ll learn the fundamental algorithms and mathematical models for human language processing and how you can use them to solve practical problems in dealing with language data wherever you encounter it.
We are hosting a small bi-monthly NLP get together to discuss and apply the Stanford NLP class to some local Montreal data. If you’re interested you can join us, leave us a comment below and we will tell you about our meeting times.