Every now and then, I get side-tracked from what I was (supposed to be) doing. This time, it was a result of the combination of preparing ICPC training problems, preparing for a statistics tutorial for the postgraduate research methods, and a conversation from last week on an isiZulu corpus with Langa Khumalo from UKZN’s ULPDO (and my co-author on several papers on isiZulu CNLs). 1,029 more words
Tags » Natural Language
Starting from multilingual knowledge representation in ontologies and an eye on linguistic linked data and controlled natural languages, we had developed a basic ontology for the Bantu noun class system  to link with the… 904 more words
When most people think of big data they think of numbers, but it turns out that a lot of big data — a lot of the output of our work and activity as humans in fact — is in the form of words. 183 more words
Constructing Word-Based Text Compression Algorithms. R. Nigel Horspool and Gordon V. Cormack.
All algorithms in the text are based on “strict” alternate maximal strings of alphanumeric and non-alphanumeric characters. 160 more words
Compressing Trigram Language Models With Golomb Coding. Ken Church, Ted Hart and Jianfeng Gao, Microsoft.
- Due to the sparsity of text and the fact that we will not see n-grams in test that did not appear in training set, Katz proposed backing off from trigrams to bigrams (and from bigrams to unigrams) when we don’t have enough training data. 152 more words