Tags » Natural Language

Quasi wordles of isiZulu online newspaper articles from this weekend

Every now and then, I get side-tracked from what I was (supposed to be) doing. This time, it was a result of the combination of preparing ICPC training problems, preparing for a statistics tutorial for the postgraduate research methods, and a conversation from last week on an isiZulu corpus with Langa Khumalo from UKZN’s ULPDO (and my co-author on several papers on isiZulu CNLs). 1,029 more words

South Africa

An orchestration of ontologies for linguistic knowledge

Starting from multilingual knowledge representation in ontologies and an eye on linguistic linked data and controlled natural languages, we had developed a basic ontology for the Bantu noun class system [1] to link with the… 904 more words


a16z Podcast: It's Not What You Say, It's How You Say It -- When Language Meets Big Data

When most people think of big data they think of numbers, but it turns out that a lot of big data — a lot of the output of our work and activity as humans in fact — is in the form of words. 183 more words

constructing word based text compression algorithms


Constructing Word-Based Text Compression Algorithms. R. Nigel Horspool and Gordon V. Cormack.


  • All algorithms in the text are based on “strict” alternate maximal strings of alphanumeric and non-alphanumeric characters.

  • 160 more words
Natural Language

compressing trigram language models with golomb coding


Compressing Trigram Language Models With Golomb Coding. Ken Church, Ted Hart and Jianfeng Gao, Microsoft.


Katz Backoff
  • Due to the sparsity of text and the fact that we will not see n-grams in test that did not appear in training set, Katz proposed backing off from trigrams to bigrams (and from bigrams to unigrams) when we don’t have enough training data.
  • 152 more words
Natural Language

development of a spelling list


Development of a Spelling List, M.D. McIlroy, AT&T Bell Labs.


On spell checking
  • All leading and trailing punctuation is removed.
  • Embedded punctuation is included.
  • 1,026 more words
Natural Language