When most people think of big data they think of numbers, but it turns out that a lot of big data — a lot of the output of our work and activity as humans in fact — is in the form of words. 183 more words
Tags » Natural Language
Constructing Word-Based Text Compression Algorithms. R. Nigel Horspool and Gordon V. Cormack.
All algorithms in the text are based on “strict” alternate maximal strings of alphanumeric and non-alphanumeric characters. 160 more words
Compressing Trigram Language Models With Golomb Coding. Ken Church, Ted Hart and Jianfeng Gao, Microsoft.
- Due to the sparsity of text and the fact that we will not see n-grams in test that did not appear in training set, Katz proposed backing off from trigrams to bigrams (and from bigrams to unigrams) when we don’t have enough training data. 152 more words
Hash Table Sizes for Storing N-Grams for Text Processing. Zhong Gu and Daniel Berleant,, Technical Report 10-00a Oct 2000, Software Research Lab, Iowa State University. 60 more words
using localmaxs algorithm for the extraction of contiguous and non contiguous multiword lexical units
Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Joaquim Ferreira da Seliva et al, Universidade Nove de Lisboa.