Tags » Natural Language

a16z Podcast: It's Not What You Say, It's How You Say It -- When Language Meets Big Data

When most people think of big data they think of numbers, but it turns out that a lot of big data — a lot of the output of our work and activity as humans in fact — is in the form of words. 183 more words

constructing word based text compression algorithms

Reference

Constructing Word-Based Text Compression Algorithms. R. Nigel Horspool and Gordon V. Cormack.

Notes

  • All algorithms in the text are based on “strict” alternate maximal strings of alphanumeric and non-alphanumeric characters.

  • 160 more words
Natural Language

compressing trigram language models with golomb coding

Reference

Compressing Trigram Language Models With Golomb Coding. Ken Church, Ted Hart and Jianfeng Gao, Microsoft.

Notes

Katz Backoff
  • Due to the sparsity of text and the fact that we will not see n-grams in test that did not appear in training set, Katz proposed backing off from trigrams to bigrams (and from bigrams to unigrams) when we don’t have enough training data.
  • 152 more words
Natural Language

development of a spelling list

Reference

Development of a Spelling List, M.D. McIlroy, AT&T Bell Labs.

Notes

On spell checking
  • All leading and trailing punctuation is removed.
  • Embedded punctuation is included.
  • 1,026 more words
Natural Language

hash table sizes for storing n grams for text processing

Reference

Hash Table Sizes for Storing N-Grams for Text Processing. Zhong Gu and Daniel Berleant,, Technical Report 10-00a Oct 2000, Software Research Lab, Iowa State University. 60 more words

Natural Language

using localmaxs algorithm for the extraction of contiguous and non contiguous multiword lexical units

Reference

Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Joaquim Ferreira da Seliva et al, Universidade Nove de Lisboa.

Notes… 447 more words

Algorithm