Discover this podcast and so much more

Podcasts are free to enjoy without a subscription. We also offer ebooks, audiobooks, and so much more for just $11.99/month.

Very Large Corpora and Zipf's Law

Very Large Corpora and Zipf's Law

FromData Skeptic


Very Large Corpora and Zipf's Law

FromData Skeptic

ratings:
Length:
24 minutes
Released:
Jan 18, 2019
Format:
Podcast episode

Description

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data. Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled Scaling to Very Very Large Corpora for Natural Language Disambiguation popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora. Although not explicitly in about NLP, the noteworthy paper The Unreasonable Effectiveness of Data emphasizes this point further while paying homage to the classic treatise The Unreasonable Effectiveness of Mathematics in the Natural Sciences. In this episode, Kyle shares a few thoughts along these lines with Linh Da. The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.
Released:
Jan 18, 2019
Format:
Podcast episode

Titles in the series (100)

Data Skeptic is a data science podcast exploring machine learning, statistics, artificial intelligence, and other data topics through short tutorials and interviews with domain experts.