RosettaCodeData/Task/Word-frequency/00-TASK.txt

;Task:
Given a text file and an integer &nbsp; '''n''', &nbsp; print/display the &nbsp; '''n''' &nbsp; most
common words in the file &nbsp; (and the number of their occurrences) &nbsp; in decreasing frequency.


For the purposes of this task:
* &nbsp; A word is a sequence of one or more contiguous letters.
* &nbsp; You are free to define what a &nbsp; ''letter'' &nbsp; is.
* &nbsp; Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
* &nbsp; You may treat a compound word like &nbsp; '''well-dressed''' &nbsp; as either one word or two.
* &nbsp; The word &nbsp; '''it's''' &nbsp; could also be one or two words as you see fit.
* &nbsp; You may also choose not to support non US-ASCII characters.
* &nbsp; Assume words will not span multiple lines.
* &nbsp; Don't worry about normalization of word spelling differences.
* &nbsp; Treat &nbsp; '''color''' &nbsp; and &nbsp; '''colour''' &nbsp; as two distinct words.
* &nbsp; Uppercase letters are considered equivalent to their lowercase counterparts.
* &nbsp; Words of equal frequency can be listed in any order.
* &nbsp; Feel free to explicitly state the thoughts behind the program decisions.


Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top &nbsp; '''10''' &nbsp; most used words.


;History:
This task was originally taken from programming pearls from [https://doi.org/10.1145/5948.315654 Communications of the ACM June 1986 Volume 29 Number 6]
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).


;References:
*[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ McIlroy's program]


{{Template:Strings}}
<br><br>