RosettaCodeData/Task/Word-frequency/00-TASK.txt

38 lines
1.9 KiB
Plaintext

;Task:
Given a text file and an integer   '''n''',   print/display the   '''n'''   most
common words in the file   (and the number of their occurrences)   in decreasing frequency.
For the purposes of this task:
*   A word is a sequence of one or more contiguous letters.
*   You are free to define what a   ''letter''   is.
*   Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
*   You may treat a compound word like   '''well-dressed'''   as either one word or two.
*   The word   '''it's'''   could also be one or two words as you see fit.
*   You may also choose not to support non US-ASCII characters.
*   Assume words will not span multiple lines.
*   Don't worry about normalization of word spelling differences.
*   Treat   '''color'''   and   '''colour'''   as two distinct words.
*   Uppercase letters are considered equivalent to their lowercase counterparts.
*   Words of equal frequency can be listed in any order.
*   Feel free to explicitly state the thoughts behind the program decisions.
Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top   '''10'''   most used words.
;History:
This task was originally taken from programming pearls from [https://doi.org/10.1145/5948.315654 Communications of the ACM June 1986 Volume 29 Number 6]
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).
;References:
*[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ McIlroy's program]
{{Template:Strings}}
<br><br>