38 lines
1.9 KiB
Plaintext
38 lines
1.9 KiB
Plaintext
;Task:
|
|
Given a text file and an integer '''n''', print/display the '''n''' most
|
|
common words in the file (and the number of their occurrences) in decreasing frequency.
|
|
|
|
|
|
For the purposes of this task:
|
|
* A word is a sequence of one or more contiguous letters.
|
|
* You are free to define what a ''letter'' is.
|
|
* Underscores, accented letters, apostrophes, hyphens, and other special characters can be handled at your discretion.
|
|
* You may treat a compound word like '''well-dressed''' as either one word or two.
|
|
* The word '''it's''' could also be one or two words as you see fit.
|
|
* You may also choose not to support non US-ASCII characters.
|
|
* Assume words will not span multiple lines.
|
|
* Don't worry about normalization of word spelling differences.
|
|
* Treat '''color''' and '''colour''' as two distinct words.
|
|
* Uppercase letters are considered equivalent to their lowercase counterparts.
|
|
* Words of equal frequency can be listed in any order.
|
|
* Feel free to explicitly state the thoughts behind the program decisions.
|
|
|
|
|
|
Show example output using [http://www.gutenberg.org/files/135/135-0.txt Les Misérables from Project Gutenberg] as the text file input and display the top '''10''' most used words.
|
|
|
|
|
|
;History:
|
|
This task was originally taken from programming pearls from [https://doi.org/10.1145/5948.315654 Communications of the ACM June 1986 Volume 29 Number 6]
|
|
where this problem is solved by Donald Knuth using literate programming and then critiqued by Doug McIlroy,
|
|
demonstrating solving the problem in a 6 line Unix shell script (provided as an example below).
|
|
|
|
|
|
;References:
|
|
*[http://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/ McIlroy's program]
|
|
|
|
|
|
|
|
{{Template:Strings}}
|
|
<br><br>
|
|
|