RosettaCodeData/Task/Word-frequency/APL/word-frequency.apl

31 lines
1.2 KiB
APL
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

⍝⍝ NOTE: input text is assumed to be encoded in ISO-8859-1
⍝⍝ (The suggested example '135-0.txt' of Les Miserables on
⍝⍝ Project Gutenberg is in UTF-8.)
⍝⍝
⍝⍝ Use Unix 'iconv' if required
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
r lowerAndStrip s;stripped;mixedCase
⍝⍝ Convert text to lowercase, punctuation and newlines to spaces
stripped ' abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz*'
mixedCase ⎕av[11],' ,.?!;:"''()[]-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
r stripped[mixedCase s]
⍝⍝ Return the _n_ most frequent words and a count of their occurrences
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝
r n wordCount fname ;D;wl;sidx;swv;pv;wc;uw;sortOrder
D lowerAndStrip (⎕fio['read_file'] fname) ⍝ raw text with newlines
wl (~ D ' ') D
sidx wl
swv wl[sidx]
pv +\ 1,~2 / swv
wc ¨ pv pv
uw 1 ¨ pv swv
sortOrder wc
r n[2] uw[sortOrder],[0.5]wc[sortOrder]
5 wordCount '135-0.txt'
the of and a to
41042 19952 14938 14526 13942