6 August 2010
A sends: If you run the following commands on a Windows machine (and have Cygwin installed!) you can produce a list of all words in the Wikileaks Afghan War Diary AFG.CSV file. You can also produce a list of words by frequency. http://cryptome.org/0002/afg/afg_list.txt.zip (2.9MB) http://cryptome.org/0002/afg/afg_freqlist.txt.zip (2.8MB) __________ Commands REM This is a windows batch file that sequences CYGWIN Unix utils. REM This batch file makes a list of all words in the Afghan War Diary CSV file, with frequencies. REM remove the formatting crap c:\cygwin\bin\tr [:space:][:blank:][:punct:] \n < afg.csv > afg.tr REM sort alphabetically, ignore case c:\cygwin\bin\sort -f -b -d <afg.tr >afg.srt REM filter out duplicates; the -c adds counts to the output file c:\cygwin\bin\uniq -c < afg.srt > afglist.txt REM list by frequency c:\cygwin\bin\sort < afglist.txt > afgfreq.txt Sample output: 17 PAID 4 Paid 555 paid 1 paided 1 Paien 2 Paienda 2 paient 1 PAIL 23 pail 1 PAILS 1 Pails 5 pails 2 PAIMAKTHU 2 Paiman 1 PAIMONAR 103 PAIN 53 Pain 469 pain 1 PAINBAGH 1 Painda 2 Paindah 1 Paindai 1 Paindakhel 2 PAINFUL 1 Painful 9 painful 73 PAINKILLER 7 Painkiller 3 painkiller 1 painkillers