List all capitalized words in a file

A reader on the notetab list asked for a clip to generate a list of words with a first letter capitalized (the rest of the word can be upper or lower cased so long as the first letter is capitalized). I tried a couple of versions that worked pretty well on my test files. Because the sort function puts umlauted characters at the bottom of the alphabet in a sort and the reader wanted to preserve umlauted characters, I decided that this method was better. This clip ran 8 minutes for the reader on a 500K file. I didn’t disable screen update so you can see the progress, but that also slows the machine.

Here is the clip:

; by don at htmlfixit.com
; using a bunch of Hugo's ideas
; runs a text file and makes
; a list of all words that start
; with a capital letter
^!Menu Edit/Copy All
^!Toolbar Paste New
^!Replace "^P" >> " " ATIWS
^!Replace ")" >> " " ATIWS
^!Replace "(" >> " " ATIWS
^!Replace """ >> " " ATIWS
^!Replace "^T" >> " " ATIWS
^!Replace "," >> " " ATIWS
^!Replace "[" >> " " ATIWS
^!Replace "]" >> " " ATIWS
^!Replace "< " >> " " ATIWS
^!Replace ">" >> " " ATIWS
^!Replace "~" >> " " ATIWS
^!Replace "!" >> " " ATIWS
^!Replace "@" >> " " ATIWS
^!Replace "#" >> " " ATIWS
^!Replace "$" >> " " ATIWS
^!Replace "%" >> " " ATIWS
^!Replace "^" >> " " ATIWS
^!Replace "&" >> " " ATIWS
^!Replace "*" >> " " ATIWS
^!Replace "_" >> " " ATIWS
^!Replace "+" >> " " ATIWS
^!Replace "=" >> " " ATIWS
^!Replace "|" >> " " ATIWS
^!Replace "{" >> " " ATIWS
^!Replace "}" >> " " ATIWS
^!Replace "" >> " " ATIWS
^!Replace "/" >> " " ATIWS
^!Replace "?" >> " " ATIWS
^!Replace "." >> " " ATIWS
^!Replace ";" >> " " ATIWS
^!Replace ":" >> " " ATIWS
^!Replace "" >> " " ATIWS
^!Replace "•" >> " " ATIWS
^!Replace "– " >> " " ATIWS
^!Replace "´" >> " " ATIWS
^!Replace "’" >> " " ATIWS
^!Replace "“" >> " " ATIWS
^!Replace "‘" >> " " ATIWS
^!Replace "`" >> " " ATIWS
^!Replace "¡" >> " " ATIWS
^!Replace "¢" >> " " ATIWS
^!Replace "£" >> " " ATIWS
^!Replace "¤" >> " " ATIWS
^!Replace "¥" >> " " ATIWS
^!Replace "§" >> " " ATIWS
^!Replace "©" >> " " ATIWS
^!Replace "«" >> " " ATIWS

^!Menu Modify/Spaces/Single Space
^!Replace " " >> "^P" ATIWS
^!Replace "^P´" >> "^P" ATIWS
^!Replace "^P-" >> "^P" ATIWS
^!Replace "^P " >> "^P" ATIWS
^!Menu Edit/Copy All
^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
^!Select All
^!Toolbar Paste
^!Jump 1

; following is to dump all numer or lower cased
; first character lines
:DumpBad
^!If ^$GetRow$ = ^$GetLinecount$ Sort2
^!Select +1
^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
^!Keyboard DELETE
^!GoTo DumpBad

^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
^!Select Eol
^!Keyboard DELETE
^!Keyboard DELETE
^!GoTo DumpBad

:GoNext
^!Jump +1
^!GoTo DumpBad

; following is to eliminate single characters on one line
:Sort2
^!Jump 1

:Sort2a
^!Select Eol
^!IfError END
^!If ^$StrSize("^$GetSelection$")$ > 1 SKIP_2
^!Keyboard DELETE
^!Keyboard DELETE
^!Jump +1
^!GoTo Sort2a

This also removes any single character lines (under the theory those aren’t words).

Significant things done in this clip:
generating a list of words by replacing most non-alphanumeric with a space
replacing all double spaces with a single space and a return
sorting of the words that are now on single lines using the function in notetab
elimate all lines that don’t have an uppercase letter as the first character (note that we needed to use the test clip info to be sure that ^!IsUppercase was alphabetic before testing if it was upper case)
remove lines with only one character on them

Leave a Reply

Recent Posts

Archives

Topics