Strip HTML except Images and Links

A reader asked if it is possible to strip all html except for image and link tags, and then reformat as a “clean” html page with links and images included in the finished product. I wrote this little script that does it as a notetab clip. It simply substitutes an unusual string for the < and > signs and then resubstitutes those later after removing the html tags. The removal tool is fooled into leaving these in because they are not html tags when they aren’t opened and closed with the <> signs.

;effort by don at

;clip strips html except for image and a href
;tags, reformats to html using notetab
;and then restores fully the image and a href
;tags … note will not preserve tables etc.

;go to start of document
^!Jump Doc_Start
;turn off screen to hasten the job
^!SetScreenUpdate Off

;loop to check each tag to see if it is a meta
;set counter to 0 — unless changes will not process
;this tag
^!Set %processtag%=”0″

;find next tag start
^!Find “< " TIS ;quit when no more tags ^!IfError Clean ^!ClearVariables ;### %TAG% will be empty if cursor is not inside a tag. ;determine if tag and get the name of the tag ;if not meta cycle to next via NotTag subroutine ^!Set %TAG%="^$GetHtmlTag(TRUE)$" ^!IfTrue ^$IsEmpty(^%TAG%)$ NotTag ^!Set %TAGNAME%="^$GetHtmlTagName("^%TAG%";UPPERCASE)$" ^!Set %TAG%="^$GetSelection$" ;if tag is A, /A or IMG (no matter on case) ;then go to DOTAG where we will process it ^!If "A" = "^%TAGNAME%" DOTAG ^!If "/A" = "^%TAGNAME%" DOTAG ^!If "IMG" = "^%TAGNAME%" DOTAG ;get here and go to next tag if not a ;special tag to preserve :NotTag ^!Jump Select_End ^!Goto Loop ;process three special tags by putting ;unusual character sequence before and ;after and then it won't be removed :DOTAG ^!Set %TAG%=^$StrDeleteLeft("^%TAG";1)$ ^!Set %TAG%=^$StrDeleteRight("^%TAG";1)$ ^!Set %TAG%=**|[|**^%TAG**|]|** ^!InsertText ^%TAG% ^!Goto NotTag :Clean ;remove html (won't remove the replaced ;things as it doesn't think they are html ^!Keyboard SHIFT+CTRL+T ;convert document to html document ^!ToolBar Document to HTML ;reinstate the < and > so they
;magically become html tags again
;for the special three A, /A, IMG
^!Replace “**|[|**” >> “< " ATIWS ^!Replace "**|]|**" >> “>” ATIWS
;turn screen back on
^!SetScreenUpdate Off
;finish at the top
^!Jump Doc_Start
;tell em we are done
^!Info [C]finished with this file
;exit the clip
^!Goto EXIT
;line 77 including all blanks and comments

Leave a Reply

Recent Posts