Regex HTML link and text extractor martin@wardenerbros.com
LINK \b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))

TEXT (<(?<tag>script|style)[\s\S]*?</\k<tag>>)|<!--[\s\S]*?-->|<[\s\S]*?>|(?<text>[^<>]*)


This application is a demo to showcase the efficiency of using regex patterns to either extract links or plain text from HTML pages. Enter the URL of the page you want to retrieve the HTML from. Clicking the Links button will then display a list of all the extracted links from that page, including their declaring attributes. This regex also captures the URL of the link separately - in the same operation - and clicking URLs will produce a result list without the declarations (even though both links and URLs are detected and extracted during the same sweep, highlighting the URLs within the link list during match iteration requires adding HTML formatted color coding, which eats up a lot of time, relatively speaking, and so yields misrepresentative parse times. So instead, Links and URLs now have separate buttons). Clicking Link markup will display the complete HTML markup of the page, with the links highlighted.


Similarly, clicking the Text button will detect and extract all textual content (text that is not contained within a tag) from the page and display it with all the HTML stripped away. Text markup shows the complete HTML markup, with the text strings highlighted in place.

If Trim is checked, the HTML page is stripped for excess spaces, tabs, carriage returns and line feeds prior to parsing.

Each of these operations are completely handled by one specialized regex pattern, with no additional parsing, filtering or modifications being done to the regex match results (including extracting the URL from the enveloping link syntax etc, which is also done by the regex). The source HTML document is retrieved afresh every time a button is clicked. The parse time recorded with the Links, URLs and Text functions is measured as the time it takes to parse the whole document, after it has been retrieved from the server. The parse times include building the result string. This application is running on shared server hosting, and results can vary between each click, depending on the current system load etc. Also, since the HTML being parsed is downloaded again each time a button is clicked, websites with frequently changing content can be expected to serve differing HTML documents from click to click.

If the parse time shows as 0 ms / 0 ticks, it is because the parsing completed quicker than the resolution of the regular system timer (which is about 15 milliseconds).

Note: In this implementation, the obsolete background attribute in HTML is not supported, neither are URLs used in the CSS @import tag without opening and closing parenthesises.
Retrieve html from: