LINK \b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *\( *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>\))
This application is a demo to showcase the efficiency of using regex patterns to
either extract links or plain text from
HTML pages. Enter the URL of the page you want to retrieve the HTML from.
Clicking the Links button will then display a list of all the extracted
links from that page, including their declaring attributes. This regex also
captures the URL of the link separately - in the same operation - and clicking
will produce a result list without the declarations (even though both links and
URLs are detected and extracted during the same sweep, highlighting the URLs
within the link list during match iteration requires adding HTML formatted color
coding, which eats up a lot of time, relatively speaking, and so yields misrepresentative parse times.
So instead, Links and URLs now have separate buttons).
Clicking Link markup will display the complete HTML markup of the page,
with the links highlighted.
Similarly, clicking the Text button will detect and extract all textual content
(text that is not contained within a tag) from
the page and display it with all the HTML stripped away. Text markup
shows the complete HTML markup, with the text strings highlighted in place.
If Trim is checked, the HTML page is stripped for excess spaces, tabs,
carriage returns and line feeds prior to parsing.
Each of these operations are completely handled by one
specialized regex pattern,
with no additional parsing, filtering or modifications being done to the regex
match results (including extracting the URL from the enveloping link syntax etc,
which is also done by the regex). The source HTML document is retrieved afresh
every time a button is clicked. The parse time recorded with the Links, URLs and
Text functions is measured as the time it takes to parse the
whole document, after it has been retrieved from the server. The parse times
include building the result string. This application is running on shared server
hosting, and results can vary between each click, depending on the current
system load etc. Also, since the HTML being parsed is downloaded again each time
a button is clicked, websites with frequently changing content can be expected
to serve differing HTML documents from click to click.
If the parse time shows as 0 ms / 0 ticks, it is because the parsing completed
quicker than the resolution of the regular system timer (which is about 15
Note: In this implementation, the obsolete background
attribute in HTML is not
supported, neither are URLs used in the CSS @import tag without opening
and closing parenthesises.