The Homepage of extract_url.pl

Version 1.6.1 can be downloaded here. This is a Perl script that extracts URLs from correctly-encoded MIME email messages or plain text. This can be used either as a pre-parser for urlview, or to replace urlview entirely. The source repository used to be on Googlecode, but has been moved to GitHub.

Why?

urlview is a great program, but has some deficiencies. In particular, it isn't particularly configurable, and cannot handle URLs that have been broken over several lines in format=flowed delsp=yes email messages. Nor can it handle quoted-printable email messages. Also, urlview doesn't eliminate duplicate URLs. This perl script handles all of that. It also sanitizes URLs so that they can't break out of the command shell.

This is designed primarily for use with the mutt emailer. The idea is that if you want to access a URL in an email, you pipe the email to a URL extractor (like this one) which then lets you select a URL to view in some third program (such as Firefox). An alternative design is to access URLs from within mutt's pager by defining macros and tagging the URLs in the display to indicate which macro to use. A script you can use to do that is tagurl.pl.

Similar Software

In addition to urlview there is also a package known as urlscan. It is written in Python, and does not have a homepage that I'm aware of. urlscan improves on urlview's interface, but is not as good at gluing correctly-encoded MIME email URLs back together (it does not support format=flowed, that I know of).

Dependencies

Mandatory (these usually come with Perl):

Optional:

MIME::QuotedPrint - decodes quoted-printable encoded messages per RFC 2045 (handles corner cases better than the builtin decoder)
URI::Find - recognizes more exotic URL variations in plain text (without HTML tags)
Curses::UI - allows it to fully replace urlview
Getopt::Long - if present, extract_url recognizes long options --version and --list
Term::ReadKey - if present, used to determine terminal width in a portable way

How to Use It

Simple Examples

This perl script expects a valid email to be piped in via STDIN. Its STDOUT can be a pipe into urlview (it will detect this). Here's how you can use it:

        cat message.txt | extract_url.pl

        cat message.txt | extract_url.pl | urlview

        extract_url.pl message.txt

        extract_url.pl message.txt | urlview

Arguments

The script has a few command-line options you can use:

-h, --help: Print helpful documentation and exit.
-m, --man: Display the full man page documentation.
-l, --list: Prevent the use of Ncurses, and simply output a list of extracted URLs.
-t, --text: Prevent MIME handling; treat the input as plain text.
-q, --quoted: Force a quoted-printable decode on plain text (only applicable if -t is used).
-c, --config: Specify a config file to read.
-V, --version: Output version information and exit.

Mutt Macros

For use with mutt 1.4.x, here's a macro you can use:

        macro index,pager \cb "<enter-command> unset pipe_decode<enter><pipe-message>extract_url.pl<enter>" "get URLs"

For use with mutt 1.5.x, here's a more complicated macro you can use:

        macro index,pager \cb "<enter-command> set my_pdsave=\$pipe_decode<enter>\
        <enter-command> unset pipe_decode<enter>\
        <pipe-message>extract_url.pl<enter>\
        <enter-command> set pipe_decode=\$my_pdsave<enter>" "get URLs"

Here's a suggestion for how to handle encrypted email:

        macro index,pager ,b "<enter-command> set my_pdsave=\$pipe_decode<enter>\
        <enter-command> unset pipe_decode<enter>\
        <pipe-message>extract_url.pl<enter>\
        <enter-command> set pipe_decode=\$my_pdsave<enter>" "get URLs"

        macro index,pager ,B "<enter-command> set my_pdsave=\$pipe_decode<enter>\
        <enter-command> set pipe_decode<enter>\
        <pipe-message>extract_url.pl<enter>\
        <enter-command> set pipe_decode=\$my_pdsave<enter>" "decrypt message, then get URLs"

        message-hook .  'macro index,pager \cb ,b "URL viewer"'
        message-hook ~G 'macro index,pager \cb ,B "URL viewer"'

It's not perfect, but it works for me.

Config File

If you're using it with Curses::UI (i.e. as a standalone URL selector), this perl script will try and figure out what command to use based on the contents of your ~/.urlview file. However, it also has its own configuration file (~/.extract_urlview) that will be used instead, if it exists. So far, there are nine kinds of lines you can have in this file:

COMMAND ...: This line specifies the command that will be used to view URLs. This command CAN contain a %s, which will be replaced by the URL inside single-quotes. If it does not contain a %s, the URL will simply be appended to the command. If this line is not present, the command is assumed to be "open", which is the correct command for MacOS X systems.
SHORTCUT: This line specifies that if an email contains only 1 URL, that URL will be opened without prompting. The default (without this line) is to always prompt.
NOREVIEW: Normally, if a URL is too long to display on screen in the menu, the user will be prompted with the full url before opening it, just to make sure it's correct. This line turns that behavior off.
PERSISTENT: By default, when a URL has been selected and viewed from the menu, extract_url.pl will exit. If you would like it to be ready to view another URL without re-parsing the email (i.e. much like standard urlview behavior), add this line to the config file.
IGNORE_EMPTY_TAGS: By default, the script collects all the URLs it can find. Sometimes, though, HTML messages contain links that don't correspond to any text (and aren't normally rendered or accessible). This tells the script to ignore these links.
RAW_RESERVED: By default, the script sanitizes URLs pretty thoroughly, eliminating all characters that are not part of the Unreserved class (per RFC 3986). Sometimes, though, this is not desirable. This tells the script to leave the Reserved Characters un-encoded (with the exception of the single quote).
HTML_TAGS ...: This line specifies which HTML tags will be examined for URLs. By default, the script is very generous, looking in a, applet, area, blockquote, embed, form, frame, iframe, input, ins, isindex, head, layer, link, object, q, script, and xmp tags for links. If you would like it to examine just a subset of these (e.g. you only want a tags to be examined), merely list the subset you want. The list is expected to be a comma-separated list. If there are multiple of these lines in the config file, the script will look for the minimum set of specified tags.
ALTSELECT ...: This line specifies a key for an alternate url viewing behavior. By default, extract_url.pl will quit after the URL viewer has been launched for the selected URL. This key will then make extract_url.pl launch the URL viewer but will not quit. However, if PERSISTENT is specified in the config file, the opposite is true: normal selection of a URL will launch the URL viewer and will not cause extract_url.pl to exit, but this key will. This setting defaults to 'k'.
DEFAULT_VIEW {url|context}: This line specifies whether to show the list of URLs at first or to show the url contexts when the program is run. By default, extract_url.pl shows a list of URLs.
DISPLAY_SANITIZED: This line specifies that URLs should be sanitized before they are displayed. This assists terminals that recognize URLs by transforming URLs into an easier-to-recognize format. For example, if a URL has spaces in it, those spaces will be turned into %20.

Here is an example config file:

SHORTCUT
COMMAND firefox %s &
HTML_TAGS a,iframe,link
ALTSELECT Q
DEFAULT_VIEW context

Security

All URLs have any potentially dangerous shell characters removed (transformed into percent-encoding) before they are used in a shell. This should eliminate the possibility of a bad URL breaking the shell. For reference, the permitted (non-transformed) characters are:

a-z A-Z 0-9 _ . ! * ( ) @ : = ? / % ~ + -

Screenshots

Here's what it looks like for a standard email:
Standard list of URLs

If a URL is too big for your terminal, when you select it, extract_url.pl will (by default) ask you to review it in a way that you can see the whole thing. Here's what that looks like:
Asking for confirmation on long URLs

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY KYLE WHEELER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL KYLE WHEELER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Changelog

1.6.1
- Expand displayed context to fill the width of the terminal.
1.6
- Add a way to force quoted-printable decoding of plain text
- Add a way to reduce the number of characters that are escaped (RAW_RESERVED)
- Add a way to further reduce the escape requirements when COMMAND does not require the shell (i.e. can be execvp'd directly).
1.5.8
- Enable/simplify dealing with plain-text inputs, and create an argument to force plain-text handling (i.e. skip MIME parsing)
- Allow named file parsing, rather than relying exclusively on stdin.
1.5.7
- Repackage to ease automatic distribution (e.g. Debian)
- Respect the BROWSER environment variable, in the absence of an explicit configuration setting
1.5.6
- Sanitize (percent-encode) ampersands (avoids certain problems with poor handling of strings in common versions of urlhandler.sh)
1.5.5
- Fix handling of trailing text in TEXT/HTML message parts, and TEXT/HTML message parts without any clickable links.
1.5.4
- Remove switch dependency
1.5.3
- Fix ALTSELECT
1.5.2
- URL sanitizing now applied to pipe output
- URL sanitizing may optionally be applied to display
1.5.1
- URL sanitizing turned out to be too strict
- Fixed a few perl warnings
- URLs found in HTML are rendered/de-obfuscated internally
1.5
- Fixed undefined variable errors when used without URI::Find
- Can toggle between showing the context of URLs in the main list and showing the URLs themselves.
- Can specify whether to show the list of URLs at first (default) or to show the list of URL contexts
1.4.1
- Better contextual text handling (uses word boundaries instead of explicit string lengths).
- Pulls URLs out of HTML text AS WELL as HTML tags (silly, but sometimes necessary). This text handling is not especially detailed, and so may be somewhat sensitive to formatting issues (unexpected line breaks, etc).
1.4
- Added support for a configurable alternative selection key (via ALTSELECT), allowing a person to, in effect, temporarily negate the PERSISTENT setting.
- Added conditional support for long options, if Getopt::Long is available.
1.3.3
- Sometimes, multipart/alternative parts don't actually have an alternative, which could fool this script. Now they're handled correctly (normal text/plain have 0 parts, according to the MIME parser).
1.3.2
- Discovered some really strange-looking emails (possibly with broken UTF-8 encoding) that interfere with maintaining context information for each URL. Added a workaround so that the script doesn't error out (there just might not be any context for some URLs in those messages).
1.3.1
- Added support for digest-style messages (i.e. message/rfc822 MIME parts)
1.3
- Added support for viewing link context
- Added HTML_TAGS
- Added IGNORE_EMPTY_TAGS