URL Checker

Tuesday, November 10th, 2009 12:51 am

I wrote a quick Python script designed to search a file / remote address for URLs and return the HTTP status codes for each one. It’s quick and dirty, and the regex needs some tweaking, but for the most part it works. The reason I didn’t just use a link checker is that I was actually testing RSS feeds, so this was designed to grab URLs throughout the feed as opposed to just A tags. It lists anything in the 40x range.

Here’s the source code:

from httplib import HTTP, HTTPConnection
from urlparse import urlparse

def get_page(url):
    parsed = urlparse(url)
    conn = HTTPConnection('%s' % parsed[1])
    conn.request("GET", parsed[2])
    response = conn.getresponse()
    data = response.read()
    return data

def get_urls(text):
    import re
    matches = re.findall(r'http://[^s<>"']+', text)
    return list(set(matches))

def check_url(url):
    url = url.strip()
    parsed = urlparse(url)
    request = HTTP(parsed[1])
    request.putrequest('HEAD', parsed[2])
    request.endheaders()
    reply = request.getreply()
    return reply[0]

if __name__ == '__main__':
    import sys
    import os
    source = sys.argv[1]
    data = ''
    if os.access(os.path.abspath(source), os.R_OK):
        print 'GETTING LOCAL FILE.'
        data = open(os.path.abspath(source), 'r').read()
    else:
        print 'GETTING REMOTE FILE.'
        data = get_page(source)
    print 'SEARCHING FOR URLS.'
    urls = get_urls(data)
    codes = {}
    print 'CHECKING %s URLS...' % len(urls)
    for url in urls:
        code = '%s' % check_url(url)
        if code not in codes.keys():
            codes[code] = []
        codes[code].append(url)
    print 'RESULTS:'
    print '========'
    for code, paths in codes.iteritems():
        if 399 < int(code) < 500:
            print 'There were %s %ss.' % (len(paths), code)
            for path in paths:
                print '* %s' % path
        else:
            print 'There were %s %ss.' % (len(paths), code)

...and here's an example of usage:

$ python checkurls.py http://www.google.com
GETTING REMOTE FILE.
SEARCHING FOR URLS.
CHECKING 21 URLS...
RESULTS:
========
There were 14 200s.
There were 2 302s.
There were 1 404s.
* http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&amp;amp;amp;usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
There were 1 405s.
* http://www.google.com/reader/view/?hl=en&amp;amp;amp;tab=wy
There were 3 301s.

You can also pass in a local file as the first parameter:

$ python checkurls.py file.htm

If you have any thoughts, improvements, etc. just post them in the comments and I'll update the script. I may make a "recursive" one eventually, so that it actually could function as a link checker, but I don't feel like adding that right now. :)

2 Responses to “URL Checker”

  1. yuce says:

    Here's a tip: instead of the following:

    if code not in codes.keys():
    codes[code] = []
    codes[code].append(url)

    You can simply write:

    codes.setdefault(code, []).append(url)

  2. yuce says:

    disqus ate my spaces, repeating; instead of:

    1. if code not in codes.keys():
    2. codes[code] = []
    3. codes[code].append(url)

    You can use this one.

    codes.setdefault(code, []).append(url)

Leave a Reply