URL Checker

Tuesday, November 10th, 2009 12:51 am

I wrote a quick Python script designed to search a file / remote address for URLs and return the HTTP status codes for each one. It’s quick and dirty, and the regex needs some tweaking, but for the most part it works. The reason I didn’t just use a link checker is that I was actually testing RSS feeds, so this was designed to grab URLs throughout the feed as opposed to just A tags. It lists anything in the 40x range.

Here’s the source code:

from httplib import HTTP, HTTPConnection
from urlparse import urlparse

def get_page(url):
    parsed = urlparse(url)
    conn = HTTPConnection('%s' % parsed[1])
    conn.request("GET", parsed[2])
    response = conn.getresponse()
    data = response.read()
    return data

def get_urls(text):
    import re
    matches = re.findall(r'http://[^s<>"']+', text)
    return list(set(matches))

def check_url(url):
    url = url.strip()
    parsed = urlparse(url)
    request = HTTP(parsed[1])
    request.putrequest('HEAD', parsed[2])
    request.endheaders()
    reply = request.getreply()
    return reply[0]

if __name__ == '__main__':
    import sys
    import os
    source = sys.argv[1]
    data = ''
    if os.access(os.path.abspath(source), os.R_OK):
        print 'GETTING LOCAL FILE.'
        data = open(os.path.abspath(source), 'r').read()
    else:
        print 'GETTING REMOTE FILE.'
        data = get_page(source)
    print 'SEARCHING FOR URLS.'
    urls = get_urls(data)
    codes = {}
    print 'CHECKING %s URLS...' % len(urls)
    for url in urls:
        code = '%s' % check_url(url)
        if code not in codes.keys():
            codes[code] = []
        codes[code].append(url)
    print 'RESULTS:'
    print '========'
    for code, paths in codes.iteritems():
        if 399 < int(code) < 500:
            print 'There were %s %ss.' % (len(paths), code)
            for path in paths:
                print '* %s' % path
        else:
            print 'There were %s %ss.' % (len(paths), code)

...and here's an example of usage:

$ python checkurls.py http://www.google.com
GETTING REMOTE FILE.
SEARCHING FOR URLS.
CHECKING 21 URLS...
RESULTS:
========
There were 14 200s.
There were 2 302s.
There were 1 404s.
* http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&amp;amp;amp;usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
There were 1 405s.
* http://www.google.com/reader/view/?hl=en&amp;amp;amp;tab=wy
There were 3 301s.

You can also pass in a local file as the first parameter:

$ python checkurls.py file.htm

If you have any thoughts, improvements, etc. just post them in the comments and I'll update the script. I may make a "recursive" one eventually, so that it actually could function as a link checker, but I don't feel like adding that right now. :)

5 Responses to “URL Checker”

  1. yuce says:

    Here's a tip: instead of the following:

    if code not in codes.keys():
    codes[code] = []
    codes[code].append(url)

    You can simply write:

    codes.setdefault(code, []).append(url)

  2. yuce says:

    disqus ate my spaces, repeating; instead of:

    1. if code not in codes.keys():
    2. codes[code] = []
    3. codes[code].append(url)

    You can use this one.

    codes.setdefault(code, []).append(url)

  3. 抓姦 says:

    We now have go through a few using the subject material articles in your internet now, and i genuinely like your trend of jogging a blog. I further it to my favorites web site site file and probably examining spine again quickly. Make certain to analyze out my world wide web in addition and let me know what assume.|Fantastic summary, actually genuinely similar to a internet site that We’ve got. Keep in mind test it out sometime and sense totally free to go away me a comenet on it and tell me what assume that. Im typically uncover feedback.2

  4. Kim A. Park says:

    I’m usually to running a blog and i really respect your content. The article has really peaks my interest. I’m going to bookmark your website and hold checking for new information.

  5. Andres says:

    I keep getting the following error message:

    File “checkurls.py”, line 14
    matches = re.findall(r’http://[^s"']+’, text)
    ^
    SyntaxError: invalid syntax

Leave a Reply