I wrote a quick Python script designed to search a file / remote address for URLs and return the HTTP status codes for each one. It’s quick and dirty, and the regex needs some tweaking, but for the most part it works. The reason I didn’t just use a link checker is that I was actually testing RSS feeds, so this was designed to grab URLs throughout the feed as opposed to just A tags. It lists anything in the 40x range.
Here’s the source code:
from httplib import HTTP, HTTPConnection
from urlparse import urlparse
def get_page(url):
parsed = urlparse(url)
conn = HTTPConnection('%s' % parsed[1])
conn.request("GET", parsed[2])
response = conn.getresponse()
data = response.read()
return data
def get_urls(text):
import re
matches = re.findall(r'http://[^s<>"']+', text)
return list(set(matches))
def check_url(url):
url = url.strip()
parsed = urlparse(url)
request = HTTP(parsed[1])
request.putrequest('HEAD', parsed[2])
request.endheaders()
reply = request.getreply()
return reply[0]
if __name__ == '__main__':
import sys
import os
source = sys.argv[1]
data = ''
if os.access(os.path.abspath(source), os.R_OK):
print 'GETTING LOCAL FILE.'
data = open(os.path.abspath(source), 'r').read()
else:
print 'GETTING REMOTE FILE.'
data = get_page(source)
print 'SEARCHING FOR URLS.'
urls = get_urls(data)
codes = {}
print 'CHECKING %s URLS...' % len(urls)
for url in urls:
code = '%s' % check_url(url)
if code not in codes.keys():
codes[code] = []
codes[code].append(url)
print 'RESULTS:'
print '========'
for code, paths in codes.iteritems():
if 399 < int(code) < 500:
print 'There were %s %ss.' % (len(paths), code)
for path in paths:
print '* %s' % path
else:
print 'There were %s %ss.' % (len(paths), code)
...and here's an example of usage:
$ python checkurls.py http://www.google.com GETTING REMOTE FILE. SEARCHING FOR URLS. CHECKING 21 URLS... RESULTS: ======== There were 14 200s. There were 2 302s. There were 1 404s. * http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&amp;amp;usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg There were 1 405s. * http://www.google.com/reader/view/?hl=en&amp;amp;tab=wy There were 3 301s.
You can also pass in a local file as the first parameter:
$ python checkurls.py file.htm
If you have any thoughts, improvements, etc. just post them in the comments and I'll update the script. I may make a "recursive" one eventually, so that it actually could function as a link checker, but I don't feel like adding that right now.
Here's a tip: instead of the following:
if code not in codes.keys():
codes[code] = []
codes[code].append(url)
You can simply write:
codes.setdefault(code, []).append(url)
disqus ate my spaces, repeating; instead of:
1. if code not in codes.keys():
2. codes[code] = []
3. codes[code].append(url)
You can use this one.
codes.setdefault(code, []).append(url)