I wrote a quick Python script designed to search a file / remote address for URLs and return the HTTP status codes for each one. It’s quick and dirty, and the regex needs some tweaking, but for the most part it works. The reason I didn’t just use a link checker is that I was actually testing RSS feeds, so this was designed to grab URLs throughout the feed as opposed to just A tags. It lists anything in the 40x range.
Here’s the source code:
from httplib import HTTP, HTTPConnection
from urlparse import urlparse
def get_page(url):
parsed = urlparse(url)
conn = HTTPConnection('%s' % parsed[1])
conn.request("GET", parsed[2])
response = conn.getresponse()
data = response.read()
return data
def get_urls(text):
import re
matches = re.findall(r'http://[^s<>"']+', text)
return list(set(matches))
def check_url(url):
url = url.strip()
parsed = urlparse(url)
request = HTTP(parsed[1])
request.putrequest('HEAD', parsed[2])
request.endheaders()
reply = request.getreply()
return reply[0]
if __name__ == '__main__':
import sys
import os
source = sys.argv[1]
data = ''
if os.access(os.path.abspath(source), os.R_OK):
print 'GETTING LOCAL FILE.'
data = open(os.path.abspath(source), 'r').read()
else:
print 'GETTING REMOTE FILE.'
data = get_page(source)
print 'SEARCHING FOR URLS.'
urls = get_urls(data)
codes = {}
print 'CHECKING %s URLS...' % len(urls)
for url in urls:
code = '%s' % check_url(url)
if code not in codes.keys():
codes[code] = []
codes[code].append(url)
print 'RESULTS:'
print '========'
for code, paths in codes.iteritems():
if 399 < int(code) < 500:
print 'There were %s %ss.' % (len(paths), code)
for path in paths:
print '* %s' % path
else:
print 'There were %s %ss.' % (len(paths), code)
...and here's an example of usage:
$ python checkurls.py http://www.google.com GETTING REMOTE FILE. SEARCHING FOR URLS. CHECKING 21 URLS... RESULTS: ======== There were 14 200s. There were 2 302s. There were 1 404s. * http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&amp;amp;usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg There were 1 405s. * http://www.google.com/reader/view/?hl=en&amp;amp;tab=wy There were 3 301s.
You can also pass in a local file as the first parameter:
$ python checkurls.py file.htm
If you have any thoughts, improvements, etc. just post them in the comments and I'll update the script. I may make a "recursive" one eventually, so that it actually could function as a link checker, but I don't feel like adding that right now.
Here's a tip: instead of the following:
if code not in codes.keys():
codes[code] = []
codes[code].append(url)
You can simply write:
codes.setdefault(code, []).append(url)
disqus ate my spaces, repeating; instead of:
1. if code not in codes.keys():
2. codes[code] = []
3. codes[code].append(url)
You can use this one.
codes.setdefault(code, []).append(url)
We now have go through a few using the subject material articles in your internet now, and i genuinely like your trend of jogging a blog. I further it to my favorites web site site file and probably examining spine again quickly. Make certain to analyze out my world wide web in addition and let me know what assume.|Fantastic summary, actually genuinely similar to a internet site that We’ve got. Keep in mind test it out sometime and sense totally free to go away me a comenet on it and tell me what assume that. Im typically uncover feedback.2
I’m usually to running a blog and i really respect your content. The article has really peaks my interest. I’m going to bookmark your website and hold checking for new information.
I keep getting the following error message:
File “checkurls.py”, line 14
matches = re.findall(r’http://[^s"']+’, text)
^
SyntaxError: invalid syntax