Books by Jules J. Berman, covers

The software is provided "as is", without warranty of any kind,
express or implied, including but not limited to the warranties
of merchantability, fitness for a particular purpose and
noninfringement. In no event shall the authors or copyright
holders be liable for any claim, damages or other liability,
whether in an action of contract, tort or otherwise, arising
from, out of or in connection with the software or the use or
other dealings in the software.

© 2010 by Jules J. Berman and distributed under the GNU General 
Public License, available at http://www.gnu.org/licenses/gpl.html

Ruby, Perl and Python Scripts
Scripts for Grabbing Web Pages
and Testing Links


Accessible web pages are files (usually in HTML format) that reside on servers that accept HTTP requests from clients connected to the Internet. Browsers are software applications that send HTTP requests and display the received web pages. Using Perl, Python, or Ruby, you can automate HTTP requests. For each language, the easiest way to make an HTTP request is to use a module that comes bundled as a standard component of the language.

HOW THE SCRIPT WORKS - GRABBING WEB PAGES

1. Import the module that makes HTTP requests.

2. Make the HTTP request

3. If the request returns the web page, print the page. Otherwise, print a message indicating the the request was unsuccessful.

PERL SCRIPT - GRABBING WEB PAGES

   #!/usr/local/bin/perl
   use LWP::Simple;
   $good_url = qq|http://julesberman.info/factoids/batch.htm|;
   $content = get($good_url);
   if (defined ($content))
     {
     print $content;
     }
   else
     {
     print "\nSorry, the get() call returned undef for $good_url"; 
     }
   $bad_url = qq|http://julesberman.info/factoids/xxxxx.htm|;
   $content = get($bad_url);
   if (defined ($content))
     {
     print $content;
     }
   else
     {
     print "\nSorry, the get() call returned undef for $bad_url"; 
     }
   exit;

For Perl, the module is LWP::Simple. A web page that explains the module syntax is available at:

http://search.cpan.org/~gaas/libwww-perl-5.834/lib/LWP/Simple.pm

PYTHON SCRIPT - GRABBING WEB PAGES

   #!/usr/local/bin/python
   import urllib2
   req = urllib2.Request('http://www.julesberman.info/factoids/batch.htm')
   try:
       response = urllib2.urlopen(req)
   except urllib2.HTTPError, e:
       print 'The server couldn\'t fulfill the request.'
       print 'Error code: ', e.code
   except urllib2.URLError, e:
       print 'We failed to reach a server.'
       print 'Reason: ', e.reason
   else:
       print urllib2.urlopen(req).read()
   req = urllib2.Request('http://www.julesberman.info/factoids/xxxxx.htm')
   try:
       response = urllib2.urlopen(req)
   except urllib2.HTTPError, e:
       print 'The server couldn\'t fulfill the request.'
       print 'Error code: ', e.code
   except urllib2.URLError, e:
       print 'We failed to reach a server.'
       print 'Reason: ', e.reason
   else:
       print urllib2.urlopen(req).read()
   exit


An excellent web tutorial explaining the urllib2 module is available at:

http://docs.python.org/dev/howto/urllib2.html

RUBY SCRIPT - GRABBING WEB PAGES

   #!/usr/local/bin/ruby
   require 'net/http'
   Net::HTTP.start('www.julesberman.info') do
     |http|
     response = http.get('/factoids/batch.htm')
     if response.body[400,3].nil?
       puts "Code = #{response.code}"
       puts "Message = #{response.message}"
       response.each{|key,value| puts key + " " + value}
     else
       puts response.body[400,10000]
     end
     response = http.get('/factoids/xxxxx.htm')
     if response.body[400,300].nil?
       puts "Code = #{response.code}"
       puts "Message = #{response.message}"
       response.each{|key,value| puts key + " " + value}
     else
       puts response.body[400,300]
     end
   end
   exit


For Ruby, the Net::HTTP module comes bundled with the Ruby interpreter, in the standard library. Another module, Net::FTP requests files by FTP (File Transfer Protocol).

More information on Ruby's Net::HTTP module is available at:

http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html

ANALYSIS

Perl, Python and Ruby use their own external modules for HTTP transactions. Each language's module has its own peculiar syntax. Still, the basic operation is the same: your script initiates an HTTP request for a web file at a specific network address (the URL, or Uniform Resource Locator). A response is received, the web page is retrieved, if possible, and printed to the monitor. Otherwise, the response will contain some information indicating why the page could not be retrieved.

In the example script, two web pages were requested. The first is located at http://www.julesberman.info/factoids/batch.htm, and is a valid URL. The second is located at http://www.julesberman.info/factoids/xxxxx.htm, and is an invalid address.

You can see that, with a little effort, you can use this basic script to collect and examine a large number of web pages. With a little more effort, you can write your own spider software that searches for web addresses within web pages, and iteratively collects information from web pages within web pages.

© 2010 Jules J. Berman

key words: testing link, ruby programming, perl programming, python programming, bioinformatics, valid web page, web page is available, good http request, valid http request testing if web page exists, testing web links, jules berman, jules j berman, Ph.D., M.D.

Last modified: January 23, 2010

Books by Jules J. Berman, covers