Web Spidering with PHP

Multithreaded JavaScript has been published with O'Reilly!

For this paper I will be using a typical LAMP server for examples (Linux, Apache, MySql, PHP). Spidering requires the use of Regular Expressions, the cURL library (if POST data or cookies are used), and the cron libraries (if we need to download information with a schedule).

Definition

Spidering, in its simplest form is the act of transferring data from one database to another.

Well, why not just export the database from one site and import it into another database? Usually, the reason this is not possible is because you do not have access the database you are spidering content from. Given that you do not have access to the database; the usual method for accessing data is through a web front end.

One such example is of a client of mine who sold products from another company. The company told my client that he could download all the products from the website at his leisure however the company had no idea how to export such a database. As everyone is aware, 16,000 is a lot of products to manually download and copy/paste descriptions. So, the client approached me and I programmed a PHP/MySQL spider to crawl the website and store information into a local osCommerce installation.

Spider Development Process

A programmer needs to go through the following steps to spider content:

  1. Determine database URL schema
  2. Determine data formatting
  3. Develop regular expressions to capture data
  4. Create database or other implementation for storing data
  5. Integrate regular expressions and database connectivity into a PHP script and execute it

Depending on the data being collected the process may require a few days to run (one such database I spidered had 400,000 records and took one week to download).

Website Analysis:

For this example we’ll use a fictional website. This website stores people’s full names and phone numbers and can be accessed using the following URL schema:

http://www.example.com/phonebook/data.php?id=1

Obviously, by changing the ?id=X value, we can view the information of another person. Data is displayed in the following format:

<table>
<tr><td>Name:</td><td>Firstname Lastname</td></tr>
<tr><td>Phone:</td><td>###-###-####</td></tr>
</table>

Through the use of regular expressions, we can develop a pattern to download the users’ information. For this example, we can use the following two regular expressions:

#Name:</td><td>(.)+</td>#
#Phone:</td><td>(.)+</td>#

For more information on developing regular expressions, I recommend www.regular-expressions.info. REGEX are processor intensive and depending on the complexity of your expression you may want to execute sleep(1) occasionally to keep from burning up your CPU.

Through viewing the website, we have determined that there are 5,287 users listed. So, in PHP we set up a ‘for’ loop to iterate through the URLs:

<?php
for ($i = 0; $i <= 5287; $i++) {
  $url = "http://www.example.com/phonebook/data.php?id=$i";
  $content = getPageContent($url);
  $name = regex("#Name:</td><td>(.)+</td>#");
  $phone = regex("#Phone:</td><td>(.)+</td>#");
  runQuery("INSERT INTO phone SET name = '$name', phone = '$phone'");
  set_timeout(2);
}

In the for loop, we download the webpage using cURL, parse the content using regular expressions, and store the information in our own database.

Common Pitfalls and Methods to Bypass Security

Some sites will set off an alert if data is downloaded page after page in high volumes. Possible methods to overcome this include downloading random data. This can be done by building a queue of URLs of the website and downloading them in a random order, making off each downloaded item as you go.

Some sites will set off an alert if too many incorrect pages are downloaded (for example, logically we can assume the records 1, 2, 3, 4, 5 exist, however perhaps records 3 and 4 were deleted. So, if we are downloading pages iteratively, we would hit two incorrect records which are not linked to anywhere else in the site). To get past this problem, we’ll need to fill a database with a list of correct records (perhaps available in an index on the website) and then only visit the URLs of existing records.

Some sites require a session cookie to be set. This can be circumvented by visiting the first page of the site and storing the cookie with cURL. However, sometimes there are a series of complex frames which obscure the location of the cookie. In this case we view the site with our browser, copy the cookie session data into cURL, and then start to spider (SESSION cookies can be tied to your IP address in which case you will need to run the spider from the local machine). Cookies expire and need be reset occasionally during the spidering process (assuming a spider job takes more time to complete than the cookies lifetime).

Some sites use a captcha to protect data, which is a human readable password which is quite hard to discern programmatically. Methods to get around this would involve an OCR to read the generated captcha images. Captcha solvers will be CPU intensive and are far beyond the scope of this document.

Some websites update content often and we may need to keep our database current. When this happens, a cron job can be set up to download new data. We will need to record the position we left off with the last crawl so that we can continually grab new data without re-downloading existing content.

Some website will require a valid user-agent. To form one you will need to dissect the user agent used by your browser of choice. Sometimes you may even need to fake the user agent of a common search engine, however if the website is checking the IP address of the downloading client as well, your spider wont have any luck. When this happens you may need to spider the search engine’s cache of the site.

Thomas Hunter II Avatar

Thomas has contributed to dozens of enterprise Node.js services and has worked for a company dedicated to securing Node.js. He has spoken at several conferences on Node.js and JavaScript and is an O'Reilly published author.