I use the following command to recursively download a bunch of files from a website to my local machine. It is great for working with open directories of files, e.g. those made available from the Apache web server. The following can be added to your .bash_profile or .bashrc script, depending on which your OS/distro recommends:… Read more »
Posts Tagged: Scraping
I just referred someone to this book on building PHP web scrapers after they emailed me asking about my personal PHP web scraper project. Unfortunately I don’t have time for a lot of freelance work these days, but I’m always willing to suggest a good book or point someone to a pro.
PHP Eve Crawler on GitHub This is a project I started working on and abandoned in 2009. It is a spider which was specifically built for crawling websites which contained EVE Kill Mail’s. In the game of EVE, every time you kill a player, an in-game ‘mail’ is sent to you containing information. Players would… Read more »
Open Sourcing my PHP Web Scraper PHP Web Scraping Engine I started this project in January 2011. It was going to be an easy to use web scraper that anyone could configure. It has an attractive GUI interface using jQuery UI elements. The selectors can be entered using three different methods; the first is the tried… Read more »
Requests for PHP Here’s a pretty cool PHP library for making HTTP requests. It handles all of the nasty cURL stuff behind the scenes and just leaves you with a clean “API” for making requests. $headers = array(‘Accept’ => ‘application/json’); $options = array(‘auth’ => array(‘user’, ‘pass’)); $request = Requests::get(‘https://api.github.com/gists’, $headers, $options); var_dump($request->status_code); // int(200) var_dump($request->headers[‘content-type’]);… Read more »
I was recently tasked with getting the pecl_http package installed on a server. I already hade PECL all setup (which can be its own nightmare), and I had cURL installed. But, there is a mystery package which needed to be installed first. tlhunter@amalthea:~ $ sudo pecl install pecl_http downloading pecl_http-1.7.4.tgz … Starting to download pecl_http-1.7.4.tgz… Read more »
This tutorial will explain how to get wget (the command-line utility for downloading files) installed on your Mac.
If you are a PHP developer who writes a lot of software which needs to be executed in many different shared hosts, it can often be frustrating when certain hosts don’t offer all of the functionality your applications require, specificially the cURL libraries. I’ve seen these missing on several hosts, either for security reasons or… Read more »
Why cURL doesn’t work well with relative paths that PHP works fine with, and a workaround for the issue.
Spidering, in its simplest form is the act of transferring data from one database to another. Spidering requires the use of Regular Expressions, the cURL library (if POST data or cookies are used), and the cron libraries (if we need to download information with a schedule).