BobbyeM71hxwHow to build a PHP Link Scraper with cURL?

Let’s build a robot, which scrapes links from web pages and dumps them in a database, and then it read those links from the database and follows them, scraping up the links on those pages, and so on ad infinitum.

To begin, let’s have a look at the groundwork.

The cURL Component-

cURL (or “client for URLS”) is a command-line tool for getting or sending files using URL syntax. It was first used in 2007 by Daniel Stenberg as a way to transfer files via protocols such as HTTP, FTP, Gopher, and many others, via a command-line interface. Since then, many more contributors has participated in further developing cURL, and the tool is used widely today.

Using cURL with PHP-

PHP is one of the languages that provide full support for cURL. (Find a listing of all the PHP functions you can use for cURL.) Luckily, PHP also enables you to use cURL without invoking the command line, making it much easier to use cURL while the server is executing. The example below demonstrates how to retrieve a page called example.com using cURL and PHP.

<?php
$ch = curl_init(“http://www.example.com/”);
$fp = fopen(“example_homepage.txt”, “w”);
curl_setopt($ch, cURLOPT_FILE, $fp);
curl_setopt($ch, cURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

<?php


$ch = curl_init(“http://www.example.com/”);
$fp = fopen(“example_homepage.txt”, “w”);


curl_setopt($ch, cURLOPT_FILE, $fp);
curl_setopt($ch, cURLOPT_HEADER, 0);


curl_exec($ch);
curl_close($ch);
fclose($fp);
?>

The Link Scraper-

For the link scraper, you will use cURL to get the content of the page you are looking for, and then you will use some DOM to grab the links and insert them into your database. You can build the database from the information below; it is really simple stuff.

$query = mysql_query(“select URL from links where visited != 1);
if($query)
{

 	while($query = mysql_fetch_array($result))
 	{

$target_url = $query[‘url’];
$userAgent = ‘ScraperBot’;

Next, grab the URL from the database table inside a simple while loop.

$ch = curl_init();
curl_setopt($ch, cURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, cURLOPT_URL,$target_url);

After instantiating cURL, you use curl_setopt() to set the USER AGENT in the HTTP_REQUEST, and then tell cURL which page you are hoping to retrieve.

curl_setopt($qw, cURLOPT_FAILONERROR, true);
curl_setopt($qw, cURLOPT_FOLLOWLOCATION, true);
curl_setopt($qw, cURLOPT_AUTOREFERER, true);
curl_setopt($qw, cURLOPT_RETURNTRANSFER,true);
curl_setopt($qw, cURLOPT_TIMEOUT, 20);

You’ve set a few more HEADERS with curl_setopt(). This time, you made sure that when an error occurs the script will return a failed result, and you set the timeout of each page followed to 20 seconds. Usually, a standard server will time-out at 30 seconds, but if you run this from your localhost you should be able to set up a no-timeout server.

$html= curl_exec($qw);
if (!$html)
{

 	echo "ERROR NUMBER: ".curl_errno($ch);
 	echo "ERROR: ".curl_error($ch);
 	exit;

}

Grab the actual page by sending the HEADERS along while executing the cURL request using curl_exec(). If an error occurs, it will be reported to PHP by the number and description inside curl_errno() and curl_error, respectively. Obviously, if such an error exists, you exit the script.

$dom = new DOMDocument();
@$dom->loadHTML($html);

Next, you create a document model of your HTML (that you grabbed from the remote server) and set it up as a DOM object.

$xpath = new DOMXPath($dom);
$href = $xpath->evaluate(“/html/body//a”);

Use XPATH to grab all the links on the page.

for ($i = 0; $i < $href->length; $i++) {

 	$data = $href->item($i);
        $url = $data->getAttribute('href');
 	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
 	mysql_query($query) or die('Error, insert query failed');
        echo "Successful Link Harvest: ".$url;
 	}

}

Dump all the links into the database, as well as the URL they are gathered from, just so you never go back there again. A more intelligent system might have a separate table for URLs already visited, as well as a normalized relationship between the two.

Going a step further than just grabbing the links enables you to harvest images or entire HTML documents as well. This is kind of where you start when building a search engine. Creating your own search engine may seem naively ambitious, and this little bit of code may inspire you a bit.

Source:- http://www.developer.com

http://www.all1social.com

http://www.all1martpro.com

Tags: , , , , ,