visitor (0 QPoints)
  • FR
  • EN
  • NL
  • DE
  • ES
315 experts, 1193 registered users, 1659 questions already answered
European Experts Exchange, the very best site for high-quality IT solutions

New Improved Search!

 


05/10/2011 1h30 : Steve Jobs is dead, the father of Apple ][ is gone, we are all orphaned.

Languages :: PHP :: php link checker


By: noobie U.S.A.  Date: 14/10/2004 00:00:00  English  Points: 500 Status: Answered
Quality : Excellent
i need to run a link checker, and i have about 2000 links to check, how would i check so many links?
i am trying to check if the links are active (and not showing a 404 error or something like that) i think a fopen() function is used for that but not sure....


any help is appreciated
By: VGR Date: 14/10/2004 07:35:00 English  Type : Comment
easy
1) a loop for your X links, say $links[$i] is the current
2) access URI $links[$i] and check the result for 404 or anything else
3) memorize $i if wrong URI, else NOP
4) loop

something like this :
<?
// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<count($links);$i++) {
// try to access that link
$isgood=CheckURI($links[$i]]);
// memorize result
if (! $isgood) $bad[]=$i;
}
// display bad links
for ($i=1;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)
";
// done

function CheckURI($parurl) {
// inits
$result=TRUE;
// try to get URI
$filename = "$parurl";
$tobec=TRUE;
$fd = @fopen ($filename, "r");
if ($fd) { // si page trouvée
while ((!feof ($fd))and($tobec)) {
$ligne= fgets($fd, 4096);
if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
$contents []=$ligne;
} // while lecture bloquante
fclose ($fd);
if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
// nothing, result is TRUE already
// this block is in case you want to log anything like "last correct date where found the URI was OK"
} else { // we stopped before the end : 404 found
$result=FALSE;
}
} else { // page not found
$result=FALSE;
} // if page trouvée ou non
return $result;
} // CheckURI Boolean Function
?>
By: noobie Date: 14/10/2004 07:59:00 English  Type : Comment
so how would this script work?
what do i have to do? create a data file?

By: Hatemben Date: 14/10/2004 08:09:00 English  Type : Comment
is your links in database or text file ?
By: noobie Date: 14/10/2004 08:36:00 English  Type : Comment
well the links are in this format:
filename.php?go=Download&id=1
........
filename.php?go=Download&id=9999

first...(they skip numbers.)
second..i want to generate the links (all of the id's are in a database)
third...i want to check them if they are active (if they are returning 404 errors)

thanks alot..
anyone that helps me complete this gets 500 points.

By: VGR Date: 14/10/2004 08:38:00 English  Type : Comment
just do this at the begin of the script (not tested by the way)

$links=array();
$links[]='<A HREF="http://www.netscape.com">http://www.netscape.com</A>';
$links[]='<A HREF="http://www.badlink.zob">http://www.badlink.zob</A>';
$links[]='<A HREF="http://www.experts-exchange.com">http://www.experts-exchange.com</A>';

and you'll see...

you just have to get your links in an array called $links (how surprising :/ ) and test the script... :/
By: noobie Date: 14/10/2004 08:40:00 English  Type : Comment
wait so i have to do:
$links=array();
$links[]='<A HREF="http://www.mydomain.com">http://www.mydomain.com</A>';

?
and it will list all of the links on the site? (there are many pages...for example filename.php?page=1-20)
By: Morph007x2b Date: 14/10/2004 08:54:00 English  Type : Comment
Check this post out.

<A HREF="http://www.experts-exchange.com/Web/Q_20145908.html">http://www.experts-exchange.com/Web/Q_20145908.html</A>
By: VGR Date: 14/10/2004 09:01:00 English  Type : Comment
Well noobie, you wrote "i need to run a link checker, and i have about 2000 links to check, how would i check so many links?" so I supposed that you had this list of links :/

Don't you ?

call this list $links[] and my code will become crystal clear ;-)

In a word : yes, do

<?
$links=array();
$links[]='<A HREF="http://www.netscape.com">http://www.netscape.com</A>';
$links[]='<A HREF="http://www.badlink.zob">http://www.badlink.zob</A>';
$links[]='<A HREF="http://www.europeanexperts.org">http://www.europeanexperts.org</A>';

// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<count($links);$i++) {
// try to access that link
$isgood=CheckURI($links[$i]]);
// memorize result
if (! $isgood) $bad[]=$i;
}
// display bad links
for ($i=1;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)
";
// done

function CheckURI($parurl) {
// inits
$result=TRUE;
// try to get URI
$filename = "$parurl";
$tobec=TRUE;
$fd = @fopen ($filename, "r");
if ($fd) { // si page trouvie
while ((!feof ($fd))and($tobec)) {
$ligne= fgets($fd, 4096);
if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
$contents []=$ligne;
} // while lecture bloquante
fclose ($fd);
if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
// nothing, result is TRUE already
// this block is in case you want to log anything like "last correct date where found the URI was OK"
} else { // we stopped before the end : 404 found
$result=FALSE;
}
} else { // page not found
$result=FALSE;
} // if page trouvie ou non
return $result;
} // CheckURI Boolean Function
?>

I don't guarantee it typo-free or error-free, but it's 85% minimum what you'll need at the end.
By: VGR Date: 14/10/2004 09:10:00 English  Type : Answer
OK, I TESTED IT AND IT WORKS

I had some typos and minor errors (thigs forgotten)


So now the code is
<?
$links=array();
$links[1]='<A HREF="http://www.netscape.com">http://www.netscape.com</A>';
$links[2]='<A HREF="http://www.badlink.zob">http://www.badlink.zob</A>';
$links[3]='<A HREF="http://www.europeanexperts.org">http://www.europeanexperts.org</A>';

//test
$DEBUGTEST=1;
if ($DEBUGTEST==1) echo count($links)." links in input
";
//
// inits
$badlinks=0;
$bad=array();
// loop through $links[] (beforehand filled in by you)
for ($i=1;$i<=count($links);$i++) {
// try to access that link
$isgood=CheckURI($links[$i]);
if ($DEBUGTEST==1) echo "link $i '".$links[$i]."' is ".(($isgood)?'OK':'KO')."
";
// memorize result
if (! $isgood) $bad[]=$i;
}
// display bad links
$badlinks=count($bad);
//test
if ($DEBUGTEST==1) echo "$badlinks bad links found
";
//
for ($i=0;$i<$badlinks;$i++) echo "bad link '".$links[$bad[$i]]."' (index=$i)
";
// done

function CheckURI($parurl) {
// inits
$result=TRUE;
// try to get URI
$filename = "$parurl";
$tobec=TRUE;
$fd = @fopen ($filename, "r");
if ($fd) { // si page trouvie
while ((!feof ($fd))and($tobec)) {
$ligne= fgets($fd, 4096);
if (!(strpos($ligne,'[404] Not Found')===false)) $tobec=FALSE; // stop as soon as this is encountered
$contents []=$ligne;
} // while lecture bloquante
fclose ($fd);
if ($tobec) { // file entirely read OK (note that we could stop after X first lines, the '404' message is not at the 345th line...
// nothing, result is TRUE already
// this block is in case you want to log anything like "last correct date where found the URI was OK"
} else { // we stopped before the end : 404 found
$result=FALSE;
}
} else { // page not found
$result=FALSE;
} // if page trouvie ou non
return $result;
} // CheckURI Boolean Function
?>

and it produces (correctly) :
3 links in input
link 1 '<A HREF="http://www.netscape.com">http://www.netscape.com</A>' is OK
link 2 '<A HREF="http://www.badlink.zob">http://www.badlink.zob</A>' is KO
link 3 '<A HREF="http://www.europeanexperts.org">http://www.europeanexperts.org</A>' is OK
1 bad links found
bad link '<A HREF="http://www.badlink.zob">http://www.badlink.zob</A>' (index=0)

Just set $DEBUGTEST=0 and your code will behave as expected by you.

By: noobie Date: 14/10/2004 09:29:00 English  Type : Comment
the script works, but i want to check all of the link that are associated with the site...
if i put in yahoo.com, i want it to check the entire site map of it! all of the links the page is linked to and all of the pages the linked site is linked to

later.
By: VGR Date: 14/10/2004 10:39:00 English  Type : Comment
that's not at all what was your original question about...

... anyway, it's feasible (same CheckURI calls), but after having reda the page and CheckURI-ed all links encountered in it

I let you build this loop, given it's a different question. I even suggest you ask a new question, because I fairly answered your original one.

I would do this :
-for each URL in the original sites' list
-check it using technique above, BUT
-modify checkURI so that it recursively checks all encountered URIs in the currently-being-checked page
-you have to provide an external constant "maximum depth" to stop the recursion
-you have to parse the $contents[] array for tags : A HREF, IMG, FORM ACTION= etc it's a lot of work, and build a local array, then loop through it and call the same function again recursively

feasible but time-consuming if you go deeper than first level (ie, verify sites and immediate links, not the links of linked pages)
By: Morph007x2b Date: 14/10/2004 10:42:00 English  Type : Comment
You could try one of those Free Link Harvestors :) Search google <A HREF="http://www.google.com/search?q=Link+Harvestor">http://www.google.com/search?q=Link+Harvestor</A>

Do register to be able to answer

EContact
browser fav
page generated in 387.378930 milliseconds

Why Google AdSense ads ?

compteur
 Ranking-Hits PageRank for this page