visitor (0 QPoints)
  • FR
  • EN
  • NL
  • DE
  • ES
315 experts, 1193 registered users, 1659 questions already answered
European Experts Exchange, the very best site for high-quality IT solutions

New Improved Search!

 


05/10/2011 1h30 : Steve Jobs is dead, the father of Apple ][ is gone, we are all orphaned.

Languages :: PHP :: Search Web Sites for Specific Information


By: duerra Spain  Date: 03/10/2003 00:00:00  English  Points: 500 Status: Answered
Quality : Excellent
here's my problem. I'm trying to make a "spider" of sorts, to go out and gather lists of proxies and their port numbers, so that I can add a database list to my site to flag certain proxy events to try and monitor and keep secure. Now, there are a bunch of free proxy listing sites out there, and I need a script in PHP that can go out, gather the proxies and their respective ports, and print them to a text file in the tradiitional format:
123.123.123.123:80

One example of such a site is here:
<A HREF="http://www.proxy4free.com/page1.html">http://www.proxy4free.com/page1.html</a>

Most of the time, these lists are in tables as shown. Can anybody help me out with the regex's to get the information that I'm looking for from these sites? 500 points for the best solution.

Thanks
By: VGR Date: 03/10/2003 01:25:00 English  Type : Comment
I did this to update an other kinbd of database (not proxies), and I already posted some code (a whole bunch, in fact) - do a search in PHP section -

else I can copy-paste it again in one hour or so
By: VGR Date: 03/10/2003 01:27:00 English  Type : Comment
It's very basic, soem would say "crude and brutal", but it works fine and... no regexps :D

all with strpos(), substr(), already-made functions to extract data between tags to find in the HTML page (exactly what you do want)
By: duerra Date: 03/10/2003 01:28:00 English  Type : Comment
Yeah.... suppose I should figure out how to search on the new EE look, eh?

I'm on oldlook.ee right now. The very second I left this thread I went to the suggestions box and griped to bloody hell. VGR, I'll look up your code when I get caught up with my work here at work (probably after lunch - 3 hours).

Wow... I'm typing stuff, and I realize the anger in my tone. This new layout brings out the worst in me =(
By: duerra Date: 03/10/2003 01:42:00 English  Type : Comment
VGR, I just spent a minute trying to search for it. I couldn't come up with any results. Do you have a quicker means to the code? I'll wait for it if necessary.

As for it being crude and brutal... I'd still like it to be flexible. Cold-steel grab-and-go finder won't even work from site to site... it'd have to be customly configured for each one. I trust that your code is better than that, however.
By: VGR Date: 03/10/2003 02:46:00 English  Type : Comment
it's in there... the first one in the "development" section of the 3rd party tools, now translmated in English...

<A HREF="http://fecj.org.hebergement-dynamique.org/edain/thirdparty.php/get?nom=outils/gestDB.php.txt">http://fecj.org.hebergement-dynamique.org/edain/thirdparty.php/get?nom=outils/gestDB.php.txt</a>
By: duerra Date: 03/10/2003 03:45:00 English  Type : Comment
VGR, I've been known to be a moron from time to time.... but I can't find it in there anywhere. In fact, I can't find the "development" section with my nifty little "find" tool that I like to utilize.
By: VGR Date: 03/10/2003 04:00:00 English  Type : Answer
???

you go to <A HREF="http://www.edainworks.com">www.edainworks.com</a>

you clisk French or English

you go to the development link in the "3rd party tools" section in the menu

the first link is gestDB.php.txt (the source)

which "find tool" ?
By: duerra Date: 03/10/2003 04:26:00 English  Type : Comment
Edit: Find

I was clicking the link that you posted. I thought it was a collection of php scripts, and one of them had what I was looking for. I didn't go directly to the site.

Anyway, I'm sorry, VGR, but this is 300 lines of globber to me. I don't know what this does, I have no need for a database for this particular issue - just writing to a text file, and I don't speak.... your native language.

For 500 points, could you at either explain to me how do use this script to do what I'm looking for, or help me come up with something that does work? I'm sorry, but I don't see it in here.
By: VGR Date: 03/10/2003 17:43:00 English  Type : Comment
Anyway, I'm sorry, VGR, but this is 300 lines of globber to me.

>> thanks...

I don't know what this does,
>> well, read it or try it

>>I have no need for a database for this particular issue
well, to analyse a database, it's better to have one

and I don't speak.... your native language.
>> only comments, and don't make it too hard : English contains roughly 50% of French words


For 500 points, could you at either explain to me how do use this script to do what I'm looking for

, or help me come up with something that does work?
>> my script works

I'm sorry, but I don't see it in here.

>> that's true : I mistook your question with an other one about handling automatically databases, tables, fields, etc

My apologies. But my script is no globber ;-)


this is more what you asked for :

$filename = "$urlsite/$lienannonces";
//traces
$debug=0;
if ($debug==1) echo "########################page 1 : $filename
";
$k=0; // nombre d'annonces trouvées
$fd = @fopen ($filename, "r");
if ($fd) { // si page trouvée
while (!feof ($fd)) {
$ligne= fgets($fd, 4096);
$contents []=$ligne;
} // while lecture bloquante
// $contents = fread ($fd, filesize ($filename)); non bloquant : merdique
fclose ($fd);
//traces
//if ($debug==1) for ($i=0;$i<count($contents);$i++) echo htmlspecialchars($contents[$i]).'
';
//exit;
Analyse($locID,$contents,$annonces,$k,$liencatalogue,$lienseries,$lienannonces);
} // 1ère page trouvée
else { // 1st page not found
// infos à vide
$k=0;
// log failure : LogAction('automate',$REMOTE_ADDR,"Problème d'accès à l'URL $filename",2);
// alerte : mail("$globFAdmin", "BNniouzes : problème accès page $urlsite", "Admin message :\n\nProblème d'accès à l'URL $filename\n\nLe robot de service.","From: contact@$SERVER_NAME\nReply-To: contact@$SERVER_NAME\nX-Mailer: PHP/" . phpversion()); // avec extra headers (à voir)
} // if 1st page found



then Analyse() is something like :

function Analyse($locID,$contents,&$annonces,&$k,$liencatalogue,$lienseries,$lienauteurs,$isSortie=FALSE) {
GLOBAL $REQUEST_URI, $globEmail, $globFAdmin, $sess_pseudo, $REMOTE_ADDR, $SERVER_NAME;
$locTo=date('Y-m-d');
//traces
$debug=0;
if ($debug==1) echo 'Analy---début---
';
$i=0; // n° de la ligne courante dans $contents[]
$j=count($contents);
while ($i<$j) { // tant que pas terminé infructueusement
if ($debug==1) echo "parcours initial $i $j
";
while ((strpos($contents[$i],'">En')===false) and (strpos($contents[$i],'size="3">')===false) and ($i<$j)) $i++;
if ($i<>$j) { // on a trouvé un bloc de données - data block found
//traces
if ($debug==1) echo "Analy---niou bloc----
";
$k++; // incrémente le nombre de lignes trouvées
$yy=0;
// mémo codesortie. ATTENTION pour Soleil l'argument $lienauteurs contient le lien d'annonce utilisé (2 liens)
//traces
if ($debug==1) echo "recherche codecatalogue via '?id=' dans '".htmlspecialchars($contents[$i])."'
";
$deb='?id='; $fin='"><'; $annonces[$k]["codecatalogue"]=GetChunk($i,$yy,$contents,$deb,$fin);
//traces
if ($debug==1) echo "codecatalogue trouvé ".$annonces[$k]["codecatalogue"]."
";
} // if trouvé bloc
// else terminé le bloc
$i++;
} // on a terminé ce bloc
//traces
if ($debug==1) echo 'Analy---fin---
';
}

which uses the GetChunk() function which is :

function GetChunk(&$i,&$zz,$contents,$deb,$fin,$debug=0) {
$contents[$i]=substr($contents[$i],$zz); // le reste
while (($m=strpos($contents[$i],$deb))===false) { $i++; $zz=0; }
$m=$m+strlen($deb);
$n=$m;
$locRes='';
$l=strlen($fin);
while (($n<strlen($contents[$i]))and((substr($contents[$i],$n,$l))<>$fin)) $n++;
if ($n==strlen($contents[$i])) {
$locRes=substr($contents[$i],$m);
$i++;
$zz=0;
$m=0;
$n=strpos($contents[$i],$fin);
}
if (!($n===false)) {
$locRes.=substr($contents[$i],$m,$n-$m);
$zz=$zz+$n+1;
} else { $locRes.=''; $zz=0; } // donc si pas trouvé, retourne le vide...
return($locRes);
} // GetChunk String Function


beware, this version of this kind of function needs to find $deb (the start substring) and $fin (the ending substring) on the same line. I've more flexible versions, but this should be enough to give you ideas

regards
By: duerra Date: 03/10/2003 18:03:00 English  Type : Comment
I don't doubt that it works, I was simply stating that I could not make out what I was looking for based on what I was provided. It was not meant as any way of degrading your work or scripts, but simply asking for more explanation rather than a link to a file that's written in French (wow, I've actually got to use that phrase and mean it literally) that I couldn't make out.

Regardless, I will plug it in and try it when I get home from work. To reitterate what I was looking for:

1. Pull out the IP's ( [1-3 numbers].[1-3 numbers].[1-3 numbers].[1-3 numbers] ), and the port numbers associated with those IP's (80, for example).

2. Print these IPs to a text file in the format - IP:Port

3. Easily upkept to work between different sites.
By: VGR Date: 03/10/2003 18:13:00 English  Type : Comment
1) activate the //traces part in first part, to display contents gotten from HTTP
2) analyse to detect :
-"interesting" block start
-start and end markers for each IP@ listed

3) then modify Analyse() accordingly - this means changing some litterals and the $deb= and $fin= parts

4) put debug=1

5) trace to make sure you get the correct information

6) turn all debugging off

7) it's ready


By: jayrod Date: 03/10/2003 18:28:00 English  Type : Comment
heh.. duerra I had the same reaction when I looked at VGR's script :P
By: duerra Date: 03/10/2003 18:35:00 English  Type : Comment
Sorry, VGR, I couldn't get yours to work how I wanted it to, though I'm sure it works. Instead, I created my own. My script is as follows:


<?php
set_time_limit(5600);

$proxyFile = fopen('proxyList.txt', 'a');

$array = array();
$pageArray[0] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html');
$pageArray[1] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html');
$pageArray[2] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html');
$pageArray[3] = array('index.pl/proxy_list', 'index.pl/proxy_list?order=&offset=100',
'index.pl/proxy_list?order=&offset=100', 'index.pl/proxy_list?order=&offset=200',
'index.pl/proxy_list?order=&offset=300', 'index.pl/proxy_list?order=&offset=400',
'index.pl/proxy_list?order=&offset=500', 'index.pl/proxy_list?order=&offset=600',
'index.pl/proxy_list?order=&offset=700', 'index.pl/proxy_list?order=&offset=800');
$pageArray[4] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html');
$pageArray[5] = array('index.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html');
$pageArray[6] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html');

$siteArray = array(array('url' => '<A HREF="http://www.proxy4free.com">www.proxy4free.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[0]),
array('url' => '<A HREF="http://www.freepublicproxies.com">www.freepublicproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[1]),
array('url' => '<A HREF="http://www.findproxy.com">www.findproxy.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[2]),
array('url' => '<A HREF="http://www.stayinvisible.com">www.stayinvisible.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[3]),
array('url' => '<A HREF="http://www.allproxies.com">www.allproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[4]),
array('url' => '<A HREF="http://www.allproxies.com">www.allproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[4]),
array('url' => '<A HREF="http://www.findproxy.com">www.findproxy.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[5]),
array('url' => '<A HREF="http://www.publicproxyservers.com">www.publicproxyservers.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[6])
);

$totalSites = count($siteArray);

$numProxies = 0;
for($site = 0; $site < $totalSites; $site++)
{
$totalSitePages = count($siteArray[$site]['pages']);
$st = $siteArray[$site]['start'];
$end = $siteArray[$site]['end'];
$base = $siteArray[$site]['url'];

for($page = 0; $page < $totalSitePages; $page++)
{
$extension = $siteArray[$site]['pages'][$page];

$fp = fsockopen($base, 80, $errno, $errstr, 30);
if (!$fp)
{
echo "$errstr ($errno)
\n";
}
else
{
$string = "";
$urlString = "GET /$extension HTTP/1.0\r\n";
$urlString .= "Host: $base\r\n\r\n";

fputs($fp,$urlString);
while (!feof($fp))
{
$string .= fgets ($fp,128);
}
fclose ($fp);
}

preg_match_all("*".$st."(\s){0,}([0-9]{1,3}\.){3}[0-9]{1,3}[0-9]{2,5}(\s){0,}".$end."(\s){0,}".$st."(\s){0,}([0-9]{2,5})(\s){0,}".$end."*", $string, $matches);

$numMatches = count($matches[0]);
for($i = 0; $i < $numMatches; $i++)
{
$matches[0][$i] = str_replace($st, " ", $matches[0][$i]);
$matches[0][$i] = str_replace($end, " ", $matches[0][$i]);
$matches[0][$i] = trim($matches[0][$i]);
$matches[0][$i] = eregi_replace("[\s\r\n ]+", ":", $matches[0][$i]);
echo htmlspecialchars($matches[0][$i])."<Br>";
fwrite($proxyFile, $matches[0][$i]."\r\n");
$numProxies++;
}
}
}

fclose($proxyFile);
echo "The Number of Proxies Documented: $numProxies";

?>
By: VGR Date: 03/10/2003 19:09:00 English  Type : Comment
well, you regexp() to get data between $st and $end [while I prefer the performance of strpos() and substr() ]
then you str_replace $st and $end [while I just have them out automatically when using GetChunk() ]
then you trim() [that's normal]
then you've your data [me too]

it's the same ; it's just that your condition to get the data is much simplier than mines : I treat 17 sites having 17 entirely different formats (HTML), from which I must extract the same amount of data, sometimes by parsing multiple pages multipel times for a given site... So my GetChunk() and GetChunk2() functions are more flexible than your preg_match_all("*".$st."(\s){0,}([0-9]{1,3}\.){3}[0-9]{1,3}[0-9]{2,5}(\s){0,}".$end."(\s){0,}".$st."(\s){0,}([0-9]{2,5})(\s){0,}".$end."*", $string, $matches);

Especially given I can use it to sequentially get elements from lines (the $i and $zz counters automatically increment, so I don't have to care for the current line number and cursor position...)

I just pass it the $contents[] array I build (while you build a big string), the $se=$deb, the $end=$fin and voilà.

I hope you recognize this :D
By: duerra Date: 04/10/2003 17:51:00 English  Type : Comment
The problem is that there was more than one "block" on the page with <td> tags. I still had to find and ensure the proxies, and ports. However, I will give you the points regardless.
By: VGR Date: 04/10/2003 17:59:00 English  Type : Comment
in two hours, you can see my code working on the 17 editors' sites I parse each day for new comic strips (bandes dessinées)

Do register to be able to answer

EContact
browser fav
page generated in 380.187990 milliseconds

Why Google AdSense ads ?

compteur
 Ranking-Hits PageRank for this page