Languages :: PHP :: Search Web Sites for Specific Information |
|||
| By: duerra |
Date: 03/10/2003 00:00:00 |
Points: 500 | Status: Answered Quality : Excellent |
|
here's my problem. I'm trying to make a "spider" of sorts, to go out and gather lists of proxies and their port numbers, so that I can add a database list to my site to flag certain proxy events to try and monitor and keep secure. Now, there are a bunch of free proxy listing sites out there, and I need a script in PHP that can go out, gather the proxies and their respective ports, and print them to a text file in the tradiitional format: 123.123.123.123:80 One example of such a site is here: <A HREF="http://www.proxy4free.com/page1.html">http://www.proxy4free.com/page1.html</a> Most of the time, these lists are in tables as shown. Can anybody help me out with the regex's to get the information that I'm looking for from these sites? 500 points for the best solution. Thanks |
|||
| By: VGR | Date: 03/10/2003 01:25:00 | Type : Comment |
|
| I did this to update an other kinbd of database (not proxies), and I already posted some code (a whole bunch, in fact) - do a search in PHP section - else I can copy-paste it again in one hour or so |
|||
| By: VGR | Date: 03/10/2003 01:27:00 | Type : Comment |
|
| It's very basic, soem would say "crude and brutal", but it works fine and... no regexps :D all with strpos(), substr(), already-made functions to extract data between tags to find in the HTML page (exactly what you do want) |
|||
| By: duerra | Date: 03/10/2003 01:28:00 | Type : Comment |
|
| Yeah.... suppose I should figure out how to search on the new EE look, eh? I'm on oldlook.ee right now. The very second I left this thread I went to the suggestions box and griped to bloody hell. VGR, I'll look up your code when I get caught up with my work here at work (probably after lunch - 3 hours). Wow... I'm typing stuff, and I realize the anger in my tone. This new layout brings out the worst in me =( |
|||
| By: duerra | Date: 03/10/2003 01:42:00 | Type : Comment |
|
| VGR, I just spent a minute trying to search for it. I couldn't come up with any results. Do you have a quicker means to the code? I'll wait for it if necessary. As for it being crude and brutal... I'd still like it to be flexible. Cold-steel grab-and-go finder won't even work from site to site... it'd have to be customly configured for each one. I trust that your code is better than that, however. |
|||
| By: VGR | Date: 03/10/2003 02:46:00 | Type : Comment |
|
| it's in there... the first one in the "development" section of the 3rd party tools, now translmated in English... <A HREF="http://fecj.org.hebergement-dynamique.org/edain/thirdparty.php/get?nom=outils/gestDB.php.txt">http://fecj.org.hebergement-dynamique.org/edain/thirdparty.php/get?nom=outils/gestDB.php.txt</a> |
|||
| By: duerra | Date: 03/10/2003 03:45:00 | Type : Comment |
|
| VGR, I've been known to be a moron from time to time.... but I can't find it in there anywhere. In fact, I can't find the "development" section with my nifty little "find" tool that I like to utilize. |
|||
| By: VGR | Date: 03/10/2003 04:00:00 | Type : Answer |
|
| ??? you go to <A HREF="http://www.edainworks.com">www.edainworks.com</a> you clisk French or English you go to the development link in the "3rd party tools" section in the menu the first link is gestDB.php.txt (the source) which "find tool" ? |
|||
| By: duerra | Date: 03/10/2003 04:26:00 | Type : Comment |
|
| Edit: Find I was clicking the link that you posted. I thought it was a collection of php scripts, and one of them had what I was looking for. I didn't go directly to the site. Anyway, I'm sorry, VGR, but this is 300 lines of globber to me. I don't know what this does, I have no need for a database for this particular issue - just writing to a text file, and I don't speak.... your native language. For 500 points, could you at either explain to me how do use this script to do what I'm looking for, or help me come up with something that does work? I'm sorry, but I don't see it in here. |
|||
| By: VGR | Date: 03/10/2003 17:43:00 | Type : Comment |
|
| Anyway, I'm sorry, VGR, but this is 300 lines of globber to me. >> thanks... I don't know what this does, >> well, read it or try it >>I have no need for a database for this particular issue well, to analyse a database, it's better to have one and I don't speak.... your native language. >> only comments, and don't make it too hard : English contains roughly 50% of French words For 500 points, could you at either explain to me how do use this script to do what I'm looking for , or help me come up with something that does work? >> my script works I'm sorry, but I don't see it in here. >> that's true : I mistook your question with an other one about handling automatically databases, tables, fields, etc My apologies. But my script is no globber ;-) this is more what you asked for : $filename = "$urlsite/$lienannonces"; //traces $debug=0; if ($debug==1) echo "########################page 1 : $filename "; $k=0; // nombre d'annonces trouvées $fd = @fopen ($filename, "r"); if ($fd) { // si page trouvée while (!feof ($fd)) { $ligne= fgets($fd, 4096); $contents []=$ligne; } // while lecture bloquante // $contents = fread ($fd, filesize ($filename)); non bloquant : merdique fclose ($fd); //traces //if ($debug==1) for ($i=0;$i<count($contents);$i++) echo htmlspecialchars($contents[$i]).' '; //exit; Analyse($locID,$contents,$annonces,$k,$liencatalogue,$lienseries,$lienannonces); } // 1ère page trouvée else { // 1st page not found // infos à vide $k=0; // log failure : LogAction('automate',$REMOTE_ADDR,"Problème d'accès à l'URL $filename",2); // alerte : mail("$globFAdmin", "BNniouzes : problème accès page $urlsite", "Admin message :\n\nProblème d'accès à l'URL $filename\n\nLe robot de service.","From: contact@$SERVER_NAME\nReply-To: contact@$SERVER_NAME\nX-Mailer: PHP/" . phpversion()); // avec extra headers (à voir) } // if 1st page found then Analyse() is something like : function Analyse($locID,$contents,&$annonces,&$k,$liencatalogue,$lienseries,$lienauteurs,$isSortie=FALSE) { GLOBAL $REQUEST_URI, $globEmail, $globFAdmin, $sess_pseudo, $REMOTE_ADDR, $SERVER_NAME; $locTo=date('Y-m-d'); //traces $debug=0; if ($debug==1) echo 'Analy---début--- '; $i=0; // n° de la ligne courante dans $contents[] $j=count($contents); while ($i<$j) { // tant que pas terminé infructueusement if ($debug==1) echo "parcours initial $i $j "; while ((strpos($contents[$i],'">En')===false) and (strpos($contents[$i],'size="3">')===false) and ($i<$j)) $i++; if ($i<>$j) { // on a trouvé un bloc de données - data block found //traces if ($debug==1) echo "Analy---niou bloc---- "; $k++; // incrémente le nombre de lignes trouvées $yy=0; // mémo codesortie. ATTENTION pour Soleil l'argument $lienauteurs contient le lien d'annonce utilisé (2 liens) //traces if ($debug==1) echo "recherche codecatalogue via '?id=' dans '".htmlspecialchars($contents[$i])."' "; $deb='?id='; $fin='"><'; $annonces[$k]["codecatalogue"]=GetChunk($i,$yy,$contents,$deb,$fin); //traces if ($debug==1) echo "codecatalogue trouvé ".$annonces[$k]["codecatalogue"]." "; } // if trouvé bloc // else terminé le bloc $i++; } // on a terminé ce bloc //traces if ($debug==1) echo 'Analy---fin--- '; } which uses the GetChunk() function which is : function GetChunk(&$i,&$zz,$contents,$deb,$fin,$debug=0) { $contents[$i]=substr($contents[$i],$zz); // le reste while (($m=strpos($contents[$i],$deb))===false) { $i++; $zz=0; } $m=$m+strlen($deb); $n=$m; $locRes=''; $l=strlen($fin); while (($n<strlen($contents[$i]))and((substr($contents[$i],$n,$l))<>$fin)) $n++; if ($n==strlen($contents[$i])) { $locRes=substr($contents[$i],$m); $i++; $zz=0; $m=0; $n=strpos($contents[$i],$fin); } if (!($n===false)) { $locRes.=substr($contents[$i],$m,$n-$m); $zz=$zz+$n+1; } else { $locRes.=''; $zz=0; } // donc si pas trouvé, retourne le vide... return($locRes); } // GetChunk String Function beware, this version of this kind of function needs to find $deb (the start substring) and $fin (the ending substring) on the same line. I've more flexible versions, but this should be enough to give you ideas regards |
|||
| By: duerra | Date: 03/10/2003 18:03:00 | Type : Comment |
|
| I don't doubt that it works, I was simply stating that I could not make out what I was looking for based on what I was provided. It was not meant as any way of degrading your work or scripts, but simply asking for more explanation rather than a link to a file that's written in French (wow, I've actually got to use that phrase and mean it literally) that I couldn't make out. Regardless, I will plug it in and try it when I get home from work. To reitterate what I was looking for: 1. Pull out the IP's ( [1-3 numbers].[1-3 numbers].[1-3 numbers].[1-3 numbers] ), and the port numbers associated with those IP's (80, for example). 2. Print these IPs to a text file in the format - IP:Port 3. Easily upkept to work between different sites. |
|||
| By: VGR | Date: 03/10/2003 18:13:00 | Type : Comment |
|
| 1) activate the //traces part in first part, to display contents gotten from HTTP 2) analyse to detect : -"interesting" block start -start and end markers for each IP@ listed 3) then modify Analyse() accordingly - this means changing some litterals and the $deb= and $fin= parts 4) put debug=1 5) trace to make sure you get the correct information 6) turn all debugging off 7) it's ready |
|||
| By: jayrod | Date: 03/10/2003 18:28:00 | Type : Comment |
|
| heh.. duerra I had the same reaction when I looked at VGR's script :P |
|||
| By: duerra | Date: 03/10/2003 18:35:00 | Type : Comment |
|
| Sorry, VGR, I couldn't get yours to work how I wanted it to, though I'm sure it works. Instead, I created my own. My script is as follows: <?php set_time_limit(5600); $proxyFile = fopen('proxyList.txt', 'a'); $array = array(); $pageArray[0] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html'); $pageArray[1] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html'); $pageArray[2] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html'); $pageArray[3] = array('index.pl/proxy_list', 'index.pl/proxy_list?order=&offset=100', 'index.pl/proxy_list?order=&offset=100', 'index.pl/proxy_list?order=&offset=200', 'index.pl/proxy_list?order=&offset=300', 'index.pl/proxy_list?order=&offset=400', 'index.pl/proxy_list?order=&offset=500', 'index.pl/proxy_list?order=&offset=600', 'index.pl/proxy_list?order=&offset=700', 'index.pl/proxy_list?order=&offset=800'); $pageArray[4] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html'); $pageArray[5] = array('index.html', 'page2.html', 'page3.html','page4.html','page5.html', 'page6.html', 'page7.html', 'page8.html', 'page9.html', 'page10.html'); $pageArray[6] = array('page1.html', 'page2.html', 'page3.html','page4.html','page5.html'); $siteArray = array(array('url' => '<A HREF="http://www.proxy4free.com">www.proxy4free.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[0]), array('url' => '<A HREF="http://www.freepublicproxies.com">www.freepublicproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[1]), array('url' => '<A HREF="http://www.findproxy.com">www.findproxy.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[2]), array('url' => '<A HREF="http://www.stayinvisible.com">www.stayinvisible.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[3]), array('url' => '<A HREF="http://www.allproxies.com">www.allproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[4]), array('url' => '<A HREF="http://www.allproxies.com">www.allproxies.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[4]), array('url' => '<A HREF="http://www.findproxy.com">www.findproxy.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[5]), array('url' => '<A HREF="http://www.publicproxyservers.com">www.publicproxyservers.com</a>', 'start' => '<td>', 'end' => '</td>', 'pages' => $pageArray[6]) ); $totalSites = count($siteArray); $numProxies = 0; for($site = 0; $site < $totalSites; $site++) { $totalSitePages = count($siteArray[$site]['pages']); $st = $siteArray[$site]['start']; $end = $siteArray[$site]['end']; $base = $siteArray[$site]['url']; for($page = 0; $page < $totalSitePages; $page++) { $extension = $siteArray[$site]['pages'][$page]; $fp = fsockopen($base, 80, $errno, $errstr, 30); if (!$fp) { echo "$errstr ($errno) \n"; } else { $string = ""; $urlString = "GET /$extension HTTP/1.0\r\n"; $urlString .= "Host: $base\r\n\r\n"; fputs($fp,$urlString); while (!feof($fp)) { $string .= fgets ($fp,128); } fclose ($fp); } preg_match_all("*".$st."(\s){0,}([0-9]{1,3}\.){3}[0-9]{1,3}[0-9]{2,5}(\s){0,}".$end."(\s){0,}".$st."(\s){0,}([0-9]{2,5})(\s){0,}".$end."*", $string, $matches); $numMatches = count($matches[0]); for($i = 0; $i < $numMatches; $i++) { $matches[0][$i] = str_replace($st, " ", $matches[0][$i]); $matches[0][$i] = str_replace($end, " ", $matches[0][$i]); $matches[0][$i] = trim($matches[0][$i]); $matches[0][$i] = eregi_replace("[\s\r\n ]+", ":", $matches[0][$i]); echo htmlspecialchars($matches[0][$i])."<Br>"; fwrite($proxyFile, $matches[0][$i]."\r\n"); $numProxies++; } } } fclose($proxyFile); echo "The Number of Proxies Documented: $numProxies"; ?> |
|||
| By: VGR | Date: 03/10/2003 19:09:00 | Type : Comment |
|
| well, you regexp() to get data between $st and $end [while I prefer the performance of strpos() and substr() ] then you str_replace $st and $end [while I just have them out automatically when using GetChunk() ] then you trim() [that's normal] then you've your data [me too] it's the same ; it's just that your condition to get the data is much simplier than mines : I treat 17 sites having 17 entirely different formats (HTML), from which I must extract the same amount of data, sometimes by parsing multiple pages multipel times for a given site... So my GetChunk() and GetChunk2() functions are more flexible than your preg_match_all("*".$st."(\s){0,}([0-9]{1,3}\.){3}[0-9]{1,3}[0-9]{2,5}(\s){0,}".$end."(\s){0,}".$st."(\s){0,}([0-9]{2,5})(\s){0,}".$end."*", $string, $matches); Especially given I can use it to sequentially get elements from lines (the $i and $zz counters automatically increment, so I don't have to care for the current line number and cursor position...) I just pass it the $contents[] array I build (while you build a big string), the $se=$deb, the $end=$fin and voilà. I hope you recognize this :D |
|||
| By: duerra | Date: 04/10/2003 17:51:00 | Type : Comment |
|
| The problem is that there was more than one "block" on the page with <td> tags. I still had to find and ensure the proxies, and ports. However, I will give you the points regardless. |
|||
| By: VGR | Date: 04/10/2003 17:59:00 | Type : Comment |
|
| in two hours, you can see my code working on the 17 editors' sites I parse each day for new comic strips (bandes dessinées) |
|||
|
Do register to be able to answer |
|||
©2010 These pages are served without commercial sponsorship. (No popup ads, etc...). Bandwidth abuse increases hosting cost forcing sponsorship or shutdown. This server aggressively defends against automated copying for any reason including offline viewing, duplication, etc... Please respect this requirement and DO NOT RIP THIS SITE.
Please DO link to this page!








