[penguicon-general] anyone out there know....??

Rick Scott rick at shadowspar.dyndns.org
Mon Sep 8 12:26:42 EDT 2008


(Lady Sarah:)
> What he's saying is that he should be exempt from our rule of "no
> more than 250 results per set of criteria entered" rule because
> each new page is a new query to the database and therefor a new
> search and the data can NOT be scraped this way.
>
> Is he telling me the truth? or has there just not been a hacker
> clever enough to pull the data from their site yet?

Saying that the data can't be scraped out because it's paginated and
skipping to the next page requires javascript?  I can't say for sure
without looking at it, but if it's like most such sites, I could work 
around it in a day.  Someone who knows what they are doing could
probably do it in an hour.

Most web-bots and other such automatic page-fetching tools don't
implement javascript, so a site that requires it to get results out 
is more difficult to scrape.  Usually the javascripty bits can be 
worked around with a bit of cleverness.  Alternatively, you can just
use a tool like Selenium RC which lets you write an automated script 
that drives a real web browser.

I'm not saying that the 250-hit limit per search is a great solution
either, but it probably makes it more difficult to scrape out your 
entire database than whatever javascript this guy has implemented.




Cheers,
Rick
-- 
key CF8F8A75 / print C5C1 F87D 5056 D2C0 D5CE  D58F 970F 04D1 CF8F 8A75 
Try not!  Do, or do not.  There is no "try".
     :Yoda


More information about the penguicon-general mailing list