To register for an Internet.com membership to receive newsletters and white papers, use the Register button ABOVE.
To participate in the message forums BELOW, click here
PHPBuilder.com  
 

 

Go Back   PHPBuilder.com > PHP Help > General Help

General Help Forum for General Help questions pertaining to PHP

Reply
 
Thread Tools Rate Thread Display Modes
Old 01-04-2007, 03:15 PM   #1
JLHeidecker
Looking for new tools
 
JLHeidecker's Avatar
 
Join Date: May 2006
Posts: 47
[RESOLVED] Difficult Curl, Javascript question

Imagine an image gallery, whose Page 1 has a URL like:
http://www.myserver.com/gallery.php

The sub pages also share the same URL. Pages 2 and 3 are accessed via Javascript links on the page that load new data without refreshing the entire page. (think AJAX)

Easy: curl into gallery.php and take whatever I want.

Hard: curl into gallery.php, redirect to Javascript links to Pages 2 and 3 and get data off those pages.


Anyone know how to "follow" the Javascript links with curl and get the data off Pages 2 and 3?

thanks!
Jason
Los Angeles, CA
JLHeidecker is offline   Reply With Quote
Old 01-04-2007, 03:39 PM   #2
etully
Juñor Curmudgeon
 
Join Date: Apr 2006
Location: Connecticut
Posts: 1,250
It's really a hard problem because you have to essentially write a web browser.

If you want a quick hack, you could find the pattern of what new pages the JS calls and make curl call those pages manually - or you could even find a pattern of what the images are called and call those images through curl.

But if you are looking for a REAL, completely functional Javascript interpreter, you are going to need to build your own web browser.

Think about it this way: What if the Ajax does something really bizarre like count the number of seconds that you've been on the page and then it loads content based on how long you've been reading. 1-10 seconds, it pulls in content for version #2, 11-20 seconds and it pulls content for version #3, etc. Now imagine that it checks to see what images you moused over. If you moused over images 1,3,5,7, then it loads content for version #4, and if you clicked any of the radio buttons, then it pulls content for version #5.

So a quick hack is to assume that the Ajax code is ultra simple - essentially load the next page. The real solution (which you may not need) is to include a Javascript interpreter into your new PHP Curl web browser script which can follow the Ajax calls exactly as they would on a real site.

If you can examine the Ajax (Javascript) code that calls the next page, then you don't really have to "follow" the links, you can just calculate what they should be. Truly, the Ajax could be assembling the URL to the next page from wacky code like: URL = "h" + "t" +"tp://www" + ".mysite" + ".com/page" + "2.php" in which case, you can't simply "read" the contents of where the Javascript will bring you next... you have to let the Javascript execute so you can let it build the URL for you. And for that, you'd need to have a Javascript interpreter.

The hack is easier assuming that there is a pattern to what pages get called next.

If you are really going to pursue the complicated solution, I'd start by obtaining the source code for Firefox since it already has the "curl" and the Javascript parts in place - but that's what I would call a hard project.
etully is offline   Reply With Quote
Old 01-04-2007, 03:54 PM   #3
JLHeidecker
Looking for new tools
 
JLHeidecker's Avatar
 
Join Date: May 2006
Posts: 47
etully,

I see you understand my dilemma. Let me clarify a bit further.

Yes, I can curl to gallery.php, and retrieve what the Javascript links to the "next" pages are. That is very structured and consistent. No problems there.

Unfortunately, the Javascript links look like
"javascript:__doLInk('xyz$Main$ImageListings','2')"
or
"javascript:__doLInk('xyz$Main$ImageListings','3')"

So those Javascript functions do not call a new HTML/php page. It is like an Ajax call, and simply replaces the gallery images on the page with new ones.

I have no way of predicting what the image URL's on pages 2 and 3 are. I am stuck with figuring out a way to "follow" the Javascript functions calls to the next page, and then preg_match 'ing my image src's.

Does this make things more clear to you? Does this change the way you see the problem?

thanks,
Jason
JLHeidecker is offline   Reply With Quote
Old 01-04-2007, 04:32 PM   #4
etully
Juñor Curmudgeon
 
Join Date: Apr 2006
Location: Connecticut
Posts: 1,250
If it's just replacing an image with a new image (like the age old mouse overs)... then yes, it needs to go out to the Internet to retrieve the new image... but that doesn't make it Ajax. When you change the property of an image so that it has a new source, you are forcing the browser to load a new image, but it's not necessarily using Ajax to retrieve that image.

If it's Ajax, then you should see a part of the code that calls a real live URL. If it's simply changing the source of the image, then it's 1996 technology.

So I guess I would need to understand what the function __doLInk looks like. XMLHttpRequest? Or .src =

I suspect that you could copy the javascript on their page to a page on your own web site and put some alert commands in it to see what URL it's calling (if Ajax) or what image it's calling (if not Ajax) and that would get you closer to understanding how the JS functions are finding the new images which gets you closer to building a curl app that can predict what images to call.

As you can tell, I'm grasping at straws here (and maybe not making complete sense) because I can't see the remote site.

But the root of your problem is that unless you embed a JS interpreter into your PHP based Curl script, you can't execute the JS to truly follow the links... so you'll have to figure out what the JS does and simulate that to predict the next "thing" to call whether it's an image or a URL in Ajax.
etully is offline   Reply With Quote
Old 01-04-2007, 04:50 PM   #5
JLHeidecker
Looking for new tools
 
JLHeidecker's Avatar
 
Join Date: May 2006
Posts: 47
thank you etully!

Believe it or not, I think we're making progress.

Here's some of the javascript code:

<script type="text/javascript">
<!--
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
// -->
</script>

Unfortunately, my knowledge of javascript is somewhat limited (read maybe a book and a half on the subject).

But it's clear the page submits a form.
JLHeidecker is offline   Reply With Quote
Old 01-04-2007, 04:57 PM   #6
JLHeidecker
Looking for new tools
 
JLHeidecker's Avatar
 
Join Date: May 2006
Posts: 47
Looks like it's an asp script:

More code:
<form name="aspnetForm" method="post" action="script.aspx?%3ffuseaction=gallery" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />


So the javascript links perform a submit on this form, and supply the eventtarget and eventargument post variables.

I think I can do this!! Just use curl to send those arguments via post to the script identified in the action parameter of the form tag...
JLHeidecker is offline   Reply With Quote
Old 01-04-2007, 07:40 PM   #7
JLHeidecker
Looking for new tools
 
JLHeidecker's Avatar
 
Join Date: May 2006
Posts: 47
UPDATE

it worked! what I suggested in the above post is exactly what it took.

thanks for the helpful brainstorming etully!
JLHeidecker is offline   Reply With Quote
Old 10-25-2007, 11:41 PM   #8
claudiotaunay
Junior Member
 
Join Date: Oct 2007
Posts: 2
Hi I m having the same problem. Can you help me?
Take a look at this web page
http://www.zap.com.br/imoveis/result...oBusca=Simples

this page has page 1, page 2, page 3 and page 4

code:
#######################################################
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$1')">1</a></td><td><span>2</span></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$3')">3</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$4')">4</a>
#######################################################

Next look the code of javascript:__doPostBack :

#######################################################
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
#######################################################

All right, i need to send via POST the __VIEWSTATE too:
#######################################################
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />

<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJMzY4NDQzMjQxDxYQHgZhRGFkb3MWAB4IUXRkTGluaGEIAADOQh4Vc1RpdHVsb1Jlc3VsdGFkb0J1c2NhBXNGb3JhbSBlbmNvbnRyYWRvcyAxMDMgcmVzdWx0YWRvcyBjb20gb3Mgc2VndWludGVzIGNyaXTDqXJpb3M6IDxicj5BcGFydGFtZW50bywgVG9kb3MsIFJJTyBERSBKQU5FSVJPL1JJTyBERSBKQU5FSVJPHg1DZWx1bGFzVmF6aWFzBQVGYWxzZR4QUHJpbWVpcm9EZXN0YXF1ZQUEVHJ1ZR4JQ2FiZWNhbGhvBQRUcnVlHgppUGFnZUluZGV4AgEeDlByaW1laXJhT2ZlcnRhBQRUcnVlFgJmD2QWBmYPZBYIAgEPZBYCZg8WAh4EVGV4dGVkAgIPFgIfCAUkPG1ldGEgbmFtZT0iZGVzY3JpcHRpb24iIGNvbnRlbnQ9IiI+ZAIDDxYCHwgFITxtZXRhIG5hbWU9ImtleXdvcmRzIiBjb250ZW50PSIiPmQCIg8WAh8IBTU8bGluayByZWw9InN0eWxlc2hlZXQiIGhyZWY9Ii9jc3MvemFwX25vZnJhbWUuY3NzIiAvPmQCAQ8WAh8IBRFUb3AsUmlnaHQseDEwLHgwOWQCAg9kFg4CAg8WAh8IBTM8U0NSSVBUIExBTkdVQUdFPUphdmFTY3JpcHQ+T0FTX0FEKCdUb3AnKTs8L1NDUklQVD5kAgMPZBYIZg8WAh4HVmlzaWJsZWhkAgEPZBYEZg8PFgQfCAUNU0FPIFBBVUxPLCBTUB4HVG9vbFRpcAUNU0FPIFBBVUxPLCBTUGRkAgEPZBYCAgEPD2QWAh4Jb25rZXlkb3duBbMBaWYoZXZlbnQud2hpY2ggfHwgZXZlbnQua2V5Q29kZSl7aWYgKChldmVudC53aGljaCA9PSAxMykgfHwgKGV2ZW50LmtleUNvZGUgPT0gMTMpKSB7ZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoJ2N0bDAwJEJhcnJhX2xvZ2luJGJ0bk9LJykuY2xpY2soKTtyZXR1cm4gZmFsc2U7fX0gZWxzZSB7cmV0dXJuIHRydWV9OyBkAgMPZBYCZg8PZBYCHwsFvQFpZihldmVudC53aGljaCB8fCBldmVudC5rZXlDb2RlKXtpZiAoKGV2ZW50LndoaWNoID09IDEzKSB8fCAoZXZlbnQua2V5Q29kZSA9PSAxMykpIHtkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgnY3RsMDAkQmFycmFfbG9naW4kYnRuT0tNaW5oYVNlbmhhJykuY2xpY2soKTtyZXR1cm4gZmFsc2U7fX0gZWxzZSB7cmV0dXJuIHRydWV9OyBkAgYPFgIfCWhkAgQPFgIfCAUzPFNDUklQVCBMQU5HVUFHRT1KYXZhU2NyaXB0Pk9BU19BRCgneDA5Jyk7PC9TQ1JJUFQ+ZAIFD2QWGGYPFgIfCAVUPGlucHV0IHR5cGU9J3RleHQnIGlkPSdoaWRDaGVja2VkJyBuYW1lPSdoaWRDaGVja2VkJyBzdHlsZT0nZGlzcGxheTpub25lOycgdmFsdWU9Jyc+ZAIBDxYCHglpbm5lcmh0bWwF8QE8YSBocmVmPScvZGVmYXVsdC5hc3B4Jz5Ib21lPC9hPiZuYnNwOyZndDsmbmJzcDs8YSBocmVmPScvaW1vdmVpcy9idXNjYS1kZS1pbW92ZWlzLXNpbXBsZXMuYXNweCc+SW0mb2FjdXRlO3ZlaXM8L2E+Jm5ic3A7Jmd0OyZuYnNwOzxhIGhyZWY9Jy9pbW92ZWlzL2J1c2NhLWRlLWltb3ZlaXMtc2ltcGxlcy5hc3B4Jz5CdXNjYSBwb3IgdG9kb3M8L2E+Jm5ic3A7Jmd0OyZuYnNwOzxiPlJlc3VsdGFkb3MgZGEgYnVzY2E8L2I+ZAIDDxYEHgRocmVmBQEjHgdvbmNsaWNrBR5qYXZhc2NyaXB0OmFicmVNb2RhbCgnbG9naW4nKTtkAggPFgIfDAUGQmFpcnJvZAIXDxBkEBUJDk9yZGVuYXIgcG9yLi4uCkFudW5jaWFudGUFw4FyZWEERGF0YRFEaXN0cml0byAvIEJhaXJybwdRdWFydG9zB1N1w610ZXMFVmFnYXMFVmFsb3IVCQASb3JkZW0sTm9tZUZhbnRhc2lhCm9yZGVtLEFyZWEPb3JkZW0sRGF0YU9yZGVtDm9yZGVtLGRpc3RyaXRvFG9yZGVtLFF0ZERvcm1pdG9yaW9zD29yZGVtLFF0ZFN1aXRlcw5vcmRlbSxRdGRWYWdhcxBvcmRlbSxQcmVjb09yZGVtFCsDCWdnZ2dnZ2dnZxYBZmQCGA8WAh8JaBYCZg8WAh8JaGQCGQ8WAh8JaGQCGg8WAh8JaBYCZg8WAh8JaGQCGw8WAh8JaGQCHA8WAh8IZWQCHQ9kFgJmDxYEHgNzcmMFKi9pbWFnZW0vaW1vdmVpcy9yZXN1bHRhZG9fZGVzdGFxdWVfdGl0LmdpZh4DYWx0BSlJbSZvYWN1dGU7dmVpcyBOb3ZvcyAtIFByb250b3MgcGFyYSBNb3JhcmQCIg8WAh8JaGQCBg8WBB8IZR8JaGQCBw8WAh8JaBYKAgcPDxYCHghSZWFkT25seWhkZAIJDw8WAh8RaGRkAgsPEGRkFgFmZAINDxYCHgpvbmtleXByZXNzBTBqYXZhc2NyaXB0OnJldHVybiBNYXhMZW5ndGhUZXh0QXJlYSh0aGlzLCA0MDAwKTtkAg8PD2QWAh8OBSJqYXZhc2NyaXB0OnJldHVybiBWYWxpZGFaYXBFcnJvKCk7ZAIIDxYCHwgFNTxTQ1JJUFQgTEFOR1VBR0U9SmF2YVNjcmlwdD5PQVNfQUQoJ1JpZ2h0Jyk7PC9TQ1JJUFQ+ZBgEBR5fX0NvbnRyb2xzUmVxdWlyZVBvc3RCYWNrS2V5X18WCAUXY3RsMDAkQmFycmFfbG9naW4kYnRuT0sFIWN0bDAwJEJhcnJhX2xvZ2luJGJ0bk9LTWluaGFTZW5oYQUfY3RsMDAkQmFycmFfbG9naW4kTG5rTG9naW5Nb2RhbAUeY3RsMDAkQmFycmFfbG9naW4kY2hrVGVybW9zVXNvBSNjdGwwMCRCYXJyYV9sb2dpbiRjaGtSZWNlYmVyT2ZlcnRhcwUdY3RsMDAkQmFycmFfbG9naW4kYnRuQ29uZmlybWEFGmN0bDAwJEJhcnJhX2xvZ2luJEJ0bkxvZ2luBTBjdGwwMCRDb250ZW50UGxhY2VIb2xkZXIxJGNoa0RpdkNhcmFjdGVyaXN0aWNhJDAFK2N0bDAwJENvbnRlbnRQbGFjZUhvbGRlcjEkZ3JkUmVzdWx0YWRvQnVzY2EPPCsACAECAgFkBSFjdGwwMCRDb250ZW50UGxhY2VIb2xkZXIxJGdyZE5vdm8PZ2QFKGN0bDAwJENvbnRlbnRQbGFjZUhvbGRlcjEkZ3JkTGFuY2FtZW50b3MPZ2ShqbHM/VXXWUz5y7/voE3vAAAAAA==" />
#######################################################

The problem:
When i look the headers at FirefoxBrowser plug-in, the __VIEWSTATE has a diferente value.
I think this is the problem that i can t acess the next page.

$postfields = "$IDTransacao%3D3%26amp%3BTransacao%3DComprar%2Bum%2Bim%25u00f3vel%26amp%3BIDUF%3D19%26amp%3BUF%3DRIO%2BDE%2BJANEIRO%26amp%3BIDLocalidade%3D63118%26amp%3BLocalidade%3DRIO%2BDE%2BJANEIRO%26amp%3BIDTipo%3D1%26amp%3BTipo%3DApartamento%26amp%3BZonaGrupo%3D9%26amp%3BIDDistrito%3D0%26amp%3BDistrito%3DTodos%26amp%3BTipoBusca%3DSimples%26__EVENTTARGET%3Dctl00%24ContentPlaceHolder1%24grdResultadoBusca%26__EVENTARGUMENT%3DPage%241%26__VIEWSTATE%3D";

###############################################################################

Thankś for any help;
claudiotaunay is offline   Reply With Quote
Old 10-25-2007, 11:42 PM   #9
claudiotaunay
Junior Member
 
Join Date: Oct 2007
Posts: 2
My code:
#####################################################################
acessaPaginaClassificadosOGLOBO('http://www.zap.com.br/imoveis/rio-de-janeiro/venda/centro/apartamento/rio-de-janeiro-venda-centro-apartamento.html');

function acessaPaginaClassificadosOGLOBO($pagina_alvo){
#
# Acesando a página Alvo
#

$sessao_curl = curl_init($pagina_alvo);

//
curl_setopt($sessao_curl, CURLOPT_HEADER, 1);
curl_setopt($sessao_curl, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($sessao_curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($sessao_curl, CURLOPT_RETURNTRANSFER, 1);
$subject = curl_exec($sessao_curl);
curl_close($sessao_curl);


#acessaPaginaAlvoIndividualClassificadosOGLOBO($subject);
#/*
#
# Capturando variáveis para o POST de mudança de páginas
#

# cálculo de páginas
//Busca do Número Total de Resultados
$match = '/(?<=encontrados )\d*(?= resultados)/';
preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);
$num_resultado = $result[0][0];

//Cada página possui 30 resultados, logo total de páginas = resultados / 30
//função ceil() arredonda frações para cima
$num_total_paginas = ceil(($num_resultado / 30));


################################################################################
# Acessa cada página da busca resultado
################################################################################
if ($num_total_paginas){
for ($i=1; $i <= $num_total_paginas; $i++){

# montando post para cada página
//Busca do post action padrao

$match = '%(?<=<form name="aspnetForm" method="post" action="/imoveis/resultado-busca-imoveis\.aspx\?).*(?=" )%';

preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);
$postfields = $result[0][0];
$EVENTTARGET = 'ctl00$ContentPlaceHolder1$grdResultadoBusca';
//$EVENTTARGET = urlencode($EVENTTARGET);
$postfields .= "&__EVENTTARGET=$EVENTTARGET";
$EVENTARGUMENT = "Page$$i";
//$EVENTARGUMENT = urlencode($EVENTARGUMENT);
$postfields .= "&__EVENTARGUMENT=$EVENTARGUMENT";
//$postfields .= "&__LASTFOCUS=";

//Encontra o &__VIEWSTATE=
$match = '/(?<=<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value=").*(?=")/';
preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);
$VIEWSTATE1 = $result[0][0];
$postfields .= "&__VIEWSTATE=$VIEWSTATE";

#echo $postfields . "<br/>";

$postfields = urlencode($postfields);

echo $postfields . "<br/>";
/*

$url_pagina_alvo_individual = "http://www.zap.com.br/imoveis/resultado-busca-imoveis.aspx";
$reffer = 'http://www.zap.com.br/imoveis/resultado-busca-imoveis.aspx?IDTransacao=3&Transacao=Comprar+um+im%u00f3vel&IDUF=19&UF=RIO+DE+JANEIRO&IDLocalidade=63118&Localidade=RIO+DE+JANEIRO&IDTipo=1&Tipo=Apartamento&ZonaGrupo=9&IDDistrito=0&Distrito=Todos&TipoBusca=Simples';

// INIT CURL
$ch = curl_init();
//curl_setopt($ch, CURLOPT_HEADER, 1);
//curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

// SET URL FOR THE POST
curl_setopt($ch, CURLOPT_URL, $url_pagina_alvo_individual);

// ENABLE HTTP POST
curl_setopt($ch, CURLOPT_POST, 1);


# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_REFERER, $reffer);

// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);


// EXECUTE 1st REQUEST (FORM LOGIN)

if (curl_errno($ch)) {
echo "Erro CURL: " . curl_error($ch);
}

$pagina_alvo_individual = curl_exec($ch);
curl_close($ch);

echo $pagina_alvo_individual;
#acessaPaginaAlvoClassificadosOGLOBO($pagina_alvo_individual);

*/

} #fim fo for
} #fim do if


###########################

Thank's for help
claudiotaunay is offline   Reply With Quote
Old 10-26-2007, 02:14 AM   #10
troybtj
correct 67% of the time..
 
Join Date: Sep 2007
Location: Asleep at the wheel
Posts: 127
Could you put your Extremely LONG URLs inside a code tag, and your code inside PHP tags?

I don't have a 50" wide monitor.
troybtj is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 05:26 PM.






Acceptable Use Policy

internet.comMediabistrojusttechjobs.comGraphics.com

WebMediaBrands Corporate Info


Advertise | Newsletters | Feedback | Submit News

Legal Notices | Licensing | Permissions | Privacy Policy


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.