How can I emulate a get request exactly like a web browser?

PhpCurl

Php Problem Overview


There are websites that when I open specific ajax request on browser, I get the resulted page. But when I try to load them with curl, I receive an error from the server.

How can I properly emulate a get request to the server that will simulate a browser?

That is what I am doing:

$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname";
ini_set('user_agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
print $result;

Php Solutions


Solution 1 - Php

Are you sure the curl module honors ini_set('user_agent',...)? There is an option CURLOPT_USERAGENT described at <http://docs.php.net/function.curl-setopt>;.
Could there also be a cookie tested by the server? That you can handle by using CURLOPT_COOKIE, CURLOPT_COOKIEFILE and/or CURLOPT_COOKIEJAR.

edit: Since the request uses https there might also be error in verifying the certificate, see CURLOPT_SSL_VERIFYPEER.

$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);

Solution 2 - Php

i'll make an example, first decide what browser you want to emulate, in this case i chose Firefox 60.6.1esr (64-bit), and check what GET request it issues, this can be obtained with a simple netcat server (MacOS bundles netcat, most linux distributions bunles netcat, and Windows users can get netcat from.. Cygwin.org , among other places),

setting up the netcat server to listen on port 9999: nc -l 9999

now hitting http://127.0.0.1:9999 in firefox, i get:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1

now let us compare that with this simple script:

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_exec($ch);

i get:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
Accept: */*

there are several missing headers here, they can all be added with the CURLOPT_HTTPHEADER option of curl_setopt, but the User-Agent specifically should be set with CURLOPT_USERAGENT instead (it will be persistent across multiple calls to curl_exec() and if you use CURLOPT_FOLLOWLOCATION then it will persist across http redirections as well), and the Accept-Encoding header should be set with CURLOPT_ENCODING instead (if they're set with CURLOPT_ENCODING then curl will automatically decompress the response if the server choose to compress it, but if you set it via CURLOPT_HTTPHEADER then you must manually detect and decompress the content yourself, which is a pain in the ass and completely unnecessary, generally speaking) so adding those we get:

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_setopt_array($ch,array(
        CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
        CURLOPT_ENCODING=>'gzip, deflate',
        CURLOPT_HTTPHEADER=>array(
                'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language: en-US,en;q=0.5',
                'Connection: keep-alive',
                'Upgrade-Insecure-Requests: 1',
        ),
));
curl_exec($ch);

now running that code, our netcat server gets:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Upgrade-Insecure-Requests: 1

and voila! our php-emulated browser GET request should now be indistinguishable from the real firefox GET request :)

this next part is just nitpicking, but if you look very closely, you'll see that the headers are stacked in the wrong order, firefox put the Accept-Encoding header in line 6, and our emulated GET request puts it in line 3.. to fix this, we can manually put the Accept-Encoding header in the right line,

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_setopt_array($ch,array(
        CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
        CURLOPT_ENCODING=>'gzip, deflate',
        CURLOPT_HTTPHEADER=>array(
                'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language: en-US,en;q=0.5',
                'Accept-Encoding: gzip, deflate',
                'Connection: keep-alive',
                'Upgrade-Insecure-Requests: 1',
        ),
));
curl_exec($ch);

running that, our netcat server gets:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1

problem solved, now the headers is even in the correct order, and the request seems to be COMPLETELY INDISTINGUISHABLE from the real firefox request :) (i don't actually recommend this last step, it's a maintenance burden to keep CURLOPT_ENCODING in sync with the custom Accept-Encoding header, and i've never experienced a situation where the order of the headers are significant)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionufkView Question on Stackoverflow
Solution 1 - PhpVolkerKView Answer on Stackoverflow
Solution 2 - PhphanshenrikView Answer on Stackoverflow