Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Session Variables
#1
Hi guys, i posted a request in the GitHub not realising you's had a Forum.

Ive been using your Ultimate Scraper Toolkit, nice code..

Ive been trying to scrape the UK .gov website for DVLA information.

Ive managed to scrape it with CURL, but your class is alot more efficient and easier to traverse and keep code updates minimal.

Anyway ive managed to send all the information by emulating the forms.

Problem i have is that the Cookies seem to be 1 hour behind the current time. they expire 45 minuites before the current time. When the government have a timeout of 15 minuites. which makes sense that if its an hour behind, minus the 15 gives 45 minutes. 

When i submit the form to get to the page with the information it instantly says Session Expired.

I was wondering since the web browser emulator works on my server, is it creating the cookies/session using my server time without daylight savings even though the server time is correct ?

Or is it a Session not being registered problem ?

Is there any settings that allow saving of session variables between its movement from the form submission page to the information display page.

Am i receiving wrong date time because the script does not have correct folder permissions to write a cookie/session file ?

here is Raw debug code for the page:
Code:
------- RAW SEND START ------- GET /expired HTTP/1.1 Host: www.viewdrivingrecord.service.gov.uk Connection: close Accept: text/html, application/xhtml+xml, */* Accept-Language: en-us,en;q=0.5 Cache-Control: max-age=0 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 Referer: https://www.viewdrivingrecord.service.gov.uk/driving-record/viewbydln Cookie: PLAY_SESSION=%22cf36084d2d9b86334c3b73f77ec0e557af7e2065-5a709799be809fa47b22a601a8ce2ce9%3Dbcacf9f2514159a09f0294840e3b0dbd%266afa8be35b0579af85b9bf1c40aa1010%3Dae9fbb95553ddd3fa07f0d7fc4f1bbef%26704339978aae7d8b14e7cf85aeccc851%3Dc6301e505440d88349aa8b759bc3bae5%267ab7ba5d88d756e5d59dee4a70690167%3D878ae68b59a4888d228df348b914bd20b61d7079ba2bdc45489b79ec96a986b2%26205429c282335f82025c1e6e504f6039%3Dce171f98ea0c876caac2e3329e8c7ded%267d53e9c2d13e10a46cb71d7f345254f1%3D11e864af6a3f99e1a85724f8153f494f%22; PLAY_FLASH=%22searchTimestamp%3D1506523595795%22 ------- RAW SEND END ------- ------- RAW RECEIVE START ------- HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Date: Wed, 27 Sep 2017 14:46:37 GMT Server: nginx Set-Cookie: PLAY_SESSION=; Expires=Tue, 26 Sep 2017 14:46:37 GMT; Path=/; Domain=www.viewdrivingrecord.service.gov.uk; HTTPOnly Content-Length: 8988 Connection: Close

And here is Raw debug code for second page:

Code:
------- RAW SEND START ------- GET /driving-record/validate/expired HTTP/1.1 Host: www.viewdrivingrecord.service.gov.uk Connection: close Accept: text/html, application/xhtml+xml, */* Accept-Language: en-us,en;q=0.5 Cache-Control: max-age=0 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0 Referer: https://www.viewdrivingrecord.service.gov.uk/driving-record/validate/summary Cookie: PLAY_SESSION=%225832560f6d88bcfb52ee14b0c7a6f93ad46cc600-c90b2f6b736987f36757cd827f7cabea89945a8cc09fca787565ac0a98fa328b%3D066ae9e0b2f789ca745dc23746c61f5e%26704339978aae7d8b14e7cf85aeccc851%3Dc6301e505440d88349aa8b759bc3bae5%26aef3328f34245f2521204701db15c296%3Dc1aa2f9db5d74b3925894b6b9b45e13f%26114415c68ac712f33918dc39c734a38b%3D878ae68b59a4888d228df348b914bd20b61d7079ba2bdc45489b79ec96a986b2%2685b9547f6f3c104bd2e733cbc55d74d1356abbb4676ee1c240f0b8de77b930c1%3D1fbebaf63b3693e3ae7d1560446f598a%22 ------- RAW SEND END ------- ------- RAW RECEIVE START ------- HTTP/1.1 200 OK Content-Type: text/html; charset=utf-8 Date: Wed, 27 Sep 2017 14:48:02 GMT Server: nginx Set-Cookie: PLAY_SESSION=; Expires=Tue, 26 Sep 2017 14:48:02 GMT; Path=/; Domain=www.viewdrivingrecord.service.gov.uk; HTTPOnly Content-Length: 9035 Connection: Close

Any help with this would be greatly appreciated.

Cheers

Andy
Reply
#2
Quote:I was wondering since the web browser emulator works on my server, is it creating the cookies/session using my server time without daylight savings even though the server time is correct?

If the cookie had timed out locally, it would have not been sent to the server at all. In addition, the remote server doesn't get information about the expiration timestamp of the cookie. You shouldn't confuse response cookies and session timeout messages from the server you are scraping with the cookies that you send.

Just so you know, GMT strings are stored in the cookie arrays. It shouldn't matter where the code is used.

The way you word your question implies you are getting cookies on one host and then moving the cookie(s) to another host (e.g. GetState() on one host and then SetState() to restore the state on another host) and then running the scraper on the second host using the cookie from before. Doing such an activity probably means an IP address change. A lot of session storage code will check IP addresses and timeout a session if the IP changes. Maybe I'm not understanding the particular scenario. I usually establish a session in the first place by making a request to retrieve a form. I have only rarely needed to manually set cookies due to Javascript logic on a page.

Another thing to try is to use Incognito/Private Browsing mode to guarantee a 100% fresh web browser session. Then watch Network traffic using Developer Tools. Compare that traffic against the raw data sent via the WebBrowser class. I've had weird scenarios where something rather subtle sent for the session early on affects later results.
Author of Barebones CMS

If you found my reply to be helpful, be sure to donate!
All funding goes toward future product development.
Reply
#3
Hi thanks for the quick response back, how it is working is my server where the script is running is doing all the calls to the remote server. There is only one connection to the remote server.

Basically i have a file scraping the UK .Gov website. it states that all information on the website is provided using the Open Government Licence v3.0 so basically any information on the website can be used for any purpose. I'll call this website / server the "GOV server" then i'm connecting to it and scraping it using a PHP script on a Web server. I'll call this MY server.

So basically when MY server connects, it seems to return a cookie that is out of date from the GOV server. The GOV server requires form information to be submitted via a form from the website. Obviously the script from the UST framework completes the form and submits this information to the server. The server then responds with a Session Timed out response, even though the form has just been submitted. I'm guessing it is the Javascript cookie generator on the website that is causing the issues ? Does the built in Web Scraper emulate Javascript also ? As the GOV server runs a jquery cookie creation tool.
Reply
#4
The scraper does not emulate Javascript. I've thought about it from time to time but Javascript is a rather complex language with lots of nuances. There is an interesting PECL extension called v8, which is the same v8 Javascript engine used in Blink-based browsers (Chrome and Opera). However, at this time, that extension presents a fairly unstable, frequently changing API to PHP userland.

When Javascript is involved, I end up looking for the one or two pieces of information that are inevitably somewhere in the HTML on the page that the code uses and piece together a similar sort of request (e.g. add a missing form variable). Watching browser network operations carefully and duplicating them precisely can be tricky - it's hard sometimes to know what is missing.

If cookies are indeed expiring prematurely, you can try pulling out the cookies using GetState(), remove or modify the expiration timestamp for each cookie, and then use SetState() to update the internal class state. I'd rather figure out why a cookie is expiring prematurely though - maybe a PHP configuration or timezone bug.
Author of Barebones CMS

If you found my reply to be helpful, be sure to donate!
All funding goes toward future product development.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)
© CubicleSoft