Screen scraping scripts

Discussion in 'Mileage Runs/Travel Hacking' started by effseeoh, Jan 2, 2012.  |  Print Topic

  1. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    Does anyone have any scripts they can share or point at that automate querying various travel sites? I have done a lot of perl scripting in the past including screen scraping, but instead of starting from nothing, it would be nice to start with some examples.

    Alternatively, if anyone would like to collaborate on a screen scraping project, please get in touch. I'm interested in querying fares, fare rules, computing awards, etc.
     
    LarryInNYC likes this.
  2. iolaire
    Original Member

    iolaire Gold Member

    Messages:
    3,510
    Likes Received:
    5,767
    Status Points:
    4,170
    I am available, but generally I don't have a lot of time. I usually work in Ruby (hobby) but most of the code libraries make it very easy to do similar projects in other languages with ease. I'm ok coding backend websites in php and ruby on rails. I've done a tad of js scripting for website automation. Daytime part of my job is doing automated PDF publishing in VB, Excel, Word InDesign, SQL and older technologies.

    I tossed out the idea of having an outsourced flight searches some time back (http://milepoint.com/forums/threads...age-run-and-good-price-searches-from-was.581/), there was some interest but I did nothing because of the time commitment. My final concept was to set up some bookmarklets to report back searches to a master site where others could see your results. For example I'm looking for Award Availably on CO to BKK, my segments might be of interest to someone else even if I don't find what I want for BKK. I again I did nothing because the project would be fairly large and I'm probably not going to see it through on my own.
     
    LarryInNYC likes this.
  3. m124

    m124 Silver Member

    Messages:
    43
    Likes Received:
    41
    Status Points:
    165
    What travel sites are easy to scrape?
     
  4. LarryInNYC

    LarryInNYC Gold Member

    Messages:
    1,384
    Likes Received:
    2,797
    Status Points:
    1,445
    I've looked into this a tiny little bit, since I'd like to be able to have a robot checking, for instance, for possible good fares from NYC to pretty much anywhere for the February and April school breaks. I can put in fare watches for particular routes in various systems, and I can get a sense of month-long low fares with the maps at Kayak and Farecompare, but I sort of want to combine those two functions.

    The problem is that plenty of people are thinking along the same lines you are, except that they want to make commercial scale sites. Access to this information is a valuable commodity which is sold to companies for a pretty high cost. Kayak used to have a free, public search API but that's gone away. I imagine that anyone who detects that their search engine is getting robo-harvested is going to take action to stop it. That said, it's actually quite easy to pull down the permitted information from sites like Kayak. See http://www.kayak.com/labs/rss/ for information about easy searches that return lovely XML for your use. I'm sure that their interactive pages are also well-structured.

    I'd love to talk about this stuff but I have relatively little time these days for side projects so I can commit to anything.
     
    effseeoh, TAHKUCT, iolaire and 2 others like this.
  5. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    Thanks for the feedback iolaire, that did look like an interesting project. However I'd prefer to stay autonmous rather than rely on third parties to do the searching. The bookmarlet idea is interesting too; I'd thought of perhaps creating a firefox plugin that saves search results to a server.

    LarryInNYC: that's really useful, thanks. So far the only thing I've been able to find is a perl travelocity scraper but it stopped working a couple of weeks ago and I haven't looked into why.
    Since nobody (including me) has much spare time, perhaps we could just collaborate on this thread if we come across scripts, create them or have other ideas?
     
    LarryInNYC likes this.
  6. iolaire
    Original Member

    iolaire Gold Member

    Messages:
    3,510
    Likes Received:
    5,767
    Status Points:
    4,170
    effseeoh and LarryInNYC like this.
  7. LarryInNYC

    LarryInNYC Gold Member

    Messages:
    1,384
    Likes Received:
    2,797
    Status Points:
    1,445
    From a quick look at the link, it appears this script is a customization to the interactive search page, not an autonomous script that operates without user involvement. But still pretty cool.
     
  8. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    FYI: as far as I can see, ITA (google) don't have any objection to scripting access to matrix. There is nothing in their Ts&Cs against it and they also included a request to create a java tool to use it for their hiring challenge: http://www.itasoftware.com/careers/work-at-ita/hiring-puzzles.html

    Although greasemonkey scripts are intended mainly to customise interactive web access, I'm sure people have used them for other purposes too.
     
    LarryInNYC and iolaire like this.
  9. iolaire
    Original Member

    iolaire Gold Member

    Messages:
    3,510
    Likes Received:
    5,767
    Status Points:
    4,170
    I took a quick glance at (some dude's) script and it looks like a good start, his presentation mentions MilePoint, I wonder if you cannot find him and talk to him. Regardless you can learn a lot from it.

    I think the big issue will be that each website ends up paying for each search. Kayak states abuse is why they closed down their API. I think that if you start doing automated scrapping (and especially if multiple people do it) any tool will be shut down fast – and it is abuse since your costing them direct cash outlays.

    Long term it might be best to collect names and set up some sort of private area to discuss a project.
     
    LarryInNYC likes this.
  10. Randy Petersen
    Original Member

    Randy Petersen Founder

    Messages:
    2,731
    Likes Received:
    15,136
    Status Points:
    10,520
    The milepoint "Conversations" is a perfect place. Many are using it now since it also functions as a near realtime chat session in addition to a dynamic private messaging and conversation area, easy to add members into by invite only.
     
    carsonheim, SC Flier and iolaire like this.
  11. rizwank

    rizwank Silver Member

    Messages:
    130
    Likes Received:
    149
    Status Points:
    350
    I'd be down to work on a scraping project (found some apis and done some test runs already), but it'd be important for it to happen in a private place and not publish too much in techniques...
     
  12. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
  13. LarryInNYC

    LarryInNYC Gold Member

    Messages:
    1,384
    Likes Received:
    2,797
    Status Points:
    1,445
    Is there anyone who can set up a private Milepoint area to look at this issue? I think I'm limited to inviting 5 people.
     
  14. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    Me too, but it does have a checkbox to allow invitees to invite others, so perhaps it could be done that way?
     
  15. iolaire
    Original Member

    iolaire Gold Member

    Messages:
    3,510
    Likes Received:
    5,767
    Status Points:
    4,170
    I've started a conversation with conversation with LarryInNYC, rizwank, and effseeoh. I can invite up to 25 and others can invite other people.

    Please ask (in this thread) if you want to contribute your skills, knowledge, time and be added to the conversation.
     
  16. viguera
    Original Member

    viguera Gold Member

    Messages:
    4,737
    Likes Received:
    6,913
    Status Points:
    4,745
    Feel free to include me as well if you want. I like many others barely have the time but I can swing a mean cat at Perl and PHP or even a native Windows app in VB.

    Thanks
     
    iolaire likes this.
  17. Randy Petersen
    Original Member

    Randy Petersen Founder

    Messages:
    2,731
    Likes Received:
    15,136
    Status Points:
    10,520
    we can alter that upon request if you like. we try to keep it reasonable by default to try keeping out the spammers who like to do evil.
     
    SC Flier likes this.
  18. HaveMilesWillTravel
    Original Member

    HaveMilesWillTravel Gold Member

    Messages:
    12,504
    Likes Received:
    20,199
    Status Points:
    16,520
    Couple things:

    I was looking at Scrapy (scraping framework for Python) yesterday for some unrelated project: http://scrapy.org/

    Might also want to have a look at Selenium RC for JS-heavy sites.
     
  19. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    I do remember looking at Selenium a long time ago, but forgot about it. Good tip!

    Doing a search for "selenium vs greasemonkey" I came across this page: http://www.perlmonks.org/?node_id=720018 which looks pretty useful too
     
  20. m124

    m124 Silver Member

    Messages:
    43
    Likes Received:
    41
    Status Points:
    165
    I would like to be invited as well. I mainly know .NET only, but am eager to contribute, if possible.
     
  21. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
  22. paulum

    paulum Active Member

    Messages:
    3
    Likes Received:
    2
    Status Points:
    65
    I'm doing some work on a certain Perl script to scrape travel-O
    The setup of this script is pretty good.

    Perhaps we can build on this. Please invite me to share your thoughts on this.
     
    iolaire likes this.
  23. effseeoh

    effseeoh Gold Member

    Messages:
    740
    Likes Received:
    2,458
    Status Points:
    1,145
    Hi paulum. It seems I could only invite one person to the conversation and I've used that one up. Hopefully one of the other guys will invite you shortly.
     
  24. iolaire
    Original Member

    iolaire Gold Member

    Messages:
    3,510
    Likes Received:
    5,767
    Status Points:
    4,170
    Done - please remove that guy's name from your post to keep things discrete.
     
    paulum likes this.
  25. okrogius

    okrogius Silver Member

    Messages:
    696
    Likes Received:
    853
    Status Points:
    795
    If you're attempting to automate price searching, the entire screen scraping discussion is somewhat irrelevant. ITA's matrix is a joy to work with programatically.

    If you're aiming to recreate farecompare's fares page, probably just a matter of finding a usable source. Say http://www.farecompare.com/products/fare-display/index.html works reasonably fine and easy to use.
     
    TAHKUCT likes this.

Share This Page