ANA Web Scraping

Discussion in 'Mileage Runs/Travel Hacking' started by roaming ryan, Nov 27, 2013.  |  Print Topic

  1. roaming ryan

    roaming ryan Active Member

    Messages:
    6
    Likes Received:
    2
    Status Points:
    65
    Has anyone else experimented with scraping *A availability from the ANA website using some sort of automated tool? I have had some success with this is recent months using some scripts I've hacked together, however, I've noticed recently that they have deployed some countermeasures (in the form of captchas).

    I've tried slowing down my scraping routines and I've also tried to disable cookies but it still hits me with captchas from time to time. Has anyone else experienced this issue? I'd be curious to know if anyone has figured out what triggers the captcha.
     
  2. Mrlasssen

    Mrlasssen Silver Member

    Messages:
    179
    Likes Received:
    151
    Status Points:
    375
    I don' t understand the jargon terms, but lately trying to find US Air available is more successful on the US Air site. On ANA they show no availability but found it on US site. Has anyone else seen the same results?
     
  3. viguera
    Original Member

    viguera Gold Member

    Messages:
    4,737
    Likes Received:
    6,913
    Status Points:
    4,745
    I've never run into it, do you know what software they're using? I was just poking around with some sample routes and didn't see it pop up.

    I think @Wandering Aramean might be using something to scrape ANA, but I'm not sure if he is or how he's dealing with the CAPTCHA.

    Of course, worst case you can just throw a flag whenever you run into it, report back some kind of error and try again later?
     
  4. Wandering Aramean
    Original Member

    Wandering Aramean Gold Member

    Messages:
    28,216
    Likes Received:
    61,747
    Status Points:
    20,020
    It is all about timing. Add pauses to your queries or otherwise slow them down else your IP will be banned for a few hours.
     
    roaming ryan and viguera like this.
  5. roaming ryan

    roaming ryan Active Member

    Messages:
    6
    Likes Received:
    2
    Status Points:
    65
    Ok, that's great to hear. I did add some delay (uniform random delay [15 sec, 45 sec]) but I guess it was still too aggressive. I'll slow it down some more.

    Any idea if the rate limitation is tied to your IP or just your user/session? If it is IP, then routing requests through something like tor would completely remove the rate limitation.
     
  6. briantoronto
    Original Member

    briantoronto Gold Member

    Messages:
    4,335
    Likes Received:
    1,988
    Status Points:
    1,420
    FWIW enough requests by a real human is enough to activate the CAPTIA. As that human bugged because of this, all I can say is a sarcastic THANKS
     
  7. Wandering Aramean
    Original Member

    Wandering Aramean Gold Member

    Messages:
    28,216
    Likes Received:
    61,747
    Status Points:
    20,020
    IP.

    Yes, pushing it through Tor/NSA would likely solve the issue but that's arguably slower than just the delays.
     
    viguera likes this.
  8. mherdeg
    Original Member

    mherdeg Silver Member

    Messages:
    137
    Likes Received:
    186
    Status Points:
    395
    Does anyone have an empirical measurement of the rate limit they're applying? Like, is one request per minute too many? Two? Five?
     
  9. roaming ryan

    roaming ryan Active Member

    Messages:
    6
    Likes Received:
    2
    Status Points:
    65
    My scraper has uniform random delay in the range of [60 sec, 180 sec] between each HTTP request. My last run of this scraper executed for about 6 hours without any problems.

    It's slow, but it sure as hell beats manually performing the queries. Plus, it yields nice structured data that can be further analyzed with path-finding algorithms.
     
    Wandering Aramean likes this.
  10. roaming ryan

    roaming ryan Active Member

    Messages:
    6
    Likes Received:
    2
    Status Points:
    65
    I'm surprised you've hit them when using the website manually. Before I wrote my scraping scripts I used to conduct lengthy (many hour) searches of the ANA website. I never ran into the captchas. I guess you are just faster than I am. :)
     
    Wandering Aramean likes this.
  11. mherdeg
    Original Member

    mherdeg Silver Member

    Messages:
    137
    Likes Received:
    186
    Status Points:
    395
    Just thinking out loud, at that rate you can get about 720 pieces of information (route availability on an outbound & return on 2 days) per day from a single IP.

    This would let you get a complete snapshot of nonstop flights from, for example, LON, for three weeks (there are 31 distinct *A routes from LON). You'd need to do a lot of work to capture the whole route network! (Can get some data at http://routemap.staralliance.com/ ).
     
  12. dctanner

    dctanner New Member

    Messages:
    1
    Likes Received:
    1
    Status Points:
    15
    I've been writing something similar. It'd be great to combine efforts and work out a long term solution (with distributed scraping capabilities). Anyone interested creating a group effort where we share the results? As pointed out, getting the whole network is a mammoth task.
     
    Orlan likes this.
  13. drdavidge

    drdavidge New Member

    Messages:
    3
    Likes Received:
    1
    Status Points:
    10
    I just started experimenting with this too. I just wish i could search a particular route for several months without sitting there clicking the "next 7 days" button for a half hour. Did you guys get anywhere?
     

Share This Page