Blogorola has ping – Apache rewriting with time

Had says, that Blogorola got a ping interface at http://api.blogorola.com/ping.
I hope this means that it won’t be requesting the feed every 2 minutes anymore. It’s should be getting a 304, but anyways…

Update #1: I posted a post in Slovenian. I have no idea what I was thinking. And I also figured out that it’s getting a 302, cause my feed is at Feedburner, a Google company

Update #2: Hope floats. Blogorola’s “ItsyBitsy – spider” made 57 requests in the last 8 hours or so. My server doesn’t care much, cause it only serves a 302 and redirects to the FeedBurner hosted feed. What about yours? Are you willing to put up with this?

If you’re using Apache and mod_rewrite (chances are that you are) you can use mod rewrite to make sure the requests don’t go through to your backend and database with something like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^ItsyBitsy
RewriteCond %{TIME_HOUR} >00
RewriteCond %{TIME_HOUR} <06
RewriteRule . http://blackhole/ [R=307,L]

Add this to the .htaccess file in your WordPress folder (it should already be there) or basically anywhere on the server to disallow access to the spider at all times except at night when other traffic is low.

You could also use this to allow the spider to access the feed only when you're actually writing - e.g. you usually write your posts between 20 and 22 so you can allow access then, and send it to the "blackhole" at other times. You can also use this to server it different feeds at different times for whatever reason...

See more about rewriting with time at the Apache conf page or the rewrite guide.

7 Responses to “Blogorola has ping – Apache rewriting with time”

  1. you.go says:

    The pinging service is actually meant exactly for savvy webmasters, who mind the traffic, but would still like to be included in Blogorola’s index.

    We do not recognize frequent requests to your RSS feed as a problem, since your server is probably returning just a few bytes of code, saying “304 – Not modified”, so we’re not actually causing you a lot of traffic.

    But due to some problematic blogging platforms, who are not returning 304 messages and due to popular demand (that being you :) ), we’re starting to enable the pinging service as an alternative.

    Pinging Blogorola does not stop Itsy bitsy spider from crawling your site yet, but it will soon.

  2. I guess another reason why you don’t recognize frequent requests to RSS feeds as a problem is that you’re not the one getting them. Even the 304 requests usually make something happen on the server – they have to check whether new content is available.

    Maybe the bytes don’t really matter on a single blog and maybe not even when multiplied by the number of blogs. That argument sounds like every argument ever made by a polluting company – a single drop of oil, a single tree, a ton of CO2…

  3. Marko says:

    As Fry said, it’s not simply the bandwidth spent that would be the problem. It’s unnecessary load that might not be a problem when there’s one jerk around, but can become quite annoying when there are lots of them and there’s more than one blog on server.

    Besides, what do you think all those pinging services are for? If you just took data that weblogs.com offers, you could quite easily check which blogs were updated and when. You could also discover new slovenian blogs by checking that data with IP addresses. Gosh, you might also go so far to use the frequency of posting as a guide to how often you poll a particular blog. You could also follow net etiquette, which finds polling frequencies of more than once per half an hour impolite.

    Do you do any of that? No. You just keep hitting those servers.

    Btw:
    We do not recognize frequent requests to your RSS feed as a problem, since your server is probably returning just a few bytes of code, saying “304 – Not modified”, so we’re not actually causing you a lot of traffic.

    I can’t decide if nuances of English language are the problem or if this is just an amazingly arrogant statement. What you’re willing to recognize means zilch on my server. Either you play ball or leave my rss the fuck alone.

    Unlike Fry I wasn’t willing to wait for you to become reasonable. I removed my blog from blogorola and will keep an eye on traces of your spider in my logs. It better not appear anymore.

  4. freakolowsky says:

    Well once again … i seem to be doing this a lot lately … i’m realy sorry for all the trouble you have/’ve had with Itsy.

    It was my top-most priority, to make it as fast as an aggregator as possible while making it as little of an annoyance to blog or blog-engine owners. While we have and still do try our best to achieve the optimum ratio of those two goals, we are in most cases bound by “non standard uses of standards”. The phrase “net etiquette” was used in this discussion and this phrase represents a behavioral-format, a standard and if you expect other people to respect you by respecting your standards, it would be very nice of you to even consider trying to respect theirs prior to flaming them.
    With Itsy i tried to embed such standards. While there were many, two can be pointed out which refer to this specific conversation.
    First there is IfModifiedSince request field. By using it your server can reject the request prior to parsing the page. I you use PHP for generating feeds a simple query for max publish date (on my own computer that query takes about 0.05sec on 50k records and an unindexed column) and a header command to set “Last-Modified” field. After this you can lock the header back by issuing a session_start or a simple blank echo. If the parser identyfies an IfModifiedSince request it will not continue to parse the page if the rule is false. So all your server suffers here is (tested on a personal page where prior to connecting a login has to be processed) about 0.15sec of page compile time and aproximately 30b of data. Multiplying this by by you stated requests once every two minutes (which i doubt), that makes 21Kb of data sent in 108 seconds in a day if the rule is always false. If your feed address is redirected, it costs your server even less time to process but aroximately the same amount of data, but if you wish to avoid this also, you can alway send an email to our editor, to change the feed address of your blog to a direct one. It might also interest you that using a mod_rewrite will still mean that the header will be written and data will be sent and time will be spent … so using your mod_rewrite code or redirection to feedburner in your case costs you about the same, all that changes is that you do not get checked and we get a DoS.
    I also noticed that someone suggested using a statistical mapping for time-targeted aggregation. That principle was one of the first ones to be implemented and it’s complete and working code exists in Itsy, but it is currently deactivated, because of too many blogs that have invalid dates in their feeds which totaly corrupts all statistical data and therefore has the opposite effect. Such statistical information cannot be generated without relying on the correct data on the server side. For instance a certain unnamed blog engine sets all publish dates on all posts inside a feed to the build date of PHP page wich serves as a feed.
    So for the comments and body of this post’s sake woth of complaints, i can simply say i did my best given the cicumstances.
    But because i personaly do not blog or do anything with, for or on blogs except this aggregator i must say that up until one month ago i wasn’t aware that blog-ping services even existed, but as soon as i was informed about that we started modifying the system so such services would get used to reduce even more the annoyance factor of Itsy.

    If the described situation is still beyond your capacity to adjust you can still at any time notify our editor to remove you from our list. But if you have any other suggestions on how to make Itsy even less annoying for you while still being registrated in our database, my mail is alway open to your suggestions.
    Our success is still conditioned by the success AND feedback of the blog comunity agregated by us, so in plain words (imagine that poster for U.S. Army with Uncle Sam) if you’re happy with us we rule and will last, if you’re not happy with us we rule a bit less but will probably go down soon.

    Thanks for your time … your friendly neighbourhood spider coder :D

    PS: if all goes well ping goes online on Wednsday, so don’t give up on us just yet … we’re doing our best.

  5. you.go says:

    Whoa, did not expect that kind of bashing…

    What I’m saying is, that we’re trying to give readers a good service, which means trying to publish a link as soon as a post is published. We’ve had our share of problems trying to do this, expecting that blogging engines respect standards regarding RSS (like publishing a last modified timestamp).

    We went through several stages in designing Itsy bitsy, which did infact include anticipating fequencies of posts, querying blogs with slow responses less frequently and similar “smart” ways to reduce the amount of work the spider has to do. The current version of the spider has some of those features and we’re testing some other ones.

    We’re launching the pinging service this week so those who feel strongly against spiders (or Itsy bitsy in particular) can choose an alternative and still be included in Blogorola’s index.

    But are you also blocking Google’s spiders since they’re hitting your site constantly? I’m guessing no, because we all kinda like the traffic it generates. Yeah yeah, I know it hits the site less frequently, but isn’t that why bloggers bitch about Google? It indexes the damn posts too slow (at least that’s what bloggers told us…).

    Some day we’re hoping Blogorola will be a relevant source of visitors and we’ll all be happy to receive hits from it’s spider. Until then we’re sorry for “polluting” your server and we’re putting an effort to provide the pinging service by the end of the week.

  6. Marko says:

    I don’t block Google, because it doesn’t pester me. I didn’t block you either, I just sent request to be removed from Blogorola to stop your spider. I do have to say removal was done quickly and politely and I appreciate that.

    You can insinuate as much as you want that Google gets treated differently. It doesn’t (by me). It might be difficult for you to believe, but some of us don’t write hoping to get maximum exposure.

    Btw, Google manages to index my pages faster then Blogorola did without hitting them constantly. Go figure.

  7. you.go says:

    I’m not insinuating anything, what I’m trying to say is there, it’s in the lines, not between them.

    Obviously you are not the common blogger, the blogger next door; the ones who sent us mails, wanted to be exposed as soon as possible, felt that Google didn’t index their posts fast enough and didn’t mind extra traffic to their RSS feed. They seem happy with our service – furthermore, our primary aim is content, not technology; we’re trying to provide bloggers a broader perspective, connecting bloggers across border or former Yugoslavia. Google and Technorati are far ahead technologically.

    Like you said, we do respond to requests of any type, feature requests too, what we’re trying to prevent by commenting this post is having users performing DoS without knowing why. Those of you opinionated enough are welcome to do so, but we’d still rather hear some feature requests, suggestions and (this being a non-profit project) perhaps even some help. No defensive tactics needed :)

Leave a Reply