February 18, 2004

Powered by 1010 aggregator

The front page of this site is now driven by the 1010 aggregator. Hopefully it doesn't crash :). The aggregator still has some bugs, but hopefully by placing on this page the bugs will get shaken out more quickly.


by kwc | February 18, 2004 11:42 PM
Comments
Subscribe to comments:
rss comments netnewswire bloglines Subscribe to comments with Shrook Subscribe to comments with My Yahoo!

Neat stuff! Does running the feeds hit your machine very hard? If you stick expires headers in there, it could probably be piped through the same squid pass-through as the main site, and thus reduce the load on your machine to a less bursty point. Comments would be cool, though.

Posted by: bp at February 19, 2004 01:02 AM

Haven't noticed any appreciable hit on the machine, incoming or outgoing, but there haven't been any real users yet, and the comment feature is still on the sidelines. CPU stays in between 0-2% mostly, peaking at about 12% if I hit reload as fast as I can in a browser (not a realy test, I know). Memory is a steady 30MB on a machine with 1GB. The aggregator has been running for the past month or so without If-Modified-Since/If-None-Match headers, and I haven't noticed any bandwidth drain. The latest update to the codebase added those in, so it should be even better.

I'm currently worried about the concentration of hits on LiveJournal and Xanga. They might decide to get very angry with me, though I'm probably still below the radar. Xanga didn't seem to notice when I was writing my xanga2mt script, which hammered Xanga for several hours straight as I was debugging it :).

In the future, comment parsing will greatly outweigh anything else, as it potentially adds in a factor of N to the bandwidth consumption, where N is the number of entries. Also, LiveJournal and Xanga do not issue Last-Modified or ETag headers for actual entries. I may not support scraping of those sites for that reason, even though I already have to code to do so.

Also, I may have to design the comment feature such that you explicitly have to mark that you want to subscribe to a particular entry.

Squid won't save too much right now, as the cache would have to be very short lived. My blog, for example, only gets 3-6 hits to the front page per hour, even if you look at peak times. The cache expiration would have be set to about ten minutes, which would mean that usually no entries would be served out of the cache, and I would actually be doubling the bandwidth costs. For now, I'll have to see if I get any users first :).

Posted by: Ken at February 19, 2004 01:41 AM

BTW - you officially get credit for being "First Post!" on movabletypo.net. I guess I get "Second Post!" and "Third Post!" :)

Posted by: Ken at February 19, 2004 01:42 AM

Great script. Somehow I really doubt the Xanga/LJ people notice much... they've got so much other traffic going from other users (malacious or otherwise) that it probably barely registers as a blip in their usage.

Posted by: Mike at February 19, 2004 09:31 AM

That's what I'm hoping, but I know Slashdot, for example, shuts off its feed to any site that reloads to feed too frequently. /. is kinda on the leading edge of this, but I don't want to give other sites reason to, even if I'm only a blip on the radar of a much larger problem.

I'm also a little worried about scraping MT blogs. , but I'm hoping that I can optimize out any of the actual issues. I know that I would certainly notice my aggregator scraping my blog, but I'm also the type of person that looks at my statistics regularly to figure out where my site is being stressed. Say, for example, that your site has ten entries in the feed, which is about par for the course. If I just read your feed every thirty minutes, that's forty-eight hits per day, 1440 per month. If I scrape the comments for that, that's 528 hits per day, or 15840 per month. Also, those hits are loading individual entries, which are much larger.

Currently, no site has registered more than 5000 hits on my site in a single month, so this would be pretty big. The Modified-Since stuff should make most of this go away, but I don't know how well .php sites, for example, support those headers.

Posted by: Ken at February 19, 2004 10:55 AM

Woo hoo! You fixed the boxes around the entry. Looks great!

Posted by: Cshell at February 19, 2004 11:43 PM