Scoop || Performance and Caching Ideas

Front Page · Everything · News · Code · Help! · Wishlist · Project · Scoop Sites · Dev Notes · Latest CVS changes · Development Activities

Performance and Caching Ideas
By rusty , Section Project [] Posted on Sat Jun 30, 2001 at 12:00:00 PM PST

Recent traffic on K5 has forced me to start thinking about performance again. I guess that's not a bad thing. :-) Essentially, as far as I can tell, performance of the current design is pretty much maxed out. It's as efficient as it can be without major changes. Also, I really like the current code design, so I'm reluctant to make major changes to it. Nevertheless, performance at high-load is still really not as good as I want (and need) it to be. So, as of right now, there are two major areas I can think of where performance could be improved without major disruptions. The first is archiving stories to keep the database relatively small, and the second is creating static cached pages for Anonymous viewers. I wanted to throw my thoughts out there and see if anyone could come up with major flaws or any suggestions about them. [Cross-posted on scoop-help as well. I wanted to hit as many people as possible]

1) Archiving:
I've put this off for as long as possible, but I knew it was gonna have to come up some day. Scoop just can't let the database grow forever. K5 is starting to really slow down trying to manage a 400Mb+ database, and shrinking that would really help in a lot of places. So, we need to be able, at some point, to get old stories, comments, and all the stuff related to them out of the database and stored elsewhere. Note that this is not necessarily a strategy to speed up the serving of archive pages by caching them in html! I don't like the Slashcode approach to archiving at all, because it means that once a page is archived, the design is frozen forever. My thought is just that we need to get the data out of the database.
So, what I have in mind right now is to, at some point when archiving is deemed necessary, fetch all the relevant data for a story, and save it in a plain-text format in a file somewhere. I'd like to have the archive store *only* the data, which could then be shown to a user by translating it into data structures like those used by the normal page rendering now, and filled in to whatever templates you're using just like a story from the database.
What I don't know, at this point, is what format they should be in. Someone's gonna say XML, which is a valid option, but not one I'm really inclined to go with. There's a lot of parsing overhead associated with XML, and for a language like perl, which is all about text processing anyway, it strikes me as unnecessary. I've considered just using 'Storable' for this, since, from a coding perspective, it makes things real easy. To archive, we'd just freeze the comment thread structure and the story object. To render, there'd be an if just before the normal rendering happens, where one path (for stories in the database) would do the normal SQL stuff, and the other (for archived stories) would retreive the hashrefs from the archive. Rendering would then proceed as normal, except with some checks to prevent new comment posting and what have you.
The nifty thing about handling it this way would be that from a users perspective, the difference between archived stories and normal stories would be minimal. You could still choose your preferred display mode for comments, and if you (the admin) had changed your site design, archives would always match the new design. The drawback would be that it'd be slower than static html. Considering that old stories aren't viewed very often, I think that's probably a worthwhile tradeoff.
From an administrative side, then, what you'd do is set a default "archive period", like say three months. You should be able to also specify whether that's a hard and fast time period, or whether you want to archive stuff that's gone X months without a new comment. When posting a story, there'd also be an option to exempt it from archiving.
Is it worth setting archive periods as a section property? I.e. you could have different time lengths for different sections? It may be. Another thing to note is that for hotlisting to work properly, we would need to leave a very basic entry in the story table, with just the sid and the title of a story. The hotlist display would need to distinguish between live and archived stories, and not try to count comments on archive stories. In fact, since we know that the count for archived stories will never change, that number could go in the bare-bones database record as well.
Any other thoughts? Reasons why this wouldn't work? Things you really want, archiving-wise?
2) Disk caching for anonymous pages:
This just occurred to me today. It would help many sites if Scoop didn't have to regenerate pages over and over for anonymous visitors, since they're all identical anyway. This idea is less fleshed out than the above, so I'd like comments.
What I have in mind is something like this. Pages wouldn't be saved as fully-rendered HTML, but as the step just before Scoop calls page_out() and fills in the remaining template stuff. So, the process is like this:
A user requests a page. If the user is anonymous, then Scoop checks for the last time the page was cached. Compare this time to the saved cache_time for blocks, boxes, vars, and comment counts for any story related to this page (the related sid data would require a new table to store information relevant to page caching). If any of those things have been updated more recently than the page was cached, then render it as usual, and save in the cache before sending to the client.
If the cached page is newer than all the cache times for that stuff, then it's still up to date. Simply fetch the cached page, and send it off to page_out() for template processing.
In effect, this would make the best-case speed for anonymous pages roughly equal to that of special pages (which are pretty fast). The worst-case would be the same as we have now (time to render a full page from scratch), plus a small overhead for writing the cached page to disk. I think overall, performance would improve, and also, this would be very good at helping with sudden traffic peaks from external linking (*cough*slashdot*cough*).
One other thought is that anonymous users can set comment display preferences which are sticky with their session cookie. It would be easy to also check for these, and return the version of a story page that matches their preferences. So each story could have a cached version for whatever permutation of comment prefs have been requested-- the server path would look something like:
[htmlroot]/cache/2001/2/4/1205/4328_mixed_threaded_highest_oldest
[htmlroot]/cache/2001/2/4/1205/4328_topical_nested_dontcare_newest

...and so on. Only views that had been requested would be stored, and the majority, in the case of a sudden traffic spike, would be for the default view anyway.
Well. That's where I am right now. I think both of these strike a pretty good balance between our goals for Scoop to be flexible and easy to administer, and the need for some performance improvements. If you read all this, then thank you, and please let me know what you think! :-)

< "Section Denied" messages | All right, who broke the hotlist? >

Menu

· create account
· faq
· search
· report bugs
· Scoop Administrators Guide
· Scoop Box Exchange

Login

Make a new account

Username:

Password:

Related Links

· Scoop
· Slashdot
· More on New Code
· Also by rusty

Story Views

66 Scoop users have viewed this story.

Performance and Caching Ideas | 20 comments (20 topical, 0 hidden)

Disk Caching (none / 0) (#1)
by janra on Sun Jul 01, 2001 at 12:28:02 PM PST

It would help many sites if Scoop didn't have to regenerate pages over and over for anonymous visitors, since they're all identical anyway

That would be nice... and not only for the sites that get lots of hits. I have scoop running on a P133 (64Mb RAM) that doubles as my desktop, and if I'm doing something like, say, compiling, or having Netscape loading a particularly complex page, the time it takes for apache to load itself back into memory, pull everything out of the database, assemble the page, and send it is too long.

"Too long" in this case is defined as "long enough for the connection to time out". I think the connection times out, because it puts the bytes sent as 0 in the access log, and I know it was loading apache into memory and assembling the page because I heard my hard drive going nuts. Gotta love underpowered computers :-)

--
Discuss the art and craft of writing

Memory stuff by rusty, 07/01/2001 02:52:53 PM PST (none / 0)

PerlChildInitHandler by panner, 07/01/2001 04:24:17 PM PST (none / 0)
On second thought... by janra, 07/06/2001 09:47:53 PM PST (none / 0)

Why not leave it to squid? by Mathew Hennessy, 07/06/2001 02:32:11 PM PST (none / 0)

archive in another database? (none / 0) (#4)
by Delirium on Sun Jul 01, 2001 at 08:30:11 PM PST

Note: I know next to nothing about how MySQL works, so this may or may not be completely ridiculous.

Basically your idea about moving old stuff into textfiles, but as data, not static HTML, seems remarkably like you're moving the data out of one database into another database, only you're simulating the second DB yourself with textfiles (in which case why not just do that in the first place, like most weblog/discussion_board software, storing all data in textfiles by story and ditching SQL entirely?).

So as long as you're using a database to store stuff, why not move the 'archived' data out into another database instead of a simulated-by-textfiles pseudo-db? That way it'd still reduce the load on the main DB that handles 99% of the traffic.

Of course either way has the problem of much complicating things like generating stats and making it impossible to restart discussion in old stories (occasionally spats of discussion do get restarted even in very old stories).

I join you by Mystic, 07/03/2001 08:44:16 PM PST (none / 0)
Same problem by panner, 07/04/2001 01:17:46 AM PST (none / 0)
Actual DB/system profiling? by Mathew Hennessy, 07/06/2001 02:30:38 PM PST (none / 0)

Farming (none / 0) (#5)
by bittondb on Mon Jul 02, 2001 at 12:21:55 PM PST

It's has been my experince in the past that the quickest way to performance is a web farm. Of course a farm does bring along it's own set of baggage. I realize that scoop relies on session, but fortunately it's persisted in the db and therefore, as far as I know, if a user rerturns to a different server, their session data is not specific to the web server (unlike IIS). The session cookie that is sent down to the browser just needs to be a UUID that is persisted along with their data in the DB. I'm suggesting this because though a 1GHz/512MB/60GB machine is only $750, an F5 BigIP load balancer is ~$15,000.

instead of an F5 or Alteon/Arropoint/Cisco... by Mathew Hennessy, 07/06/2001 02:27:25 PM PST (none / 0)

Backup to a second database (none / 0) (#8)
by aquarius on Wed Jul 04, 2001 at 07:33:16 PM PST

Am I being particularly thick here? Why not just archive older data off to a second database with identical schema? If the data gets archived off to something other than an identical DB, then you've got two completely independent search routines. Having a second DB would mean that accesses to your primary DB are fast, because it's small (and I imagine that these accesses form the majority, as you say) and you can just keep all your search algorithms and apply them to both databases if someone runs some kind of a search that wants to see historical data. AFAICS, hiving stuff off to textfiles which are then available for searching means that you're essentially implementing your own textfile database, which will clearly not be as fast as a "real" DB (otherwise you'd be using it for everything already).

Aq.

ram by hurstdog, 07/06/2001 10:05:50 AM PST (none / 0)

Time Complexity v. Space Complexity by ramses0, 07/18/2001 12:43:52 PM PST (none / 0)

How about a mix? (3.00 / 1) (#14)
by tekk on Mon Jul 09, 2001 at 12:39:36 PM PST

How about keeping the stories in the database (for easy searching) and flushing the comments to the textfiles (for better storage)?

The stories are rather small in bytes when compared to the comments attached to them, so the database would shrink enormously -- and this way you've got _all_ the stories in the same place, and I think that above all, the stories are a bit more important than comments (not that the comments are _not_ important).

Well, anyway, it's my 0.02 of your local currency.

[tek.]

Hello **POSTGRES** (1.00 / 1) (#15)
by delmoi on Sat Jul 14, 2001 at 02:40:35 AM PST

Try running scoop on a real database :P

port it by hurstdog, 07/14/2001 02:54:08 PM PST (5.00 / 1)

Learned the hard way (none / 0) (#18)
by nebby on Mon Jul 23, 2001 at 09:21:00 AM PST

On half-empty the stories are archived up to a month or so after you post them into static HTML files, one for each skin. This has really, really sucked. Any changes made to the templates are hence not seen in the archived posts, and if a new skin is added the older posts will not be able to be seen in that skin.

Whatever you do, don't archive to full HTML :) I noted that you don't plan on doing so, but I just wanted to make it clear that it's the wrong way to do things.

I'm currently writing a new weblog software that bulids upon my experiences with Glasscode. As far as solving the problem of older posts being slow to get to, basically moving them into a database which has data meant to be write once and read often is your best bet I think. Building good indexes on that database would make it fly, I'd imagine. If you don't think this is the best option, XML would be best, for a few reasons. One notable advantage of saving as XML would be the potential for it to be hooked into an XML-RPC interface for retrieval or something. XML is just plain "good" and I'm betting you can get or write a fast little parser for your own XML format since you can assume you only need basic parsing. If there are any other programs which need to access your data they will be able to do so.

The anonymous thing is also very important. Unfortunately there's too many possibilities for an anonymous user to navigate to for me to cache all the pages, but I would bet that caching the front page would be a huge performance gain (especially when being /.'ed :))

Also, I might be doing this incorrectly, but my philosophy in designing the software is that I basically have infinite memory and processor speed and that the database and disk are the devil. Data is only accessed and written to disk when absolutely necessary. All data is cached in memory with timestamps, and when it is no longer being accessed it is discarded. The database also does its own thing like this, so that helps even more. The next time the data is requested, it will be reloaded from the DB if it has been cleared from the cache. By setting the timeout value, I can tweak to my memory requirements and minimize DB requests.. theoretically :)

front page (none / 0) (#19)
by janra on Tue Aug 14, 2001 at 08:09:51 PM PST

At the moment I would be so happy with a simple static front page html file that was dumped out when somebody without an account requested the front page. (2 tests, as far as I can tell, but I don't know what all needs to be loaded in before the test for UID can be done.) Updated... um, not sure about that.

I've noticed that the front page is the most complex, and takes the longest (except for stories with a silly number of comments, displayed in nested mode, but maybe that's just the sheer amount of text transferred that slows it down) to load. Which makes sense to me; it has the most boxes, it has the introtext and comment counts of a number of stories. The other pages are much simpler - eg, one story, all comments for that story - and load faster, at least on my computer.

--
Discuss the art and craft of writing

performance and ideas (none / 0) (#20)
by Deanajohnson on Fri May 04, 2018 at 07:05:47 AM PST

I am functioning on a big scale software. It's focused on memory/data moving amid huge amounts of difficult models. One-time the cache errors is also extraordinary and the performance is not well. But the situation looks too problematical to me. I evenhanded want to acquire some dissertation help and universal ideas on how to cut the cache error and improve memory performance. After reading this explanation, I realize performance and ideas really important for all field.

Performance and Caching Ideas | 20 comments (20 topical, 0 hidden)

create account | faq | search