1) Archiving:
I've put this off for as long as possible, but I knew it was gonna have
to come up some day. Scoop just can't let the database grow forever. K5
is starting to really slow down trying to manage a 400Mb+ database, and
shrinking that would really help in a lot of places. So, we need to be
able, at some point, to get old stories, comments, and all the stuff
related to them out of the database and stored elsewhere. Note that this
is not necessarily a strategy to speed up the serving of archive pages
by caching them in html! I don't like the Slashcode approach to
archiving at all, because it means that once a page is archived, the
design is frozen forever. My thought is just that we need to get the
data out of the database.
So, what I have in mind right now is to, at some point when archiving is
deemed necessary, fetch all the relevant data for a story, and save it
in a plain-text format in a file somewhere. I'd like to have the archive
store *only* the data, which could then be shown to a user by
translating it into data structures like those used by the normal page
rendering now, and filled in to whatever templates you're using just
like a story from the database.
What I don't know, at this point, is what format they should be in.
Someone's gonna say XML, which is a valid option, but not one I'm really
inclined to go with. There's a lot of parsing overhead associated with
XML, and for a language like perl, which is all about text processing
anyway, it strikes me as unnecessary. I've considered just using
'Storable' for this, since, from a coding perspective, it makes things
real easy. To archive, we'd just freeze the comment thread structure and
the story object. To render, there'd be an if just before the normal
rendering happens, where one path (for stories in the database) would do
the normal SQL stuff, and the other (for archived stories) would
retreive the hashrefs from the archive. Rendering would then proceed as
normal, except with some checks to prevent new comment posting and what
have you.
The nifty thing about handling it this way would be that from a users
perspective, the difference between archived stories and normal stories
would be minimal. You could still choose your preferred display mode for
comments, and if you (the admin) had changed your site design, archives
would always match the new design. The drawback would be that it'd be
slower than static html. Considering that old stories aren't viewed very
often, I think that's probably a worthwhile tradeoff.
From an administrative side, then, what you'd do is set a default
"archive period", like say three months. You should be able to also
specify whether that's a hard and fast time period, or whether you want
to archive stuff that's gone X months without a new comment. When
posting a story, there'd also be an option to exempt it from archiving.
Is it worth setting archive periods as a section property? I.e. you
could have different time lengths for different sections? It may be.
Another thing to note is that for hotlisting to work properly, we would
need to leave a very basic entry in the story table, with just the sid
and the title of a story. The hotlist display would need to distinguish
between live and archived stories, and not try to count comments on
archive stories. In fact, since we know that the count for archived
stories will never change, that number could go in the bare-bones
database record as well.
Any other thoughts? Reasons why this wouldn't work? Things you really
want, archiving-wise?
2) Disk caching for anonymous pages:
This just occurred to me today. It would help many sites if Scoop didn't
have to regenerate pages over and over for anonymous visitors, since
they're all identical anyway. This idea is less fleshed out than the
above, so I'd like comments.
What I have in mind is something like this. Pages wouldn't be saved as
fully-rendered HTML, but as the step just before Scoop calls page_out()
and fills in the remaining template stuff. So, the process is like this:
A user requests a page. If the user is anonymous, then Scoop checks for
the last time the page was cached. Compare this time to the saved
cache_time for blocks, boxes, vars, and comment counts for any story
related to this page (the related sid data would require a new table to
store information relevant to page caching). If any of those things have
been updated more recently than the page was cached, then render it as
usual, and save in the cache before sending to the client.
If the cached page is newer than all the cache times for that stuff,
then it's still up to date. Simply fetch the cached page, and send it
off to page_out() for template processing.
In effect, this would make the best-case speed for anonymous pages
roughly equal to that of special pages (which are pretty fast). The
worst-case would be the same as we have now (time to render a full page
from scratch), plus a small overhead for writing the cached page to
disk. I think overall, performance would improve, and also, this would
be very good at helping with sudden traffic peaks from external linking
(*cough*slashdot*cough*).
One other thought is that anonymous users can set comment display
preferences which are sticky with their session cookie. It would be easy
to also check for these, and return the version of a story page that
matches their preferences. So each story could have a cached version for
whatever permutation of comment prefs have been requested-- the server
path would look something like:
[htmlroot]/cache/2001/2/4/1205/4328_mixed_threaded_highest_oldest
[htmlroot]/cache/2001/2/4/1205/4328_topical_nested_dontcare_newest
...and so on. Only views that had been requested would be stored, and
the majority, in the case of a sudden traffic spike, would be for the
default view anyway.
Well. That's where I am right now. I think both of these strike a pretty
good balance between our goals for Scoop to be flexible and easy to
administer, and the need for some performance improvements. If you read
all this, then thank you, and please let me know what you think! :-)