Let me say up front, I'm speaking from what I understood from reading
about the API so far, I trust that the Gentle People on this mailing
list will politely discuss with me where I might be confused.
Tobias Weingartner wrote:
Friday, December 23, Ian Turner wrote:
>Friday 23 December 2005 02:21 pm, Tobias Weingartner wrote:
>>In both cases, user objects are one of: Regular files, directories, sockets,
>>pipes, named FIF, symlinks, or anything else the filesystem supports.
, so it is an arbitrary "object". Why bother making the amanda server
aware of them? I mean, you can't seek to them anyways
Actually, that isn't a given: there are some archive types, like tar
archives, which are seekable. If you examine the process
we currently use to generate the indexes (i.e. a copy of the
dump stream is sent to a "tar tvf -" to get a list of files,
you can also put a -R on there to get file offsets to include in the
index)
Now some formats (i.e. BSD "dump" format) don't have that 'seekable'
property, but that doesn't mean we can't support it for things that do.
The other reason to make a new set of terms is to consider the case of
a database dump format, where the "objects" are databases, tables, rows,
etc. You can use the "object" and "collection" terms for that sort of
backup as well, whereas calling them "files" and "directories" just
wrong. If you design it generally enough up front, you won't have to
kluge around it so much later.
Either way, is a "collection" a bunch of "user objects" or not? What
relation do the two have?
>>
>>User objects are resident in a number of collections, but a collection does
>>not necessarily hold an entire user object. In the tar case, most collections
>>will only hold a fraction of an object.
So how do you envision the index to work then? Can you spell out in detail
what the index will look like?
The index will look very much like it does now; but there will be
important semantic differences. Were we now have pathnames that
map to files or directories, we will instead have object names that
will be interpreted by an appropriate Application via the Application
API, and we may add extra information like the offset in the archive
where the entry lives. The index currently lists files for incremental
dumps, where the dumpfile may only contain the changes relative to the
previous full dump image of the file
The above makes no sense to me. It's nice to say that a "collection" is
one 512-byte block. But what does that mean? So in my backup I'm going
to have 200GB * 2 (there are 2 512KB blocks/MB) = 400'000 collections on
each tape?
>>
>>That's right. Though for convenience, we may group ~128 tar blocks into a
>>single 64K collection. Conceptually, there's no difference.
Actually I'm finding the tar block grouping distinction a little
confusing myself; but I'm hoping it's because I haven't really read up
on the listed-incremental format stuff that we really use to support
"partial" tar backups.
Who defines the collection? The client, or the server? And yes, there is a
huge difference between an indexable thing that is 512-bytes and one that is
64KB in size. It's 2 orders of magnitude
Actually each Application supported via the API defines its own
collection type, that's kind of the point; the "tar" Application
defines a tarfile collection, a "advfsdump" Application defines a
SF1 ADVFS dumpfile type, etc. could have several Application
types that are implemented with tar, but have different semantics.
(i.e. a plain tar, or a tar with , or whatever).
the details are hashed out, this Application API interface may
actually distill down to a table with very little actual code for each
Application type; the point is to think about a suitably abstract model
of what it should be so that the interesting cases are all covered by it.
How do you know which tar blocks you need? Who knows this?
>>
>>The amanda server knows this; the client's Application API gathers the block
>>information from the output of gnutar -vR, and passes it to the server at
>>dump time.
Which is to say, you store block offset information in the index, so
just as you can lookup if a given file is present in the index, you can
look up its offset.
If you don't have an index, then you can't skip to the right offset
Just like now, if you don't have an index, you can't tell if a given
file is on a given backup without reading the whole darned thing.
That doesn't mean the backup server can read the archive; it just means
it got suitable indexing information so it can send the client the parts
it wants.