Networking

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Application API proposal

    6 answers - 4422 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Friday, December 23, Ian Turner wrote:
    Friday 23 December 2005 02:21 pm, Tobias Weingartner wrote:
    The "dump" version looks suspiciously like I said. User objects are the
    "things" within the "dump-file". Nothing new there. And the "gnu tar"
    version is slightly different. What is an "archive object"? Is that one
    of the "things" within the tar file? Is it the tar file itself?
    In both cases, user objects are one of: Regular files, directories, sockets,
    pipes, named FIF, symlinks, or anything else the filesystem supports.
    , so it is an arbitrary "object". Why bother making the amanda server
    aware of them? I mean, you can't seek to them anyways
    Either way, is a "collection" a bunch of "user objects" or not? What
    relation do the two have?
    User objects are resident in a number of collections, but a collection does
    not necessarily hold an entire user object. In the tar case, most collections
    will only hold a fraction of an object.
    So how do you envision the index to work then? Can you spell out in detail
    what the index will look like?
    The above makes no sense to me. It's nice to say that a "collection" is
    one 512-byte block. But what does that mean? So in my backup I'm going
    to have 200GB * 2 (there are 2 512KB blocks/MB) = 400'000 collections on
    each tape?
    That's right. Though for convenience, we may group ~128 tar blocks into a
    single 64K collection. Conceptually, there's no difference.
    Who defines the collection? The client, or the server? And yes, there is a
    huge difference between an indexable thing that is 512-bytes and one that is
    64KB in size. It's 2 orders of magnitude
    How do you know which tar blocks you need? Who knows this?
    The amanda server knows this; the client's Application API gathers the block
    information from the output of gnutar -vR, and passes it to the server at
    dump time.
    But, if the server knows which blocks a particular user object is contained
    in, what is preventing it from seeking to this point and only retrieving the
    needed information (modulo user-object formating issues)? Your proposal states
    that seeking to the right place (knowing where the object is located) is not
    going to be possible for the server.
    How are they captured inside the API. Is the API implemented with a
    library? What does not require changes to manage tar options? Who invokes
    tar?
    The API will probably be implemented as a series of modules, with some common
    library code to tie things together. In order to make changes to the tar
    interface, you will only need to make changes to the tar module -- nowhere
    else.
    , this is a start
    At present, changes must be made througout the Amanda codebase, and are
    generally fixed at configure time.
    So, what you're looking for, is an ABI for how the client and server talk
    to each other. What you really want is a different and more structured way
    to exchange information between the client and server.
    you have that, you can build an API on top to talk the ABI. And then
    you can start to convert the various backup programs to start using the API,
    such that things are more full of goodness :)
    You don't really need a config file to deal with different versions of tar;
    you can just detect it at runtime, and then invoke the proper Application API
    module.
    You're going to "find" the "right" tool, tar, etc without hardcoding it
    at some place? That's usually a disaster waiting to happen.
    How will it exist on the client? As a library? How do the various
    programs (dump, tar, etc) get called? Do you envision a stub program to
    call non-api conforming backup/restore programs?
    There will be an Application API module for every interesting dumping
    application. That module will be responsible for implementing each of the
    operations defined on the wiki page. There are a lot of different ways to
    take things from there, but certainly the Application API module will need to
    reform the output from each of the different tools into a common format.
    Which wiki page defines these operations? I dont just mean the simple
    "outputs the backup" type of thing, but a detailed, format specific thing.
  • No.1 | | 5255 bytes | |

    Friday 23 December 2005 03:21 pm, Tobias Weingartner wrote:
    , so it is an arbitrary "object".

    Yup. Hence the use of the word "object". ^_^

    Why bother making the amanda server
    aware of them? I mean, you can't seek to them anyways

    User objects are important because they are what the user is interested in.
    When you go to restore, you don't want to restore some blocks, you want to
    restore a file. So we want to have the Amanda infrastructure in place to
    recover a single user object, even though that may require retrieving several
    collections from media.

    In other words, think of the restore like this:

    1) User finds the objects they are interested in.
    2) Amrecover passes that object list to the Amanda server.
    3) Amanda server looks up the associated collections.
    4) Amanda server retrieves those collections from media, and passes them to
    amrecover.
    5) Amrecover uses the Application API to extract the user's objects from the
    collections.

    So how do you envision the index to work then? Can you spell out in detail
    what the index will look like?

    I can propose something, but it's certainly not fixed in stone.

    possibility is to have several plain-text files for each job:
    -- User object list (consists of object ID and name)
    -- User object map (maps user objects to a set of collections)
    -- Collection-media map (tells you which collections are stored at what offset
    in which files on which media)

    Alternatively, there may be performance advantages to using something slightly
    more relational, such as Berkley DB or SQLite.

    Who defines the collection? The client, or the server?

    The Application API plugin for the particular dump tool determines what
    constitutes a collection. That runs on the client.

    And yes, there is
    a huge difference between an indexable thing that is 512-bytes and one that
    is 64KB in size. It's 2 orders of magnitude

    There is a difference in terms of numbers, but not so much in terms of
    concepts. In general, we will support collections of any size, but reccommend
    that the number of collections be on the order of the number of user objects.
    If collections are too big, restore will be slower. If they are too small,
    the index will take up a lot of space.

    But, if the server knows which blocks a particular user object is contained
    in, what is preventing it from seeking to this point and only retrieving
    the needed information (modulo user-object formating issues)? Your
    proposal states that seeking to the right place (knowing where the object
    is located) is not going to be possible for the server.

    There's two answers to this question.

    First, it's actually impossible because the user object is not necessarily
    contiguous in the dump. Consider dump itself, which may very well dump data
    blocks as they occur in disk order, without regard to how that looks in the
    file itself. When you want to restore just one file, you may actually need
    data from all over the place.

    Second, even in cases where it is possible (like tar), you would need to know
    the details of the file format to accomplish this. Let me put it to you this
    way: Knowing nothing about the tar format, and without using the tar tool,
    are you prepared to extract a file from a tar archive with nothing but its
    name?

    At present, changes must be made througout the Amanda codebase, and are
    generally fixed at configure time.

    So, what you're looking for, is an ABI for how the client and server talk
    to each other. What you really want is a different and more structured way
    to exchange information between the client and server.

    you have that, you can build an API on top to talk the ABI. And then
    you can start to convert the various backup programs to start using the
    API, such that things are more full of goodness :)

    If you want to think of things that way, be my guest. We will certainly want
    to settle on a protocol for how the server and client talk to each other, but
    I don't think there is any plan to support legacy dumper code and Application
    API dumper code at the same time, except possibly on different clients.

    You're going to "find" the "right" tool, tar, etc without hardcoding it
    at some place? That's usually a disaster waiting to happen.

    Yes. After all, that's what you do anytime you type "tar" into your shell. But
    if that's a problem, it should be very easy to change the search path in the
    driver.

    Look at it this way: Right now it is compiled in. If you want to stick with
    that (inflexible) behavior, it's not hard to do. Those who want more
    flexibility and less maintenance, can let it autodetect.

    Which wiki page defines these operations? I dont just mean the simple
    "outputs the backup" type of thing, but a detailed, format specific thing.

    I don't understand what you are looking for. But probably it doesn't exist.
    This is just a preliminary proposal. Mabye you would like to suggest
    something?

    Cheers,
  • No.2 | | 5288 bytes | |

    Let me say up front, I'm speaking from what I understood from reading
    about the API so far, I trust that the Gentle People on this mailing
    list will politely discuss with me where I might be confused.

    Tobias Weingartner wrote:
    Friday, December 23, Ian Turner wrote:
    >Friday 23 December 2005 02:21 pm, Tobias Weingartner wrote:
    >>In both cases, user objects are one of: Regular files, directories, sockets,
    >>pipes, named FIF, symlinks, or anything else the filesystem supports.


    , so it is an arbitrary "object". Why bother making the amanda server
    aware of them? I mean, you can't seek to them anyways

    Actually, that isn't a given: there are some archive types, like tar
    archives, which are seekable. If you examine the process
    we currently use to generate the indexes (i.e. a copy of the
    dump stream is sent to a "tar tvf -" to get a list of files,
    you can also put a -R on there to get file offsets to include in the
    index)

    Now some formats (i.e. BSD "dump" format) don't have that 'seekable'
    property, but that doesn't mean we can't support it for things that do.

    The other reason to make a new set of terms is to consider the case of
    a database dump format, where the "objects" are databases, tables, rows,
    etc. You can use the "object" and "collection" terms for that sort of
    backup as well, whereas calling them "files" and "directories" just
    wrong. If you design it generally enough up front, you won't have to
    kluge around it so much later.

    Either way, is a "collection" a bunch of "user objects" or not? What
    relation do the two have?
    >>
    >>User objects are resident in a number of collections, but a collection does
    >>not necessarily hold an entire user object. In the tar case, most collections
    >>will only hold a fraction of an object.


    So how do you envision the index to work then? Can you spell out in detail
    what the index will look like?

    The index will look very much like it does now; but there will be
    important semantic differences. Were we now have pathnames that
    map to files or directories, we will instead have object names that
    will be interpreted by an appropriate Application via the Application
    API, and we may add extra information like the offset in the archive
    where the entry lives. The index currently lists files for incremental
    dumps, where the dumpfile may only contain the changes relative to the
    previous full dump image of the file

    The above makes no sense to me. It's nice to say that a "collection" is
    one 512-byte block. But what does that mean? So in my backup I'm going
    to have 200GB * 2 (there are 2 512KB blocks/MB) = 400'000 collections on
    each tape?
    >>
    >>That's right. Though for convenience, we may group ~128 tar blocks into a
    >>single 64K collection. Conceptually, there's no difference.


    Actually I'm finding the tar block grouping distinction a little
    confusing myself; but I'm hoping it's because I haven't really read up
    on the listed-incremental format stuff that we really use to support
    "partial" tar backups.

    Who defines the collection? The client, or the server? And yes, there is a
    huge difference between an indexable thing that is 512-bytes and one that is
    64KB in size. It's 2 orders of magnitude

    Actually each Application supported via the API defines its own
    collection type, that's kind of the point; the "tar" Application
    defines a tarfile collection, a "advfsdump" Application defines a
    SF1 ADVFS dumpfile type, etc. could have several Application
    types that are implemented with tar, but have different semantics.
    (i.e. a plain tar, or a tar with , or whatever).

    the details are hashed out, this Application API interface may
    actually distill down to a table with very little actual code for each
    Application type; the point is to think about a suitably abstract model
    of what it should be so that the interesting cases are all covered by it.

    How do you know which tar blocks you need? Who knows this?
    >>
    >>The amanda server knows this; the client's Application API gathers the block
    >>information from the output of gnutar -vR, and passes it to the server at
    >>dump time.


    Which is to say, you store block offset information in the index, so
    just as you can lookup if a given file is present in the index, you can
    look up its offset.
    If you don't have an index, then you can't skip to the right offset

    Just like now, if you don't have an index, you can't tell if a given
    file is on a given backup without reading the whole darned thing.
    That doesn't mean the backup server can read the archive; it just means
    it got suitable indexing information so it can send the client the parts
    it wants.
  • No.3 | | 3472 bytes | |

    Wed, Dec 28, 2005 at 10:31:04AM -0600, Marc W. Mengel wrote:

    Let me say up front, I'm speaking from what I understood from reading
    about the API so far, I trust that the Gentle People on this mailing
    list will politely discuss with me where I might be confused.

    I'm in complete agreement there Marc, I suspect/feel
    that substatial discussion is still needed.

    Tobias Weingartner wrote:
    Friday, December 23, Ian Turner wrote:
    >Friday 23 December 2005 02:21 pm, Tobias Weingartner wrote:
    >>In both cases, user objects are one of: Regular files, directories,
    >>sockets, pipes, named FIF, symlinks, or anything else the filesystem
    >>supports.

    >
    >, so it is an arbitrary "object". Why bother making the amanda server
    >aware of them? I mean, you can't seek to them anyways


    Actually, that isn't a given: there are some archive types, like tar
    archives, which are seekable. If you examine the process
    we currently use to generate the indexes (i.e. a copy of the
    dump stream is sent to a "tar tvf -" to get a list of files,
    you can also put a -R on there to get file offsets to include in the
    index)

    Now some formats (i.e. BSD "dump" format) don't have that 'seekable'
    property, but that doesn't mean we can't support it for things that do.

    FWIW, gnutar seems to call them "records" rather than blocks,
    hence the "-R" option. Two questions I have about this new
    capability the API envisions (the server extracting and
    returning individual filesobjects to the client):

    As Marc notes, dump does not support a seekable archive.
    Are there other "backup programs" that are envisioned for
    use with the new API that are likely to support seekable
    archives or is this new capability, and all its attending
    changes needed to support it, being done just for gnutar?

    I played a bit with gnutar and the -R. Sure enough I was
    able to extract the record corresponding to a file by doing
    a dd command for the desired blocks. The resulting output
    was not just the file however. It seemed to be a tar header,
    some null padding, the file, more null padding. I'm assuming
    this "mini-gtar-archive" would be returned to the client and
    run through a gtar extraction on that host?

    Actually each Application supported via the API defines its own
    collection type, that's kind of the point; the "tar" Application
    defines a tarfile collection, a "advfsdump" Application defines a
    SF1 ADVFS dumpfile type, etc. could have several Application
    types that are implemented with tar, but have different semantics.
    (i.e. a plain tar, or a tar with , or whatever).

    the details are hashed out, this Application API interface may
    actually distill down to a table with very little actual code for each
    Application type; the point is to think about a suitably abstract model
    of what it should be so that the interesting cases are all covered by it.

    you get away from gnutar I'm a little dubious about how much
    new capability the server could provide given the possibility of
    different server and client Ss. If I can't run advfs tools on
    the server, then isn't the best I can hope for is to have amanda
    store and return archives from the client? Which it does now.
  • No.4 | | 1954 bytes | |

    Wednesday 28 December 2005 02:03 pm, Jon LaBadie wrote:
    As Marc notes, dump does not support a seekable archive.
    Are there other "backup programs" that are envisioned for
    use with the new API that are likely to support seekable
    archives or is this new capability, and all its attending
    changes needed to support it, being done just for gnutar?

    The answer is "yes and no". If you think only in terms of seekable archives,
    then probably not, except maybe for other similar tools like star or zip. But
    don't think of seekable archives, think of collections.

    There are a lot of tools that will generate multiple collections for one job,
    the most obvious being any database dumper (one collection / one table) or
    whatever tool we end up using to handle Windows natively.

    I'm assuming
    this "mini-gtar-archive" would be returned to the client and
    run through a gtar extraction on that host?

    Right. You take the collections (a set of contiguous tar records) and give
    them to the Application API on the client to extract.

    you get away from gnutar I'm a little dubious about how much
    new capability the server could provide given the possibility of
    different server and client Ss. If I can't run advfs tools on
    the server, then isn't the best I can hope for is to have amanda
    store and return archives from the client? Which it does now.

    The big change here is that even if we restore on the client, we don't want to
    ship the entire archive there, just the bits we're interested in. Just
    because you can't run the tools on the server, doesn't mean that you need to
    ship the entire archive to the client. I don't know anything about advfs, but
    it's perfectly plausible that the server knows nothing of pg_restore, but
    nonetheless can transmit only the table that the user is interested in.

    Cheers,
  • No.5 | | 3228 bytes | |

    Wednesday 28 December 2005 11:31 am, Marc W. Mengel wrote:
    The other reason to make a new set of terms is to consider the case of
    a database dump format, where the "objects" are databases, tables, rows,
    etc. You can use the "object" and "collection" terms for that sort of
    backup as well, whereas calling them "files" and "directories" just
    wrong. If you design it generally enough up front, you won't have to
    kluge around it so much later.

    Yup, that's exactly right. For the same reason, I like to avoid the word
    seekable in favor of thinking about collections, which may be nonadjacent.

    Actually I'm finding the tar block grouping distinction a little
    confusing myself; but I'm hoping it's because I haven't really read up
    on the listed-incremental format stuff that we really use to support
    "partial" tar backups.

    If it's confusing, then don't think about it; it's not an important point.

    The API proposal as written will work fine with 1 tar block -1 collection.
    But since there has to be index information for each collection, it may be
    inefficent to keep track of each chunk of 512 bytes. So instead we can work
    with 128 blocks at a time, or 64K of data. reading, we just group the
    first 128 blocks into the first collection, the next 128 blocks into the
    secend collection, etc. restore, If you give tar extra blocks it will
    never complain.

    In general, we say that we prefer to have a number of collections on the order
    of the number of objects. But it is only a preference, not a requirement.

    could have several Application
    types that are implemented with tar, but have different semantics.
    (i.e. a plain tar, or a tar with , or whatever).

    Yes, but there's probably a better way. likely possibility is providing a
    generic name/value configuration interface on the server, that is sent to the
    client at dump time.

    the details are hashed out, this Application API interface may
    actually distill down to a table with very little actual code for each
    Application type; the point is to think about a suitably abstract model
    of what it should be so that the interesting cases are all covered by it.

    Unfortunately, it's more difficult than that. See the Dumper API proposal for
    some details on why you need to special-case most tools.

    Which is to say, you store block offset information in the index, so
    just as you can lookup if a given file is present in the index, you can
    look up its offset.
    If you don't have an index, then you can't skip to the right offset

    Sortof. Don't think of block offsets, think of sets of collections. It's
    entirely possible that the data you need to recover a particular object is
    not contiguously located in the archive.

    As for the indexes, you have two options:
    1) Rebuild the index from tape (hence the index-from-image operation); or
    2) Provide *all* the collections to the Application API tool to restore.

    While we should support #2 for some bare-metal cases, in general I view #1 as
    being the preferred method.

    Cheers,
  • No.6 | | 1637 bytes | |

    Jon LaBadie wrote:

    you get away from gnutar I'm a little dubious about how much
    new capability the server could provide given the possibility of
    different server and client Ss. If I can't run advfs tools on
    the server, then isn't the best I can hope for is to have amanda
    store and return archives from the client? Which it does now.

    For small restores, the ability for the server to deliver just
    the chunk or chunks of the archive that is needed to the client
    is a big performance win. The other real win I see here is that
    the "right" shell script can be put on the tape headers to unpack
    the file.

    Let's take for example mysqldump. You could envision a mysqldump-based
    application that dumps database tables, and a table of contents filter
    that takes the mysqldump output, lists the tables and their file
    offsets, and now it also can do individual table restores efficiently.

    And the tape could say something like
    dd if=/dev/rmt0 bs=32 skip=1 | mysql
    for the restore command, rather than
    dd if=/dev/rmt0 bs=32 skip=1 | gnutar_wrapper_does_mysql

    That to me is the big win of any sort of Application/Dumper API --
    making it generic enough to support really different sorts of
    utilities; aside from supporting more than just two -- gnutar or dump.

    And of course, even for advfsdump/ufsdump type utilities, with
    sufficiently clever coding you could still specify a couple of block
    ranges that would be required to restore a file. Most of the time
    it would be considerably smaller than the whole dump image.

Re: Application API proposal


max 4000 letters.
Your nickname that display:
In order to stop the spam: 0 + 9 =
QUESTION ON "Networking"

EMSDN.COM