Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Any interest in "merge" and "by" implementationsspecifically for sorted data?

    12 answers - 3988 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Which version of R are you looking at? R-devel has
    omerge() works more efficiently when there are relatively few
    matches between the data frames (for example, for 1-1
    matching). The order of the result is changed for 'sort = FALSE'.
    Thu, 27 Jul 2006, Kevin B. Hendricks wrote:
    Hi Developers,
    I am looking for another new project to help me get more up to speed
    on R and to learn something outside of R internals. recent R
    issue I have run into is finding a fast implementations of the
    equivalent to the following SAS code:
    /* MDPC is an integer sort key made from two integer columns */
    MDPC = (MD * 100000) + PC;
    /* sort the dataset by the key */
    PRC SRT;
    BY MDPC;
    /* print out count and sum for each unique sort key (subgroup) */
    /* use of BY MDPC requires sorting that data set by MDPC first in SAS */
    PRC UNIVARIATE NPRINT;
    VAR MVE;
    BY MDPC;
    UTPUT UT=TMP0 N=XN SUM=XSUM;
    Easy to do in R but the problem is the data set this is being run on
    has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
    character data. The sort key has easily has over 200,000 unique keys
    (if not twice that).
    My first R attempt was a simple
    # sort the data.frame gd and the sort key
    sorder <- order(MDPC)
    gd <- gd[sorder,]
    MDPC <- MDPC[sorder]
    attach(gd)
    # find the length and sum for each unique sort key
    XN <- by(MVE, MDPC, length)
    XSUM <- by(MVE, MDPC, sum)
    GRPS <- levels(as.factor(MDPC))
    Well the ordering and sorting was reasonably fast but the first "by"
    statement was still running 4 hours later on my machine (a dual 2.6
    gig with 4 gig of main memory). This same snippet of code in
    SAS running on a slower machine takes about 5 minutes of system time.
    I tried various simple R implementations of a "by_sorted" that I
    thought might help
    # walk sorted array once and keep track of beginning
    # and ending points for each unique sort key value in p
    # and run function fcn on that sub sequence in vector v
    # store the results in a vector
    by_sorted <- function(v, p, fcn) {
    key <- p[[1]]
    bp <- 1
    r <- NULL
    for (i in 2:length(p)) {
    if (key != p[[i]]) {
    r <- c(r,fcn(v[bp:i-1]))
    bp <- i
    key <- p[[i]]
    }
    }
    r <- c(r,fcn(v[bp:i]))
    }
    but they took "forever" to run also (read that I killed those
    attempts at 15 minutes of cpu time).
    I literally had the same issue when trying with "tapply".
    So unless it already exists someplace, I need a really fast
    implementation of "by" for very large sorted data sets (probably
    written in fortran or c) that will do the equivalent of what SAS does
    with its "proc univariate by" approach with close to the same
    performance. The code should only have to walk the array once (ie.
    be linear in time with the number of rows in the vector). I have
    similar issues with "merge" as well since merging data frames already
    sorted by the same sort key should be fast as well and does not
    appear to be.
    Before I jump into this and create my own "sorted large data set"
    versions of "by" and "merge", I wanted to be sure it would be of
    interest to others. If they work well and are well implemented (a
    big if since I am really still just learning this - the whole point
    of the project!) would something like this be of any interest for
    internal use of R? is this something too specialized?
    Is there an R function implemented in c or fortran that would make a
    good "model" to follow for implementing something like this?
    Would/should they be extensions of current implementations of "merge"
    and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
    extra parameter.
    am I simply barking up the wrong tree here?
    Thanks,
    Kevin
    R-devel (AT) r-project (DOT) org mailing list
  • No.1 | | 751 bytes | |

    Hi,

    I was using my installed R which is 2.3.1 for the first tests. I
    moved to the r-devel tree (I svn up and rebuild everyday) for my "by"
    tests to see if it would work any better. I neglected to retest
    "merge" with the devel version.

    So it appears "merge" is already fixed and I just need to worry about
    "by".

    Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:

    Which version of R are you looking at? R-devel has

    omerge() works more efficiently when there are relatively few
    matches between the data frames (for example, for 1-1
    matching). The order of the result is changed for 'sort = FALSE'.

    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.2 | | 1310 bytes | |

    There was a performance comparison of several moving average
    approaches here:

    The author of that message ultimately wrote the caTools R package
    which contains some optimized versions.

    Not sure if these results suggest anything of interest here but it
    would be interesting if various base routines could be sped up
    to the point that a simple idiom is competitive with the caTools versions.

    7/28/06, Kevin B. Hendricks <kevin.hendricks (AT) sympatico (DOT) cawrote:
    Hi,

    I was using my installed R which is 2.3.1 for the first tests. I
    moved to the r-devel tree (I svn up and rebuild everyday) for my "by"
    tests to see if it would work any better. I neglected to retest
    "merge" with the devel version.

    So it appears "merge" is already fixed and I just need to worry about
    "by".

    Jul 28, 2006, at 3:06 AM, Brian D Ripley wrote:

    Which version of R are you looking at? R-devel has

    o merge() works more efficiently when there are relatively few
    matches between the data frames (for example, for 1-1
    matching). The order of the result is changed for 'sort = FALSE'.
    --
    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list

    R-devel (AT) r-project (DOT) org mailing list
  • No.3 | | 1480 bytes | |

    Hi,

    There was a performance comparison of several moving average
    approaches here:

    Thanks for that link. It is not quite the same thing but is very
    similar.

    The idea is to somehow make functions that work well over small sub-
    sequences of a much longer vector without resorting to splitting the
    vector into many smaller vectors.

    In my particular case, the problem was my data frame had over 1
    million lines had probably over 500,000 unique sort keys (ie. think
    of it as an R factor with over 500,000 levels). The implementation
    of "by" uses "tapply" which in turn uses "split". So "split" simply
    ate up all the time trying to create 500,000 vectors each of short
    length 1, 2, or 3; and the associated garbage collection.

    I simple loop that walked the short sequence of values (since the
    data frame was already sorted) calculating what it needed, would work
    much faster than splitting the original vector into so very many
    smaller vectors (and the associated copying of data).

    That problem is very similar problem to the calculation of basic
    stats on a short moving window over a very long vector.

    The author of that message ultimately wrote the caTools R package
    which contains some optimized versions.

    I will look into that package and maybe use it for a model for what I
    want to do.

    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.4 | | 2354 bytes | |

    "Kevin" == Kevin B Hendricks <kevin.hendricks (AT) sympatico (DOT) ca>
    on Fri, 28 Jul 2006 14:53:57 -0400 writes:

    []

    KevinThe idea is to somehow make functions that work well
    Kevinover small sub- sequences of a much longer vector
    Kevinwithout resorting to splitting the vector into many
    Kevinsmaller vectors.

    KevinIn my particular case, the problem was my data frame
    Kevinhad over 1 million lines had probably over 500,000
    Kevinunique sort keys (ie. think of it as an R factor with
    Kevinover 500,000 levels). The implementation of "by"
    Kevinuses "tapply" which in turn uses "split". So "split"
    Kevinsimply ate up all the time trying to create 500,000
    Kevinvectors each of short length 1, 2, or 3; and the
    Kevinassociated garbage collection.

    Not that I have spent enough time thinking about this thread's
    topic, but I have seen more than one case where using tapply()
    unnecessarily slowed down computations.
    I don't remember the details, but know that in one case, replacing
    tapply() by a few lines of code {one of which using lapply() IIRC},
    sped up that computation by a factor (of 2 ? or more?).

    I also vaguely remember that I thought about making tapply()
    faster, but came to the conclusion it could not be
    sped up quickly, because it works in a quite more general
    context than it was used in that application (and maybe yours?).

    KevinI simple loop that walked the short sequence of
    Kevinvalues (since the data frame was already sorted)
    Kevincalculating what it needed, would work much faster
    Kevinthan splitting the original vector into so very many
    Kevinsmaller vectors (and the associated copying of data).

    KevinThat problem is very similar problem to the
    Kevincalculation of basic stats on a short moving window
    Kevinover a very long vector.

    >The author of that message ultimately wrote the caTools R
    >package which contains some optimized versions.


    KevinI will look into that package and maybe use it for a
    Kevinmodel for what I want to do.

    KevinThanks,

    KevinKevin

    Kevin
    KevinR-devel (AT) r-project (DOT) org mailing list
    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.5 | | 3222 bytes | |

    Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

    In my particular case, the problem was my data frame had over 1
    million lines had probably over 500,000 unique sort keys (ie. think
    of it as an R factor with over 500,000 levels). The implementation
    of "by" uses "tapply" which in turn uses "split". So "split" simply
    ate up all the time trying to create 500,000 vectors each of short
    length 1, 2, or 3; and the associated garbage collection.

    I simple loop that walked the short sequence of values (since the
    data frame was already sorted) calculating what it needed, would work
    much faster than splitting the original vector into so very many
    smaller vectors (and the associated copying of data).

    That problem is very similar problem to the calculation of basic
    stats on a short moving window over a very long vector.

    I will look into that package and maybe use it for a model for what I
    want to do.

    Splus8.0 has something like what you are talking about
    that provides a fast way to compute
    sapply(split(xVector, integerGroupCode), summaryFunction)
    for some common summary functions. The 'integerGroupCode'
    is typically the codes from a factor, but you could compute
    it in other ways. It needs to be a "small" integer in
    the range 1:ngroups (like the 'bin' argument to tabulate).
    Like tabulate(), which is called from table(), these are
    meant to be called from other functions that can set up
    appropriate group codes. E.g., groupSums or rowSums or
    fancier things could be based on this.

    They do not insist you sort the input in any way. That
    would really only be useful for group medians and we haven't
    written that one yet.

    The typical prototype is

    igroupSums
    function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(
    group)) 1 else max(as.integer(group), na.rm = T))

    and the currently supported summary functions are
    mean : igroupMeans
    sum : igroupSums
    prod : igroupProds
    min : igroupMins
    max : igroupMaxs
    range : igroupRanges
    any : igroupAnys
    all : igroupAlls
    (The plurals are not all properly spelled and they
    are plural so match related functions like rowSums.)
    It might be useful to also have one to count the number
    of non-missing values in each group of x's.

    They are fast:

    x<-runif(2e6)
    i<-rep(1:1e6, 2)
    sys.time(sx <- igroupSums(x,i))
    [1] 0.66 0.67
    length(sx)
    [1] 1000000

    that machine R takes 44 seconds to go the lapply/split
    route:

    unix.time(unlist(lapply(split(x,i), sum)))
    [1] 43.24 0.78 44.11 0.00 0.00

    To save setup time in the S code, out-of-range values
    in the group argument (negatives, values greater than
    ngroup, and NA's), mean that the correponding element
    in x is ignored.

    Bill Dunlap
    Insightful Corporation
    bill at insightful dot com
    360-428-8146

    "All statements in this message represent the opinions of the author and do
    not necessarily reflect Insightful Corporation policy or position."

    R-devel (AT) r-project (DOT) org mailing list
  • No.6 | | 5067 bytes | |

    Hi Bill,

    Splus8.0 has something like what you are talking about
    that provides a fast way to compute
    sapply(split(xVector, integerGroupCode), summaryFunction)
    for some common summary functions. The 'integerGroupCode'
    is typically the codes from a factor, but you could compute
    it in other ways. It needs to be a "small" integer in
    the range 1:ngroups (like the 'bin' argument to tabulate).
    Like tabulate(), which is called from table(), these are
    meant to be called from other functions that can set up
    appropriate group codes. E.g., groupSums or rowSums or
    fancier things could be based on this.

    They do not insist you sort the input in any way. That
    would really only be useful for group medians and we haven't
    written that one yet.

    The sort is also useful for recoding each group into subgroups based
    on some other numeric vector. This is the problem I run into trying
    to build portfolios that can be used as benchmarks for long term
    stock returns.

    Another issue I have is that to recode a long character string that I
    use as a sort key for accessing a subgroup of the data in the
    data.frame to a set of small integers is not fast. I can make a fast
    implementation if the data is sorted by the key, but without the
    sort, just converting my sort keys to the required small integer
    codes would be expensive for very long vectors since my small integer
    codes would have to reflect the order of the data (ie. be increasing
    subportfolio numbers).

    More specifically, I am now converting all of my SAS code to R code
    and the problem is I have lots of snippets of SAS that do the
    following

    PRC SRT;
    BY MDSIZ FSIZ;

    /* WRITE UT THE MIN SIZE CUTFF VALUES */
    PRC UNIVARIATE NPRINT;
    VAR FSIZ;
    BY MDSIZ;
    UTPUT UT=TMPS1 MIN=XMIN;

    where my sort key MDSIZ is a character string that is the
    concatenation of the month ending date MD and the size portfolio of a
    particular firm (SIZ) and I want to find the cutoff points (the mins)
    for each of the portfolios for every month end date across all traded
    firms.

    The typical prototype is
    >
    >igroupSums

    function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
    (is.null(
    group)) 1 else max(as.integer(group), na.rm = T))

    and the currently supported summary functions are
    mean : igroupMeans
    sum : igroupSums
    prod : igroupProds
    min : igroupMins
    max : igroupMaxs
    range : igroupRanges
    any : igroupAnys
    all : igroupAlls

    SAS is similar in that is also has a specific list of functions you
    can request including all of the basic stats from a PRC univariate
    including higher moment stuff (skewness, kurtosis, robust
    statistics, and even statistical test results for each coded
    subgroup, and the nice thing is all combinations can be done with one
    call.

    But to do that SAS does require the presorting, but it does run
    really fast for even super long vectors with lots of sort keys.

    Similarly the next snippet of code, will take the file and resort it
    by the portfolio key and then the market to book ratio (MTB) for all
    trading firms for all monthly periods since 1980. It will then
    split each size portfolio for each month ending date into 5 equal
    portfolios based on market to book ratios (thus the need for the
    sort). SAS returns a coded integer vector PMTB (made up of 1s to 5
    with 1s's for the smallest MTB and 5 for the largest MTB) repeated
    for each subgroup of MDSIZ. PMTB matches the original vector in
    length and therefore fits right into the data frame.

    /* SPLIT INT Market to Book QUINTILES BY MDSIZ */
    PRC SRT;
    BY MDSIZ MTB;
    PRC RANK GRUPS=5 UT=TMPS0;
    VAR MTB;
    RANKS PMTB;
    BY MDSIZ;

    The problem of assigning elements of a long data vector to portfolios
    and sub portfolios based on the values of specific data columns which
    must be calculated at each step and are not fixed or hardcoded is one
    that finance can run into (and therefore I run into it).

    So by sorting I could handle the need for "small integer" recoding
    and the small integers would have meaning (i.e. higher values will
    represent larger MTB firms, etc).

    That just leaves the problem of calculating stats on short sequences
    of of a longer integer.

    They are fast:
    >
    >x<-runif(2e6)
    >i<-rep(1:1e6, 2)
    >sys.time(sx <- igroupSums(x,i))

    [1] 0.66 0.67
    >length(sx)

    [1] 1000000

    that machine R takes 44 seconds to go the lapply/split
    route:
    >
    >unix.time(unlist(lapply(split(x,i), sum)))

    [1] 43.24 0.78 44.11 0.00 0.00

    Yes! That is exactly what I need.

    Are there plans for adding something like that to R?

    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.7 | | 6072 bytes | |

    Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

    Hi Bill,

    Splus8.0 has something like what you are talking about
    that provides a fast way to compute
    sapply(split(xVector, integerGroupCode), summaryFunction)
    for some common summary functions. The 'integerGroupCode'
    is typically the codes from a factor, but you could compute
    it in other ways. It needs to be a "small" integer in
    the range 1:ngroups (like the 'bin' argument to tabulate).
    Like tabulate(), which is called from table(), these are
    meant to be called from other functions that can set up
    appropriate group codes. E.g., groupSums or rowSums or
    fancier things could be based on this.

    They do not insist you sort the input in any way. That
    would really only be useful for group medians and we haven't
    written that one yet.

    The sort is also useful for recoding each group into subgroups based
    on some other numeric vector. This is the problem I run into trying
    to build portfolios that can be used as benchmarks for long term
    stock returns.

    Another issue I have is that to recode a long character string that I
    use as a sort key for accessing a subgroup of the data in the
    data.frame to a set of small integers is not fast. I can make a fast
    implementation if the data is sorted by the key, but without the
    sort, just converting my sort keys to the required small integer
    codes would be expensive for very long vectors since my small integer
    codes would have to reflect the order of the data (ie. be increasing
    subportfolio numbers).

    True, but the underlying grouped summary code
    shouldn't require you to do the sorting. If
    codes <- match(char, sort(unique(char)))
    is too slow then you could try sorting the
    data set by th 'char' column and doing
    codes <- cumsum(char[-1] != char[-length(char)])
    (loop over columns before doing cumsum if there
    are several columns).

    If that isn't fast enough, then you could sort
    in the C code as well, but I think there could
    be lots of cases where that is slower.

    I've used this code for out of core applications,
    where I definitely do not want to sort the whole
    dataset.

    More specifically, I am now converting all of my SAS code to R code
    and the problem is I have lots of snippets of SAS that do the
    following

    PRC SRT;
    BY MDSIZ FSIZ;

    /* WRITE UT THE MIN SIZE CUTFF VALUES */
    PRC UNIVARIATE NPRINT;
    VAR FSIZ;
    BY MDSIZ;
    UTPUT UT=TMPS1 MIN=XMIN;

    where my sort key MDSIZ is a character string that is the
    concatenation of the month ending date MD and the size portfolio of a
    particular firm (SIZ) and I want to find the cutoff points (the mins)
    for each of the portfolios for every month end date across all traded
    firms.
    >
    >
    >

    The typical prototype is
    >
    >igroupSums

    function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if
    (is.null(
    group)) 1 else max(as.integer(group), na.rm = T))

    and the currently supported summary functions are
    mean : igroupMeans
    sum : igroupSums
    prod : igroupProds
    min : igroupMins
    max : igroupMaxs
    range : igroupRanges
    any : igroupAnys
    all : igroupAlls

    SAS is similar in that is also has a specific list of functions you
    can request including all of the basic stats from a PRC univariate
    including higher moment stuff (skewness, kurtosis, robust
    statistics, and even statistical test results for each coded
    subgroup, and the nice thing is all combinations can be done with one
    call.

    But to do that SAS does require the presorting, but it does run
    really fast for even super long vectors with lots of sort keys.

    Similarly the next snippet of code, will take the file and resort it
    by the portfolio key and then the market to book ratio (MTB) for all
    trading firms for all monthly periods since 1980. It will then
    split each size portfolio for each month ending date into 5 equal
    portfolios based on market to book ratios (thus the need for the
    sort). SAS returns a coded integer vector PMTB (made up of 1s to 5
    with 1s's for the smallest MTB and 5 for the largest MTB) repeated
    for each subgroup of MDSIZ. PMTB matches the original vector in
    length and therefore fits right into the data frame.

    /* SPLIT INT Market to Book QUINTILES BY MDSIZ */
    PRC SRT;
    BY MDSIZ MTB;
    PRC RANK GRUPS=5 UT=TMPS0;
    VAR MTB;
    RANKS PMTB;
    BY MDSIZ;

    The problem of assigning elements of a long data vector to portfolios
    and sub portfolios based on the values of specific data columns which
    must be calculated at each step and are not fixed or hardcoded is one
    that finance can run into (and therefore I run into it).

    So by sorting I could handle the need for "small integer" recoding
    and the small integers would have meaning (i.e. higher values will
    represent larger MTB firms, etc).

    That just leaves the problem of calculating stats on short sequences
    of of a longer integer.

    They are fast:
    >
    >x<-runif(2e6)
    >i<-rep(1:1e6, 2)
    >sys.time(sx <- igroupSums(x,i))

    [1] 0.66 0.67
    >length(sx)

    [1] 1000000

    that machine R takes 44 seconds to go the lapply/split
    route:
    >
    >unix.time(unlist(lapply(split(x,i), sum)))

    [1] 43.24 0.78 44.11 0.00 0.00
    --
    Yes! That is exactly what I need.

    Are there plans for adding something like that to R?

    Thanks,

    Kevin
    --

    Bill Dunlap
    Insightful Corporation
    bill at insightful dot com
    360-428-8146

    "All statements in this message represent the opinions of the author and do
    not necessarily reflect Insightful Corporation policy or position."

    R-devel (AT) r-project (DOT) org mailing list
  • No.8 | | 1037 bytes | |

    Hi Bill,

    sum : igroupSums

    , after thinking about this

    # assumes i is the small integer factor with n levels
    # v is some long vector
    # no sorting required

    igroupSums <- function(v,i) {
    sums <- rep(0,max(i))
    for (j in 1:length(v)) {
    sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
    }
    sums
    }

    if written in fortran or c might be faster than using split. It is
    at least just linear in time with the length of vector v. This
    approach could be easily made parallel to t threads simply by picking
    t starting points someplace along v and running this routine in
    parallel on each piece. You could even do it without thread locking
    if "sums" elements can be accessed atomically or by creating multiple
    copies of "sums" (one for each piece) and then doing a final addition.

    I still think I am missing some obvious way to do this but

    Am I thinking along the right lines?

    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.9 | | 6853 bytes | |

    Hi Bill,

    So you wrote one routine that can calculate any single of a variety
    of stats and allows weights, is that right? Can it return a data
    frame of any subset of requested stats as well (that is what I was
    thinking of doing anyway).

    I think someone can easily calculate all of those things in one pass
    through the array and then allow the user to select which of the new
    columns of stats should be added to a data.frame that is returned to
    the user.

    To test all of this, I simply wrote my own igroupSums and integrated
    it into r-devel based on the code in split.c. I can easily modify it
    to handle the case of calculating a variety of stats (even all at
    the same time if desired). I do not deal with "weights" at all and
    ignored that for now.

    Here is what your test case now shows on my machine with the latest R
    build
    with my added igroupSums routine (added internally to R).

    x <- rnorm(2e6)
    i <- rep(1:1e6,2)
    unix.time(Asums <- unlist(lapply(split(x,i),sum)))
    [1] 8.940 0.112 9.053 0.000 0.000
    names(Asums) <- NULL

    My version of igroupSums does not keep the names so I remove them to
    make the results comparable.

    Here is my my own internal function igroupSums

    unix.time(Bsums <- igroupSums(x,i))
    [1] 0.932 0.024 0.958 0.000 0.000
    >

    all.equal(Asums, Bsums)
    [1] TRUE

    So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

    I will next modify my code to handle any single one of maxs, mins,
    sums, counts, anys, alls, means, prods, and ranges by user choice.
    Although I will leave the use of weights as unimplemented for now (I
    always get mixed up thinking about weights and basic stats and I
    never use them so )

    In case others want to play around with this too, here is the R
    wrapper in igroupSums.R to put in src/library/base/R/

    igroupSums <- function(x, f, drop = FALSE, ) UseMethod("igroupSums")

    igroupSums.default <- function(x, f, drop=FALSE, )
    {
    if(length(list())) .NotYetUsed(deparse(), error = FALSE)

    if (is.list(f)) f <- interaction(f, drop = drop)
    else if (drop || !is.factor(f)) # drop extraneous levels
    f <- factor(f)
    storage.mode(f) <- "integer" # some factors have double
    if (is.null(attr(x, "class")))
    return(.Internal(igroupSums(x, f)))
    ## else
    r <- by(x,f,sum)
    r
    }

    igroupSums.data.frame <- function(x, f, drop = FALSE, )
    lapply(split(seq(length=nrow(x)), f, drop = drop, ),
    function(ind) x[ind, , drop = FALSE])

    And here is a very simple igroupSums.c to put in src/main/

    It still needs a lot of work since it does not handle NAs in the
    vector x yet and still needs to be modified into a general routine to
    handle any single function of counts, sums, maxs, mins, means, prods,
    anys, alls, and ranges

    #ifdef HAVE_CNFIG_H
    #include <config.h>
    #endif

    #include "Defn.h"

    SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args,
    SEXP env)
    {
    SEXP x, f, sums;
    int i, j, nobs, nlevs, nfac;

    checkArity(op, args);

    x = CAR(args);
    f = CADR(args);
    if (!isVector(x))
    errorcall(call, _("first argument must be a vector"));
    if (!isFactor(f))
    errorcall(call, _("second argument must be a factor"));
    nlevs = nlevels(f);
    nfac = LENGTH(CADR(args));
    nobs = LENGTH(CAR(args));
    if (nobs <= 0)
    return R_NilValue;
    if (nfac <= 0)
    errorcall(call, _("Group length is 0 but data length 0"));
    if (nobs % nfac != 0)
    warningcall(call, _("data length is not a multiple of split
    variable"));

    PRTECT(sums = allocVector(TYPEF(x), nlevs));
    switch (TYPEF(x)) {
    case INTSXP:
    for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
    break;
    case REALSXP:
    for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
    break;
    default:
    UNIMPLEMENTED_TYPE("igroupSums", x);
    }
    for (i = 0; i < nobs; i++) {
    j = INTEGER(f)[i % nfac];
    if (j != NA_INTEGER) {
    j--;
    switch (TYPEF(x)) {
    case INTSXP:
    INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
    break;
    case REALSXP:
    REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
    break;
    default:
    UNIMPLEMENTED_TYPE("igroupSums", x);
    }
    }
    }
    UNPRTECT(1);
    return sums;
    }

    If anyone is playing with this themselves, don't forget to update
    Internal.h and names.c to reflect the added routine before you make
    clean and then rebuild.

    I finish, I will post me patches here and then if someone would
    like to modify them to implement "weights", please let me know.

    Even if these never get added to R I can keep them in my own tree and
    use them for my own work.

    Thanks again for all of your hints and guidance. This alone will
    speed up my R code greatly!

    Kevin

    That is roughly what I did in C code for the Splus version.
    E.g., here is the integer x case for sums and means. It accumlates
    the weighted group sum and the group weight so that if we
    are doing the mean it has the right divisor. It would
    go a bit faster if I didn't bother with the group weight
    in the sum case.

    (I was mostly interested in seeing if this function's interface
    was sufficiently general for your uses. Computing the integer
    group codes can sometimes be a pain and there are cases where you
    might want to combine that with computing the grouped summaries.
    I am guessing that in most cases you want those two functions
    to be separate.)

    case S_SUM: /*FALLTHRUGH*/
    case S_MEAN:
    for(i=0;i<length;i++) {
    bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */
    weight = weighted ? *weightp++ : 1.0 ;
    x = *xp++ ;
    if (is_na_INT(&bin) || bin<0 || bin>=maxbin || weight==0.0)
    continue ;
    if (is_na_DUBLE(&groupWeightp[bin]))
    continue ;
    if (is_na_DUBLE(&x) || is_na_DUBLE(&weight)) {
    if (!na_rm) {
    na_set3(&valuep[bin], value->mode, To_NA);
    na_set3(&groupWeightp[bin], groupWeight->mode, To_NA);
    }
    continue ;
    }
    if (!is_na_DUBLE(&valuep[bin])) {
    valuep[bin] += x * weight ;
    groupWeightp[bin] += weight ; /* groupWeightp and
    valuep should both have same NA-ness */
    }
    }
    if (which==S_MEAN) {
    for(ibin=0;ibin<maxbin;ibin++) {
    if (is_na_DUBLE(&valuep[ibin])) {
    /* leave valuep[ibin] as NA */
    } else if (groupWeightp[ibin]==0.0) {
    /* valuep[ibin] must be 0.0 if groupWeightp[ibin]
    is */
    na_set3(&valuep[ibin], value->mode, To_NaN) ;
    } else {
    valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ;
    }
    }
    }
    break;

    R-devel (AT) r-project (DOT) org mailing list
  • No.10 | | 1818 bytes | |

    Hi Bill,

    After playing with this some more and adding an implementation to
    handle NAs in the data vector, I have run into the problem of what to
    return when the only data values for a particular bin (or level) in
    the data vector were NAs and the user selected na.rm=T

    1. Should it return 0 for counts of that particular bin and NA for
    that bin for all of the other functions? If so, wouldn't that be
    strange to return a NA just since there is no valid data for that bin
    because the user asked for na.rm=T?

    2. do I have to literally rebuild the final result vector,
    removing all "unused" bins before returning the results? And
    wouldn't that cause problems in not all of the levels from 1:ngroups
    will be returned for some variables and not for others.

    I personally like the approach of 1. better since if I give an igroup
    function my groups and tell it to na.rm=T from my data vector, I
    would really want all group levels returned and not just the ones
    that had valid data in them and if a particular group had no data, I
    would want the count to be 0 for that bin and all of the other funs
    to return NA for that particular bin?

    Is that what you are returning in that case?

    Also, do you always return Sums, Maxs, and Mins as "numeric" or do
    you sometimes return "integer" values if an "integer" data vector is
    passed in?

    Are "Counts" always returned as "integer" or do you always set them
    to "numeric" or does that vary with the type of the data vector
    passed in?

    Do you handle "complex" data vectors in a similar fashion (ie. using
    the length of the complex vector as its value for Maxs, Mins, etc?)?

    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list
  • No.11 | | 1049 bytes | |

    Sat, 29 Jul 2006, Kevin B. Hendricks wrote:

    Hi Bill,

    sum : igroupSums

    , after thinking about this

    # assumes i is the small integer factor with n levels
    # v is some long vector
    # no sorting required

    igroupSums <- function(v,i) {
    sums <- rep(0,max(i))
    for (j in 1:length(v)) {
    sums[[i[[j]]]] <- sums[[i[[j]]]] + v[[j]]
    }
    sums
    }

    if written in fortran or c might be faster than using split. It is
    at least just linear in time with the length of vector v.

    For sums you should look at rowsum(). It uses a hash table in C and last
    time I looked was faster than using split(). It returns a vector of the
    same length as the input, but that would easily be fixed.

    The same approach would work for min, max, range, count, mean, but not for
    arbitrary functions.

    -thomas

    Thomas LumleyAssoc. Professor, Biostatistics
    tlumley@u.washington.eduUniversity of Washington, Seattle

    R-devel (AT) r-project (DOT) org mailing list
  • No.12 | | 1527 bytes | |

    Hi Thomas,

    Here is a comparison of performance times from my own igroupSums
    versus using split and rowsum:

    x <- rnorm(2e6)
    i <- rep(1:1e6,2)
    >

    unix.time(suma <- unlist(lapply(split(x,i),sum)))
    [1] 8.188 0.076 8.263 0.000 0.000
    >

    names(suma)<- NULL
    >

    unix.time(sumb <- igroupSums(x,i))
    [1] 0.036 0.000 0.035 0.000 0.000
    >

    all.equal(suma, sumb)
    [1] TRUE
    >

    unix.time(sumc <- rowsum(x,i))
    [1] 0.744 0.000 0.742 0.000 0.000
    >

    sumc <- sumc[,1]
    names(sumc)<-NULL
    all.equal(suma,sumc)
    [1] TRUE

    So my implementation of igroupSums is faster and already handles NA.
    I also have implemented igroupMins, igroupMaxs, igroupAnys,
    igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

    The igroup functions I implemented do not handle weights yet but do
    handle NAs properly.

    Assuming I clean them up, is anyone in the R developer group interested?

    would you rather I instead extend the rowsum appropach to create
    rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

    All of these approaches simply use differently ways to map group
    codes to integers and then do the functions the same.

    Thanks,

    Kevin

    R-devel (AT) r-project (DOT) org mailing list

Re: Any interest in "merge" and "by" implementationsspecifically for sorted data?


max 4000 letters.
Your nickname that display:
In order to stop the spam: 8 + 7 =
QUESTION ON "Development"

EMSDN.COM