Sunday, April 01, 2012

Apache Cassandra: Iterate over all columns in a row

Recently I have been using Cassandra for one of my projects, and one of the needs is to iterate over all columns of a row. Each column represents an individual data, of type identified by row id, and keeps changing. So I can’t simply use a set of known column names. Using the setRange call on a SliceQuery and setting a large count is also not an option, since Cassandra will try to load the entire set of columns into memory. Instead I’ve written this iterator which takes a query on which row key and column family has been set, and will load columns as they are requested. By default it loads a 100 columns at a time. You could make it take the count as a parameter and all, but this works for me for now.

The one ‘problem’ with this is the removal of the last column to ensure that there are no duplicates, but still having a start point for the next query. This is because each column is independent, so you cannot ask a column who it’s next neighbour is and start the next query from there. If anybody has a tip to make it more elegant, I’d love to hear it.

2 comments:

  1. Nice post.

    Isn't there a little bug here ? Why don't you update the "start" local variable in fetchMore() method ?

    ReplyDelete
  2. I am updating it if 'origSize >= count'. This condition is true whenever count rows were returned. In this case we have to do another query as there may be more rows. Otherwise we know that there are no more rows so we don't need to run the query again.

    ReplyDelete