Wednesday, 2 November 2016

PouchDB read optimisation - reducing time of data queries

Recent ARC version has so called API assistant. It's a helper program which queries history data for relevant information when the URL or HTTP method changed. It reads history of requests made to particular method (meaning endpoint and method) and analyse the statistics data about this method like connection time or response size.

Recently I was asked how it will behave when the app stores a lot of data. I mean hundreds of thousands of records. Then it occurred to me that I haven't performed test for this amount of data. It started a series of changes in the code of the assistant which finally allowed me to shorten database query time by 98.52%!
But lets start from the beginning.
The app uses PouchDB wrapper for the IndexedDB database. Every time a user make a request in the app the request, response and some timings data are stored locally in the data store. So far this data was never really used. But now the assistant make a lot of use of this data.

This is how computed assistant report looks like:

The API assistant

The data are computed across all the history data. It can be few records but it also can be a lot of data gathered for months or years.

Initial tests showed that the query time do not influence app performance and it's still fast. But then I've generated 300 thousands additional records and the app started to work little bit slower. I measured that average data query and processing time for data was ~35 seconds. That's a lot of time and it was unacceptable for the app.


The cause

As mentioned earlier the app is using PouchDB database wrapper for IndexDB. It's a nice library but it turns out that it may also be slow even when working directly on keys. When getting a record from the datastore using PouchDB it looks for a key, then in the metadata values it checks if the record is deleted or not (PouchDB doesn't really delete data because it's need it for replication, it only marks it as deleted). Then it need to find latest review of the record (the _rev parameter) and finally read the record associated with the key. Rev search part is required because the PouchDB keeps history of changes which is required for replication with the remote database.

The data structure

First step for optimisation was to use proper data structure. Fortunately I make a use of keys in IndexedDB. Many developers (including me in the past) don't really use keys to find the data. It appears, however, that keys are excellent tool to query IndexedDB really fast. And to sort data without calling sort function on the result list.

The app is using the following structure of the history data key:

URL encoded request URL + "/" + HTTP method + "/"  + generated UUID

Why this structure? With this keys I can open cursor on the data store keys and iterate over keys only without event loading and reading the object associated with the key. Below I'll show it on the real code.
The UUID is attached to every request or else the record will be overwritten when saving new history object for the same URL and HTTP method which would defeat the purpose of the store.

The solution

I'd like to worn you that this will work on data that  aren't suppose to change over the time. It doesn't care about the revision system of PouchDB.

First step to optimise the query time was to use... raw IndexedDB instead of PouchDB wrapper. It may sounds silly, because we are using wrappers to not work on the raw IndexedDB, because it's not easy, it is not supporting promises, querying IDB is really hard and you may find 10 other reasons for not using it without a framework. In this case, though, it was the only thing I could do to get the data out really fast.

Algorithm for the database query is quite straightforward:

  1.  Compute the "queryKey" whis is beginning of the key consisted with URL and HTTP method that we are looking for (the encoded URL + "/" + HTTP method part)
  2. Open a cursor (with value) with key range defined from the queryKey on the document-store store
  3. In the success callback function check if value's deletedOrLocal property equals zero
    1. If yes, append result's seq property to the results list
  4. When all data are ready then query the by-sequence store and get data for each key
  5. Compute stats data
You may notice that step 4 can be included into step 3. Then using the same transaction you could take a data from the other store. I've tried this, measured impact ono the performance and in this scenario read time doubled! Somehow, when using two different transactions with single store opened within the transaction it double speed up the read time.

Lets jump to the code.
// The namespace
var processor = {};
// Open the database
processor.getDatabase = function() {
  return new Promise((resolve, reject) => {
    // In ARC the store is called history-data
    let request = indexedDB.open('_pouch_history-data');
    request.onerror = function(event) {
      reject(event);
    };
    request.onsuccess = function(event) {
      resolve(event.target.result);
    };
  });
};
// Query keys for related data
processor.queryKeys = function(db, url, method) {
  return new Promise((resolve, reject) => {
    // This is where PouchDB stores keys defined in the app and related metadata. 
    // It also stores the sequence number which is a key to the data stored by the app for given key
    let transaction = db.transaction(['document-store'], 'readonly');
    let objectStore = transaction.objectStore('document-store');
    // Creating beginning of the key to search for.
    let key = encodeURIComponent(url) + '/' + method + '/';
    var keyRange = IDBKeyRange.bound(key, key + '\uffff', true, true);
    let request = objectStore.openCursor(keyRange);
    let keys = [];
    request.onsuccess = function(event) {
      let cursor = event.target.result;
      if(cursor) {
        if (cursor.value.deletedOrLocal !== '0') {
          // This record is useless
          cursor.continue();
          return;
        }
        // The `seq` keeps a reference to the by-sequence store where the final data are kept
        keys[keys.length] = cursor.value.seq; 
        cursor.continue();
      } else {
        // ready to continue
        resolve(keys);
      }
    };
    request.onerror = function(event) {
      // Because data consistency is not very important here we can do nothing.
      // Other apps could actually report an error.
      console.error(event);
    };
  });
}
Querying by the key is very efficient. This part of the app, no matter the database size (at least sizes I checked so far) always took couple milliseconds to finish! Notice that in this time I've completed a query for particular URL and HTTP method!
It is possible because I coded key with the formula described above. The rest is just a native KeyRange and cursor.
var keyRange = IDBKeyRange.bound(key, key + '\uffff', true, true);
The whole magic is happening here. The \uffff is added to the end of the upper boud of the key range because IndexedDB is indexed in lexicological order. It means that if you want to query for data from some string value up to any character after this string you have to use highest character value (in the UTF table) which is the \uffff character. IndexedDB is optimised to run queries like this. If I'd use auto-generated keys or any other key structure this wouldn't be possible and I would need to query all data in the database and perform manual search. And this would be very expensive.

Finally the queryKeys function returns a list of sequence numbers, which in PouchDb is a key for the by-sequence store, where the final data that I'm looking for are kept.
processor.readHistoryData = function(store, seq) {
  return new Promise((resolve, reject) => {
    let request = store.get(seq);
    request.onsuccess = function(event) {
      resolve(request.result);
    };
    request.onerror = function() {
      resolve(null);
    };
  });
}
processor.getHistoryData = function(url, method) {
  var startTime = performance.now();
  var db;
  return processor.getDatabase()
  .then((_db) => {
    db = _db;
    return processor.queryKeys(_db, url, method);
  })
  .then((list) => {
    if (!list || !list.length) {
      return [];
    }
    // Second transaction I mentioned earlier
    let transaction = db.transaction(['by-sequence'], 'readonly');
    let store = transaction.objectStore('by-sequence');
    return Promise.all(list.map((i) => processor.readHistoryData(store, i)))
    // Clean up possible errors. Your app can actually collect information about the errors
    .then((list) => list.filter((i) => i !== null))
    .then((list) => {
      list.sort((a, b) => {
        if (a.created > b.created) {
          return 1;
        }
        if (a.created < b.created) {
          return -1;
        }
        return 0;
      });
      return list;
    });
  })
  .then((data) => {
    let execTime = performance.now() - startTime;
    return {
      data: data,
      time: execTime
    };
  });
}
The readHistoryData function just reads the data associated with given key. It's the most natural way to do it in IndexedDB. What's interesting is that I don't create a transaction for each call of this function. I rather use one transaction and I'm passing it as an argument to the read function because it's faster.

The getHistoryData function if the one to call to get the data. It contains a sequence of events to query and read the data. The by-sequence store is the store where the data the app needs are kept (in PouchDB model).

One more thing to consider. I use The sort function to sort data from the oldest to newest. It's actually a room for improvement here. If the key instead of UUID contained a timestamp of the request then I wouldn't need to sort the data. Remember lexicological order? The database would just return already sorted data by time. Right now it's a few milliseconds added to the the overall process time so it's not important issue but in the future I would change it.

The improvements 

Like I said in the beginning I was able to reduce query time by 98.52% just by bypassing PounchDB and using raw IndexedDB to query this particular data. This are the result's of measuring both implementations.

For 30,934 records in the datastore:


PouchDb query (ms)Raw IDB query (ms)
9984170
9500204
9715199
9537195
9665172
9433154
9501207
9739167
9719212
9685181
9599176
9665181

Last row is a median for the values. In this case the improvement was -98.13% of query time.

For 160,934 objects in the datastore:


PouchDb query (ms)Raw IDB query (ms)
88479.58662.49
53449.78790.735
53165.36741.625
54344.375877.285
53601.495815.605
53140.32810.6
54249.165841.805
53752.095728.825
53533.885712.17
53565.16691.475
53838.635833.09
53601.495790.735

The improvement is -98.52% of query time. That's a lot!

I hope that you've found this article helpful and you'll use this technique to improve a performance of your app.
The most important thing here is to remember that you can use IDB keys and they should work for you. IDB doesn't care if the key is a string, number or object. But you should care.

Full code is available on my GitHub: https://github.com/jarrodek/ChromeRestClient/blob/develop/app/elements/request-stats-analyser/request-stats-analyser.html
It's embedded a the web component which creates a web worker instance from the template content and run it in separate thread so main UI is not affected by the database query and later by data analyser.

No comments:

Post a Comment