Squeezing better performance out of Datomic datoms – datomic performance

With Datomic, for most purposes, using datalog is recommended.

However this is not always satisfactory, especially if you have a memory constrained system or billions of datoms to choose from, with a need for lazy processing.

In these situations you are sometimes forced to use the lower level datoms function.

What is a datom?

A datom is the basic unit of storage in datomic. It is a tuple of 4 parts

EntityId

just a number

Attribute

describes the datatype of the value and its cardinality. Has a keyword that hopefully describes the purpose of the association

Value

can be any of the scalar datatypes described in https://docs.datomic.com/on-prem/schema.html#required-schema-attributes

Transaction

reference to the transaction that asserted this datom.

What is the d/datoms function call?

Datomic keeps 4 internal indexes with datoms sorted by increasing value

EAVT

holds all datoms, sorted by entityId then Attribute EntityId then Value then Transaction Entity ID

AEVT

holds all datoms, sorted by Attribute EntityId then EntityId then Value then Transaction Entity ID

AVET

holds datoms whose Value is configured as unique and those configured with Index=true, sorted by Atttribute then Value then Entity then Transaction

VAET

holds only datoms whose scalar type is Reference. Sorted by Reference then Attribute then Entity then Transaction

The d/datoms function call requires you to specify one of these indexes and optionally some matching params.

It will return an iterable that lazily supplies (matching) datoms sorted by increasing value

How does this help?

Since we know that the datoms are sorted we can write a transducer to

  • keep only those entityIds that appear in both datom iterables

  • reject only those entityIds that appear in both iterables.

Traversing lazily-realised iterables drastically reduces memory overhead thereby increasing performance.

Since these are transducers we can daisy-chain them together into arbitrary complexity.

For example, we had a problem running this Datalog query in reasonable time;

Give me the number of entities with an :asset/type that are not referred to by a :sync/asset

 (d/q '[:find (count ?asset) :where [?asset :asset/type]
                                    [(missing? $ ?asset :asset/disabled-at)]
                                    (not [_ :sync/asset ?a])]
       datomic-db)

The insight is that we can re-express this as:

  1. Get me all entities with an :asset/type

    (d/datoms db :aevt :asset/type)

  2. Throw away the ones that have an :asset/disabled-at (d/datoms db :aevt :asset/disabled-at)

  3. Throw away the ones referenced by :sync/asset (d/datoms db :avet :sync/asset)

  4. Count the number that remain.

The code for this is as follows:

(transduce
   (comp (remove-from-orderedcollection-xform #(- (.v ^Datom %1) (.e ^Datom %2))
                                               (d/datoms db :avet :sync/asset))
         (remove-from-orderedcollection-xform #(- (.e ^Datom %1) (.e ^Datom %2))
                                               (d/datoms db :aevt :asset/disabled-at)))
   (completing (fn [acc _] (inc acc)))
   0
   (d/datoms db :aevt :asset/type))

In tests, taking this approach brought the delinquent datalog query down from ~300 seconds to ~900ms.

What did you put in that remove-from-orderedcollection-xform function?

This part gets somewhat involved. Luckily you only have to write it once and then it is reusable.

You pass in a comparator and a (probably lazy) Iterator called collR. It returns a transducer that will filter out data items that can be found in collR. As long as the items are sorted in a way that is compatible with the cmp function

(defn remove-from-orderedcollection-xform
  "takes two collections values ASCENDING ACCORDING TO cmp FUNCTION
  eg Iterables of datomic entity Ids
  and will permit the transducer values that are not present in collR
  cmp will always be called with collR items as first arg, collVal as second arg
  only collVal items are returned.
  The two colls can have different kinds of data as long as cmp can work on them"
  [cmp ^Iterable collR]
  (fn [rf]
    (let [iterR (.iterator collR)
          nextR (volatile! nil)]
      (fn
        ([] (rf))
        ([acc] (rf acc))
        ([acc trans-val]
         (transduce
          (fn [rf]
            (fn
              ([] (rf))
              ([acc]
               ;; if nextR is set, then we are at the end of a sequence of matches
               ;; and test-val was also a match. Therefore we don't want it.
               ;; if nextR is NOT set then we have run out of collR values
               ;; and we want EVERYTHING from NOW ON
               (if (some? @nextR)
                 acc
                 (rf acc trans-val)))
              ([acc remove-val]
               (let [c (cmp remove-val trans-val)]
                 (cond
                   ;; test-val is ahead of remove-val; we have not yet caught up
                   (neg? c) acc

                   ;; test-val equals remove-val; we have caught up
                   ;; we do not want this test-val
                   (zero? c) (do (vreset! nextR remove-val)
                                 (reduced acc))

                   ;; remove-val has raced ahead of test-val.
                   ;; we want this test-val
                   ;; We want to keep this remove-val whilst test-val catches up
                   (pos? c) (do (vreset! nextR remove-val)
                                (reduced (rf acc trans-val))))))))
          rf
          acc
          (let [r @nextR]
            (vreset! nextR nil)
            (iterator-reducible r iterR))))))))

You can try this out by running something like:

(deftest test-remove-from-orderedcollection-easy
  (testing "easy non-trivial ordered removal"
    (is (= [0 2 4 6 8]
           (into []
                 (remove-from-orderedcollection-xform compare (filter odd? (range 12)))
                 (range 10))))))

What if I want to keep only the values that appear in both?

Oh that’s a bit easier. This does the trick:

(defn keep-orderedcoll-xform
  [cmp ^Iterable collK]
  (fn [rf]
    (let [iterK (.iterator collK)
          nextK (volatile! nil)]
      (fn
        ([] (rf))
        ([acc] (rf acc))
        ([acc trans-val]
         (transduce
          (fn [rf]
            (fn
              ([] (rf))
              ([acc] acc)
              ([acc keep-val]
               (let [c (cmp keep-val trans-val)]
                 (cond
                   ;; trans-val is ahead of keep-val;
                   ;; keep-val needs to catch up so stay in this inner loop
                   (neg? c) acc

                   ;;test-val equals keep-val; we want this one
                   (zero? c) (do (vreset! nextK nil)
                                 (reduced (rf acc trans-val)))

                   ;; keep-val has overtaken trans-val
                   ;; stash this keep-val and break out of inner loop
                   (pos? c) (do (vreset! nextK keep-val)
                                (reduced acc))
                   )))))
          rf
          acc
          (let [k @nextK]
            (vreset! nextK nil)
            (iterator-reducible k iterK))))))))

Isn’t there something missing?

Yes, I forget to specify one of the utility functions; shim an Iterable onto a Reducible

I have avoided iterator-seq because I am worried about the following:

  1. Sequence chunking could drain the Iterator prematurely and cause data loss.

  2. Head retention could cause unacceptable memory usage; by definition, datomic iterables can be extremely long.

  3. Sequences construct temporary Cons Objects, creating more memory churn than a simple loop in a reducible.

Our goal here is to be as performant as possible.

(defn iterator-reducible
  "expresses an iterator through the medium of IReduceInit
   if first-val is nil it will be ignored"
  [first-val ^java.util.Iterator it]
  (reify IReduceInit
    (reduce [_ f start]
      (loop [acc (if first-val
                   (f start first-val)
                   start)]
        (if (or (reduced? acc)
                (not (.hasNext it)))
          (unreduced acc)
          (recur (f acc (.next it))))))))

What about inequalities?

datoms only gives you the contiguous datoms that match the arguments you pass. If you pass it a value that does not exist, you will get an empty Iterable back.

This is awkward when you want to discover the datoms whose value is a date that is after a pivot date.

Datomic offers a seek-datoms function to cover this; it returns an Iterable that starts at the passed value, or the next value if it did not exist and will then keep going and going.

seek-datoms returns an Iterable over the entire index that lies beyond the starting point Since Datomic only has 4 indexes, this will likely be a sizeable chunk of the entire database, which is probably not what you want. You must terminate the iteration when the datoms are no longer of interest or you will find your system develops a surprisingly large memory churn.

Published: 2019-02-27

Privacy policy

crux   clojure  
Oct 24, 2019
Crux Development Diary Getting into Beta
jon
by Jon Pither
clojure  
Oct 22, 2019
On-the-fly collections with Reducibles digest large datasets with ease
bnh
by Ben Hammond
conference   clojutre  
Sep 30, 2019
ClojuTRE 2019 An amazing conference with amazing speakers
and
by Andrew Jackson