A friend of mine shared this video from JSConf with me last week of Interactive Visualisations on massive data sets using Palentir:
Watching Tim Slatcher present, it reminded me of some thoughts I had a few years ago concerning the problem of interacting with very large data sets. Tim talks about how the UX for exploring data needs to be intuitive as well as responsive, scaling with data volumes. I think Tim and I share a lot of common ground on these points, but we approach it from two differing angles:
You see, the problem with computers can be traced back to when they were mechanical devices. Some inputs were given, a handle was turned, some complex calculation performed, and the result was output.
Data --> Compute --> Result
and depending on the volume of data and the complexity of the calculation:
Data --> … [crickets]… --> Result
In some cases the crickets can sound for minutes or hours (days at times) before the result is presented. But what happens during that wait?
Well, the human mind thrives on inadequate information. Aside from the finely tuned rational aspects of our brain, we have a set of other psychological functions which kick in when there is limited data, and help us short-cut to the answer. What I might call ‘gut’ instinct, you might say is ‘prejudice’ and vice versa. Minority groups throughout history will have experienced this problem. A crime is perpetrated and in the absence of evidence, the finger gets pointed to one of their number. It then takes an overwhelming barrage of facts to convince people against this initial judgement, if indeed (as is often), it has been made in error.
While these evolutionary functions are very useful to surviving in the wild, they sometimes get in the way as in the example above of administering justice, or indeed simple office based decisions.
I remember a story from a client who had invested a significant amount of money into a decision support system. The system was able to aggregate their sales data from stores around the world, and management were able to slice and dice this data by varying information such as product line, country, time of day etc. One day, they had a new product to launch, but they did not have the resources to do a simultaneous launch in every country they operated in. They ran a complex query on their data to work out the optimal territory to launch first. Interestingly and despite their investment in the technology, they ignored what the data was telling them to do; the product tanked, and they pulled the launch. A competitor launched a competing product six months later, and went on to corner the market. What went wrong? A data failure? Missing dimensions from the query? No. Just a bit too much pride and emotion.
You see, while the query was running, the exec in charge of this particular initiative didn’t bother waiting for the results. In fact, the moment he or she knew that they had to make a decision about which country to launch in, their brain was already busy computing an answer. By the time they asked their business analysts to do the research, they already had a hunch. When the data came back three days later, it was too late.
“The data is wrong”
“The system never gets it right”
“There are just some variables the computer hasn’t taken into account”
The problem with a gut instinct is that once it’s formed, we start to build an emotional attachment to it. This is very useful for surviving in the wild. No one escaped a woolly mammoth by being indecisive. The odd few who made poor decisions ran very decisively directly into the path of the beast, or indeed headlong into other sources of danger; but enough made a decent snap decision, stuck to their guns, and tens of thousands of generations later – here we are.
For humans to finally accept computers in the decision support process, they need to be able to guide us towards the answer faster than the speed of thought. Unfortunately, if you explain this to a computer engineer, you often end up chasing down the wrong path.
“We need to increase the server RAM to 128Gb”
“We need to re-write the database into a columnar datastore, but even then we will be limited to a couple of hundred billion rows if you need that response time”
“Sorry, we just need to reduce the amount of data. Let’s aggregate the values somewhere – the users will never notice”
Oh dear [sigh].
* * *
What I think Tim Slatcher has stumbled across is something which I think is a universal truth of computing:
“You are only ever able to maximise two of these three factors:
1. Volume of Data
2. Speed of Response
3. Level of Accuracy
All data systems must compromise one factor in order to maximise the remaining two”.
The problem with data processing design is that we all take #3 – Level of Accuracy for granted. It’s 100% in all cases. Therefore, as #1 - Volumes of Data rises, so #2 – Speed of Response falls. Our solution? Jack up the computing power; double the number of CPUs in the array.
Tim is one of very few Computer Scientists who are advocating a new way – sacrifice the level of accuracy for speed of response. Imagine a system where response time was a constant. For any given question, it must give a result in under half a second:
“Where shall we launch our new product?”, the user asks; “Not sure, just yet. I’ll probably need a few hours to give you a fully accurate answer. I'll keep you updated on my thoughts as I go, though:
What I can say already from skimming the data, is that it’s very unlikely to be a successful launch in Europe; Australia and Canada seem like the two stand-out choices”.
Suddenly, our system begins to sound more human. The big difference being, its answers are based on data, not on gut.