On the difficulty of estimating on-chain transaction volume

We are happy to report that this site is getting quite a bit of attention these days. We never anticipated this when we first decided to put together a public repository of cryptocurrency data. With lots of attention comes lots of trouble. One thing we’ve always worried about is how to carefully present data which is very noisy by its very nature.

Extracting information from public blockchains is a delicate thing. There are valuable insights there, but also a huge amount of misdirection and obfuscation. Since we don’t feel that we’ve done a good job of caveating our data, and we’re getting a bit of exposure now, I’ll try and make our constraints absolutely clear in this post. First things first:

Transaction volume in USD terms is highly unreliable and may be overstated by a factor of 5-10 or more

We recommend you use our “transaction volume (USD)” figure with caution. It is by no means a canonical estimate of the actual economic activity transacted on a network in a given day. For UTXO chains in particular, it is probably a ludicrous over-estimate. Why is this the case? There are two chief reasons, one structural, and the other practical.

It’s hard to tell which outputs are genuine, and which are change.
A large portion of bitcoin transaction volume comes from exchanges moving money around, or from mixers, and they’re hard to identify.

What we do is add up all the outputs in a block and subtract out the ones we know for sure are change transactions. But some things artificially inflate the estimate: it is vulnerable to mixer activity, self-churn, and exchange activity, among other things.
Bitcoin is based on a UTXO model, which means that it operates like a typical cash transaction at the convenience store. Bob wants a $2 packet of cigarettes. He hands over a $20 and gets $18 back in change. Bitcoin works the same way. A transaction where the input isn’t quite equivalent to the transaction amount records two outputs – one to the intended recipient, another in which the remainder is cycled back to the originator as change. So naively adding up the value of outputs in the cigarette transaction gives you a “value transmitted” of $2 + $18 = $20, when in reality only $2 actually changed hands. Problematic.

It’s worth noting that subtracting out mixer and exchange outputs is controversial. That involves deeming them “not economical” transactions. But the blockchain doesn’t care – they paid their fee and have the same right to be mined into the blockchain than any other. From Bitcoin’s perspective, they’re all equally valid transactions. They still subsidize the miners and contribute to the security of the chain. Many would argue that there is no such thing as a “spam” transaction.

We don’t take a position on this. It is useful, if you’re peering into the blockchain to determine the shape of that economy, to segment the various actors. Exchanges are fairly simple to identify, so they are easy targets. But this is an open question. It complicates the task which we have set ourselves, which is ‘find a plausible estimate for on-chain economic volume.’
The difference between nominal and adjusted volume is fairly stark.

If you look at smaller UTXO chains like Verge, the problem can be quite severe. In late December 2017, Verge USD volume spiked, even briefly surpassing that of Bitcoin. But it wasn’t a case of an increase in natural demand – the spikes are attributable to the behavior of a single address.

Around that time, a large Verge address with about $23m in XVG began peeling, sending small amounts (~1000 XVG) repeatedly from one large parent address. Since our change heuristics don’t filter these sorts of transactions out, this behavior fabricated a lot of volume. We don’t know how much economic volume was actually occurring on Verge around that time, but it certainly wasn’t $22b worth in a day.

This isn’t just a UTXO-only problem. There is some fairly credible research finding that a significant fraction of Ethereum volume flows through temporary addresses which are only used once. This may be due to a mixer, but more likely, we believe it may be due to the way that exchanges interact with their users. The end result is the same – a significant portion of volume, even in an account-based system like Ethereum (no UTXOs or change addresses to worry about), comes from temporary addresses.

So where does Blockchain.info get their estimates from?

The other approach to estimating transaction volume is to subtract out as many change outputs, exchange churn, and mixing outputs as possible. Of course, sometimes it’s impossible to know which outputs are change and which are original. Blockchain.info takes a stab at this. They have a unique advantage – they run a wallet service. This means they are probably able to obtain decent estimates about user behavior and extrapolate that to the whole set of users.
How do they get their estimates? We’re actually not sure. They don’t make it clear, as far as we can tell. However, we have been able to roughly reverse-engineer them using two heuristics:

Subtracting out outputs which are spent within k blocks, where k = 4
Identifying change outputs as defined in the Meiklejohn paper

We haven’t operationalized these yet, as we would pass from the realm of “presenting data” to “curating and modifying data.” We want to be extremely careful that we can make reliable claims about the information we’re sharing here. Of course, part of that involves informing you about how unreliable data can be. We’re focused on building out our offering now and adding support for lots of blockchains. Our longer-term plan involves letting users apply chosen heuristics to transaction data to hopefully get a better estimate of economic volume.

What solutions exist?

Well we have a couple options. We can keep using nominal (unadjusted) transaction volume and just stay vigilant about the constraints of that measure. It’s still fairly useful in quite a few ways. We could try and render a more robust analysis of economic volume by getting better at guessing change outputs and subtracting known mixer/exchange volume. Or we could default to a simpler measure like Bitcoin “days destroyed.”

Days destroyed works like this: you take the number of bitcoins in a given transaction and multiply them by the number of days since they were last spent. This neatly removes the double-counting from mixer volume or from exchanges that are just shuffling coins around. For some reason, it has fallen out of fashion now, and it’s not hosted on blockchain.info any more. It’s kind of obscure and a bit tricky to interpret. So let’s ignore that for now.

Let’s call our current USD figure nominal or raw volume and the adjusted figure or “heuristically reduced” volume. We’ve already looked into blockchain.info-style reductions for Bitcoin. We don’t at present have the tools to do this analysis for all of the 15+ public blockchains we support. So this will have to wait for now. Don’t despair though – nominal volume is still a pretty good figure. Just be aware that a single large wallet making lots of small transactions can appear as a huge spike in volume, especially on smaller chains.

As long as you are vigilant to the constraints of the data, you will be empowered to use it appropriately. We are improving our method and our estimates all the time. If you would like to reconstruct our analysis, consider looking at the tools in our Github. If you are interested in more sophisticated blockchain analysis, we suggest looking into the Meiklejohn et al paper linked in useful resources. They provide an excellent explanation of various heuristics used to infer transaction data. The BlockSci tool is another good starting point.

Lastly, be advised that this is an evolving science. Wallet footprints are changing all the time, altering the viability of some of these heuristics that exploit them. Some privacy enhancements, especially on Bitcoin, render these analyses more difficult. And more transactions are being held off-chain, whether internally on exchanges, off-chain via Opendimes, or near-chain through improvements like the Lightning Network.

So the unfortunate reality is that there is no objective truth to be discovered. Instead, plausible guesses can be made. Our guesses are of a certain type, more precise estimates found in blockchain.info and in academia are another. We’re optimizing for scale, honesty, and coverage of a lot of different public blockchains.

Please note that this isn’t a mea culpa. We laid out these problems in a previous post. However, there are many new users who may be interpreting our data with insufficient skepticism. Be advised that extracting human-readable data from blockchains is an imprecise affair. Don’t rely on us exclusively. We are just one tool in a buffet of many: refer to our useful resources page for more excellent data services.