This is a repost of a network analysis I performed on historical Bitcoin blockchain data. Kaggle kernels introduced seamless access to this massive raw dataset via Google BigQuery. All of my original code can be found & forked here. The code chunks in this post are helpful sections for using BigQuery and for performing network analysis.

Note: The original notebook has trouble running because the large SQL queries often timeout in the kernel. I have found repeated interactive attempts are necessary to return large queries.


# import libraries into Python 3 env, under kernels docker image
import numpy as np # linear algebra
import pandas as pd # data processing
import networkx as nx # network analysis & viz
import bq_helper # Google BigQuery API helper
import matplotlib.pyplot as plt # plotting libraries
import seaborn as sns

Kaggle has put together a useful bq_helper package which unpacks SQL objects into pandas dataframes. It is immediately available in the kernels docker image.

# create helper object for the  the bigQuery data set
blockchain_helper = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
# to inspect available tables 

To begin, I performed exploratory data analysis on historical data trends within the Bitcoin ledger. Below, I define a SQL query and pass it to the BigQueryHelper object. I grab the block ID and corresponding block creation timestamp from the database.

# parse the timestamp data into readable date times
query = """ 
      WITH time AS 
          SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
          FROM `bigquery-public-data.bitcoin_blockchain.blocks`
      SELECT COUNT(block_id) AS blocks,
          EXTRACT(DATE FROM trans_time) AS date
      FROM time
      GROUP BY date
      ORDER BY date
# estimate the query size prior to requesting
# assign query to dataframe
q1_df = blockchain_helper.query_to_pandas(query)

This graph is a visualization of the volume of blocks created each day since the inception of Bitcoin.

The spike in block creation volume around mid-2010 is an obvious outlier. I did a little bit of research and it turns out July 2010 was an exciting month for Bitcoin. Per Bitcoin Wiki:

The huge number of blocks created reflects new user influx into the budding technology, and the new demand resulted in a 10x value increase.

Transaction Network

A point of interest in this unregulated space is the influence that "whales" (holders of large amounts of currency) have on price fluctuation. Hedge funds, early adopters, governments and businesses (illicit or honest) can sway value by injecting or withdrawing funds into the market, thereby influencing currency momentum. I looked at the transaction networks of some of these big players to better understand financial behavior in this new arena.

I first queried for wallets with the highest volume of transactions. Below is a query for wallets with the largest volume of buyer-side (in) transactions. A separate but similar query for seller-side (out) transactions revealed the same wallets operating at the highest transaction volume. My initial idea was these high-frequency wallets represent financial institutions operating algorithmic trade strategies.

# addresses has the most number of bitcoins sent out
query = """
          inputs.input_pubkey_base58 AS input_key, count(*)
      FROM `bigquery-public-data.bitcoin_blockchain.transactions`
      JOIN UNNEST (inputs) AS inputs
      WHERE inputs.input_pubkey_base58 IS NOT NULL
      GROUP BY inputs.input_pubkey_base58 order by count(*) desc limit 1000
wallet in_transactions out_transactions
1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4 1893834 1893867
1dice8EMZmqKvrGE4Qc9bUFf9PX3xaYDp 1600304 1601173
1LuckyR1fFHEsXYyx5QK4UFzv3PEAepPMK 1203839 1203881
1dice97ECuByXAvqXpaYzSaQuPVvrtmz6 1105022 1105388

To investigate further, I pulled out the trade history of the wallet with the highest transaction volume.

# query all transactions this wallet was involved in
q_input = """
        WITH time AS 
                SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time,
                    inputs.input_pubkey_base58 AS input_key,
                    outputs.output_pubkey_base58 AS output_key,
                    outputs.output_satoshis AS satoshis,
                    transaction_id AS trans_id
                FROM `bigquery-public-data.bitcoin_blockchain.transactions`
                    JOIN UNNEST (inputs) AS inputs
                    JOIN UNNEST (outputs) AS outputs
                WHERE inputs.input_pubkey_base58 = '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4'
                    OR outputs.output_pubkey_base58 = '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4'
        SELECT input_key, output_key, satoshis, trans_id,
            EXTRACT(DATE FROM trans_time) AS date
        FROM time
        --ORDER BY date
q3 = blockchain_helper.query_to_pandas(q_input)
# make a datatime type transformation
q3['date'] = pd.to_datetime(
q3 = q3.sort_values('date')
# convert satoshis to bitcoin
q3['bitcoin'] = q3['satoshis'].apply(lambda x: float(x/100000000))
# make any sending of bitcoin a negative value representing 'leaving' this wallet
q3['bitcoin_mod'] = q3['bitcoin']
q3.loc[q3['input_key'] == '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4', 'bitcoin_mod'] = -q3['bitcoin_mod']

With the preprocessed data, I examined some of the transaction trends. This wallet had a peak of trading activity around May 2015, and the net volume of Bitcoin exchanged was consistently negative. This indicates the wallet was trading more Bitcoin away than it was receiving. My primary interpretation of this phenomenon was the wallet holder mined Bitcoin and traded away the generated currency.

Network Analysis

Moving onto network analysis and visualization. Below is code to visualize the network of the top 40 wallets the wallet of interest received Bitcoin from and sent Bitcoin to.

# top 20 wallets bitcoins sent out to
nx_plot_asc = q3.groupby(['input_key', 'output_key'], as_index=False)['bitcoin_mod'].sum().sort_values(by='bitcoin_mod', ascending=True)[0:20]
# top 20 wallets bitcoins received from
nx_plot_desc = q3.groupby(['input_key', 'output_key'], as_index=False)['bitcoin_mod'].sum().sort_values(by='bitcoin_mod', ascending=False)[0:20]
# concatenate
nx_plot = pd.concat([nx_plot_asc, nx_plot_desc], ignore_index=True)
# networkx graph of net volumne of bitcoins transacted (most into and most out of the wallet in question)
G = nx.from_pandas_edgelist(nx_plot, 'input_key', 'output_key', 'bitcoin_mod', nx.DiGraph())
# specify node color
color_list = []
for node in G.nodes():
    if node == '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4':
# plot params
nx.draw_networkx_nodes(G,pos, node_list= G.nodes(),node_color='g',node_size=100)
nx.draw_networkx_edges(G, pos)
plt.title('Net Volume Bitcoins Exchanged', fontsize=22)

Each node represents a wallet, and the the distance and direction along the edge between each node represents net exchange volume and flow. The closer a node is to any other node, the larger amount of currency transacted between the wallets. Think of the black boxes along each edge as arrows, representing the direction of currency travel to the node in contact. This is known as a trivial graph, as we are visualizing a graph with a single vertex, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4, the wallet of interest.

To investigate a more complex network, I queried the transactions data of the top 5 wallets 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4 sent Bitcoin to. The SQL query and Python code for graph generation can be found in my original notebook.

The few nodes that harbor a majority of the transaction flow stand out in this visualization. Different centrality metrics describe the relative “importance”" of vertices in the graph. Degree centrality defines “the number of links incident upon a node.” Indegree (incoming ties) and outdegree (outgoing ties) cetrality help to specifically define directed networks. Betweenness centrality “quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.”

The 3 wallets at the top of the chart are all very important in terms of degree centrality. The highest ranking wallets connect to many nodes in the network and represent hubs of transaction. The high betweenness centrality of 1BT4DYrt3ZoSz6WeGEZzrj4tdidUcpCfQ6 signifies its use as a bridge of connection between nodes.

The centrality metrics provide an interesting conclusion in regards to the currency dispersion pattern of the original wallet, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4. I believe that the most important nodes in this network are most likely wallets under the same ownership and represent central nodes of a currency "cleaning"" service. Due to the unregulated nature of Bitcoin, some individuals use the market to exchange illicit goods and services. To increase anonymity and privacy cyprtocurrency tumblers have evolved as a tool to help launder digital currency. From Wikipedia:

In traditional financial systems, the equivalent would be moving funds through banks located in countries with strict bank secrecy laws, such the Cayman Islands, the Bahamas, or Panama. Tumblers take a percentage transaction fee of the total coins mixed to turn a profit, typically 1-3%. Mixing helps protect privacy and can also be used for money laundering by mixing illegally obtained funds.

By pinging transactions around a network of wallets, these services provide a way for individuals to input ‘dirty’ Bitcoin and receive ‘clean’ Bitcoin via creation of a convoluted paper trail.


Simple network analysis reveals interesting trends within the Bitcoin blockchain ledger, and I have shown an important use case for this type of analysis in detecting sub-networks that may accomodate fradualent/illicit activity. Using data science in the context of publicly distributed technology will be instrumental in promoting the safe adoption of these amazingly usefal exchange platforms. Quantitative insight into blockchain networks can provide developers, as well as legislators, with appropriate information to spur global adoption of this technology.