This is a repost of a network analysis I performed on historical Bitcoin blockchain data. Kaggle kernels introduced seamless access to this massive raw dataset via Google BigQuery. All of my original code can be found & forked here. The code chunks in this post are essential sections that may be helpful for querying via BigQuery in Kaggle kernels and performing network analysis/visualization.Note: The original notebook has trouble running because the large SQL queries often timeout in the kernel. I have found interactively repeated attempts are neccessary to successfully return large queries.
# import libraries into Python 3 env, under kernels docker image import numpy as np # linear algebra import pandas as pd # data processing import networkx as nx # network analysis & viz import bq_helper # Google BigQuery API helper import matplotlib.pyplot as plt # plotting libraries import seaborn as sns
Kaggle has put together a useful
bq_helper package which unpacks SQL object into pandas dataframes. It is immediately available in the kernels docker image.
# create helper object for the the bigQuery data set blockchain_helper = bq_helper.BigQueryHelper(active_project="bigquery-public-data", dataset_name="bitcoin_blockchain") # to inspect available tables blockchain_helper.list_tables()
To begin, I performed exploratory data analysis to look at overall historic data trends within the Bitcoin ledger. Below, I define a SQL query and pass it to the BigQueryHelper object to grab the block ID and corresponding block creation timestamp from the database.
# parse the timestamp data into readable date times query = """ WITH time AS ( SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time, block_id FROM `bigquery-public-data.bitcoin_blockchain.blocks` ) SELECT COUNT(block_id) AS blocks, EXTRACT(DATE FROM trans_time) AS date FROM time GROUP BY date ORDER BY date """ # estimate the query size prior to requesting blockchain_helper.estimate_query_size(query) # assign query to dataframe q1_df = blockchain_helper.query_to_pandas(query)
This graph is a visualization of the volume of blocks created each day since the inception of Bitcoin.
Interestingly there is a spike in block creation volume around mid-2010. I did a little bit of research and it turns out July 2010 was an exciting month for Bitcoin. Per Bitcoin Wiki:
The huge number of blocks created reflects new user influx into the budding technology, and the new demand resulted in a 10x value increase.
A point of interest in this unregulated space is the influence that "whales" (holders of relatively large amounts of currency) have on price fluctuation. Hedge funds, early adopters, governments and businesses (illicit or honest) holding large amounts of Bitcoin can sway value by temporally injecting or withdrawing funds into the market, thereby influencing currency momentum. I looked at the transaction networks of some of these big players to better understand financial behavior in this unregulated space.
I first queried for wallets with the highest volume of tranactions. Below is a query for wallets with the largest volume of buyer-side (in) transactions. A separate but similar query for seller-side (out) transactions revealed virtually the exact same wallets operating at the highest volume. My first idea was these high-frequency wallets may represent financial institutions operating algorithmic trade strategies.
# addresses has the most number of bitcoins sent out query = """ SELECT inputs.input_pubkey_base58 AS input_key, count(*) FROM `bigquery-public-data.bitcoin_blockchain.transactions` JOIN UNNEST (inputs) AS inputs WHERE inputs.input_pubkey_base58 IS NOT NULL GROUP BY inputs.input_pubkey_base58 order by count(*) desc limit 1000 """
Investigating futher, I pulled down all of the data that the wallet with the highest transaction volume was involved in.
# query all transactions this wallet was involved in q_input = """ WITH time AS ( SELECT TIMESTAMP_MILLIS(timestamp) AS trans_time, inputs.input_pubkey_base58 AS input_key, outputs.output_pubkey_base58 AS output_key, outputs.output_satoshis AS satoshis, transaction_id AS trans_id FROM `bigquery-public-data.bitcoin_blockchain.transactions` JOIN UNNEST (inputs) AS inputs JOIN UNNEST (outputs) AS outputs WHERE inputs.input_pubkey_base58 = '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4' OR outputs.output_pubkey_base58 = '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4' ) SELECT input_key, output_key, satoshis, trans_id, EXTRACT(DATE FROM trans_time) AS date FROM time --ORDER BY date """ q3 = blockchain_helper.query_to_pandas(q_input) # make a datatime type transformation q3['date'] = pd.to_datetime(q3.date) q3 = q3.sort_values('date') # convert satoshis to bitcoin q3['bitcoin'] = q3['satoshis'].apply(lambda x: float(x/100000000)) # make any sending of bitcoin a negative value representing 'leaving' this wallet q3['bitcoin_mod'] = q3['bitcoin'] q3.loc[q3['input_key'] == '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4', 'bitcoin_mod'] = -q3['bitcoin_mod']
With the preprocessed data, I examined some of the transaction trends. This wallet had a peak of trading activity around May 2015, and the net volume of Bitcoin exchanged was consitently negative. This indicates overall the wallet was trading more Bitcoin away than it was receiving. My initial interpretation of this phenomenom was the holder was involved in mining activities; this would create an inflow of coins that could be traded away on the currency market.
Moving onto network analysis and visualization: below is code to visualize the network of the top 20 wallets the wallet of interest received Bitcoin from, and the top 20 wallets the wallet of interest sent Bitcoin to.
# top 20 wallets bitcoins sent out to nx_plot_asc = q3.groupby(['input_key', 'output_key'], as_index=False)['bitcoin_mod'].sum().sort_values(by='bitcoin_mod', ascending=True)[0:20] # top 20 wallets bitcoins received from nx_plot_desc = q3.groupby(['input_key', 'output_key'], as_index=False)['bitcoin_mod'].sum().sort_values(by='bitcoin_mod', ascending=False)[0:20] # concatenate nx_plot = pd.concat([nx_plot_asc, nx_plot_desc], ignore_index=True) # networkx graph of net volumne of bitcoins transacted (most into and most out of the wallet in question) G = nx.from_pandas_edgelist(nx_plot, 'input_key', 'output_key', 'bitcoin_mod', nx.DiGraph()) # specify node color color_list =  for node in G.nodes(): if node == '1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4': color_list.append('red') else: color_list.append('green') # plot params pos=nx.random_layout(G) nx.draw_networkx_nodes(G,pos, node_list= G.nodes(),node_color='g',node_size=100) nx.draw_networkx_edges(G, pos) plt.title('Net Volume Bitcoins Exchanged', fontsize=22)
Each node represents a wallet, and the the distance and direction along the edge between each node represents net exchange volume and flow. The closer a node is to any other node, the larger amount of currency transacted between the wallets. The black boxes along each edge can be thought of as arrows, representing the direction of currency travel. This is known as a trivial graph, as we are visualizing a graph with a single vertex, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4, the wallet of interest.
To investigate a more complex network, I queried the transactions data of the top 5 wallets 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4 sent Bitcoin to. The SQL query code and Python code for graph generation can be found in my original notebook.
A few things come to mind from this visualization, mainly that there are a few nodes which harbor a majority of the transaction flow. The relative "importance"" of vertices in this graph can be computed with different centrality metrics. Degree centrality defines "the number of links incident upon a node." Indegree (incoming ties) and outdegree (outgoing ties) cetrality help to more specifically define directed networks. Betweenness centrality "quantifies the number of times a node acts as a bridge along the shortest path between two other nodes."
The 3 wallets at the top of the chart are all very important in terms of degree centrality, meaning each are connected to many nodes in the network and consequently represent hubs of transaction. 1BT4DYrt3ZoSz6WeGEZzrj4tdidUcpCfQ6 stands out in terms of betweenness centrality, signifying its use as a bridge of connection between nodes.
The centrality metrics provide an interesting conclusion in regards to the currency dispersion pattern of the original wallet, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4. I believe that the most important nodes in this network are most likely wallets under the same ownership, representing central nodes of a currency "cleaning"" service. Due to the unregulated nature of Bitcoin, some individuals use the market to exchange illicit goods and services. To increase anonymity and privacy cyprtocurrency tumblers have evolved as a tool to help launder digital currency. From Wikipedia:
In traditional financial systems, the equivalent would be moving funds through banks located in countries with strict bank secrecy laws, such the Cayman Islands, the Bahamas, or Panama. Tumblers take a percentage transaction fee of the total coins mixed to turn a profit, typically 1-3%. Mixing helps protect privacy and can also be used for money laundering by mixing illegally obtained funds.
By pinging transactions around a network of wallets, these services provide a way for individuals to input 'dirty' Bitcoin and receive 'clean' Bitcoin via creation of a convoluted paper trail.
Simple network analysis reveals interesting trends within the Bitcoin blockchain ledger, and I have shown an important use case for this type of analysis in detecting sub-networks that may accomodate fraudulent/illicit activity. Using data science in the context of publicly distributed technology will be instrumental in promoting the safe adoption of these amazingly useful exchange platforms. Quantitative insight into blockchain networks can provide developers, as well as legislators, with appropriate information to spur global adoption of this technology.