What I discovered by analyzing almost 100,000 ERC-20 Tokens

I’ve always found Blockchain Data somewhat paradoxical.

On one hand, you can view information about any transaction, whether it’s from seconds ago or years past, at any time, on platforms like Etherscan.

However, this doesn’t mean that accessing any type of blockchain data is easy, especially if you lack the technical knowledge for it.

This research aims to demonstrate how I extracted, transformed, and analyzed data from nearly 100,000 ERC-20 tokens launched on Uniswap V2 over the past few years, seeking to answer questions such as:

  • Are the majority of these tokens truly scams?
  • What is the average number of holders for these tokens?
  • How big can a token of this kind grow?

I accomplished this using the following tech stack:

  • Python (for libraries like requests, pandas, numpy, matplotlib, and others)
  • BlockSync (to fetch initial token information within a specific timeframe)
  • GoPlus (for security analysis)
  • Bitquery (for querying blockchain data)
  • MongoDB (NoSQL database storage)

The Signal and the Noise

Hundreds of thousands of new ERC-20 tokens are launched every day, especially when considering all the DEXs (Decentralized Exchanges) that exist across various blockchains.

Here, the first decision I made was to narrow down the scope of analysis:

I decided to focus on one of the most famous and liquid DEXs in the market (Uniswap V2) and only on pairs pegged to WETH.

Decentralized Exchanges sorted by their daily volume. Source: DeFi Llama
WETH is by far the most used quote currency on Uniswap V2. Source: BitDegree

Okay, now we have a defined scope.

The single piece of information we need to get started is simple: the pair address. This is a unique number that will identify that pair, related to its token. For example, this is Pepe’s WETH pair address for Uniswap V2: 0xa43fe16908251ee70ef74718545e4fe6c5ccec9f.

With a pair address in our hands we are in our way to get all other information we need.

Now, how in the world can we get the pair address of 100k tokens? Getting that manually would be impossible (or at least, very time-consuming). Thank God, there’s BlockSync, a platform that allows you to query blockchain data using SQL.

BlockSync

Then, we just need to connect with the BlockSync API and run the right query. Here is a simple draft on how you could do it:

Here is an example of what running that code will return:

{'cols': ['tx_hash',
          'block_number',
          'log_index',
          'tx_index',
          'address',
          'chain_id',
          'ts',
          'token0',
          'token1',
          'pair',
          'arg3'],
 'rows': [{'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274432,
           'block_number': 18435720,
           'chain_id': 1,
           'log_index': 1,
           'pair': '0xC12f0e00F8e1ABE487f08cc5E05d1c3aC127cA29',
           'token0': '0x9CB156fF063B6dCE4c38d14830a69a8Cc08002C0',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:56:35',
           'tx_hash': '0xbad6994d30eb49fb668e5146706976c797a74e336a79e649af52f2185f6674d6',
           'tx_index': 0},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274431,
           'block_number': 18435716,
           'chain_id': 1,
           'log_index': 331,
           'pair': '0xcebD0DB8fcCD1C9f3D4Edd20c4B8a77DB537e635',
           'token0': '0x0444F5bfe7F75063Fa9e56B32522fF1F6e5064AE',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:55:47',
           'tx_hash': '0xd167409eeba7750d43e6b0e2e5a082294e5beb528dc0ca06c6d83107f8216bb2',
           'tx_index': 90},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274430,
           'block_number': 18435714,
           'chain_id': 1,
           'log_index': 1,
           'pair': '0x90fa779ad7464E9b2805819eef2E3c52c50ee282',
           'token0': '0x4c538bE977902F8946f255BB2f60d9e5C3b3a504',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:55:23',
           'tx_hash': '0xac5d9b652912ad7b98b41d5fa0f0af2481c6c8010e99353b3571aa1e168ef2d0',
           'tx_index': 0},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274428,
           'block_number': 18435664,
           'chain_id': 1,
           'log_index': 38,
           'pair': '0x5369f9CfC1Dd6d32eAc079E9b6d9Fe978B654302',
           'token0': '0xA32d3E483bB293F7b615fE5616b60f4e33773571',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:45:23',
           'tx_hash': '0x115f203f376738b4cd9e356e31ce5b62520f0879ca4359511f7c95c2dae75c3a',
           'tx_index': 18},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274427,
           'block_number': 18435633,
           'chain_id': 1,
           'log_index': 35,
           'pair': '0xcC9d6C33efe4f26B8d7994740037c8Ad5828407a',
           'token0': '0x5780a0701847c4434F0efc6F68C7f0F872359F00',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:39:11',
           'tx_hash': '0xd8920696476e9843af37d029f5f000e2406036fc2fd779cebc61c1ad1e4df483',
           'tx_index': 7},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274424,
           'block_number': 18435593,
           'chain_id': 1,
           'log_index': 1,
           'pair': '0x5c3ACF9aF95dcb01E45F699E340Dc7622136f658',
           'token0': '0xA9Bd591529cea30FC3815fc130F67E38073E175A',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:31:11',
           'tx_hash': '0x27917179b625cbbdd777faddf6f9c69f3b239ae1a2035ed9ccf1f91e1e3366b5',
           'tx_index': 0},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274423,
           'block_number': 18435576,
           'chain_id': 1,
           'log_index': 160,
           'pair': '0x0E2370D742Ad12f8137c7359bEAD27cBd5BC4272',
           'token0': '0x6286A7A8094A011503e0D6cddA2Cb1eEA557B00b',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:27:47',
           'tx_hash': '0xef8bb241932941315ebee5f1271c32217165a7831ebd21093ee1720a7e89078b',
           'tx_index': 62},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274422,
           'block_number': 18435565,
           'chain_id': 1,
           'log_index': 348,
           'pair': '0xe4DE2CabcCab3132e01e907A89Cf1C1abE1166F1',
           'token0': '0x3cBA4AC389D774D1B5f1391CeAcb4A1717b6fc60',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:25:35',
           'tx_hash': '0xf5f1cc667aa1320c9a0bccd6764a5f6ddf2cde7f66265d1a76490d5ba750c536',
           'tx_index': 159},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274421,
           'block_number': 18435514,
           'chain_id': 1,
           'log_index': 177,
           'pair': '0x979250B9eE2c2f7ba75A85291E868dA92743844B',
           'token0': '0xBB45b1811D31ACE065a22D93292cCD90FBF9aCCD',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:15:23',
           'tx_hash': '0x375717a5ca09b8d0436bd30f60bb8c191940dde2cb348207a4fa962d89410c19',
           'tx_index': 71},
          {'address': '0x5C69bEe701ef814a2B6a3EDD4B1652CB9cc5aA6f',
           'arg3': 274420,
           'block_number': 18435513,
           'chain_id': 1,
           'log_index': 152,
           'pair': '0xa201D1F21fc63fCbbFf32D7636270eddAe1c2e98',
           'token0': '0x190A98422C3Ac27AcCb25268eE5b0091A0bCb7B1',
           'token1': '0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2',
           'ts': '2023-10-26T16:15:11',
           'tx_hash': '0xa2a3acc35f19bd886ec835f1aee1e35a05d6398e11d2ca4f87d2420efd40f58a',
           'tx_index': 101}]}

Cool, right?

Notice how “token1” is always the same: 0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2. That’s exactly what we need, since this is the address of WETH.

On that specific query, we got the logs related to pair creations. So the query is looking on Uniswap V2 factory to find the last 10 WETH pairs created, bringing the transactions hash, pair creation timestamp, token address and, of course, pair address.

But there is another important part we are missing before we get 100k tokens and start analyzing them: Security.

Addressing Security

There is a general sentiment, an intuition, that almost all ERC-20 tokens are scams. They can be scams in 2 senses:

  1. Security: There is something in the smart contract, like a malicious function, that can drain the funds of those who interact with it, for example.
  2. Lack of Liquidity and Market: Despite being safe, the token simply doesn’t have enough movement, which causes it not to form a price and die out in a short time.

We need to address that somehow, right? But how to measure the security of a given smart contract?

Here, we don’t need to reinvent the wheel. We can rely on what other talented developers already built. I’m talking about the platform GoPlus.

GoPlus is a platform that, among other things, analyzes smart contracts and returns a security diagnosis with various elements. Of course, it is not perfect – and there is no perfect platform – but it is an intuitive tool that allows us to save a lot of analysis time.

Here is an example on how to integrate witht he GoPlus API to run your first analysis:

In this example, we are specifically analyzing the security of the token 0x6d4e673a2aa644eb50708a7ca098fb9b0c45b1b5.

But what parameters should we use to run this analysis? Here, we need to study GoPlus Token Security API Docs to understand all the parameters it returns.

There are 42 parameters returned by the GoPlus diagnosis. Some may be much more important than others, depending on your goals. For example, this is was is returned when we run the analysis for the token I showed you above:

{
  "code": 1,
  "message": "OK",
  "result": {
    "0x6d4e673a2aa644eb50708a7ca098fb9b0c45b1b5": {
      "anti_whale_modifiable": "0",
      "buy_tax": "0",
      "can_take_back_ownership": "0",
      "cannot_buy": "0",
      "cannot_sell_all": "0",
      "creator_address": "0x8a1fb46d2638bc18bed1db7741606e9c7abbfab6",
      "creator_balance": "0",
      "creator_percent": "0.000000",
      "dex": [
        {
          "name": "UniswapV2",
          "liquidity": "4070.29258864",
          "pair": "0x8860255d305c5e37d79cee41e6d6a51524ede36d"
        }
      ],
      "external_call": "0",
      "hidden_owner": "0",
      "holder_count": "116",
      "holders": [
        {
          "address": "0x8860255d305c5e37d79cee41e6d6a51524ede36d",
          "tag": "UniswapV2",
          "is_contract": 1,
          "balance": "362353474.6135118",
          "percent": "0.362353474613511800",
          "is_locked": 0
        },
        {
          "address": "0xb0f7e21fefafa7dfe4a5d9fc7458dadb308328bd",
          "tag": "",
          "is_contract": 0,
          "balance": "46080216.93181042",
          "percent": "0.046080216931810420",
          "is_locked": 0
        },
        {
          "address": "0x7d9b0bf93ff420c19a5f4841cde170d13d9c5297",
          "tag": "",
          "is_contract": 0,
          "balance": "45285355.30053709",
          "percent": "0.045285355300537090",
          "is_locked": 0
        },
        {
          "address": "0x059e0945d0432810b1a67805779945fda7bd6c9e",
          "tag": "",
          "is_contract": 0,
          "balance": "33512718.73669074",
          "percent": "0.033512718736690740",
          "is_locked": 0
        },
        {
          "address": "0xc5837956daea6806b1812b187793ef07bb54dc37",
          "tag": "",
          "is_contract": 0,
          "balance": "30475916.71658086",
          "percent": "0.030475916716580860",
          "is_locked": 0
        },
        {
          "address": "0xd5fe0cd51d5891714ebe8410b4cd0290480d6900",
          "tag": "",
          "is_contract": 0,
          "balance": "29584592.32886121",
          "percent": "0.029584592328861210",
          "is_locked": 0
        },
        {
          "address": "0x01ed6703e18b764717d1eacf76387dd256f69b08",
          "tag": "",
          "is_contract": 0,
          "balance": "28564347.75565186",
          "percent": "0.028564347755651860",
          "is_locked": 0
        },
        {
          "address": "0x4daff7a283b874228b3e3a7688539e7a94301497",
          "tag": "",
          "is_contract": 0,
          "balance": "27069699.05131456",
          "percent": "0.027069699051314560",
          "is_locked": 0
        },
        {
          "address": "0x370e93e03cf6d18aec158989d8b8adcfd6045e78",
          "tag": "",
          "is_contract": 0,
          "balance": "26112526.39410326",
          "percent": "0.026112526394103260",
          "is_locked": 0
        },
        {
          "address": "0x03c65fdbff50d9032b7161cf65e6b10dad736692",
          "tag": "",
          "is_contract": 0,
          "balance": "22920819.8898935",
          "percent": "0.022920819889893500",
          "is_locked": 0
        }
      ],
      "honeypot_with_same_creator": "0",
      "is_anti_whale": "1",
      "is_blacklisted": "0",
      "is_honeypot": "0",
      "is_in_dex": "1",
      "is_mintable": "0",
      "is_open_source": "1",
      "is_proxy": "0",
      "is_whitelisted": "1",
      "lp_holder_count": "2",
      "lp_holders": [
        {
          "address": "0x000000000000000000000000000000000000dead",
          "tag": "",
          "is_contract": 0,
          "balance": "0.281424945589404773",
          "percent": "0.999999999999996446",
          "is_locked": 1
        },
        {
          "address": "0x0000000000000000000000000000000000000000",
          "tag": "Null Address",
          "is_contract": 0,
          "balance": "0.000000000000001",
          "percent": "0.000000000000003553",
          "is_locked": 1
        }
      ],
      "lp_total_supply": "0.281424945589405773",
      "owner_address": "0x0000000000000000000000000000000000000000",
      "owner_balance": "0",
      "owner_change_balance": "0",
      "owner_percent": "0.000000",
      "personal_slippage_modifiable": "0",
      "selfdestruct": "0",
      "sell_tax": "0",
      "slippage_modifiable": "0",
      "token_name": "TokenFi Classic",
      "token_symbol": "TokenFiC",
      "total_supply": "1000000000",
      "trading_cooldown": "1",
      "transfer_pausable": "0"
    }
  }
}

Okay! We are making some beautiful progress here.

Now, some choices need to be made:

  • I don’t necessarily need to analyze everything that GoPlus returns.
  • For that, I had to study each of the parameters it returns and decide what, in my view, were non-negotiable security precepts.

‘is_open_source’, for example, will return 1 if the contract is open_source. Not being open_source can be very dangerous, so that’s non-negotiable. ‘sell_tax’ and ‘buy_tax’ are important as well. Some scammers insert a huge amount of sell_tax in their contracts so users can buy them, but it will be severly penalized if they try to sell it, with a huge sell_tax. It’s not uncommon, for example, to see 99% sell_tax in a lot of tokens. A clear scam. These are important as well. Based on the market, I saw that good limits for sell_tax and buy_tax had a limit of 2%, so we will consider this as our threshold. slippage_modifiable is also important, because it says if a contract owner can change buy and sell tax. So even if buy and sell tax are not high, the owner can later change them, harming users.

Based on everything I studied on GoPlus documentation and in the market, I decided that I would take 16 parameters into account. Again, they are not the same: some are non-negotiable, like is_proxy or is_opensource. Others are okay if GoPlus don’t find them, returning NaN or “”.

Putting it more clearly: some parameters will not be returned sometimes, because GoPlus was not able to identify them. Some of them, for us, are non-negotiable, so it they are not returned, we will tag that token as not secure. Others, however, are potentially less harmful, so if GoPlus don’t find them, that’s a risk we can take (‘cannot_sell_all’ is an example).

We don’t want our security check to be like this, right?

via GIPHY

Great! We have defined our parameters for security. The next step is to put all the parameters together and apply a logic that, if the token passes in all parameters, the token is secure, so we will tag it as “can_buy = True”. If any of the parameters fail, the token is risky (remember, in my own judgment, some people may thing that 5% sell_tax is fine), so we will tag it as “can_buy = False”.

Here is part of the code:

Perfect!

Now what we can do is to run both BlockSync and GoPlus together. Remember:
1) We will use BlockSync to get pair listings and getting their information, like pair address, token address, timestamp and so on.

2) Then, we will use that information in GoPlus to run a security analysis for each token.

We can do that by running 2 functions together and then, we store the results in MongoDB. Do you want to see some numbers? I got some numbers for you:

89,039.

This is the total number of ERC-20 tokens we got using BlockSync and run the analysis on GoPlus. In securityInfo we have all those fields we got from GoPlus. Then, we got 16 of those fields to tag each token as can_buy True or False.

Now, let’s see, among all those tokens analyzed, how many of them are secure, based in GoPlus standards and our own judgment as well:

Out of 89,039, only 18,069 are considered secure for buyers. That means 20.29%. I gave them the name “viable_tokens”.

Not a bad ratio, I would say, but for sure it’s a point that reinforces that a lot of tokens out there are scams. I like some visualizations, so let’s use Matplotlib to plot a chart demonstrating this difference:

Awesome! We discovered that only 1 out of every 5 ERC-20 tokens launched is secure for holders. What else can we discover?

What can Blockchain tell us?

Our next step is to analyze viable_tokens and try to answer questions such as:

  • What is the average number of holders?
  • How much ETH is pooled in these tokens?
  • What is the wallet profile of the top 10 holders?

Again, we need to make some decisions. The first one is: we need to define a timeframe for analysis. Due to API limits and also to obtain quicker results, I decided that I would analyze only the first hour of each pair, starting from the time it was created.

Great, time to move on!

We have almost 20k tokens remaining for our analysis. Now, the important question is: Which metrics will we get and how can we get those metrics?

One of the best ways to do that is to simply query all the transactions of a given token in a given timeframe. From the transactions, we can get thousands of different metrics, we just need to know how to interpret them and what to do with all that data.

And now, another important point: How can we get that type of information?

We have famous platforms, like Infura, Alchemy, and Etherscan. I’ve used all those for different purposes, but for this research, I’ve used Bitquery to get historical data regarding the pairs.

Using Bitquery GraphQL APIs we can build a pipeline to get the data we need.

Due to privacy reasons, I will not open the whole code here, but here you can have an idea on what it looks like:

Each iteration will get information from some place, insert it into the pipeline and use the results to get and calculate new metrics.

This is how a Bitquery query looks like. On this case, we are fetching all the first hour transactions from a given token. We will also need other queries to get all the metrics we want, but this one is responsible for the majority of them:

            {
                EVM(network: eth, dataset: combined) {
                    buyside: DEXTrades(
                    orderBy: {descending: Block_Time}
                    where: {Trade: {Buy: {Currency: {SmartContract: {is: "0xe0f63a424a4439cbe457d80e4f4b51ad25b2c56c"}}}}, Block: {Time: {since: "2023-08-16T02:52:11Z", till: "2023-08-16T03:52:11Z"}}, TransactionStatus: {Success: true}}
                    ) {
                    Block {
                        Number
                        Time
                    }
                    Transaction {
                        From
                        To
                        Hash
                    }
                    Trade {
                        Buy {
                        Amount
                        Buyer
                        Currency {
                            Name
                            Symbol
                            SmartContract
                        }
                        Seller
                        Price
                        }
                        Sell {
                        Amount
                        Buyer
                        Currency {
                            Name
                            SmartContract
                            Symbol
                        }
                        Seller
                        Price
                        }
                        Dex {
                        SmartContract
                        ProtocolVersion
                        }
                    }
                    }
                    sellside: DEXTrades(
                    orderBy: {descending: Block_Time}
                    where: {Trade: {Sell: {Currency: {SmartContract: {is: "0xe0f63a424a4439cbe457d80e4f4b51ad25b2c56c"}}}}, Block: {Time: {since: "2023-08-16T02:52:11Z", till: "2023-08-16T03:52:11Z"}}, TransactionStatus: {Success: true}}
                    ) {
                    Block {
                        Number
                        Time
                    }
                    Transaction {
                        From
                        To
                        Hash
                    }
                    Trade {
                        Buy {
                        Amount
                        Buyer
                        Currency {
                            Name
                            Symbol
                            SmartContract
                        }
                        Seller
                        Price
                        }
                        Sell {
                        Amount
                        Buyer
                        Currency {
                            Name
                            SmartContract
                            Symbol
                        }
                        Seller
                        Price
                        }
                        Dex {
                        SmartContract
                        ProtocolVersion
                        }
                    }
                    }
                }
                }

Let’s talk more about what we are doing here:

  • We are querying ‘buyside’ and ‘sellside’ of a given token in a specific timeframe. In the case of our project, these values are all variables. The token address and the since timestamp we get from our MongoDB. The till timestamp is just the result of the since timestamp + 1hr. We achieve that by doing a simple Python formula with timedelta.
  • I also included a filter ‘TransactionStatus: {Success: true}}’, so we will only retrieve succesfull transactions, since sometimes transactions are initalized and not concluded and they could mess up our metrics.

This is what we get when we run that query. Notice that I included just one buying and one selling transaction because of the size of the results.

{
  "EVM": {
    "buyside": [
      {
        "Block": {
          "Number": "17924813",
          "Time": "2023-08-16T03:49:11Z"
        },
        "Trade": {
          "Buy": {
            "Amount": "16999999.99999999",
            "Buyer": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39",
            "Currency": {
              "Name": "SPX6900",
              "SmartContract": "0xe0f63a424a4439cbe457d80e4f4b51ad25b2c56c",
              "Symbol": "SPX"
            },
            "Price": 1.9260538072124636e-9,
            "Seller": "0x7a250d5630b4cf539739df2c5dacb4c659f2488d"
          },
          "Dex": {
            "ProtocolVersion": "2",
            "SmartContract": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39"
          },
          "Sell": {
            "Amount": "0.032742914722611862",
            "Buyer": "0x7a250d5630b4cf539739df2c5dacb4c659f2488d",
            "Currency": {
              "Name": "Wrapped Ether",
              "SmartContract": "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
              "Symbol": "WETH"
            },
            "Price": 519196294.649358,
            "Seller": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39"
          }
        },
        "Transaction": {
          "From": "0x0efa075398b9c2430e46bc47309f4da26fc2f51a",
          "Hash": "0xf386056db6916c9b8e963dbe030710add5439d0f452819a629015131cefbe1c9",
          "To": "0x7a250d5630b4cf539739df2c5dacb4c659f2488d"
        }
      }
    "sellside": [
      {
        "Block": {
          "Number": "17924783",
          "Time": "2023-08-16T03:43:11Z"
        },
        "Trade": {
          "Buy": {
            "Amount": "0.050000000000000000",
            "Buyer": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39",
            "Currency": {
              "Name": "Wrapped Ether",
              "SmartContract": "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
              "Symbol": "WETH"
            },
            "Price": 475144296.1438466,
            "Seller": "0x4cb3c4feeabc84289be38a42d456e8b4ac0086dc"
          },
          "Dex": {
            "ProtocolVersion": "2",
            "SmartContract": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39"
          },
          "Sell": {
            "Amount": "23757214.80719233",
            "Buyer": "0x4cb3c4feeabc84289be38a42d456e8b4ac0086dc",
            "Currency": {
              "Name": "SPX6900",
              "SmartContract": "0xe0f63a424a4439cbe457d80e4f4b51ad25b2c56c",
              "Symbol": "SPX"
            },
            "Price": 2.104623812420253e-9,
            "Seller": "0x52c77b0cb827afbad022e6d6caf2c44452edbc39"
          }
        },
        "Transaction": {
          "From": "0x4cb3c4feeabc84289be38a42d456e8b4ac0086dc",
          "Hash": "0x24f7d8fd8e8c9c74129026e2efb1d7d56438f71a25f8d6ba2b50c3d07e77d7d0",
          "To": "0x7a250d5630b4cf539739df2c5dacb4c659f2488d"
        }
      }

‘buyside’ and ‘sellside’ refer to the different sides of transactions. Buyside will bring us all the transactions that sold the given token in exchange for WETH. Sellside will bring us all the transactions that sold WETH in exchange for the given token.

Inside each one of them we will have the block and the time, and also the trades, buy and sell, with amounts and price of each token. We also have the ‘Transaction’ value, with transaction hash, from and to address.

When we run that for a token, depending on the number of transactions during its first hour, we can receive a small or a giant JSON as response. Our mission now is to iteration through that JSON to get the metrics we need.

Building the metrics

I also built a function to divide the JSON with all the transactions in chunks of data. For example, if we want to analyze how a token evolves every 5 minutes, we can insert ‘5’ in ‘chunks_of_time_in_minutes’. For this analysis, I chose 10 minutes, so we should have 6 chunks of data.

And this is how the function looks like:

Since ‘buyside’ and ‘sellside’ are different blocks, I joined them into a new dataset called all_transactions and them sorted it by ‘block’, ‘time’.

Quick Recap on what we did until now:

  • We used BlockSync to automatically get important information, like token address, pair address, and pair creation timestamp, about almost 100,000 ERC-20 tokens.
  • Then, we run that together with GoPlus, with the parameters we studied and defined to label tokens as secure (can_buy: True) or non-secure (can_buy: False).
  • We stored all that data on MongoDB and then, we saw that 20% of tokens were labeled as secure.
  • The next step was to build the Bitquery query to get the transactions related to that token and the quote currency we chose (WETH).

Here, we need to address an important point: I said before that the transactions query was very important, but this one was not the only one we had. We can only get some metrics with specific queries, since their information is not contained in the transactions query.

Here is an example:

In that query, we are getting the Pooled WETH amount on that first hour. This is not an information available in the transactions query, so we need an specific query just for that.

Another query is related to what I called “Fishes, Tunas, Sharks, and Whales”.

Depending on the source, people will refer to Whales as those who own more than $1 million or $10 million of a particular token. However, since we are talking about quite small tokens and not so much volume, I do not believe that this definition would be the best for our use case.

Thus, I created my own definition of these terms. The following query checks the balance of the top 10 holders of that token. I get the top 10 holders doing some math with the transactions JSON. This balance is checked in WETH, USDC, USDT, and ETH. I check only in those because they are the biggest tokens usually used in trading, so we would be covering a lot of cases.

I only check the top 10 holders because checking all the holders would be an extremely costly task in terms of computing and API consumption, depending on the number of holders.

I also run another query with those top 10 holders addresses to get the timestamp of their first ever transaction:

Then, I get all that information we extracted and we can perform some math with it.

With the balances, multiply ETH and WETH balances by WETH current value. We get that information using CryptoCompare API. With USDC and USDT we don’t need to multiply since they are stablecoins. Although the calculation will not be perfect, it’s a very good approximation.

Then, we store the balances. If a holders has more than $ 100k in the sum of the 4 coins or in at least one of them, it’s labeled as a Whale. If it’s between $ 10k and $ 99k, it’s a Shark. If it’s between $ 1k and $ 9.9k, it’s a Tuna. And if it’s less than $ 1k, it’s a Fish.

Of course, that’s not a perfect labeling, but it’s good enough for our specific use case.

We also label them based in their first transaction. Remember that we got their first transaction timestamp in one of our queries?

Here, we get that information to know if that one is a Fresh or Non-Fresh Wallet. If the first transaction of that wallet occured in the last 3 days before the pair creation, that address is labeled as a “Fresh Wallet”. Pretty cool, right?

This is important because Fresh Wallets can, in a lot of cases, suggest insider trading. Let me know if you wanna know more about that.

Of course, this is just a tiny part of all the equations I use to get all the metrics I want. In total, when performing this whole code for a given token, and later transforming it into a dataframe, it will have almost 300 columns/features.

I also needed to address some issues, for example, involving missing values. Sometimes, when a specific action is performed on the token, it can cause the ‘buyside’ and ‘sellside’ metrics to fluctuate wildly, which in turn alters the logic behind the ‘Price’ calculation. I addressed this issue using the Interquartile Range (IQR) method. Below is an example of this phenomenon, which can significantly distort our metrics for that particular token. Fortunately, this scenario is quite rare, occurring only in about 3% to 4% of the tokens.

Show me the Data

Okay, so we got the code to perform all the metrics we want. Next step was to run all the tokens we collected in the previous steps into it and then, preprocess that JSON and transform it into a pandas dataframe so we can have some cool visualizations.

Here is an example of the preprocessing code:

This is longer than it looks like, since our metrics were loaded into a very nested JSON, I needed to perform some transformations here. However, at the end of the day, we got it. Here is an example of our .head():

Then, we can start playing a bit with all the metrics we have. Let’s see, for example, some stats related to number of holders of all those 18k tokens at the end of their first hour:

As you can see, even though from a safety standpoint these tokens are secure, a good portion of them have an almost insignificant number of holders in their first hour, although outliers exist.

Now let’s see the same visualizations, but for the total number of trades during the first hour.

As expected, data distribution is pretty the same with holders count. We can see that some outliers exist, but almost all tokens end up the first hour with a very small number of trades.

How about fishes, tunas, sharks, and whales?

We can clearly see that, as one could expect, wallets with low balances are much more present in this kind of token, at least in their first hours. Important: these numbers may be that low because, as we saw that a lot of tokens end up with 0 holders, of course, their number of fishes, tunas, and so on, will be 0.

So, let’s try something different: We will check those same numbers, but only for pairs that had at least 100 holders at the end of their first hour:

Okay, that’s different from the last one. You can see that here, the sum of the averages is 10, the number of top holders we check to build the labeling.

Now, let’s see how prices of these tokens change on their first hour. We have a feature called ‘price_growth_rate’, that measures the growth of the price between the first chunk (10 minutes) and the last one (60 minutes).

As we can see, we are dealing with pretty big (or small) numbers here. Since this kind of token usually have values with a loooot of 0s, their changes will be big, impacting our metrics.

Lastly, let’s analyze the total volume of these tokens during its first hour:

As we can see, the majority of tokens have small values of volume during their first hour, although we have outliers that make the average to be huge (more than 25,000 WETH).

We can also try some clustering. I will use ‘1hr_total_trades’ as our main variable. First, we need to use the Elbow Method to discover the optimal number of clusters for our ‘1hr_total_trades’ data.

As we can see, the biggest drop is between 1 and 3 clusters, and then the WCSS value starts to level-off. This suggests that 3 is the optimal number of clusters to divide our data based on ‘1hr_total_trades.’. Based on that, we can now build our K-Means Clustering with 3 clusters. This is what we get:

Pretty cool, right? Visually, it seems to be a different clusterization between tokens with 0 and > 300 trades, between 300 and 1,000, and more than 1,000.

Let’s try another one. This time, we will be clustering our data based on ‘holders_count_chunk_6’ and ‘1hr_total_buying_trades’. The Elbow Method for this new data will bring us this:

As we can see, the optimal number seems to be between 2 and 3 clusters. Let’s go with 3. Plotting the clustering taking into account ‘holders_count_chunk_6’ and ‘1hr_total_buying_trades’ will give us this:

Here, we can see a distiction with tokens falling into these 3 clusters based on their number of holders and number of buying trades.

Lastly, we can get the correlation heatmap for some of our metrics. Here are the ones we will be checking: ‘1hr_total_buying_trades’, ‘1hr_total_selling_trades’, ‘1hr_pooled_weth’, ‘number_of_whales’, ‘holders_count_chunk_1’, ‘holders_count_chunk_6’.

Pretty interesting things here!

We can see that both holders count for chunk 1 and chunk 6 are correlated with buying trades and sellings trades. However, chunk 6 is stronger. We can also see that the number of whales and pooled WETH are not correlated with the other metrics. Also, holders count for chunk 1 and chunk 6 are correlated, but not perfectly correlated. This implies that, although a large number of holders in the first chunk may suggest a high number of holders in chunk 6, it’s not guaranteed.

Of course, these are just some of our first impressions. We have hundreds of features in our dataset that we could explore.

Conclusions

I have to say that this was a pretty fun research.

I worked with different APIs and platforms, and I had to deal with various problems such as division by zero and missing values. I also applied various mathematical skills to extract and calculate all the metrics I needed.

Based on all the data I gathered and processed, here are some quick conclusions:

  • We can indeed say that the ERC-20 token market is extremely risky. As we saw, nearly 80% of the tokens failed security requirements. Of the remaining 20%, the majority end up dying in the first hour, not attracting enough holders and transactions for a powerful network effect.
  • The crypto market is fascinating and still resembles the wild west. For those looking to invest in this type of token, a lot of diligence and caution are required.
  • This doesn’t mean that this market isn’t fascinating. I continue to strongly believe that it is one of the most significant technological revolutions of this century and will change many aspects of our lives.

I got more conclusions than that, naturally, but as I stated before, due to privacy reasons, I can’t talk about all of them in this article.


If you are interested in talking about data, writing, blockchain or this research, you can contact me at tiago@inevitable.education.

Leave a Reply

Your email address will not be published. Required fields are marked *