Hey guys! @micheleicebergnodes sorry I missed your message.
Quick update. I am very happy with the status of the indexer as it is now so I am moving on with more standard stuff like UI, accounts, etc. A quick explanation why making the indexer has taken so long:
The goal of the indexer is to extract any new value (whether stored on chain or inferred from event values) for any block. This is akin to taking an entire new snapshot of the blockchain state for every block, which is obviously computationally impossible to achieve in the block time constraints. To make it happen, we want to take a snapshot of every value that has changed in that block. We assume that, if something has changed (for example, the balance of an account) then the corresponding input to its function (the account in this case) will appear inside an event or extrinsic, or in the output values of other methods.
We then call every pallet method with all possible combinations of its inputs over the event values, and the pallet method returned values for each block. Most of the results are nulls or donāt have new information, and a small percentage of the returned values have new information.
Calling every method over its potential argument space is very hard for a number of reasons.
-
Inputs depend on outputs (the returned results create more input combinations which then return more results, etc.). So, a method may produce an output, that is used as an input in another method to produce a new output that is used as an input to another method, etc. This has to work concurrently.
-
Determining when all possible results have been produced is challenging due to the concurrent and recursive nature of the problem.
-
Inputs can be objects too (not just primitives), so the construction of all possible objects is not trivial.
-
The metadata is dynamic and can change from block to block.
-
Objects can have recursive structures, so there need to be manual bounds.
-
We allow custom output-input matching as an option to speed up indexing, although itās not strictly required.
-
Calling .entries on some methods helps, but it does not work everywhere, and it has its own limitations (unfortunately, entries may occasionally return nothing without erroring, even though there is data)
-
The load the indexer puts on the substrate node is so much that the client crashes once every 10 minutes (we have reported the 2 errors in Github and they are both known substrate issues). To manage this, we maintain more than 1 full nodes in each machine which also helps to speed things up. CPU utilization is around 80% most of the time.
-
Extracting block information takes 20-40 seconds, but this can be parallelized so itās ok. Nevertheless, we have to use several Node workers to get over the single-thread limits to achieve a practical processing speed.
-
Even with all the optimizations, the indexer requires some heuristics to avoid processing some useless and chunky data (i.e. runtime binaries, etc.)
All this is happens in the second part of a 3-step cloud-based pipeline. In the final part, all the data is processed to remove repeated values, and ingested into a custom graph/timeseries database. Calculation of averages, sums, etc. happens with triggers inside the database, as the values come in.
Surprisingly enough, all this works. Failures are inevitable due to system overload, inter-process communication problems, IO throttling, memory limits, etc. but everything is fully recoverable on the NodJs worker level (per block), and system level (per block batch). So, if a 100-block batch fails, the system is durably designed to repeat that batch (on the same machine, or other machine) using the Temporal durable engine.
Our current indexing time is 2.7 seconds per block, which means it will still take some time to catch up to the tail, but it is doable. The bound on this is actually due to the ingestion part (3rd step) because that step cannot be parallelized across multiple blocks. I can improve that further, but itās not a priority for now.
Looking forward to delivering a demo before year end!