[Proposal: MB11 & 30 / MR8 & 26] StakeGlmr.com and StakeMovr.com V2 - Treasury Proposal

Hey guys! @micheleicebergnodes sorry I missed your message.

Quick update. I am very happy with the status of the indexer as it is now so I am moving on with more standard stuff like UI, accounts, etc. A quick explanation why making the indexer has taken so long:

The goal of the indexer is to extract any new value (whether stored on chain or inferred from event values) for any block. This is akin to taking an entire new snapshot of the blockchain state for every block, which is obviously computationally impossible to achieve in the block time constraints. To make it happen, we want to take a snapshot of every value that has changed in that block. We assume that, if something has changed (for example, the balance of an account) then the corresponding input to its function (the account in this case) will appear inside an event or extrinsic, or in the output values of other methods.

We then call every pallet method with all possible combinations of its inputs over the event values, and the pallet method returned values for each block. Most of the results are nulls or donā€™t have new information, and a small percentage of the returned values have new information.

Calling every method over its potential argument space is very hard for a number of reasons.

  • Inputs depend on outputs (the returned results create more input combinations which then return more results, etc.). So, a method may produce an output, that is used as an input in another method to produce a new output that is used as an input to another method, etc. This has to work concurrently.

  • Determining when all possible results have been produced is challenging due to the concurrent and recursive nature of the problem.

  • Inputs can be objects too (not just primitives), so the construction of all possible objects is not trivial.

  • The metadata is dynamic and can change from block to block.

  • Objects can have recursive structures, so there need to be manual bounds.

  • We allow custom output-input matching as an option to speed up indexing, although itā€™s not strictly required.

  • Calling .entries on some methods helps, but it does not work everywhere, and it has its own limitations (unfortunately, entries may occasionally return nothing without erroring, even though there is data)

  • The load the indexer puts on the substrate node is so much that the client crashes once every 10 minutes (we have reported the 2 errors in Github and they are both known substrate issues). To manage this, we maintain more than 1 full nodes in each machine which also helps to speed things up. CPU utilization is around 80% most of the time.

  • Extracting block information takes 20-40 seconds, but this can be parallelized so itā€™s ok. Nevertheless, we have to use several Node workers to get over the single-thread limits to achieve a practical processing speed.

  • Even with all the optimizations, the indexer requires some heuristics to avoid processing some useless and chunky data (i.e. runtime binaries, etc.)

All this is happens in the second part of a 3-step cloud-based pipeline. In the final part, all the data is processed to remove repeated values, and ingested into a custom graph/timeseries database. Calculation of averages, sums, etc. happens with triggers inside the database, as the values come in.

Surprisingly enough, all this works. Failures are inevitable due to system overload, inter-process communication problems, IO throttling, memory limits, etc. but everything is fully recoverable on the NodJs worker level (per block), and system level (per block batch). So, if a 100-block batch fails, the system is durably designed to repeat that batch (on the same machine, or other machine) using the Temporal durable engine.

Our current indexing time is 2.7 seconds per block, which means it will still take some time to catch up to the tail, but it is doable. The bound on this is actually due to the ingestion part (3rd step) because that step cannot be parallelized across multiple blocks. I can improve that further, but itā€™s not a priority for now.

Looking forward to delivering a demo before year end!

9 Likes

Video presentation / update

@turrizt @jose.crypto @dev0_sik @blackk_magiik

5 Likes

Will be giving weekly or biweekly updates going forward. Should have been doing that since the beginning to increase transparency and visibility.

Iā€™ve spent the last week deploying and testing the orchestrator in a real world scenario. It became apparent that, because ā€œarchiveā€ and ā€œingest-1ā€ are must faster than the ā€œindexā€ process, there is a block gap buildup that gets larger over time. The initial implementation used an inefficient data structure of out-of-order-blocks to account for gaps caused by parallel processes finishing out of order. Temporal stores the entire state of the workflow as an event history on the cloud, so long gaps (large out-of-order sets) would become problematic.

I initially experimented with a compressed representation of ranges. This worked but made the code too hard to understand and introduced bugs.

In the end, I changed the design by introducing a constraint on the ETL job queue - that it must always receive target blocks in sequence AND that it is guaranteed that its parent ETL has completely processed the blocks up to that target block. This was not previously the case, i.e. a parallel ETL process could finish processing the block span 2000-2299, triggering its child to process the same block, whilst the parent had not yet finished the span 2700-2999.

The above constraint allowed for a much slicker design at a small efficiency loss.

4 Likes

Ran into some issues with aggregate tables in Surrealdb. Aggregate tables allow the real-time computation of ā€œderivativeā€ values like sums, rolling averages, counts, etc. It looks like aggregate tables are not production ready yet, and are causing more problems than they are solving. I have decided to introduce a 4th step in the ETL process and avoid using aggregate tables all-together. This is also a cleaner design and will keep storage and computation more clearly separated.

Identified a bug in the ContinueAsNew of the Temporal workflow. Workflows have a limited memory and event count. To have an everlasting workflow you need to continue-as-new. Since we cannot serialize goroutines, we have to bundle the state of all ETLs, serialize it, and unbundle it again. The bug causes a block range to be skipped for ETLs with parallelism>1.

Spend some time trying to manage the constant growth of TiKV memory.

Other than that, it was mostly tax weekā€¦ yiiiikess

3 Likes

Worked on the handover process from batch block processing to few-block processing (block processing is always concurrent because it takes longer to index a block that for the chain to produce it, so cloud-worker and/or thread concurrency is a must). The idea is that the workflow can orchestrate a switch from batch mode to few-blocks mode, and the reverse, should we run into a scenario that we have to catch up to the chain height again.

Also, worked on a custom worker for producing the real-time future unbonding charts for the collators. Producing these charts requires knowing the schedulled undelegation of all delegators, and it takes a long time to fetch all this data from the chain. To make the view real-time, we need to keep the state in memory and update it based on events. This is out of the scope of the generic indexer so I have to make a custom worker for it.

2 Likes

Thanks for this update dear Ioannis!!!

Finished the custom indexing for the parachainStaking pallet. This was essentially the third (from scratch) iteration of the backend. The first one was done in Go and didnā€™t work very well. The second evolved to be a generalized version that indexes everything (see video above) and will allow querying the chain with a chat interface. So, writing the staking indexer was a walk in the park and most of the code was boilerplate stuff (11K lines in 3 weeks ā†’ thank you Copilot).

I chose to not update some round-related statistics until a round is completed. In the beginning I was doing everything block by block, but for some information it is an overkill from both a backend and UI perspective. However, for things that are important or time-sensitive, like seeing rewards or having a view of the unbonding curve, future rankings, updates are near real-time, block by block.

Happy to say that I also discovered a pretty critical bug in the Temporal SDK that was resolved within a couple hours!

5 Likes

The indexer had trouble catching up in the blocks immediately after the round change cause they include all the delegation rewards distribution. So, I had to apply a few efficiency improvements to ensure that the unbonding charts are near-real-time. For example, I added some custom compression on the serialized jsons, and added some in memory hashing to avoid DB updates when there are no changes.

Kick started integration with the frontend. Sorted out the frontend model definition to match the backend (frontend was developed by Abdullah so I had to bridge various structures and types). Also had to make some revisions to the backend model to reduce the data volume of push updates sent to the client. In the updated version, only one collator (the one selected/expanded) can receive push updates on its unbonding data.

Moved state management to 100% store-based in Svelte with no state in components or pages. The goal is to combine SSR (for SEO and good user experience) with websocket push updates. To make this work, I use the latest state to render the page on the server, but also pass the state to the client side to hydrate the store. The store then takes care of subscribing to updates to keep the data live.

5 Likes

Thanks a lot Ioannis for these regular updates, amazing job ongoing here, please keep on this way!!!

1 Like

Hey guys, I need some advice on a database infrastructure dilemma.

Setup 1
SurrealDB and 3+ node TiKV cluster. This is a typical distributed setup with 2x redundancy.

Cons

  • A bit slow
  • Not very stable. Both Surreal and TiKV crash under heavy load due to timeouts, region leader switching, and other issues.

Pros

  • Storage is infinitely scalable
  • Can host ALL data in one database (both chain data and off-chain data like collator thumbnails, etc.)
  • Infinitely scalable CPU/IO
  • 2x (or whatever) redundancy

Setup 2
(SurrealDB and RocksDB in one box) x 2
Since blockchain data is immutable, we can simply launch independent database nodes without worrying about data syncing between them. They will all sync to the chain.

Cons

  • Max storage of 45 TB (Nvme)
  • Non-chain data must be stored on another database because they are not immutable.

Pros

  • Surreadb has been tested much more with RocksDb; they are also making their own surrealkv for storage that will be single-node and a rocksdb replacement. They have openly said that they will prioritize stability and support for surrealkv.
  • Muuuuuch faster (compare a 44-core Hetzner box with no network sync requirements to three 6-core boxes that need to sync between them)
  • Infinitely scalable CPU/IO
  • 2x (or whatever) redundancy

The most common setup is #1. However, since we are only syncing to one central source of truth, I think we can do without the benefits (and burdens) of a distributed database. Perhaps there is something I am missing?

I should add that this is the database that powers the AI chat, so we are talking a few TBs to store all data points + graph relations.

2 Likes

Hey @stakebaby whats the operating cost difference between the 2 setup options?

For #1, how long is the outage for the crashing? whats the crash frequency?

Roughly speakingā€¦

Setup 1
3 servers at $100 each

  • horizontal scaling ā†’ more servers ā†’ more expensive
  • some extra time for maintenance

Setup 2
1 server at $200 and a backup at $100

  • vertical scaling ā†’ more NVME drives ā†’ a bit cheaper

So, the price is roughly the same between the two setups and much, much lower compared to the old DynamoDB setup. These servers also run the workflow and activity workers which is the bulk of the processing.

1 Like

for me at least, less asset management the better, regarding the 45TB limitationā€¦how long you think you have until you reach that? What challenges does that imposeā€¦deploying another stack in parallel?

It should be several years. By that time, surrealdb should have their own production ready distributed surrealkv so we can switch to that for infinite storage capacity scaling. An alternative solution would be pruning.

TiKV crashes every 20 min when using the default 3 sec timeout, and every couple of days when increasing that to 6 sec. Unfortunately surrealdb does not handle the crash gracefully and I donā€™t think the team has plans to fix that as they are focusing on their in house surrealkv solution.

I have been testing setup 2 in the 48 core hetzner machine and I am blown away by the performance. This will allow much more flexibility in the chat interface, where users can request pretty much anything and the only limit will be a query timeout env variable. In other words, if the query timeout limit is 5 seconds, setup 2 will be able to serve much larger data requests than setup 1.

2 Likes

I have pushed the current version of the sveltekit app to vercel. You can preview it at:

https://moonbeam-seven.vercel.app

You can search for this (randomly selected) address
0x0000500321a7a9c61a332049726e01156b5872c0
to see some account-specific data

The chat, extrinsics, and collator management are wip.

I have only indexed a few thousand blocks just to populate it with some data. The aggregate values (daily, weekly, monthly) are wrong because of this:

so I need to wait for that fix to be included in a production version to start indexing.

3 Likes

this is awesome, great job! curious to see the polished production ready version

2 Likes

Quick update. Had to catch up with some other work so I did not do much, but I am currently trying to finish up testing/debugging for all the indexing stuff on the backend so I can start the indexing process. Everything seems to be working OK but I am still a bit nervous of a scenario where I run the indexer for 4 months only to find out the data is corrupt and I have to to start from scratch.

The issue we had with Surrealdb was resolved and went into the production version; but, it looks like their solution introduced a new problem which seems trivial to solve.

I decided to simplify the indexing workflow from a flexible and efficient golang implementation based on channels to a simpler, less flexible, and slightly less efficient (due to design) typescript implementation based on promises. So, I spent the last couple of days reprogramming that and I am happy with the result. I will spend the next couple of weeks running more tests across various regions of the chain.

Here is a cool screenshot of the indexer at work.

4 Likes

Had to set up the server again from scratch, resync, etc. cause something went wrong in our previous raid0 setup. I opted for no raid in the new system (everything else is the same).

Found 3 more bugs in surrealdb and submitted them. I have worked out around 2 of them, but the 3rd results to wrong min values across all aggregate tables. Will have to wait for a fix before starting the final indexing run.

Tested the indexer across a few points in the first 1M blocks and found/fixed two bugs. Also, I noticed some really data-heavy regions of the chain that resulted to 40K+ db transactions per block (each tx is 9 conditional writes). Unfortunately each block has to be processed in sequence, and a high concurrency (for a specific block) results to lots of ā€œResource busyā€ errors in the db. Shuffling the tx array and using 3 processes seem to work, as long as these data-heavy blocks are sparse.

Ran into a few cases of ā€œError: Data corruption detectedā€ and ā€œError: Stream ended in middle of compressed data frameā€ when decompressing files with zstd-napi. Looks like the library is not robust enough when pressed under heavy IO which results in the activity being retried. It still works (after the retry) but I will replace it cause the compressed files are archived on s3 and I donā€™t want to have to change the format in the future.

I will continue testing past the 1M blocks and will probably start writing the chatbot data retrieval code.

1 Like

Ran the indexer for a few batches here and there to sample the number of datapoints. Looks like we have around 1M unique datapoints tracked for each block.

Thatā€™s a lot of data :slight_smile:

Of course, the majority of these are constant from one block to another and we only record a record when the datapoint changes.

Think of a datapoint as a unique parameter identified by the pallet, method, query arguments, and result object property path.

3 Likes