008.Refactoring the Neo4J Cache Layer

Posted on 2026-03-29 Disqus:

Last week was almost entirely firefighting. We use Neo4J as our graph query backend, and to speed things up we had been caching a large volume of relationship data in Redis. The problem: the cache kept growing, and so did the AWS bill.

When I first proposed migrating the cache to DocumentDB (MongoDB-compatible), I estimated it would take a week. In the end, just the initialization alone took two weeks, and the full migration took about three weeks. We hit a lot of unexpected issues along the way. This post documents the approach, the problems, and the outcome.

Background and Goals

We have a Cache Service that syncs relationship and node data from Neo4J into Redis for fast reads of user relationship data. Performance was never the issue — cost was:

A large portion of the cached data was rarely accessed, resulting in low cache utilization.
Memory-optimized instances were holding cold data long-term, which is terrible value for money.

The goal of this refactor was to introduce a hot/cold data separation:

Hot data stays in Redis.
Infrequently accessed cache moves to AWS DocumentDB.

State of the Code (Before)

Reading the code before starting, I had a bad feeling.

Many files across apps/ shared the same names.
The code itself was nearly identical across them.
ElasticSearch logic was mixed in with cache initialization code.
Incremental cache update logic was scattered across multiple apps.
Lots of lint errors, lots of commented-out code blocks.

All classic red flags — any change risked a cascade of side effects, making estimation impossible. Before touching the migration, I did two things first:

Fixed all related lint issues and deleted dead code.
Read the entire call chain end-to-end, then cross-checked it against Codex’s analysis.

Migration Strategy

The approach was low-risk: dual-write first, then shift reads, then shrink Redis.

Dual-write phase: writes go to both Redis and DocumentDB simultaneously.
Validation phase: monitor consistency and production behavior.
Convergence phase: remove unnecessary cold-data writes from Redis.

Challenge 1: Query and Write Pressure

Indexing

This is a common problem when writing large volumes of data — usually adding an index is enough. However, DocumentDB has poor support for indexing arrays of embedded documents: $in queries on such fields do not use indexes. For example:

{
  ...
  identities: {
    platform: x,
    id: 1234,
  }
}

The following query will not hit an index:

db.relation.find({
    identities: {
        $in: [
            {
                platform: "x",
                id: 1234,
            },
        ],
    },
});

To work around this, we manually concatenated the identity fields and stored them as a flat string x_1234. This allows the query to hit the index:

db.relation.find({
    identities: {
        $in: ["x_1234"],
    },
});

Database I/O

The original code was fetching 10,000 records at a time. I had already reduced this to 1,000, but DocumentDB Insights was still flagging many queries with I/O warnings. The batch size had to be reduced further to 100 records per query or update.

Challenge 2: Controlling Runtime Memory

The original concurrency was set to 100 with a taskGroup size of 1,000, which meant the EC2 instance had to load 100 × 1,000 records into memory simultaneously — enough to OOM the instance. I went down several dead ends trying to fix this: deduplicating documents in memory, splitting requests into batches of 100. In the end, setting concurrency to 1 and groupSize to 100 solved it completely, making all those earlier optimizations unnecessary — so I deleted that code.

Challenge 3: Full Cache Initialization Was Too Slow

We had long avoided doing a full cache rebuild for exactly this reason: it’s slow. This refactor made it unavoidable, and we needed it to finish fast. Investigating the root cause revealed the main bottleneck: every time a record was written to Redis, it triggered a synchronous HTTP request to update ElasticSearch data and indexes.

The fix was to decouple ES from the write path entirely. All ES operations were pushed onto an asynq queue, and a dedicated ES-Init Worker was started to process those updates asynchronously.

AWS OpenSearch Default Request Body Limit: 1 MB

To improve throughput, we batched tasks and submitted data for 1,000 nodes in a single request. This quickly produced a wave of errors. Checking the logs revealed the request body was exceeding the size limit. We extracted the ES code into a shared package at pkg/es and updated the document assembly logic to ensure each request stays under 1 MB.

Concurrent ES Update Conflicts

After fixing the request size, things worked locally but still errored in production. The errors turned out to be document version conflicts — the same document was being updated concurrently more than once. Without proper deduplication in place, the only option was to run single-threaded for now. If we want to improve throughput later, we’ll need to deduplicate before dispatching requests — currently the dispatch layer is already working with raw bytes, which makes this tricky. Left as a known issue for future optimization. The priority for now was getting the new cache live.

Results

The refactor delivered three concrete wins:

Redis instance downgraded from cache.r7g.4xlarge to cache.r7g.2xlarge, cutting roughly 60 GB of memory usage and reducing instance cost by about half — saving approximately $1,000/month.
Cache utilization improved: DocumentDB now handles infrequently accessed data, Redis is focused on hot data, and the overall cost-efficiency is much better.
Full cache initialization time cut from ~8 hours to ~4 hours.

Closing Thoughts

Hold your code to a standard. Don’t copy-paste. Delete dead code instead of commenting it out. Keep things decoupled.

Most importantly: read the code fully before estimating the work. Otherwise, be prepared to work late every night for a week.