008.Refactoring the Neo4J Cache Layer

Last week was almost entirely firefighting. We use Neo4J as our graph query backend, and to speed things up we had been caching a large volume of relationship data in Redis. The problem: the cache kept growing, and so did the AWS bill.

When I first proposed migrating the cache to DocumentDB (MongoDB-compatible), I estimated it would take a week. In the end, just the initialization alone took two weeks, and the full migration took about three weeks. We hit a lot of unexpected issues along the way. This post documents the approach, the problems, and the outcome.

Background and Goals

We have a Cache Service that syncs relationship and node data from Neo4J into Redis for fast reads of user relationship data. Performance was never the issue — cost was:

  • A large portion of the cached data was rarely accessed, resulting in low cache utilization.
  • Memory-optimized instances were holding cold data long-term, which is terrible value for money.

The goal of this refactor was to introduce a hot/cold data separation:

  • Hot data stays in Redis.
  • Infrequently accessed cache moves to AWS DocumentDB.

State of the Code (Before)

Reading the code before starting, I had a bad feeling.

  • Many files across apps/ shared the same names.
  • The code itself was nearly identical across them.
  • ElasticSearch logic was mixed in with cache initialization code.
  • Incremental cache update logic was scattered across multiple apps.
  • Lots of lint errors, lots of commented-out code blocks.

All classic red flags — any change risked a cascade of side effects, making estimation impossible. Before touching the migration, I did two things first:

  1. Fixed all related lint issues and deleted dead code.
  2. Read the entire call chain end-to-end, then cross-checked it against Codex’s analysis.

Migration Strategy

The approach was low-risk: dual-write first, then shift reads, then shrink Redis.

  1. Dual-write phase: writes go to both Redis and DocumentDB simultaneously.
  2. Validation phase: monitor consistency and production behavior.
  3. Convergence phase: remove unnecessary cold-data writes from Redis.

1774827639-cache-architecture.png

Challenge 1: Query and Write Pressure

Indexing

This is a common problem when writing large volumes of data — usually adding an index is enough. However, DocumentDB has poor support for indexing arrays of embedded documents: $in queries on such fields do not use indexes. For example:

1
2
3
4
5
6
7
{
  ...
  identities: {
    platform: x,
    id: 1234,
  }
}

The following query will not hit an index:

1
2
3
4
5
6
7
8
9
10
db.relation.find({
    identities: {
        $in: [
            {
                platform: "x",
                id: 1234,
            },
        ],
    },
});

To work around this, we manually concatenated the identity fields and stored them as a flat string x_1234. This allows the query to hit the index:

1
2
3
4
5
db.relation.find({
    identities: {
        $in: ["x_1234"],
    },
});

Database I/O

The original code was fetching 10,000 records at a time. I had already reduced this to 1,000, but DocumentDB Insights was still flagging many queries with I/O warnings. The batch size had to be reduced further to 100 records per query or update.

Challenge 2: Controlling Runtime Memory

The original concurrency was set to 100 with a taskGroup size of 1,000, which meant the EC2 instance had to load 100 × 1,000 records into memory simultaneously — enough to OOM the instance. I went down several dead ends trying to fix this: deduplicating documents in memory, splitting requests into batches of 100. In the end, setting concurrency to 1 and groupSize to 100 solved it completely, making all those earlier optimizations unnecessary — so I deleted that code.

Challenge 3: Full Cache Initialization Was Too Slow

We had long avoided doing a full cache rebuild for exactly this reason: it’s slow. This refactor made it unavoidable, and we needed it to finish fast. Investigating the root cause revealed the main bottleneck: every time a record was written to Redis, it triggered a synchronous HTTP request to update ElasticSearch data and indexes.

The fix was to decouple ES from the write path entirely. All ES operations were pushed onto an asynq queue, and a dedicated ES-Init Worker was started to process those updates asynchronously.

AWS OpenSearch Default Request Body Limit: 1 MB

To improve throughput, we batched tasks and submitted data for 1,000 nodes in a single request. This quickly produced a wave of errors. Checking the logs revealed the request body was exceeding the size limit. We extracted the ES code into a shared package at pkg/es and updated the document assembly logic to ensure each request stays under 1 MB.

Concurrent ES Update Conflicts

After fixing the request size, things worked locally but still errored in production. The errors turned out to be document version conflicts — the same document was being updated concurrently more than once. Without proper deduplication in place, the only option was to run single-threaded for now. If we want to improve throughput later, we’ll need to deduplicate before dispatching requests — currently the dispatch layer is already working with raw bytes, which makes this tricky. Left as a known issue for future optimization. The priority for now was getting the new cache live.

Results

The refactor delivered three concrete wins:

  1. Redis instance downgraded from cache.r7g.4xlarge to cache.r7g.2xlarge, cutting roughly 60 GB of memory usage and reducing instance cost by about half — saving approximately $1,000/month.
  2. Cache utilization improved: DocumentDB now handles infrequently accessed data, Redis is focused on hot data, and the overall cost-efficiency is much better.
  3. Full cache initialization time cut from ~8 hours to ~4 hours.

Closing Thoughts

Hold your code to a standard. Don’t copy-paste. Delete dead code instead of commenting it out. Keep things decoupled.

Most importantly: read the code fully before estimating the work. Otherwise, be prepared to work late every night for a week.