阅读文章阅读文章

Tutorials 13 mins

How to Modernize a Legacy Monolith Step by Step Without Taking the Whole System Down

Nben M. 25 May, 2026 13 mins

The most dangerous moment in any legacy modernization project is when someone proposes a full rewrite. The argument is always compelling at the time. The existing system is hard to change, expensive to operate, and understood by fewer people every year as the engineers who built it move on. A clean slate feels like the rational solution.

Rewrites fail at a predictable rate not because engineers are incompetent but because they underestimate what the existing system does. The monolith running in production has survived years of edge cases, regulatory changes, data anomalies and operational incidents. It carries that knowledge implicitly, embedded in conditionals, in compensating logic, in database triggers that nobody documents. A rewrite discards all of that and asks a team to reconstruct it from incomplete specifications while simultaneously delivering new features on a schedule.

I have worked on legacy modernization at a major banking institution and across several SaaS products. The approach that works consistently is not a rewrite. It is incremental extraction with continuous delivery, where the monolith keeps running in production throughout, and each extraction is a controlled, reversible change that can be validated before the next one begins.

This article is the step-by-step version of that approach.

Step One: Understand Before You Move Anything

The instinct when inheriting a legacy system is to start changing it. Resist that instinct for the first several weeks. The most valuable thing you can do before moving any code is build a complete picture of what the system actually does, not what the documentation says it does.

Legacy systems and their documentation diverge from each other over time. The documentation reflects intent. The code reflects reality. Read the code.

Start with the entry points: the routes in a web framework, the message consumers, the cron jobs, the batch processes. Map every input the system receives and trace where it goes. Then read the database schema and map every table to the domain it belongs to. Tables that are read by modules in multiple domains are coupling points that will be expensive to separate. Identify them early.

markdown

Audit checklist before touching anything:

- All HTTP routes documented with their handlers
- All background jobs and their schedules
- All message queue consumers and the events they handle
- All external service integrations (payment processors, email, SMS)
- All database tables grouped by owning domain
- Tables read by more than one domain flagged as coupling points
- Any database triggers or stored procedures documented
- Any scheduled SQL jobs documented

Database triggers and stored procedures are the most common source of undocumented behavior in legacy systems. They execute outside the application layer and are invisible to anyone reading the application code alone. Query information_schema.triggers and information_schema.routines in Postgres, or the equivalent in your database, before you conclude that you understand the full data layer.

The output of this step is a domain map: a list of the discrete business capabilities the system provides, which tables and code belong to each one, and which capabilities share data in ways that complicate separation.

Step Two: Establish a Deployment Pipeline Before Changing the System

The second most common mistake in legacy modernization, after attempting a full rewrite, is making changes to a system that does not have a reliable, automated deployment pipeline. Without one, every change you make carries the risk of a manual, error-prone deployment. Rollbacks require coordination. Incidents are harder to resolve. The modernization work stalls because engineers lose confidence that changes can be safely deployed.

Before extracting a single service or refactoring a single module, establish the pipeline.

A minimum viable pipeline for a legacy system has four stages: automated tests run on every commit, a build produces a deployable artifact, the artifact is deployed to a staging environment, and promotion to production requires a manual gate or an automated check. The tests do not need to be comprehensive at the start. Even a small number of integration tests against the most critical paths are more valuable than none.

yaml

# .github/workflows/deploy.yml
name: deploy

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests
        run: make test

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build image
        run: |
          TAG=$(git rev-parse --short HEAD)
          docker build -t ${{ secrets.REGISTRY }}/monolith:$TAG .
          docker push ${{ secrets.REGISTRY }}/monolith:$TAG

  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Deploy to staging
        run: |
          TAG=$(git rev-parse --short HEAD)
          ./scripts/deploy.sh staging $TAG

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy to production
        run: |
          TAG=$(git rev-parse --short HEAD)
          ./scripts/deploy.sh production $TAG

Git SHA tagging on every image gives an unambiguous link between what is running in production and the commit that produced it. When an incident occurs, you know exactly what code is running and which commit introduced the change that caused it.

If the legacy system has no automated tests at all, write characterisation tests before the pipeline goes live. A characterisation test does not verify that the system is correct. It verifies that the system behaves the same way it did when you ran the test the first time. It captures existing behavior as a baseline so that changes can be detected, even if that behavior is not yet fully understood.

Step Three: Introduce the Proxy Layer

With a deployment pipeline in place, the next step is to introduce a routing proxy in front of the monolith. This is the infrastructure that makes incremental extraction possible. The proxy routes requests to the monolith by default and can be reconfigured to route specific paths to new services as they are extracted.

The monolith continues to handle all traffic until a service is ready. Extraction is a configuration change in the proxy, not a cutover of the entire system.

markdown

# nginx.conf: initial state, all traffic to the monolith
server {
    listen 80;
    server_name api.yourproduct.com;

    # Default: all traffic to the monolith
    location / {
        proxy_pass         http://monolith:8080;
        proxy_set_header   Host $host;
        proxy_set_header   X-Request-ID $request_id;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 30s;
    }
}

markdown

# nginx.conf: after the inventory service is extracted
server {
    listen 80;
    server_name api.yourproduct.com;

    # Extracted service: inventory
    location /api/v1/inventory {
        proxy_pass         http://inventory-service:8080;
        proxy_set_header   Host $host;
        proxy_set_header   X-Request-ID $request_id;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 10s;
    }

    # All other traffic still goes to the monolith
    location / {
        proxy_pass         http://monolith:8080;
        proxy_set_header   Host $host;
        proxy_set_header   X-Request-ID $request_id;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 30s;
    }
}

X-Request-ID is propagated from the proxy to both the monolith and the extracted services. A single user-facing request that touches multiple services can be traced through all of them using the same identifier. This is not optional in a system where some requests still go to the monolith and others go to new services. Without correlation IDs, debugging a request that crosses the boundary is very difficult.

Step Four: Extract the Least Coupled Domain First

The domain map from step one tells you which domains are heavily coupled to others and which are relatively independent. Extract the least coupled domain first. A successful first extraction builds confidence, establishes the pattern for subsequent extractions, and produces a running new service in production that the team can learn from.

A domain is ready for extraction when it meets three conditions. Its data is written by only one domain. Its API surface with the rest of the system is small enough to define as an explicit contract. Its behavior is understood well enough to reproduce exactly.

For the extraction itself, the process is: build the new service, run it in shadow mode alongside the monolith, validate that its behavior matches the monolith's, then switch traffic at the proxy.

Shadow mode means the new service receives the same requests as the monolith but its responses are discarded. The monolith continues to serve all responses. Differences between the monolith's responses and the new service's responses are logged for comparison.

// Shadow mode request forwarder: runs in the monolith during transition
func shadowForward(req *http.Request, newServiceURL string) {
    clone, err := http.NewRequest(req.Method, newServiceURL+req.URL.Path, cloneBody(req.Body))
    if err != nil {
        return
    }
    clone.Header = req.Header.Clone()

    resp, err := http.DefaultClient.Do(clone)
    if err != nil {
        slog.Warn("shadow request failed",
            "url", newServiceURL+req.URL.Path,
            "err", err,
        )
        return
    }
    defer resp.Body.Close()

    // Log divergence for comparison, do not use the response
    slog.Info("shadow response",
        "path",          req.URL.Path,
        "shadow_status", resp.StatusCode,
    )
}

Running shadow traffic for two weeks before switching gives enough data to identify divergences under real production load patterns. If the new service handles all shadow requests correctly and produces responses that match the monolith's, the switch at the proxy is a configuration change with a tested rollback path.

Step Five: Migrate Data Ownership Cleanly

Each extracted service must own its data. The monolith and the new service cannot share a database table. Data sharing through a common database is the most common way incremental extractions collapse back into a distributed monolith.

The data migration process for each extraction runs in three phases: dual write, backfill, and cutover.

In the dual write phase, the monolith writes to both its own database and publishes an event that the new service consumes to build its own read copy. The new service's database is populated in real time from events, while the monolith's database remains the source of truth.

// Monolith publishes an event on every inventory write during dual write phase
func (s *InventoryService) UpdateStock(ctx context.Context, itemID string, delta int) error {
    if err := s.repo.UpdateStock(ctx, itemID, delta); err != nil {
        return err
    }

    // Publish to the new service's event stream
    s.publisher.Publish(ctx, StockUpdatedEvent{
        ItemID:    itemID,
        Delta:     delta,
        UpdatedAt: time.Now(),
    })

    return nil
}

In the backfill phase, existing data is copied from the monolith's database to the new service's database using a one-time script. After the backfill completes, both databases should reflect the same state. Run validation queries that compare row counts and key values between the two until they agree consistently.

In the cutover phase, the proxy is updated to route the domain's traffic to the new service, and the monolith stops writing to those tables. The monolith's tables for that domain are kept in read-only mode for a defined period, typically two to four weeks, before they are removed. This provides a rollback path without requiring a dual write to continue.

Step Six: Decouple Internal Calls

As domains are extracted, the internal method calls that previously happened in the same process become network calls. This is expected. It is also where most extraction projects accumulate hidden coupling that causes problems later.

Every cross-service call introduces a latency cost and a failure mode that did not exist in the monolith. A method call that took microseconds now takes milliseconds. A call that never failed now fails with network timeouts and service unavailability. Both need to be designed for explicitly.

// Before extraction: direct internal call, no failure handling needed
func (s *OrderService) PlaceOrder(ctx context.Context, req OrderRequest) (*Order, error) {
    order := NewOrder(req)
    if err := s.inventory.Reserve(ctx, req.Items); err != nil {
        return nil, err
    }
    return s.repo.Save(ctx, order)
}

// After extraction: HTTP call with timeout, retry and circuit breaker
func (s *OrderService) PlaceOrder(ctx context.Context, req OrderRequest) (*Order, error) {
    order := NewOrder(req)

    ctx, cancel := context.WithTimeout(ctx, 3*time.Second)
    defer cancel()

    if err := s.inventoryClient.Reserve(ctx, req.Items); err != nil {
        var netErr *url.Error
        if errors.As(err, &netErr) && netErr.Timeout() {
            return nil, apierr.ServiceUnavailable("inventory service unavailable, try again", err)
        }
        return nil, err
    }

    return s.repo.Save(ctx, order)
}

The failure modes are now explicit in the code. The caller handles inventory unavailability as a distinct case rather than a generic error. The timeout is set to 3 seconds, well within the overall request budget, so an inventory service slowdown does not cause the order service to exhaust its own timeout.

The monolith's internal call had no such design. It could not fail in isolation. Every extracted service call requires this explicit design work, and it should be done before the service receives production traffic, not after the first timeout incident.

Step Seven: Decommission Incrementally

Extraction is not complete when the new service is running. It is complete when the monolith no longer contains the code and data for the extracted domain. Incomplete decommissioning is how modernization projects stall: services accumulate without the monolith shrinking, operational complexity doubles, and the team is maintaining two copies of each domain indefinitely.

After a new service has been running in production for two to four weeks without incidents, and after the monolith's tables for that domain have been in read-only mode for the same period with no read traffic observed, remove the code from the monolith and drop the tables.

Keep a record of what has been extracted:

markdown

Extraction Log

[x] Inventory Service       - Extracted 2025-03-12, monolith code removed 2025-04-02
[x] Notification Service    - Extracted 2025-04-01, monolith code removed 2025-04-22
[ ] Billing Service         - Shadow mode since 2025-04-15, cutover planned 2025-05-06
[ ] Customer Service        - Domain mapping complete, extraction not yet started
[ ] Order Service           - Blocked: shares tables with Billing, requires Billing first
[ ] Payment Processing      - Deferred: highest coupling, extract last

The extraction log is visible to the whole team and to stakeholders. It communicates progress, blockers and sequencing decisions. It prevents the modernization effort from becoming invisible background work that loses momentum.

Payment processing and any domain with the highest coupling should be extracted last. By the time you reach those domains, you will have developed the patterns, the tooling and the team confidence from earlier extractions. The hardest domains should be tackled with the most experience, not the least.

Conclusion

Legacy modernization without downtime is not a special technique. It is disciplined application of principles that apply to any production system change: understand before you act, automate the deployment path before you add to it, move traffic incrementally with tested rollbacks, and clean up what you have extracted before moving to the next domain.

The systems that get modernized successfully are not the ones that had the best architecture plans. They are the ones where the team resisted the rewrite temptation, maintained a working production system throughout, and extracted one domain at a time until the monolith contained only what had not yet been replaced.

The monolith shrinks. The new services accumulate. At some point the monolith is small enough that it can be reasoned about completely, and the decision about what to do with it last is made from a position of understanding rather than urgency. That is where you want to be. The step-by-step approach is what gets you there.

Nben M. 25 May, 2026 13 mins

Next up

新闻概览

Blogs 11 mins

What FinTech Engineering Taught Me About Writing Code That Cannot Fail

Nben M. 04 Nov, 2025

Building software where failures cost real money forces you to develop instincts about reliability that general software engineering never demands.

新闻概览