Read Article Read Article

Blogs 11 mins

What FinTech Engineering Taught Me About Writing Code That Cannot Fail

Nben M. 04 Nov, 2025 11 mins

Most software fails quietly. A bug in a content recommendation system surfaces as slightly worse engagement metrics. A broken search filter returns fewer results. A misconfigured notification service sends emails twice. These are real problems. None of them move money in the wrong direction, lock a customer out of their account, or trigger a regulatory investigation.

FinTech engineering operates on a different failure model. A rounding error in a fee calculation, applied at scale, is a material misstatement. A race condition in a payment processor is not a flaky test. It is a double charge, a compliance event, and a customer support escalation simultaneously. A timeout that is handled incorrectly does not produce a slow page. It produces an ambiguous transaction state that might require manual reconciliation to resolve.

I spent years writing and reviewing code in this environment, first on payment processing infrastructure, then on core banking systems at Standard Chartered Ireland. The instincts that environment forces on you do not arrive through training. They arrive through seeing what happens when code that seemed correct meets real money at real scale. What follows is the honest version of what I learned.

Precision Is Not Optional

General software engineering treats floating point as a reasonable default for numbers. FinTech engineering treats it as a liability. Floating point arithmetic does not represent decimal fractions exactly, and the rounding errors that accumulate across thousands of operations are not theoretical.

// This is wrong for money
price := 0.1 + 0.2
fmt.Println(price) // 0.30000000000000004

// This is correct
import "github.com/shopspring/decimal"

price := decimal.NewFromFloat(0.1).Add(decimal.NewFromFloat(0.2))
fmt.Println(price) // 0.3

The difference is not visible in a unit test that checks for approximate equality. It is visible when you sum a million transactions and the total is off by a non-trivial amount that cannot be explained by any individual operation.

Store monetary values as integers in the smallest denomination: pence, cents, fils. A value stored as 1099 representing £10.99 is exact. A value stored as 10.99 as a float is not. Apply this at the database layer as well. NUMERIC(19, 4) in Postgres for monetary columns, never FLOAT or DOUBLE. The precision is part of the contract between your application and your data, and it needs to hold at every layer.

// Store as integer cents, convert at presentation only
type Money struct {
    AmountCents int64  `db:"amount_cents"`
    Currency    string `db:"currency"`
}

func (m Money) Display() string {
    major := m.AmountCents / 100
    minor := m.AmountCents % 100
    return fmt.Sprintf("%d.%02d %s", major, minor, m.Currency)
}

The arithmetic never involves decimals. The only place a decimal appears is in the string returned for display. You cannot introduce a rounding error in code that does not perform rounding.

Idempotency Is a Correctness Requirement

In most systems, idempotency is a nice property to have. In financial systems, it is a correctness requirement. A payment request that can be safely retried without producing a duplicate charge is not over-engineered. It is the minimum viable behavior for any production payment endpoint.

Networks fail. Load balancers time out. Clients retry. If your payment endpoint is not idempotent, a client that retries a timed-out request creates a duplicate charge. The user sees two debits. Your support team sees a complaint. Your reconciliation team sees an anomaly. All of this is preventable.

The implementation requires two things: a client-supplied idempotency key and server-side storage of processed requests.

func (h *PaymentHandler) Process(w http.ResponseWriter, r *http.Request) {
    idempotencyKey := r.Header.Get("X-Idempotency-Key")
    if idempotencyKey == "" {
        apierr.WriteError(w, r, apierr.InvalidInput("X-Idempotency-Key header is required", nil), h.logger)
        return
    }

    // Check if we have already processed this request
    existing, err := h.idempotencyStore.Get(r.Context(), idempotencyKey)
    if err != nil && !errors.Is(err, store.ErrNotFound) {
        apierr.WriteError(w, r, apierr.Internal(err), h.logger)
        return
    }

    if existing != nil {
        // Return the original response without re-processing
        writeJSON(w, existing.StatusCode, existing.Body)
        return
    }

    // Process the payment
    result, err := h.paymentService.Charge(r.Context(), parseRequest(r))
    if err != nil {
        apierr.WriteError(w, r, err, h.logger)
        return
    }

    // Store the result before responding
    h.idempotencyStore.Set(r.Context(), idempotencyKey, result, 24*time.Hour)

    writeJSON(w, http.StatusCreated, result)
}

The idempotency store is checked before any processing occurs. If the key exists, the original response is returned. The payment is not processed a second time. The client receives the same response it would have received on the first successful attempt.

Store the serialised response, not just a flag that the request was processed. A flag tells you the request happened. The stored response tells you exactly what you returned, which is what the client needs if it actually missed the first response.

Every State Transition Must Be Explicit

General software engineering often treats state as implicit: an order is considered complete when all its items are fulfilled, derived from the data rather than stored directly. Financial systems cannot afford implicit state. Every transition in a financial workflow must be stored explicitly, with a timestamp, a cause, and enough context to reconstruct why it happened.

The reason is auditing. Regulators do not ask whether your system produced the correct result. They ask how you know it produced the correct result, and they expect a clear, documented answer. An implicit state that is derived on read cannot be audited. An explicit state transition that is recorded on write can be.

type PaymentStatus string

const (
    PaymentPending    PaymentStatus = "pending"
    PaymentAuthorised PaymentStatus = "authorised"
    PaymentSettled    PaymentStatus = "settled"
    PaymentFailed     PaymentStatus = "failed"
    PaymentRefunded   PaymentStatus = "refunded"
)

type PaymentEvent struct {
    ID          string        `db:"id"`
    PaymentID   string        `db:"payment_id"`
    FromStatus  PaymentStatus `db:"from_status"`
    ToStatus    PaymentStatus `db:"to_status"`
    Reason      string        `db:"reason"`
    ActorID     string        `db:"actor_id"`
    OccurredAt  time.Time     `db:"occurred_at"`
}

Every state change is an event written to an append-only table. The current state is derived from the event log, but each transition is recorded when it happens. You can answer any question about any payment's history by reading the event table. You never need to guess.

State machine validation belongs at the application layer, not enforced only by convention:

var validTransitions = map[PaymentStatus][]PaymentStatus{
    PaymentPending:    {PaymentAuthorised, PaymentFailed},
    PaymentAuthorised: {PaymentSettled, PaymentRefunded},
    PaymentSettled:    {PaymentRefunded},
    PaymentFailed:     {},
    PaymentRefunded:   {},
}

func (s *PaymentService) Transition(ctx context.Context, payment *Payment, to PaymentStatus, reason string) error {
    allowed := validTransitions[payment.Status]
    for _, status := range allowed {
        if status == to {
            return s.store.RecordTransition(ctx, payment, to, reason)
        }
    }
    return apierr.Conflict(fmt.Sprintf("cannot transition payment from %s to %s", payment.Status, to))
}

An invalid transition returns an error. It does not silently write bad state. The application layer enforces the rules before any write occurs, not after.

Handle Timeouts as Ambiguity, Not Failure

A timeout in a general web application means the request failed. Retry it. A timeout in a financial system means something different: the request may have succeeded, may have failed, or may still be processing. Treating a timeout as a definitive failure and retrying unconditionally is how double charges happen.

The correct model treats timeouts as producing an ambiguous state. The correct response is to query the status of the operation before deciding whether to retry.

func (c *PaymentClient) Charge(ctx context.Context, req ChargeRequest) (*ChargeResult, error) {
    resp, err := c.httpClient.Post(ctx, "/payments", req)
    if err != nil {
        if isTimeout(err) {
            // Do not retry immediately. Query the status first.
            return nil, &AmbiguousError{
                Operation:      "charge",
                IdempotencyKey: req.IdempotencyKey,
                Message:        "request timed out: query payment status before retrying",
            }
        }
        return nil, err
    }
    return parseResponse(resp)
}

// AmbiguousError signals that the outcome is unknown, not that the operation failed.
type AmbiguousError struct {
    Operation      string
    IdempotencyKey string
    Message        string
}

func (e *AmbiguousError) Error() string { return e.Message }

The AmbiguousError type signals to the caller that the operation's outcome is unknown, not that it definitively failed. The caller queries the payment status using the idempotency key before deciding whether to retry. If the payment exists, it succeeded. If it does not, it is safe to retry.

This distinction requires a different error type. A generic error says "it failed." An ambiguous error says "we do not know." Treating those as the same thing is the root cause of a large class of financial data inconsistencies.

Locks and Concurrency Are Business Logic

Race conditions in most applications produce incorrect UI state. Race conditions in financial applications produce incorrect balances. The correction for a UI glitch is a page refresh. The correction for an incorrect balance is a manual reconciliation process, a customer call, and possibly a regulatory report.

Concurrent updates to financial balances must use database-level locking, not application-level optimistic concurrency. Optimistic concurrency assumes conflicts are rare and handles them by retrying. Financial operations cannot always be retried safely, and the window for conflict is often exactly the high-load window where conflicts are most likely.

// Pessimistic lock: no other transaction can read or write this row
// until this transaction commits or rolls back
func (s *AccountStore) Debit(ctx context.Context, tx pgx.Tx, accountID string, amount decimal.Decimal) error {
    var balance decimal.Decimal

    err := tx.QueryRow(ctx,
        `SELECT balance FROM accounts WHERE id = $1 FOR UPDATE`,
        accountID,
    ).Scan(&balance)
    if err != nil {
        return fmt.Errorf("lock account %s: %w", accountID, err)
    }

    if balance.LessThan(amount) {
        return apierr.Conflict("insufficient funds")
    }

    _, err = tx.Exec(ctx,
        `UPDATE accounts SET balance = balance - $1, updated_at = NOW() WHERE id = $2`,
        amount, accountID,
    )
    return err
}

FOR UPDATE acquires a row-level exclusive lock when the row is read. No other transaction can read or modify that row until this transaction completes. The balance check and the debit are atomic. A concurrent request attempting to debit the same account will wait for this transaction to finish before it can read the balance, not race against it.

This is slower than optimistic concurrency under low load. Under high load, it is the only approach that guarantees correct results. Financial systems operate under high load.

Observability Is a Correctness Property

In most systems, observability is an operational concern. You add it so you can debug problems faster. In financial systems, observability is closer to a correctness property. If you cannot trace exactly what happened to a payment, you cannot verify that it was processed correctly, which means you cannot attest to its correctness, which means you have a compliance problem regardless of whether the underlying behavior was right.

Every financial operation must produce a log entry that contains enough information to reconstruct exactly what happened: the input values, the computed output, the state transition that occurred, and the identifiers that link the log entry to the database records.

func (s *PaymentService) Settle(ctx context.Context, payment *Payment) (*Payment, error) {
    result, err := s.processor.Settle(ctx, payment)
    if err != nil {
        s.logger.ErrorContext(ctx, "settlement failed",
            "payment_id", payment.ID,
            "amount_cents", payment.AmountCents,
            "currency", payment.Currency,
            "processor_ref", payment.ProcessorRef,
            "error", err,
        )
        return nil, apierr.Internal(fmt.Errorf("settle payment %s: %w", payment.ID, err))
    }

    s.logger.InfoContext(ctx, "payment settled",
        "payment_id", payment.ID,
        "amount_cents", payment.AmountCents,
        "currency", payment.Currency,
        "processor_ref", payment.ProcessorRef,
        "settlement_ref", result.SettlementRef,
        "settled_at", result.SettledAt,
    )

    return result, nil
}

Both the success and failure paths log the same identifying fields. An auditor querying logs for a specific payment ID gets the full history: what was attempted, what the processor returned, what state transition occurred. The log is not a debugging aid. It is part of the audit trail.

Conclusion

FinTech engineering does not use fundamentally different tools or techniques. The language features, the database primitives, the network protocols are all the same. What differs is the consequence model. When a failure costs money, affects a regulatory record, or produces an incorrect balance that has to be manually corrected, the tolerance for shortcuts drops to near zero.

The instincts this environment develops transfer directly to any system where correctness matters: decimal arithmetic over floating point, idempotency at every state-changing endpoint, explicit state machines with recorded transitions, ambiguity-aware timeout handling, pessimistic locking for concurrent writes, and observability that produces a complete audit trail rather than operational metrics.

These are not advanced techniques. They are disciplined applications of fundamentals, applied consistently, with the full understanding of what it costs when they are skipped. That understanding is what FinTech engineering provides that general software engineering rarely does.

Nben M. 04 Nov, 2025 11 mins

Next up

News Overview

Blogs 7 mins

What Legacy Codebases Teach You That Greenfield Projects Never Will

Nben M. 02 May, 2025

Working inside a legacy codebase forces you to develop instincts that clean-slate projects cannot replicate, no matter how well-architected they are.

News Overview