How a short-video app stores your media, from draft to delivery

You open the camera, shoot a 20-second clip, trim the dead air off the front, drop a song under it, and tap Save draft. Then you close the app. The next morning it's still sitting there, exactly how you left it. Tap Post, and a few seconds later it's playing back smoothly for someone on the other side of the world.

That whole sequence feels like nothing. Under it is a pile of decisions about where bytes live, when they move, and who pays for the move. I've built parts of this for [the app or project you can name here], and most of the interesting work isn't the upload everyone fixates on. It's everything around it.

I'm describing the common shape of these systems here, not any one company's internal design. The patterns are well understood, and you'll find variations of them in most apps that push around a lot of video. Let me walk through it in the order a Reel actually experiences, starting with the part people underrate: the draft.

The draft is where the real plumbing hides

When you record a clip, the raw video doesn't sit in memory waiting for you to decide what to do with it. It can't. A short clip at 1080p is already [around ___ MB for a 30-second take, fill in what you saw], and 4K is several times that. Hold a few of those in RAM and the OS will kill your app without apology. So the camera stack (AVFoundation on iOS, CameraX/MediaCodec on Android) writes the encoded video straight to a file in the app's own storage as you record.

Editing doesn't overwrite that file in place either. Trims, filters, and the audio track are usually kept as a set of instructions plus references to the source media. The actual re-encode happens later, often only at export time, because re-encoding video is expensive and you don't want to do it every time the user nudges a trim handle by half a second.

A draft, then, is a small record in a local database that points at one or more media files on disk, plus the edit instructions and a bit of metadata (caption, chosen audio, cover frame, timestamps). The video itself is just a file. The draft is the bookkeeping that says "these bytes, arranged this way, are an unfinished post."

That database matters more than it sounds. If you stash draft state in something volatile, it won't survive the app being swapped out of memory or the phone rebooting. You want it in real persistent storage: SQLite directly, or something built on it like Core Data on iOS or Room on Android. [Say which you used and why.] When the app cold-starts, you read those rows back, re-point them at the files on disk, and the draft is "restored." There's no magic to it. The files never went anywhere; you just reopened the index that describes them.

One gotcha that's bitten people: on iOS, anything you put in the Caches directory can be purged by the system when storage gets tight, and it happens silently. Drafts do not belong there. They go in a directory the OS treats as user data the app is responsible for. Get this wrong and users lose drafts seemingly at random, and you'll burn a week trying to reproduce a bug that only shows up on phones that are nearly full.

flowchart TD
    A[Record clip] --> B[Trim, filter, add audio]
    B --> C[Write media to app storage]
    C --> D{Save as draft?}
    D -->|Yes| E["Write draft row in local DB,<br/>referencing the media file"]
    E --> F["App closed and reopened:<br/>draft restored from disk"]
    F --> G[User taps Post]
    D -->|Post now| G
    G --> H[Upload media to backend]
    H --> I[Server-side processing]

Why everything starts on the device

There's a reason media lives locally first instead of streaming straight to the cloud as you shoot.

Local storage is fast, it's effectively free (you already own the phone's flash), and it works with no signal. A user on the subway can record, edit, and save five drafts with the radio off, and nothing in that flow needs a server. Recording is a local operation. Making something public is the part that needs the network.

Cloud storage flips all of that. It's durable, so your post survives a lost phone. It's reachable from any device. And it can be served to a huge audience at once. The catch is that it costs money per gigabyte stored and per gigabyte served, it needs a working connection, and every round-trip is slow next to a local file read. You don't want either side doing the other's job. Local handles capture and drafts; cloud handles durability, sharing, and distribution.

The rule I follow in practice: keep the source of truth local until the user commits to publishing, then make the cloud the source of truth once the upload succeeds. The handoff between those two states is where most of the bugs live.

flowchart LR
    subgraph Device
      Cam[Camera capture] --> FS[("App sandbox<br/>filesystem")]
      DB[("Local DB<br/>SQLite / Core Data / Room")]
      FS <--> DB
      MC[("On-device media cache")]
    end
    subgraph Cloud
      OBJ[("Object storage<br/>S3 / GCS")]
      META[("Metadata DB")]
      CDN["CDN edge nodes"]
    end
    FS -->|upload on Post| OBJ
    DB -->|sync metadata| META
    OBJ --> CDN
    CDN --> MC

Getting a big file off the phone without it falling apart

Now the user taps Post. You've got a [___ MB] file and a mobile connection that might drop in a tunnel. Uploading the whole thing in one PUT and hoping is how you end up with a 90%-complete upload that fails and restarts from zero. Users hate that, and on metered data it's genuinely costing them money.

So you don't do that. A handful of things make uploads survivable.

Chunked, resumable uploads are the big one. Split the file into pieces (a few MB each is reasonable) and upload them one at a time. If the connection drops, you only re-send the chunk that was in flight, not the whole file, and when the user comes back into range the upload picks up where it left off. This is the single biggest reliability win, and it's worth the extra complexity.

Pre-signed URLs keep your servers out of the data path. You generally don't want the file flowing through your application servers at all. The client asks your backend "where do I put this," the backend hands back a short-lived signed URL that authorizes a direct upload to object storage, and the bytes go straight there. Your servers just deal with small JSON.

Background upload sessions let the transfer survive the user leaving the app. Hand it to the OS's background upload mechanism and it keeps going within limits after the app is backgrounded. It'll get suspended and resumed, so this only works cleanly if the upload is resumable in the first place. The two features are a package deal.

Dedup with content hashes saves you sending bytes you already have. Hash the file and check whether it exists before uploading anything. People repost the same clip constantly. If the bytes are already there, you skip the transfer and create a new reference. [If you measured how often this hits, that number is great to drop in.]

For us, moving from a single-shot upload to resumable chunks took the failure rate on flaky connections from [before] to [after]. The median upload time barely moved. It's the long tail on bad networks that gets fixed, and that tail is where the angry support tickets come from.

What the server does to your video

The file that lands in object storage is not the file you serve. Serving the user's original is a mistake. A clip from a recent flagship might be HEVC at a bitrate that's overkill for a phone screen, and it won't even play on some older devices. So the raw upload kicks off a processing job, usually asynchronously off a queue, and that job is where most of the compute (and most of your processing bill) goes.

The core of it is transcoding: take the source and produce several renditions at different resolutions and bitrates. That's what makes adaptive playback work later. ffmpeg is the workhorse here for most teams. You're choosing a codec ladder, and the choice is a real tradeoff. H.264 plays on basically everything but compresses worse. HEVC and AV1 give you smaller files at the same quality but cost more CPU to encode and don't decode everywhere. [Say which ladder you landed on and why; this is exactly the kind of opinion readers want.]

You also normalize things people never think about: honoring or stripping rotation metadata so videos aren't sideways, fixing audio levels, clamping absurd durations. The sideways-video bug is a rite of passage. Rotation flags lie often enough that you learn to test with footage from a pile of different phones.

flowchart TD
    U[Client requests upload URL] --> P[Backend returns pre-signed URL]
    P --> CH["Client uploads in chunks (resumable)"]
    CH --> RAW[("Raw object in storage")]
    RAW --> Q[Enqueue processing job]
    Q --> T["Transcode to multiple<br/>resolutions and bitrates"]
    T --> TH["Generate thumbnails<br/>and blur placeholder"]
    T --> HLS["Package HLS / DASH<br/>with manifests"]
    TH --> READY[("Processed assets stored")]
    HLS --> READY
    READY --> PUB["Mark post ready,<br/>fan out to feeds"]

Thumbnails and that blurry placeholder

While the video's open for transcoding, you pull frames for thumbnails. Usually a representative frame (not always the first one, which is often a black or blurry mess from the camera warming up) at a couple of sizes: a small one for grids, a larger one for the cover.

The other trick is the tiny blurry placeholder you see for a split second before media loads. That's not the real image downscaled on the fly. It's a precomputed, very small encoding of the colors. BlurHash is the common approach: a short string, a handful of bytes, that the client expands into a soft gradient roughly matching the real image. It ships inside the feed JSON, costs almost nothing, and means the layout never pops in with empty gray boxes. Cheap, and it makes the whole app feel faster even though nothing about the actual download got quicker.

Serving it back fast with a CDN

Once the renditions and manifests exist, they sit in object storage, which is durable but not built to be read by a global audience at low latency. That's the CDN's job.

A CDN is a network of edge servers spread around the world. The first time someone in a region asks for a piece of your media, the nearest edge fetches it from origin (your object storage) and keeps a copy. Every later request from that region is served from the edge instead of crossing the planet to your bucket. The viewer gets lower latency, and your origin gets shielded from the bulk of the traffic, which is also where the real cost savings show up. [Name your CDN and, if you have it, the cache hit ratio you see; a high hit ratio is the whole game.]

Video specifically is served as a set of small segments plus a manifest (HLS or DASH). The player reads the manifest, sees the available bitrates, and switches between them based on the viewer's current bandwidth. Good signal, it pulls higher-quality segments; signal drops, it steps down so playback doesn't stall. All of those segments are static files the CDN is happy to cache.

flowchart LR
    V[Viewer opens feed] --> APP[App requests manifest]
    APP --> EDGE{Edge has it cached?}
    EDGE -->|Hit| FAST[Serve from edge]
    EDGE -->|Miss| ORIGIN[Fetch from origin storage]
    ORIGIN --> FILL[Edge stores a copy]
    FILL --> FAST
    FAST --> PLAYER["Player streams,<br/>adapts bitrate"]

Caching on the device, and when to throw things away

The CDN handles the network side. The phone has its own caching problem. You don't re-download a video every time it scrolls past, and you definitely don't keep every video you've ever seen forever.

Most clients run a layered cache. There's a small in-memory cache for what's on screen and just off it, backed by a larger on-disk cache for recently seen media. A request checks memory first, then disk, then finally the network. Anything fetched from the network gets written back so the next look is free.

The part people get wrong is eviction. Storage is finite, and users notice when your app is eating [several GB] of their phone. So you set a budget and evict on a least-recently-used basis: when you're over the limit, the stuff nobody's touched in a while gets deleted first. The same idea applies on the server and at the CDN, just with longer time horizons and bigger numbers.

flowchart TD
    REQ[Request media] --> MEM{In memory?}
    MEM -->|Yes| SERVE[Serve instantly]
    MEM -->|No| DISK{On disk cache?}
    DISK -->|Yes| PROMOTE[Load into memory] --> SERVE
    DISK -->|No| NET[Fetch from CDN] --> STORE[Write to disk and memory] --> SERVE
    SERVE --> TOUCH[Update last-accessed time]
    TOUCH --> EVICT{Over size budget?}
    EVICT -->|Yes| LRU[Evict least-recently-used]
    EVICT -->|No| KEEP[Keep]

Prefetching is the flip side of caching, and it's what makes a feed feel instant. While you're watching one Reel, the app is quietly pulling the first segment or two of the next one so it starts the moment you swipe. Overdo it and you're burning the user's data on videos they'll never watch. Do too little and every swipe stalls for a beat. Tuning that balance is mostly judgment and watching real usage, not a formula. [If you tuned how many items ahead you prefetch, that story is worth telling.]

Keeping it all from quietly falling over

A few things you only learn once this is running at any real size.

Drafts and temp files accumulate. Every abandoned recording and half-finished export leaves files behind, and without cleanup the app's storage footprint creeps up until users start uninstalling. You need a janitor: clear temp files on a schedule, expire drafts the user clearly walked away from, and make sure deleting a draft actually deletes its media and not just the database row. Orphaned files are easy to create and annoying to find.

Failed uploads need somewhere to go. A post that didn't finish uploading shouldn't vanish, and it definitely shouldn't look published when it isn't. Keep it as a pending item the user can retry, and be honest in the UI about its state.

And the whole thing is a constant negotiation between three forces that pull against each other: how much storage you use, how fast everything feels, and how much it costs to run. Higher quality means bigger files, more storage, more bandwidth, more money. More caching means a faster app and a fatter footprint on the device. There's no setting that wins on all of them at once. You're picking a point on that surface that fits your users and your budget, and you revisit it as both change.

None of this is exotic. It's mostly the same small set of ideas (keep it local until you have to go remote, move big files carefully, transcode once and serve many times, cache hard and evict honestly) applied at each layer. The craft is in the boring details: which directory the draft lives in, what happens when the upload dies at 80%, whether your placeholder shows up before the gray box does.

How a short-video app stores your media, from draft to delivery

The draft is where the real plumbing hides

Why everything starts on the device

Getting a big file off the phone without it falling apart

What the server does to your video

Thumbnails and that blurry placeholder

Serving it back fast with a CDN

Caching on the device, and when to throw things away

Keeping it all from quietly falling over

Comments

More from this blog

Why your chat messages send instantly, even with no signal

Middleware in Express

Event loop in NodeJS

Why NodeJS is Perfect for Building Fast Web Applications

Command Palette

The draft is where the real plumbing hides

Why everything starts on the device

Getting a big file off the phone without it falling apart

What the server does to your video

Thumbnails and that blurry placeholder

Serving it back fast with a CDN

Caching on the device, and when to throw things away

Keeping it all from quietly falling over

Comments

More from this blog