Tonight’s problems – an explanation

[Flickr is now back up, but this is still probably a useful explanation for many people.]

While the site is still down and everyone else is working on it, I thought it’d be a good time to give a more thorough explanation of what is going on. Earlier tonight, people started seeing strange photos in place of their own about 1/7th of the time.

This was the result of our caching servers returning random photos each time they got asked. The caching servers (called "photocaches") are a thin layer of servers which sit between your browser and our primary storage. They store the most recently requested photos in a way that’s quick to access in order to speed up serving the photos you see on the site.

To explain the problem, a little background on how Flickr works is required:

Flickr serves hundreds of millions of photos each day (on the highest traffic days, just over a billion photos are served). Because relative to other computers components like memory (RAM) or processors (CPUs), reading from disks is relatively slow — and randomly accessing hundreds of terabytes of storage is both slow and a strain on the primary storage servers — it wouldn’t be possible to run Flickr without this caching layer.

Each photo has a unique address (or URL). This is what your browser uses to request a particular photo. It knows the address from the web page which is produced by the "application layer" (the "program" or software that runs Flickr") based on data stored in the database.

The database knows whose photos are whose, what permissions everyone has, what comments have been left and by whom, etc. In contast, the storage and the caches are "dumb": they just store the 1s and 0s that represent your photos.

Tonight’s problem was a result a few of the photocaches going berzerk and instead of returning the correct image file when a particular photo was being requested, it just returning some random image that happened to be in the cache. The result was web pages which had some correct photos, and some random ones. And the random ones would change when you reloaded the page.

This is not a permenant problem: the primary storage, the database and the software that runs Flickr is all fine. The problem was with the internal directory of a few photocaching servers – the bit that keeps track of which image files correspond with which photo URLs (and therefore items in the database).

To be clear, we regard this as a serious problem, but it is something that goes away as soon as we restart the malfunctioning servers (tonight we found that the servers were going insane again shortly after restarting, but we have isolated the problem and believe we have a permanent fix).

We want everyone to understand that there are no permanent problems with any data, we have not been "hacked" and you don’t need to do anything in order to have your photos return to normal (though you might need to do a "hard refresh" in order to clear your web browser’s internal image cache where the wrong photos might still be stored). In particular, you do NOT need to delete, replace or reupload any photos.

We shamefacedly apologize for the inconvenience and the scare. We understand that it probably seems very, very strange and we know that many people got the impression that their photos were lost forever. But they should all be back now, safe and sound. And everyone who works on Flickr’s engineering and technical operations teams are working double time to ensure that it never happens again. Thanks for your understanding and patience!