Synthetic training data

The images below are 100% computer-generated

Automated Testing

While interviewing for a startup that specialized in reading photos of receipts, I decided to explore some of their computer vision challenges by implementing a system for running automated OCR tests. For a given receipt text, could their system read the text accurately in any random condition? I wanted to know! So I built a Docker-based application that generated photorealistic images of receipts in different visual conditions—blurred, overexposed, crinkled—anything that made reading the receipt more difficult.

Supervised Learning

After I accepted an offer to work there, I decided to expand the code to serve as a source of synthetic training data for their custom OCR system. Most OCR systems are pretty bad for wrinkled paper, so improving the accuracy in this specific area would have a big payoff for what we were doing.

But I needed lots of labeled training data. I decided to tackle this with synthetic training data, and leveraged the fact that since I knew the exact geometry/object/camera transformations in the CGI receipt image, I also knew—with some matrix multiplications and spatial tree searches—the exact location of every letter on its surface! The result was that I could harvest hundreds of thousands of individually-labeled letters, in a variety of conditions, to serve as labeled training data.

Technology

Blender and Docker served as the basis for the system. I constructed a code-driven procedurally-generated receipt 3d mesh whose various physical parameters could be randomly set at runtime.

Docker encapsulated all of the dependencies and a series of automated render scripts allowed me to generate thousands of synthetic training images.

See the code on Github