FeatSharp: Your Vision Model Features, Sharper
Abstract
Lay Summary
Modern computer vision models, while powerful, often lack the ability to process high resolution images, or are only able to produce representations in low resolution. This makes them challenging to use for tasks that require high-res, such as detecting small objects in an image (e.g. find the bird flying in the sky), or labeling every pixel within the scene as some category (e.g. "bird", "sky", "tree", etc.). We present a method for enabling low-res-only vision models to produce hi-res representations by carefully upsampling them, combined with additional passes through the model (called tiling) to get details for small objects that are otherwise not large enough to be encoded properly. In doing so, we demonstrate that we can improve on various dense task benchmarks for numerous base vision models.