Poster
in
Workshop: Neural Compression: From Information Theory to Applications
Making Text-Image Connection Formal and Practical
Carlos-Gustavo Salas-Flores · Dongmian Zou · Luyao Zhang
Text and image feature extraction is at the core of several state-of-the-art artificial intelligence algorithms, including DALLE-2, Stable Diffusion, and Segment Anything. However, models that connect images and texts are usually trained using hundreds of GPUs and millions of data points, making it infeasible for most agents to perform the training from scratches. Furthermore, these groundbreaking works necessitate more formally defined algorithms to enable easier adoption and implementation. To address these issues, this paper elaborates on a formal and intuitive algorithm for text-image connections and proposes an alternative to train CLIP, a neural network model that learns joint representations from text and images, on low computing resources. Our focus is on improving training speed and using a fraction of the data. In our experimentation, two models were trained on a third of WKIT-24M, a dataset of text-image pairs, by making use of mixed precision in back-propagation and shrinking the resolution of input images while also shrinking the maximum length of the query in comparison to the original CLIP in a setting constrained to a single GPU. Our results show that it is feasible to train image-text connection models from scratches in a simplified setting in recognizing related image concepts.
Virtual talk: https://drive.google.com/file/d/1tdjchYTMkeOVnveCiT1d8JVhNBIwU-cz/view?usp=drive_link