An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance

1Carnegie Mellon University
Applications of Image Transcreation

Different applications of image transcreation today.

Abstract

Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task.

Evaluation Dataset

In this paper we introduce a new task, that of transcreating images, to make them culturally relevant. For the machine learning community to make progress on this task, we curate a test set which is composed of two parts: concept (visualization link) and application (visualization link), each containing 600 and 100 images respectively. The concept dataset is spread across 7 countries and 17 categories (shown below) and focuses on a single concept per image. The application dataset is curated from children's storybooks and math worksheets for grades 1-3.

Dataset overview

Pipelines

We introduce three pipelines comprising state-of-the-art vision and language models to solve this task. The first pipeline uses an end-to-end image editing model, specifically, InstructPix2Pix, with the prompt, "Make this image culturally relevant to X", where X is one of the 7 countries above. The second pipeline captions the image using InstructBLIP, edits the caption using GPT-3.5, and edits the image using Plug-and-Play with GPT's output as the edit instruction. The third pipeline similarly captions the image and edits the caption, but retrieves a natural image from Datacomp-1b, with GPT's output as the search query.

*Update*: Recently, we also experimented with generating a new image given the LLM-edited caption. An overview of all pipelines is given below:

Pipelines

Results

Setup: We transcreate all 700 images using the three pipelines, with each of the 7 countries as target. We then conduct a human evaluation of outputs, to assess them across multiple dimensions (template of questions asked).

Criteria for successful transcreation: We define a transcreation successful if the target image satisfies the following criteria:

  1. Concept: The target image belongs to the same semantic category as the source image (example, a food image is transcreated to a food image).
  2. Application (Stories): The target image matches the text of the story.
  3. Application (Worksheets): The target image can be used to teach the same concept as the source image (example, a worksheet for counting is transcreated to another which also teaches counting).
  4. All: The cultural relevance score of the target image is greater than that of the source image.

Results

Visualizations of all results can be found here. A few examples are shown below:


BibTeX

@article{khanuja2024image,
              author  = {Khanuja, Simran and Ramamoorthy, Sathyanarayanan and Song, Yueqi and Neubig, Graham},
              title   = {An image speaks a thousand words, but can everyone listen? On translating images for cultural relevance},
              journal = {arXiv pre-print},
              year    = {2024},
              url     = {https://arxiv.org/pdf/2404.01247}
            }

The website design was adapted from Nerfies and Plug-and-Play.