Demo NADA Catalog
Data Catalog
  • Home
  • Catalog
  • Collections
  • Citations
  • How to?
  • Login
    Login
    Home / Central Data Catalog / JD_SCR_001
central

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

World, 2024
Reference ID
JD_SCR_001
Producer(s)
Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
Metadata
JSON
Created on
Mar 24, 2025
Last modified
Mar 24, 2025
Page views
105
  • Project Description
  • Downloads
Download related resources
Other Materials
GitHub repository: double-jeopardy-in-llms
External link
Author(s) Aivin Solatorio
Language English
Description GitHub repository for the research project.
Abstract This work investigates the socio-economic disparities and reduced utility for non-English speakers in the use of large language models (LLMs). We use the FLORES-200 dataset and Ethnologue to analyze the socio-economic disparities in the use of LLMs. We also use the OpenAI's GPT-4 API to assess the reduced utility of LLMsfor non-English speakers.
Download https://github.com/worldbank/double-jeopardy-in-llms/tree/main
Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers
External link
Author(s) Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
Date 2024-10-14T04:00:00.000Z
Language English
Abstract Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI's GPT models via APIs because of how the system processes the input -- tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.
Download https://arxiv.org/abs/2410.10665
Back to Catalog
Demo NADA Catalog

© Demo NADA Catalog, All Rights Reserved.