Demo NADA Catalog
Data Catalog
  • Home
  • Catalog
  • Collections
  • Citations
  • How to?
  • Login
    Login
    Home / Central Data Catalog / JD_SCR_001
central

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

World, 2024
Reference ID
JD_SCR_001
Producer(s)
Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
Metadata
JSON
Created on
Mar 24, 2025
Last modified
Mar 24, 2025
Page views
104
  • Project Description
  • Downloads
  • Overview
  • Reproducibility Package
  • Description
  • Scope and coverage
  • Access and rights
  • Information on metadata
  • Overview

    Abstract

    This work investigates the socio-economic disparities and reduced utility for non-English speakers in the use of large language models (LLMs). We use the FLORES-200, Ethnologue, and World Development Indicators datasets to analyze the socio-economic disparities in the use of LLMs. We also use the OpenAI's GPT-4 API to assess the reduced utility of LLMs for non-English speakers.

    Reproducibility Package

    Scripts
    Tokenization of FLORES dataset
    File name
    compute-premium-costs.ipynb
    Title
    Tokenization of FLORES dataset
    Format
    Jupyter Python Notebook
    Description
    Computes the tokenization premium for the FLORES dataset. The calculation of the population-weighted GDP for each language is also done in this notebook.
    License
    Name
    Mozilla Public License
    Back-translation task for the FLORES dataset
    File name
    back-translation-task.ipynb
    Title
    Back-translation task for the FLORES dataset
    Format
    Jupyter Python Notebook
    Description
    Generates the back-translation task for the FLORES dataset. The notebook implements the batched translation strategy for the translation task and uses the OpenAI GPT-4o API.
    License
    Name
    Mozilla Public License
    Additional analysis of the results
    File name
    analysis.ipynb
    Title
    Additional analysis of the results
    Format
    Jupyter Python Notebook
    Description
    Notebook for additional analysis of the results. Key visualizations are generated in this notebook, including the comparison of the tokenization premiums between two different tokenizers (GPT-4o vs. GPT-4 Turbo).
    License
    Name
    Mozilla Public License
    Source code repository
    Repository name Type URI
    double-jeopardy-in-llms GitHub https://github.com/worldbank/double-jeopardy-in-llms/tree/main
    Software
    Python
    Name
    Python
    Version
    3.10
    Libraries
    • requests; pandas; docutils; jupyter-book; datasets; tiktoken; fire; openpyxl; tokenizers; ipykernel; transformers; torch; plotly; httpx; joblib; nbformat; openai; ipywidgets; groq; matplotlib; kaleido; scipy; statsmodels

    Reproducibility

    Technology environment

    This work has been developed using a MacBook Pro with an M1 Pro processor and 64GB of RAM. No GPU is needed for the computations.

    Technology requirements

    Access to the OpenAI API is required.

    Reproduction instructions

    Some of the notebooks are not publicly available because they are used to handle proprietary data from Ethnologue which is not publicly available. One of the notebooks is used to compute the adjusted population based on the historical figures from Ethnologue and the annual population growth rates.

    This repository uses poetry to manage dependencies. To install the dependencies, run the following command:
    `poetry install'

    To review the list of dependencies, please refer to the pyproject.toml file.

    VS Code / Cursor users can use the Python extension to run the notebooks.

    Use the following command to spin up a local Jupyter server:

    `poetry run jupyter notebook'

    It is recommended to use a virtual environment to run the code.

    Additionaly, the notebooks/compute-premium-costs.ipynb notebook uses the OpenAI API. To use the API, you need to set the OPENAI_API_KEY environment variable. You can create a .env file in the root of the repository and add the following:

    `OPENAI_API_KEY=<your-openai-api-key>'

    Data

    Datasets
    FLORES-200 and FLORES+
    Name
    FLORES-200 and FLORES+
    Note
    A multilingual dataset covering 100 languages, with 1,000 sentences per language. Used for evaluating translation quality and computing the tokenization premium relative to English.
    Data URL
    https://github.com/facebookresearch/flores
    Ethnologue
    Name
    Ethnologue
    Note
    Provides linguistic data, including the number of speakers, geographic distribution, and writing systems. We use Ethnologue to estimate the number of speakers for each language.
    Data URL
    https://www.ethnologue.com/
    World Bank, World Development Indicators (WDI)
    Name
    World Bank, World Development Indicators (WDI)
    Note
    Contains socio-economic data at the country level. Specifically, we use the GDP per capita in current US$ (NY.GDP.PCAP.CD) and the annual population growth rates (SP.POP.GROW) indicators to compute the population-weighted GDP for each language and for aligning population estimates to 2022 based on historical figures from Ethnologue.
    License
    CC BY 4.0
    Data URL
    https://datacatalog.worldbank.org/dataset/world-development-indicators
    OpenAI GPT-4o and GTP-4 Turbo APIs
    Name
    OpenAI GPT-4o and GTP-4 Turbo APIs
    Note
    Used to assess the reduced utility of LLMs for non-English speakers. We applied translation with different prompting methods to generate reference translations for FLORES sentences. The LLM translated non-English sentences into English, with the original English sentences serving as a benchmark for evaluating translation quality.
    Data URL
    https://openai.com/api/
    Citation requirements

    Please cite our paper as follows when referencing this work.

    @misc{solatorio2024doublejeopardyclimateimpact,
    title={Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers},
    author={Aivin V. Solatorio and Gabriel Stefanini Vicente and Holly Krambeck and Olivier Dupriez},
    year={2024},
    eprint={2410.10665},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2410.10665},
    }

    Description

    Output
    Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers
    Type
    Working paper
    Title
    Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers
    Authors
    Aivin V. Solatorio, Gabriel Stefanini Vicente, Holly Krambeck, Olivier Dupriez
    Abstract
    Artificial Intelligence (AI), particularly large language models (LLMs), holds the potential to bridge language and information gaps, which can benefit the economies of developing nations. However, our analysis of FLORES-200, FLORES+, Ethnologue, and World Development Indicators data reveals that these benefits largely favor English speakers. Speakers of languages in low-income and lower-middle-income countries face higher costs when using OpenAI's GPT models via APIs because of how the system processes the input -- tokenization. Around 1.5 billion people, speaking languages primarily from lower-middle-income countries, could incur costs that are 4 to 6 times higher than those faced by English speakers. Disparities in LLM performance are significant, and tokenization in models priced per token amplifies inequalities in access, cost, and utility. Moreover, using the quality of translation tasks as a proxy measure, we show that LLMs perform poorly in low-resource languages, presenting a ``double jeopardy" of higher costs and poor performance for these users. We also discuss the direct impact of fragmentation in tokenizing low-resource languages on climate. This underscores the need for fairer algorithm development to benefit all linguistic groups.
    URL
    https://arxiv.org/abs/2410.10665
    DOI
    https://doi.org/10.48550/arXiv.2410.10665
    Project website
    • https://github.com/worldbank/double-jeopardy-in-llms
    Authoring entity
    Agency Name Affiliation
    Aivin V. Solatorio World Bank
    Gabriel Stefanini Vicente World Bank
    Holly Krambeck World Bank
    Olivier Dupriez World Bank
    Date of production

    2024-10

    Scope and coverage

    Geographic locations
    Location Code
    World WLD

    Access and rights

    License
    Name URI
    Mozilla Public License https://www.mozilla.org/en-US/MPL/

    Information on metadata

    Producers
    Name
    John Doe
    Date of Production

    2025-03-14

    Back to Catalog
    Demo NADA Catalog

    © Demo NADA Catalog, All Rights Reserved.