This package contains code and data for a statistical forecasting approach to predict the outbreak of food crises.
Name |
---|
Bo P.J. Andree |
Name |
---|
Bo P.J. Andree |
Name |
---|
Bo P.J. Andree |
Name |
---|
Bo P.J. Andree |
INSTALLATION
The user will need to follow standard installation instructions for R.
* To avoid unexpected issues, it is recommended to run this code on a similar R installation and OS, i.e. Microsoft Open R 3.5.1. on Ubuntu 16.04.5 and r-studio-server 1.2.5001.
Install the required R packages (lines 5 - 34 in predicting_food_crises_dependencies.R).
* Note that many R packages require the user to install dependencies on ubuntu OS itself.
* User will need to install packages manually, since currently, there is no good way to automatize this. This is due to the large number of (in)direct dependencies in and outside R.
- At the end of this readme file, a print out of sessionInfo() is provided such that versions of all packages can be viewed.
* Note that the main R code (predicting_food_crises.R) sources the dependencies, the balanced learners, and reads the data.
- The user needs to specify the folder that contains these files in line 8. The default value is:'/home/predicting _food_crisis_package/' which assumes this package is unzipped in the home folder of ubuntu.
The code can be run in a terminal, in which case the data plots will not be visible to the user.
* One solution is to run the code on R Studio server. When set up correctly, one can access the RStudio IDE from anywhere via a web browser and use plot functionality. The code was developed on r-studio-server 1.2.5001. This can be isntalled by following standard installation procedures.
>> CONFIGURATION
There are a number of choices that the user can make to control the behavior of the main program:
* Lines 15-26 are options to control the definition of the dependent variable and the treatment of independent variables.
The default settings runs a model on all countries, using ipc 3 and above as positive class, uses only exogenous covariates as predictors, adds synthetic cases to the training data, calculates additional features, and restricts linear correlation to .75. These are the settings that correspond to the paper.
* Lines 31-32 control the type of learner used, default settings correspond to a simplified RF algorithm that delivers good results (nearly identical to the paper) but runs much faster.
See also the comments in the code.
* Lines 35-37 control an imputation strategy in case a missing value is encountered, settings should not matter when the supplied data is used.
* Lines 40-43 control the cross-validation, note that repetitions have been reduced to make the runtime and RAM requirements more manageable.
* Lines 46-55 control the compute environment.
Default settings:
* Note that parallel processing works differently on ubuntu than on other OS, but generally it involves generating copies of dependencies or compute environments and so memory requirements can be extremely high even when the initial data set seems manageable. For this reason the following simplifications have been made to default seetings:
- The number of validation samples has been reduced from 50 to 10.
- The tuning parameters of the default RF model have been fixed at recommended values. To run full tuning or use one of the alternative balanced classifiers, change MODEL_METHOD to one of the classifiers from predicting_food_crises_balanced_learners.R
- When an alterantive model is used, the length of the tuning grid has been reduced to 5, the paper uses 10.
* These settings produce similar results as those presented in the main paper, but the runtime and RAM requirements have been drastically reduced (depending of course on the number of CPUs available).
- The final code at (recommended) default settings was last run on a D32s_v3 VM with 32CPUs and 128 GiB RAM, reaching 100% CPU utilization and approx 60% RAM utilization, and took just below 2.5 hours to complete.
- By default, the code runs on the entire data set that is provided. Note that the paper only trains and cross-validates on data up to February 2019. With the current settings, it is thus straightforward to update the data set and make real forecasts.
>> EXECUTION
Running code:
* After installation, simply unpack the folder, point the code (line 8) to the correct folder and run predicting_food_crises.R.
* The code is currently not set up to write results to disk. As always, complex R objects can be saved for re-use using saveRDS() and text can be written using write.csv().
>> TROUBLESHOOTING
Dependencies:
* Make sure all OS dependencies are installed such that all libraries can be installed. Then make sure that all R libraries are installed and that also their dependencies are installed.
* See the sessionInfo() readout at the end of this file.
* Make sure the predicting_food_crises_dependencies.R and predicting_food_crises_balanced_learners.R files are correctly sourced.
Unexpected crash with different compute settings:
* If a different VM is used or if changes are made to the settings and the program crashes halfway, then keep an eye on the RAM usage. On ubuntu this can be monitored using > htop
If RAM usage is too high, reduce the number of cores used in lines 46-55.
NA values in validation metrics:
* A common issue with caret is that validation metrics return as NA. This is likely result of a missing dependency in the slave environment, which may occur because different OS handle parallelization differently. See if the issue persists when setting MODEL_METHOD to another value, for example 'multinom'.
The published materials do not contain any confidential information.
The citation of this work is Andree, Bo Pieter Johannes; Chamorro, Andres; Kraay, Aart; Spencer, Phoebe; Wang, Dieter. 2020. Predicting Food Crises. Policy Research Working Paper; No. 9412. World Bank, Washington, DC.
Agency Name | Affiliation |
---|---|
Bo Pieter Johannes Andree | World Bank |
This work was prepared as background for the Famine Action Mechanism (FAM).
The authors would like to thank Nadia Piffaretti, Zacharey Carmichael, Harun Dogo, Arif Hussain, Luca Russo, Jose Lopez, Colin Bruce, Nick Haan, Frank Davenport, Dan Maxwell, Joanna Macrae, Soomin Park, Marco Zambotti, Sardar Azari, Therese Norman-Monroe, Jacob LaRiviere, and the IPC, WFP mVAM, and FAO teams for invaluable contributions in the initial phase of this work.
In particular, we'd like to thank the participants of the FAM Workshop in Geneva on February 2018 hosted by ICRC, Artemis Working Days in Rome on April 2018 hosted by WFP, the FAM Data and Analytics meetings with global tech partners in Rome and New York on September 2018, and the participants to the Predictive Analytics workshop hosted by UN OCHA, at the Center for Humanitarian data in the Hague in April 2019.
2020-10-07
Location | Code |
---|---|
Afghanistan | AFG |
Burkina Faso | BFA |
Chad | TCD |
Congo, Dem. Rep. | COD |
Ethiopia | ETH |
Guatemala | GTM |
Haiti | HTI |
Kenya | KEN |
Malawi | MWI |
Mali | MLI |
Mauritania | MRT |
Mozambique | MOZ |
Niger | NER |
Nigeria | NGA |
Somalia | SOM |
South Sudan | SSD |
Sudan | SDN |
Uganda | UGA |
Yemen, Rep. | YEM |
Zambia | ZMB |
Zimbabwe | ZWE |
Name |
---|
Health |
Nutrition |
These results and the related working paper reflect the views of the authors, and do not reflect the official views of the World Bank, its Executive Directors, or the countries they represent.
1.0
September 2020
Name | URI |
---|---|
CCA 4.0 | https://creativecommons.org/licenses/by/4.0 |
Name | Affiliation |
---|---|
Bo P.J. Andree | World Bank |
Andres Chamorro | World Bank |
Nadia Piffaretti | World Bank |