Hosted by: DataKind & Producers Direct
Role: Volunteer Data Scientist
This project analyzes 21 million SMS messages from smallholder farmers in East Africa (Kenya, Uganda, and Tanzania). By integrating ERA5 Satellite Climate Reanalysis data with a multilingual Natural Language Processing (NLP) pipeline, I investigated how climate patterns and seasonality drive farmer inquiries.
This research was conducted as part of a DataKind DataKit.
"A DataKit™ is a work-ready set of data, software, and innovation questions, curated by DataKind... All learnings, ideas, and insights resulting from this DataKit will be aggregated and used to expand impact in the financial inclusion and economic opportunity sector."
The Dataset Source (WeFarm / Producers Direct): Producers Direct acquired this dataset from WeFarm, a peer-to-peer SMS platform. The archive represents 7 years of agricultural activity, including:
- 7.6 Million+ farmer questions across 4 languages (English, Swahili, Luganda, Runyankole).
- 17.2 Million+ responses and 200,000+ farming tips.
- Demographic Data covering key agricultural regions in East Africa.
The Challenge: Producers Direct needed to extract actionable patterns from this unstructured text. My research focused on determining if external climate factors (Rainfall/Temperature) could be used to forecast spikes in specific topics like "Pest & Disease."
- The Model: Random Forest Regressor using ERA5 weather features.
-
The Finding: While Seasonality (Month) is the primary driver of question volume, 1-Month Lagged Rainfall emerged as one of the top environmental predictors (
$R^2=0.18$ for 'Pests & Disease' Question Topics). - Future Application: With further refinement (weekly aggregation), this 4-week lag signal could trigger automated SMS alerts to warn farmers of impending pest incubation periods.
- Read the Full Analysis
- The Finding: Farmer engagement does not peak during the primary planting season ("Long Rains"). Instead, it spikes during the Short Rains/Post-Harvest window (Aug–Nov).
- The Baseline: This insight offers a data-driven hypothesis for resource allocation.
- Future Application: With further refinement, incorporating the full dataset (beyond stratified sampling) and including stronger translation models (validated by native-speakers), these insights could guide services to target farmers' needs during the post-harvest handling and second-season preparation season.
- Read the Full Analysis
- Data Processing:
pandas,duckdb,Google Colab(Handling 7.6M+ rows). - Machine Learning:
scikit-learn(Random Forest, TimeSeriesSplit). - NLP:
nltk,Google Gemini(Batch translation for local languages). - ERA5 Reanalysis Data
Email: @allisonnsibrian@gmail.com

