To see the code used in this post, visit my kernel on kaggle in R Markdown format.
- Objectives: To visualise a dataset and understand its main trends.
- Challenge: Largest dataset worked with so far.
- Data points: 13424100
- Language: R
2 Purpose of this post
Muhammad Yunus and the Grameen Bank won the Nobel Peace Prize in 2006 for “their efforts through microcredit to create economic and social development from below.”
Back in 1976, Yunus, at the time a professor at the University of Chittagong (Bangladesh), noticed that small amounts of money could make a substantial difference to people living in poverty. He started to loan money to people that didn’t meet the requirements listed by the mainstream banking system. It was reported that these type of loans were effective to “emerge” from poverty using default rates lower than those of commercial banks, reported at 2%. Eventually, in October 1983, Muhammad Yunus founded Grameen Bank, considered to be the first microfinance institution.
Founded in 2005, Kiva, has the same mission as Grameen Bank except that anyone can become a Kiva banker. This online platform enables microcredit lending to help low-income entrepreneurs around the world with a couple of clicks. Pretty neat, huh? In this post, I unpack a large dataset published by Kiva on the Kaggle platform and explore these microloans.
3 The data
The dataset was published during the first months of 2018 on the Kaggle platform.
The complete dataset was a zip file with size 232.7 MB containing four files:
loan_themes_by_region.csv. After I looked at the contents, I chose to work with the first one
The dataset had 671,205 observations and 20 variables.
Some of the codebook came with the dataset and some I researched to make assumptions:
funded_amount: “The amount disbursed by Kiva to the field agent(USD)”
loan_amount: “The amount disbursed by the field agent to the borrower(USD)”
activity: “More granular category”
sector: “High level category”
use: “Exact usage of loan amount”
country_code: “ISO country code of country in which loan was disbursed”
country: “Full country name of country in which loan was disbursed”
region: “Full region name within the country”
currency: “The currency in which the loan was disbursed”
partner_id: “ID of partner organization”
This variable has a lot of missing values and the Kiva explanation on Kaggle doesn’t go much further. For now, I will exclude
posted_time: “The time at which the loan is posted on Kiva by the field agent”
disbursed_time: “The time at which the loan is disbursed by the field agent to the borrower”
funded_time: “The time at which the loan posted to Kiva gets funded by lenders completely”
term_in_months: “The duration for which the loan was disbursed in months”
lender_count: “The total number of lenders that contributed to this loan”
borrower_genders: “Comma separated M,F letters, where each instance represents a single male/female in the group”
repayment_interval: Not specified so we’ll assume that it means the standard definition of when the loan is repaid back to the lender.
4 Data cleaning
From at the structure of our data, we can soon see there are a few bits that don’t make sense and need to be fixed.
a. Missing values
We will leave missing values in for now.
From the variable descriptions, we expected
borrower_gendersto have only two levels,
female. Here we see many more levels, 11,298 to be precise. This isn’t very clear so we’ll fix that first by creating five levels:
 "mixed_genders" "mult_females" "mult_males" "single_female"  "single_male"
Now, since we’re trying to make sense of all loans, it’s better if
loan_amounts is in the same currency. Let’s translate it to USD.
Finally, with 86 countries, we have 86 levels. Perhaps it would be interesting to create another category called region to produce less levels and have a better understanding of the overall function of regional distributions.
Our country codes are in the ISO3166 format, so we will use the associated region code found here. And make five regions. Africa, Asia, Europe, Oceania, and South America.
 "Africa" "Americas" "Asia" "Europe" "Oceania"
Let’s calculate two lengths of time that I think are interesting. First, how much time passes from the moment the loan is posted to the moment it’s disbursed (
total_time). Second, how long does a loan take to get funded (
5 Exploratory data analysis
Now, let’s describe the data starting with some plots and tables to understand it.
5.1 Borrower gender by region
In all regions, single females request the most loans followed by single males. In Asia and South America, the third category is multiple females while in Africa it’s mixed genders.
5.2 Repayment interval by region
5.3 Loan use by region
Notice most loans are requested by single females, the least amount of loans are given in the EU, weekly repayment is an unpopular form of paying loans back and entertainment, wholesale, manufacturing, and construction amount less than 2.2% of sectors.
Now I’m curious to look at the top 10 countries requesting Kiva loans but per capita and per number of internet users.
Now let’s look at the distributions of our numeric variables.
6 Techniques used
I used the
quantmodepackage to convert all loans into a unique currency (US dollars) for comparison. There were two currencies that were unavailable.
Taking all the different levels that came in the
borrower_gendersvariable and creating five neat categories to better understand who are the borrowers was good practice with lists and the
7 Questions from this analysis
Food(as in other regions) the second most common use for loans in Asia.
Who are the givers? Where are they? Does proximity of the lender to the borrower have anything to do with funding times?
Does the Kiva website have anything to do with funding times? For example,
giving_time, the time between posting the loan and the loan being fully funded, has two peaks, at around one week and one month. Is this due to the platform and the promotion of loans that have been posted for a certain amount of time?