# 1 Summary

To see the code used in this post, visit my kernel on kaggle in R Markdown format.

• Objectives: To visualise a dataset and understand its main trends.
• Challenge: Largest dataset worked with so far.
• Data points: 13424100
• Language: R

# 2 Purpose of this post

Muhammad Yunus and the Grameen Bank won the Nobel Peace Prize in 2006 for “their efforts through microcredit to create economic and social development from below.”

Back in 1976, Yunus, at the time a professor at the University of Chittagong (Bangladesh), noticed that small amounts of money could make a substantial difference to people living in poverty. He started to loan money to people that didn’t meet the requirements listed by the mainstream banking system. It was reported that these type of loans were effective to “emerge” from poverty using default rates lower than those of commercial banks, reported at 2%. Eventually, in October 1983, Muhammad Yunus founded Grameen Bank, considered to be the first microfinance institution.

Founded in 2005, Kiva, has the same mission as Grameen Bank except that anyone can become a Kiva banker. This online platform enables microcredit lending to help low-income entrepreneurs around the world with a couple of clicks. Pretty neat, huh? In this post, I unpack a large dataset published by Kiva on the Kaggle platform and explore these microloans.

# 3 The data

The dataset was published during the first months of 2018 on the Kaggle platform.

The complete dataset was a zip file with size 232.7 MB containing four files: kiva_loans.csv, kiva_mpi_region_locations.csv,loan_theme_ids.csv, and loan_themes_by_region.csv. After I looked at the contents, I chose to work with the first one kiva_loans.csv.

The dataset had 671,205 observations and 20 variables.

Some of the codebook came with the dataset and some I researched to make assumptions:

• funded_amount: “The amount disbursed by Kiva to the field agent(USD)”
• loan_amount: “The amount disbursed by the field agent to the borrower(USD)”
• activity: “More granular category”
• sector: “High level category”
• use: “Exact usage of loan amount”
• country_code: “ISO country code of country in which loan was disbursed”
• country: “Full country name of country in which loan was disbursed”
• region: “Full region name within the country”
• currency: “The currency in which the loan was disbursed”
• partner_id: “ID of partner organization”

This variable has a lot of missing values and the Kiva explanation on Kaggle doesn’t go much further. For now, I will exclude partner_id.

• posted_time: “The time at which the loan is posted on Kiva by the field agent”
• disbursed_time: “The time at which the loan is disbursed by the field agent to the borrower”
• funded_time: “The time at which the loan posted to Kiva gets funded by lenders completely”
• term_in_months: “The duration for which the loan was disbursed in months”
• lender_count: “The total number of lenders that contributed to this loan”
• borrower_genders: “Comma separated M,F letters, where each instance represents a single male/female in the group”
• repayment_interval: Not specified so we’ll assume that it means the standard definition of when the loan is repaid back to the lender.

# 4 Data cleaning

From at the structure of our data, we can soon see there are a few bits that don’t make sense and need to be fixed.

a. Missing values

We will leave missing values in for now.

b. borrower_genders

From the variable descriptions, we expected borrower_gendersto have only two levels, maleor female. Here we see many more levels, 11,298 to be precise. This isn’t very clear so we’ll fix that first by creating five levels:

[1] "mixed_genders" "mult_females"  "mult_males"    "single_female"
[5] "single_male"  

c. loan_amounts

Now, since we’re trying to make sense of all loans, it’s better if loan_amounts is in the same currency. Let’s translate it to USD.

d. country_codes

Finally, with 86 countries, we have 86 levels. Perhaps it would be interesting to create another category called region to produce less levels and have a better understanding of the overall function of regional distributions.

Our country codes are in the ISO3166 format, so we will use the associated region code found here. And make five regions. Africa, Asia, Europe, Oceania, and South America.

[1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania" 

e. Dates

Let’s calculate two lengths of time that I think are interesting. First, how much time passes from the moment the loan is posted to the moment it’s disbursed (total_time). Second, how long does a loan take to get funded (giving_time)?

# 5 Exploratory data analysis

Now, let’s describe the data starting with some plots and tables to understand it.

## 5.1 Borrower gender by region

Table 5.1: A single female is the most common type of borrower gender with over half of all Kiva loans requested.
Gender Percentage
single_female 66.04
single_male 17.79
mult_females 9.58
mixed_genders 6.08
mult_males 0.52

In all regions, single females request the most loans followed by single males. In Asia and South America, the third category is multiple females while in Africa it’s mixed genders.

## 5.2 Repayment interval by region

Table 5.2: The most popular type of repayment interval is monthly.
Repayment interval Percentage
monthly 49.7
irregular 40.6
bullet 9.7

## 5.3 Loan use by region

Table 5.3: Generally, the most frequent use of loans is agriculture, followed by food and retail.
Sector Percentage
Agriculture 27.3
Food 21.1
Retail 18.8
Services 6.7
Personal Use 5.8
Education 5.0
Clothing 4.9
Housing 4.9
Transportation 2.3
Arts 1.9
Health 1.4
Table 5.4: Half of all Kiva loans are requested in Asia followed by Africa, South America, Oceania and finally the European Union.
Regions Percentage
Asia 49.74
Africa 25.83
Americas 22.77
Oceania 1.18
Europe 0.47

Notice most loans are requested by single females, the least amount of loans are given in the EU, weekly repayment is an unpopular form of paying loans back and entertainment, wholesale, manufacturing, and construction amount less than 2.2% of sectors.

Now I’m curious to look at the top 10 countries requesting Kiva loans but per capita and per number of internet users.

Let’s see how the top 10 changes when it comes to internet users. I used this ranking, which is based on numbers published by the International Telecommunications Union.

Now let’s look at the distributions of our numeric variables.

# 6 Techniques used

• I used the quantmode package to convert all loans into a unique currency (US dollars) for comparison. There were two currencies that were unavailable.

• Taking all the different levels that came in the borrower_genders variable and creating five neat categories to better understand who are the borrowers was good practice with lists and the stringr package.

# 7 Questions from this analysis

• Why is Retail and not Food (as in other regions) the second most common use for loans in Asia.

• Who are the givers? Where are they? Does proximity of the lender to the borrower have anything to do with funding times?

• Does the Kiva website have anything to do with funding times? For example, giving_time, the time between posting the loan and the loan being fully funded, has two peaks, at around one week and one month. Is this due to the platform and the promotion of loans that have been posted for a certain amount of time?