This is the code for my best submission in the “Pump it Up: Data Mining the Water Table” cometition, hosted by DrivenData. This competition really interested me because DrivenData hosts competitions for non-profits. Having the ability to help out an organization in need, especially when the goal is determining if wells are functioning in Tanzania, is a very gratifying knowning that I might provide some assistance.
The sections of the analyis are: -Inspecting the data -Clean the data -Feature Engineering -Building the model -Summary
Let’s start with a look at how much data we have.
## [1] 59400 40
So we have a fair amount of data here, 40 features with 59,400 rows of data. Now for a closer look at the data.
## 'data.frame': 59400 obs. of 40 variables:
## $ id : int 69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
## $ amount_tsh : num 6000 0 25 0 0 20 0 0 0 0 ...
## $ date_recorded : chr "2011-03-14" "2013-03-06" "2013-02-25" "2013-01-28" ...
## $ funder : chr "Roman" "Grumeti" "Lottery Club" "Unicef" ...
## $ gps_height : int 1390 1399 686 263 0 0 0 0 0 0 ...
## $ installer : chr "Roman" "GRUMETI" "World vision" "UNICEF" ...
## $ longitude : num 34.9 34.7 37.5 38.5 31.1 ...
## $ latitude : num -9.86 -2.15 -3.82 -11.16 -1.83 ...
## $ wpt_name : chr "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : chr "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
## $ subvillage : chr "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
## $ region : chr "Iringa" "Mara" "Manyara" "Mtwara" ...
## $ region_code : int 11 20 21 90 18 4 17 17 14 18 ...
## $ district_code : int 5 2 4 63 1 8 3 3 6 1 ...
## $ lga : chr "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
## $ ward : chr "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
## $ population : int 109 280 250 58 0 1 0 0 0 0 ...
## $ public_meeting : chr "True" "" "True" "True" ...
## $ recorded_by : chr "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
## $ scheme_management : chr "VWC" "Other" "VWC" "VWC" ...
## $ scheme_name : chr "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
## $ permit : chr "False" "True" "True" "True" ...
## $ construction_year : int 1999 2010 2009 1986 0 2009 0 0 0 0 ...
## $ extraction_type : chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_group: chr "gravity" "gravity" "gravity" "submersible" ...
## $ extraction_type_class: chr "gravity" "gravity" "gravity" "submersible" ...
## $ management : chr "vwc" "wug" "vwc" "vwc" ...
## $ management_group : chr "user-group" "user-group" "user-group" "user-group" ...
## $ payment : chr "pay annually" "never pay" "pay per bucket" "never pay" ...
## $ payment_type : chr "annually" "never pay" "per bucket" "never pay" ...
## $ water_quality : chr "soft" "soft" "soft" "soft" ...
## $ quality_group : chr "good" "good" "good" "good" ...
## $ quantity : chr "enough" "insufficient" "enough" "dry" ...
## $ quantity_group : chr "enough" "insufficient" "enough" "dry" ...
## $ source : chr "spring" "rainwater harvesting" "dam" "machine dbh" ...
## $ source_type : chr "spring" "rainwater harvesting" "dam" "borehole" ...
## $ source_class : chr "groundwater" "surface" "surface" "groundwater" ...
## $ waterpoint_type : chr "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
## $ waterpoint_type_group: chr "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...
The top 10 rows of the data:
## id amount_tsh date_recorded funder gps_height
## 1 69572 6000 2011-03-14 Roman 1390
## 2 8776 0 2013-03-06 Grumeti 1399
## 3 34310 25 2013-02-25 Lottery Club 686
## 4 67743 0 2013-01-28 Unicef 263
## 5 19728 0 2011-07-13 Action In A 0
## 6 9944 20 2011-03-13 Mkinga Distric Coun 0
## 7 19816 0 2012-10-01 Dwsp 0
## 8 54551 0 2012-10-09 Rwssp 0
## 9 53934 0 2012-11-03 Wateraid 0
## 10 46144 0 2011-08-03 Isingiro Ho 0
## installer longitude latitude wpt_name num_private
## 1 Roman 34.93809 -9.856322 none 0
## 2 GRUMETI 34.69877 -2.147466 Zahanati 0
## 3 World vision 37.46066 -3.821329 Kwa Mahundi 0
## 4 UNICEF 38.48616 -11.155298 Zahanati Ya Nanyumbu 0
## 5 Artisan 31.13085 -1.825359 Shuleni 0
## 6 DWE 39.17280 -4.765587 Tajiri 0
## 7 DWSP 33.36241 -3.766365 Kwa Ngomho 0
## 8 DWE 32.62062 -4.226198 Tushirikiane 0
## 9 Water Aid 32.71110 -5.146712 Kwa Ramadhan Musa 0
## 10 Artisan 30.62699 -1.257051 Kwapeto 0
## basin subvillage region region_code
## 1 Lake Nyasa Mnyusi B Iringa 11
## 2 Lake Victoria Nyamara Mara 20
## 3 Pangani Majengo Manyara 21
## 4 Ruvuma / Southern Coast Mahakamani Mtwara 90
## 5 Lake Victoria Kyanyamisa Kagera 18
## 6 Pangani Moa/Mwereme Tanga 4
## 7 Internal Ishinabulandi Shinyanga 17
## 8 Lake Tanganyika Nyawishi Center Shinyanga 17
## 9 Lake Tanganyika Imalauduki Tabora 14
## 10 Lake Victoria Mkonomre Kagera 18
## district_code lga ward population public_meeting
## 1 5 Ludewa Mundindi 109 True
## 2 2 Serengeti Natta 280
## 3 4 Simanjiro Ngorika 250 True
## 4 63 Nanyumbu Nanyumbu 58 True
## 5 1 Karagwe Nyakasimbi 0 True
## 6 8 Mkinga Moa 1 True
## 7 3 Shinyanga Rural Samuye 0 True
## 8 3 Kahama Chambo 0 True
## 9 6 Tabora Urban Itetemia 0 True
## 10 1 Karagwe Kaisho 0 True
## recorded_by scheme_management scheme_name
## 1 GeoData Consultants Ltd VWC Roman
## 2 GeoData Consultants Ltd Other
## 3 GeoData Consultants Ltd VWC Nyumba ya mungu pipe scheme
## 4 GeoData Consultants Ltd VWC
## 5 GeoData Consultants Ltd
## 6 GeoData Consultants Ltd VWC Zingibali
## 7 GeoData Consultants Ltd VWC
## 8 GeoData Consultants Ltd
## 9 GeoData Consultants Ltd VWC
## 10 GeoData Consultants Ltd
## permit construction_year extraction_type extraction_type_group
## 1 False 1999 gravity gravity
## 2 True 2010 gravity gravity
## 3 True 2009 gravity gravity
## 4 True 1986 submersible submersible
## 5 True 0 gravity gravity
## 6 True 2009 submersible submersible
## 7 True 0 swn 80 swn 80
## 8 True 0 nira/tanira nira/tanira
## 9 True 0 india mark ii india mark ii
## 10 True 0 nira/tanira nira/tanira
## extraction_type_class management management_group payment
## 1 gravity vwc user-group pay annually
## 2 gravity wug user-group never pay
## 3 gravity vwc user-group pay per bucket
## 4 submersible vwc user-group never pay
## 5 gravity other other never pay
## 6 submersible vwc user-group pay per bucket
## 7 handpump vwc user-group never pay
## 8 handpump wug user-group unknown
## 9 handpump vwc user-group never pay
## 10 handpump vwc user-group never pay
## payment_type water_quality quality_group quantity quantity_group
## 1 annually soft good enough enough
## 2 never pay soft good insufficient insufficient
## 3 per bucket soft good enough enough
## 4 never pay soft good dry dry
## 5 never pay soft good seasonal seasonal
## 6 per bucket salty salty enough enough
## 7 never pay soft good enough enough
## 8 unknown milky milky enough enough
## 9 never pay salty salty seasonal seasonal
## 10 never pay soft good enough enough
## source source_type source_class
## 1 spring spring groundwater
## 2 rainwater harvesting rainwater harvesting surface
## 3 dam dam surface
## 4 machine dbh borehole groundwater
## 5 rainwater harvesting rainwater harvesting surface
## 6 other other unknown
## 7 machine dbh borehole groundwater
## 8 shallow well shallow well groundwater
## 9 machine dbh borehole groundwater
## 10 shallow well shallow well groundwater
## waterpoint_type waterpoint_type_group
## 1 communal standpipe communal standpipe
## 2 communal standpipe communal standpipe
## 3 communal standpipe multiple communal standpipe
## 4 communal standpipe multiple communal standpipe
## 5 communal standpipe communal standpipe
## 6 communal standpipe multiple communal standpipe
## 7 hand pump hand pump
## 8 hand pump hand pump
## 9 hand pump hand pump
## 10 hand pump hand pump
Hmm, we have some missing data here, such as in “scheme_name” and “scheme_management.” Let’s take a look at the amount of missing data in each feature.
## id amount_tsh date_recorded
## 0 0 0
## funder gps_height installer
## 0 0 0
## longitude latitude wpt_name
## 0 0 0
## num_private basin subvillage
## 0 0 0
## region region_code district_code
## 0 0 0
## lga ward population
## 0 0 0
## public_meeting recorded_by scheme_management
## 0 0 0
## scheme_name permit construction_year
## 0 0 0
## extraction_type extraction_type_group extraction_type_class
## 0 0 0
## management management_group payment
## 0 0 0
## payment_type water_quality quality_group
## 0 0 0
## quantity quantity_group source
## 0 0 0
## source_type source_class waterpoint_type
## 0 0 0
## waterpoint_type_group
## 0
Okay…we don’t have any null values, but we know there are some missing values. We’ll have to look at the features one-by-one to find what’s not quite right.
First, let’s see what the status of the water pumps are, since that’s our main feature.
##
## functional functional needs repair non functional
## 32259 4317 22824
It’s good to see that most of the wells are functioning, but the data is very unequally distributed.
Next, what are the top 20 subvillages.
## feature
## Shuleni Majengo Madukani Kati Mtakuja
## 646 631 629 470 467 322
## Sokoni M Muungano Mbuyuni Songambele Mlimani
## 294 243 215 210 194 190
## Miembeni Msikitini 1 Kanisani Kibaoni Mjini
## 168 164 157 146 143 131
## Mjimwema Mapinduzi
## 129 126
Hmm, looks like we have some interesting values here, including capital letters, empty strings, and 1. There are a few things we could do to try to correct these values, such as assign them the nearest subvillages based on latitude or longitude, however these features and many others are also missing values, so I am going to assign the value of ‘None’ to all of the subvillages.
## [1] "Number of subvillages with the value 'None'"
##
## FALSE TRUE
## 72954 1296
Below are the region codes, but I am going to remove this feature because we will use dummy variables to represent the regions. You might be thinking, why not keep this feature and remove the regions’ names to save a step. To help to ensure that the data is correct and make feature engineering easier, I think that it is best to keep the regions’ names instead.
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2779 3709 5494 3145 5079 2032 1020 375 499 3306 6608 5759 2536 2488 2242
## 16 17 18 19 20 21 24 40 60 80 90 99
## 3533 6334 4183 3824 2451 1972 402 1 1298 1536 1133 512
Below we have the district codes. There are 21 districts and 20 regions. I’m not sure how accurate the district codes are or how they work because when I try to find the average district code for subvillages, lgas, wards, or regions, there are never still 21 districts. I’m going to drop this feature because of its unreliability and there should be enough geographic features already.
## [1] "The district codes:"
##
## 0 1 2 3 4 5 6 7 8 13 23 30
## 27 15299 13929 12521 11253 5428 5108 4166 1282 496 360 1256
## 33 43 53 60 62 63 67 80
## 1063 653 921 76 127 264 8 13
Below we have the 20 most frequent wards, everything looks good here.
## feature
## Igosi Imalinyi Siha Kati Mdandu Nduruma
## 386 318 297 292 261
## Kitunda Mishamo Msindo Chalinze Maji ya Chai
## 260 251 243 238 227
## Usuka Chanika Ngarenanyuki Vikindu Mtwango
## 222 221 212 199 189
## Zinga/Ikerege Itete Matola Magomeni Maramba
## 185 179 177 170 167
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 318.7 20.0 350000.0
##
## 0 0.2 0.25 0.5 1 2 3 5 6 7
## 52049 4 1 1 3 18 1 471 231 87
## 9 10 12 15 20 25 26 30 33 35
## 1 995 1 17 1812 445 1 929 25 10
## 40 50 53 59 60 70 100 150 200 220
## 74 3103 1 1 19 30 1034 25 1516 1
## 250 300 306 350 400 450 500 520 550 590
## 731 688 1 4 55 59 3874 2 4 2
## 600 700 750 800 900 1000 1200 1300 1400 1500
## 219 31 71 2 1 1858 322 10 1 244
## 2000 2200 2400 2500 2550 2800 3000 3500 3600 4000
## 882 39 186 169 1 15 425 6 56 195
## 4500 4700 5000 5400 5500 6000 6300 6500 7000 7200
## 2 31 559 1 5 150 1 11 8 15
## 7500 8000 8500 9000 10000 11000 12000 13000 14000 15000
## 13 87 1 3 74 2 64 2 5 21
## 16000 16300 18000 20000 25000 26000 30000 35000 38000 40000
## 2 2 6 53 16 2 12 3 1 8
## 45000 50000 60000 70000 1e+05 117000 120000 138000 170000 2e+05
## 3 6 2 2 4 7 1 1 1 2
## 250000 350000
## 1 1
We have some very skewed data here. Let’s tranform amount_tsh by log10 to help make this plot a little clearer.
That’s much better. Now we can see the range of values better. I am going to create a new feature for this tranformation of amount_tsh.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.689 1.322 5.544
The dates are impossible to read, but they vary from 2002-10-14 to 2013-12-03, so we have quite the range here.
##
## Government Of Tanzania Danida
## 11299 4504 3907
## Hesawa Rwssp World Bank
## 2782 1703 1701
## Kkkt World Vision Unicef
## 1623 1562 1324
## Tasaf Dhv District Council
## 1136 1065 1037
## Private Individual Dwsp 0
## 1034 1012 980
## Norad Germany Republi Water
## 949 765 739
## Tcrs Ministry Of Water
## 735 728
Here we have the 20 most common funders, except the second key is missing and another is “0”. I’m going to set these to Unknown because I don’t think there will be an easy way to find their true values.
0 is a very common value for gps_height. I expect that this means we have many missing values. I am going to take an average of the subvillage values, where gps_height does not equal 0, and assign these values to the subvillages where the gps_height does equal 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -90 424 1178 1025 1495 2777
That looks better. A few subvillages have gps_heights very close to 0 or even below, but this is possible because some locations can be below sea level.
## feature
## DWE Government
## 21751 4532 2282
## RWE Commu DANIDA
## 1498 1347 1305
## KKKT Hesawa 0
## 1120 1070 980
## TCRS CES Central government
## 887 765 764
## DANID Community HESAWA
## 690 687 679
## District Council World vision TASAF
## 663 517 504
## LGA WEDECO
## 501 496
Much like with funder, the installer feature has keys with values of 0 and nothing. Let’s also change these to “Unknown”.
##
## 0 31.61952953 32.91986139 32.92057868 32.92488627 32.92601185
## 2269 2 2 2 2 2
## 32.93668943 32.95559708 32.95652279 32.96573445 32.96700926 32.9727187
## 2 2 2 2 2 2
## 32.9771906 32.9780624 32.98269806 32.98393225 32.98478963 32.98724945
## 2 2 2 2 2 2
## 32.98751118 32.98767048
## 2 2
Quite a few values are set to 0 here. Just like with gps_height, I am going to find, then apply, the non-zero average of the subvillages, which will hence forth be referred to as the ‘non-zero average function.’
## [1] "Summary of longitude"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.61 33.24 35.00 35.13 37.22 40.35
As expected, latitude has the same problem as longitude and the non-zero average function will be applied here.
##
## -2e-08 -9.2893492 -8.90539462 -7.17908174 -7.1772029 -7.17715478
## 2269 2 2 2 2 2
## -7.17517443 -7.17160463 -7.17083546 -7.17066586 -7.16249333 -7.10591905
## 2 2 2 2 2 2
## -7.10492257 -7.10462503 -7.10374232 -7.10200368 -7.10013213 -7.09516233
## 2 2 2 2 2 2
## -7.09513079 -7.0939143
## 2 2
## [1] "Summary of latitude"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.6500 -8.5800 -5.1190 -5.8520 -3.3670 -0.9985
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 73299 94 31 38 30 60 92 31 58 4 5 18
## 12 13 14 15 16 17 20 22 23 24 25 26
## 7 1 3 47 13 24 14 11 2 5 14 3
## 27 30 32 34 35 38 39 41 42 45 47 50
## 2 1 45 13 1 4 34 12 1 41 27 7
## 55 58 60 61 62 65 80 87 93 94 102 111
## 1 4 1 3 1 27 17 1 37 1 28 1
## 120 131 136 141 150 160 180 213 240 280 300 420
## 9 1 1 2 6 1 3 1 1 1 1 1
## 450 668 669 672 698 755 1402 1776
## 1 1 1 1 1 1 1 1
I am not sure what num_private is, but most of the values are 0 and there are a few extreme outliers. I’m going to keep this feature as it should still be be useful in the feature engineering stage.
## feature
## Lake Victoria Pangani Rufiji
## 12871 11143 9987
## Internal Lake Tanganyika Wami / Ruvu
## 9642 8052 7577
## Lake Nyasa Ruvuma / Southern Coast Lake Rukwa
## 6332 5587 3059
Everything looks good with the values of basin.
Looks like we have some big outliers and many of the values are very close to 0. Let’s take a closer look.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 25.0 180.8 215.0 30500.0
## [1] "Number of values that are <= 1:"
##
## FALSE TRUE
## 38634 35616
Many of the values are <= 1. I don’t believe this is the case, so we are going to use the non-zero average function, but using 1 instead of 0.
## [1] "public_meeting values"
##
## False True
## 4155 6346 63749
Since the number of True values outnumber False 10:1, I’m going to assign all of the empty strings to True.
## [1] "recorded_by values:"
##
## GeoData Consultants Ltd
## 74250
Since it’s always the same values, I’m going to drop this feature.
## [1] "scheme_management values"
##
## Company None Other
## 4846 1341 1 996
## Parastatal Private operator SWC Trust
## 2124 1326 123 92
## VWC Water authority Water Board WUA
## 45917 3975 3462 3551
## WUG
## 6496
Let’s set all of the empty strings to None.
## [1] "Top 20 most frequent scheme_name values"
## feature
##
## 35258
## K
## 858
## None
## 794
## Borehole
## 704
## Chalinze wate
## 501
## M
## 490
## DANIDA
## 483
## Government
## 395
## Ngana water supplied scheme
## 335
## wanging'ombe water supply s
## 323
## Bagamoyo wate
## 296
## wanging'ombe supply scheme
## 284
## I
## 281
## Uroki-Bomang'ombe water sup
## 266
## N
## 258
## Kirua kahe gravity water supply trust
## 237
## Machumba estate pipe line
## 225
## Makwale water supplied sche
## 209
## Kijiji
## 205
## Handeni Trunk Main(H
## 204
Looks like we have the same problem as with subvillage values. We’ll set all of these unlikely values to None.
## [1] "permit values"
##
## False True
## 3793 21851 48606
I’m going to change the value of True to 1 and False to 0, then find the average value for each subvillage (very similar to the non-zero average function). I will then apply these values to the empty strings.
##
## 0 1
## 22971 51279
Some very obvious incorrect values here. I’m going to assume that there is a strong correlation between proximity to another well and year built, i.e. wells in the same subvillage have the same (or very similar) construction year. Therefore, we are going to assign the missing values the average construction year in the same subvillage.
All of the tables below have values that look reasonable.
## [1] "extraction_type"
##
## afridev cemo
## 2208 108
## climax gravity
## 41 33263
## india mark ii india mark iii
## 3029 135
## ksb mono
## 1790 3628
## nira/tanira other
## 10205 8102
## other - mkulima/shinyanga other - play pump
## 2 101
## other - rope pump other - swn 81
## 572 284
## submersible swn 80
## 5982 4588
## walimi windmill
## 60 152
## [1] "extraction_type_group"
##
## afridev gravity india mark ii india mark iii
## 2208 33263 3029 135
## mono nira/tanira other other handpump
## 3628 10205 8102 447
## other motorpump rope pump submersible swn 80
## 149 572 7772 4588
## wind-powered
## 152
## [1] "extraction_type_class"
##
## gravity handpump motorpump other rope pump
## 33263 20612 3777 8102 572
## submersible wind-powered
## 7772 152
## [1] "management"
##
## company other other - school parastatal
## 859 1083 126 2229
## private operator trust unknown vwc
## 2504 105 683 50624
## water authority water board wua wug
## 1123 3688 3118 8108
## [1] "management_group"
##
## commercial other parastatal unknown user-group
## 4591 1209 2229 683 65538
## [1] "payment"
##
## never pay other pay annually
## 31712 1314 4570
## pay monthly pay per bucket pay when scheme fails
## 10397 11266 4842
## unknown
## 10149
## [1] "payment_type"
##
## annually monthly never pay on failure other per bucket
## 4570 10397 31712 4842 1314 11266
## unknown
## 10149
## [1] "water_quality"
##
## coloured fluoride fluoride abandoned
## 623 244 23
## milky salty salty abandoned
## 1005 6082 423
## soft unknown
## 63505 2345
## [1] "quality_group"
##
## colored fluoride good milky salty unknown
## 623 267 63505 1005 6505 2345
## [1] "quantity"
##
## dry enough insufficient seasonal unknown
## 7782 41522 18896 5075 975
## [1] "quantity_group"
##
## dry enough insufficient seasonal unknown
## 7782 41522 18896 5075 975
## [1] "Quantity group is the same as quantity, so we're going to drop it."
## [1] "source"
##
## dam hand dtw lake
## 840 1108 950
## machine dbh other rainwater harvesting
## 13822 261 2863
## river shallow well spring
## 11964 21140 21216
## unknown
## 86
## [1] "source_type"
##
## borehole dam other
## 14930 840 347
## rainwater harvesting river/lake shallow well
## 2863 12914 21140
## spring
## 21216
## [1] "source_class"
##
## groundwater surface unknown
## 57286 16617 347
## [1] "waterpoint_type"
##
## cattle trough communal standpipe
## 150 35628
## communal standpipe multiple dam
## 7611 8
## hand pump improved spring
## 21884 959
## other
## 8010
## [1] "waterpoint_type_group"
##
## cattle trough communal standpipe dam
## 150 43239 8
## hand pump improved spring other
## 21884 959 8010
Alright, the features have all been cleaned, now we can seperate the train and test sets to see how the features compare to the status of the wells.
We have some outliers, let’s take a closer look by transforming amount_tsh by log10.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 461.8 50.0 350000.0
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 267.1 50.0 100000.0
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 123.5 0.0 117000.0
We can see that there is typically more water at functioning wells, than nonfunctional. There are also quite a few extreme outliers for functional wells.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -90.0 496.7 1235.0 1082.0 1546.0 2628.0
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -51 572 1170 1049 1450 2770
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -59.0 355.4 1099.0 953.7 1432.0 2774.0
(Note: The three horizontal lines per graph represent the 75th, 50th, and 25th quantiles.) The values looks to be rather evenly distributed, even in terms of outliers.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.539 0.000 1402.000
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3076 0.0000 160.0000
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.414 0.000 1776.000
##
## FALSE TRUE
## 757 58643
Most of the values are 0 (58643 out of 59400), so it’s not too surprising to see the violin plots we have here.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 100.0 216.2 365.9 400.0 30500.0
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 130.0 250.0 377.1 450.0 15300.0
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 130.0 250.0 383.8 417.2 15300.0
The distributions look very similar here as well. The main differences that I see are the outliers in the functional wells and the higher number of 0 values in the non functional wells.
##
## FALSE
## 59400
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1993 2002 1999 2008 2013
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1986 1999 1996 2006 2013
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1984 1996 1994 2005 2013
It looks pretty clear that newer wells are more likely to be functional than older wells and that many wells were built during 2007/2008.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.61 33.48 35.05 35.16 37.08 40.35
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.64 31.81 34.11 34.29 36.80 40.28
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.61 33.06 35.01 35.21 37.58 40.34
Functional wells look to have a more central longitude. Extreme longitudes seem to be more common with wells that need repair or are non functional.
## training$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.560 -8.681 -5.010 -5.836 -3.352 -0.999
## --------------------------------------------------------
## training$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.650 -8.332 -4.730 -5.577 -3.369 -1.003
## --------------------------------------------------------
## training$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.5900 -8.5410 -5.5110 -5.9250 -3.4230 -0.9985
The main thing I see here is that there are disproportionately more well just below -3 and -9 degrees latitude.
Let’s rejoin the training and testing test to do some feature engineering (unfortunately this can only be seen if you look at my .RMD file).
To run the model properly, use cross validation and 10 folds, but to knit the file quickly to HTML, I changed these values.
## + Resample1: mtry=15
## Growing trees.. Progress: 33%. Estimated remaining time: 1 minute, 1 seconds.
## Growing trees.. Progress: 66%. Estimated remaining time: 31 seconds.
## - Resample1: mtry=15
## Aggregating results
## Fitting final model on full training set
## Growing trees.. Progress: 34%. Estimated remaining time: 1 minute, 1 seconds.
## Growing trees.. Progress: 65%. Estimated remaining time: 33 seconds.
## Growing trees.. Progress: 97%. Estimated remaining time: 2 seconds.
Predicting with the first validation set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair
## functional 3903 319
## functional needs repair 76 174
## non functional 344 86
## Reference
## Prediction non functional
## functional 670
## functional needs repair 44
## non functional 2344
##
## Overall Statistics
##
## Accuracy : 0.8067
## 95% CI : (0.7978, 0.8153)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.635
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.9028 0.30052
## Specificity 0.7281 0.98374
## Pos Pred Value 0.7978 0.59184
## Neg Pred Value 0.8631 0.94717
## Prevalence 0.5431 0.07274
## Detection Rate 0.4903 0.02186
## Detection Prevalence 0.6146 0.03693
## Balanced Accuracy 0.8155 0.64213
## Class: non functional
## Sensitivity 0.7665
## Specificity 0.9123
## Pos Pred Value 0.8450
## Neg Pred Value 0.8623
## Prevalence 0.3842
## Detection Rate 0.2945
## Detection Prevalence 0.3485
## Balanced Accuracy 0.8394
Predicting with the final validating set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction functional functional needs repair
## functional 1968 154
## functional needs repair 24 92
## non functional 136 38
## Reference
## Prediction non functional
## functional 345
## functional needs repair 18
## non functional 1143
##
## Overall Statistics
##
## Accuracy : 0.8175
## 95% CI : (0.8051, 0.8295)
## No Information Rate : 0.5431
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6533
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: functional Class: functional needs repair
## Sensitivity 0.9248 0.32394
## Specificity 0.7212 0.98844
## Pos Pred Value 0.7977 0.68657
## Neg Pred Value 0.8897 0.94926
## Prevalence 0.5431 0.07249
## Detection Rate 0.5023 0.02348
## Detection Prevalence 0.6297 0.03420
## Balanced Accuracy 0.8230 0.65619
## Class: non functional
## Sensitivity 0.7590
## Specificity 0.9279
## Pos Pred Value 0.8679
## Neg Pred Value 0.8604
## Prevalence 0.3844
## Detection Rate 0.2917
## Detection Prevalence 0.3361
## Balanced Accuracy 0.8434
The goal of this model is to predict which wells are functioning, need repair, or are not working. The data was provided by DrivenData as part of their ‘Pump it Up: Data Mining the Water Table’ competition. I believe that my model does as rather good job of performing this task as it currently rank 65th out of 2567 users. My score (based on classification accuracy) is 0.8213, and the top score is 0.8285, so I’m not off the mark by too much. There are a few ways for how my score could be improved, including finding different values for the missing values when cleaning my data, engineering new features, using a different algorithm, using different features to train the model, and using an ensemble of models to make a prediction. After putting quite a few hours into build my model, I chose to stop fine-tuning my dataset and model because my time would be better spent working on different datasets/projects.
The five most important features for my model are: log_amount_tsh, construction_year, hasWater, extraction_type_groupsubmersible, and extraction_type_groupnira/tanira. From just these features we can learn that functional wells typically have more water at the source and are built more recently.
## [1] "Comparing status of wells with log_amount_tsh"
## finalTraining$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.9017 1.7080 5.5440
## --------------------------------------------------------
## finalTraining$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7238 1.7080 5.0000
## --------------------------------------------------------
## finalTraining$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3796 0.0000 5.0680
## [1] "Comparing status of wells with construction_year"
## finalTraining$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1993 2002 1999 2008 2013
## --------------------------------------------------------
## finalTraining$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1986 1999 1996 2006 2013
## --------------------------------------------------------
## finalTraining$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1984 1996 1994 2005 2013
## [1] "Comparing status of wells with hasWater"
## finalTraining$status: functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3898 1.0000 1.0000
## --------------------------------------------------------
## finalTraining$status: functional needs repair
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.293 1.000 1.000
## --------------------------------------------------------
## finalTraining$status: non functional
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1714 0.0000 1.0000
My predictions followed a similar distribution, with respect to the status of the wells, to the training dataset, but functioning wells were overrepresented and functional-needs-repair were underrepresented. I tried upsampling to counter the unequal distrbution, but it did not improve my score. This issue expresses how machine learning competition models differ from real world models. Rather than overall accuracy being most important, it would be more useful to know which wells are not working or need repair. Catering the model to predict these types of wells correctly could decrease the overall accuracy, but would improve the usefullness of the model in the real world.