This is the code for my best submission in the “Pump it Up: Data Mining the Water Table” cometition, hosted by DrivenData. This competition really interested me because DrivenData hosts competitions for non-profits. Having the ability to help out an organization in need, especially when the goal is determining if wells are functioning in Tanzania, is a very gratifying knowning that I might provide some assistance.

The sections of the analyis are: -Inspecting the data -Clean the data -Feature Engineering -Building the model -Summary

Inspecting the Data

Let’s start with a look at how much data we have.

## [1] 59400    40

So we have a fair amount of data here, 40 features with 59,400 rows of data. Now for a closer look at the data.

## 'data.frame':    59400 obs. of  40 variables:
##  $ id                   : int  69572 8776 34310 67743 19728 9944 19816 54551 53934 46144 ...
##  $ amount_tsh           : num  6000 0 25 0 0 20 0 0 0 0 ...
##  $ date_recorded        : chr  "2011-03-14" "2013-03-06" "2013-02-25" "2013-01-28" ...
##  $ funder               : chr  "Roman" "Grumeti" "Lottery Club" "Unicef" ...
##  $ gps_height           : int  1390 1399 686 263 0 0 0 0 0 0 ...
##  $ installer            : chr  "Roman" "GRUMETI" "World vision" "UNICEF" ...
##  $ longitude            : num  34.9 34.7 37.5 38.5 31.1 ...
##  $ latitude             : num  -9.86 -2.15 -3.82 -11.16 -1.83 ...
##  $ wpt_name             : chr  "none" "Zahanati" "Kwa Mahundi" "Zahanati Ya Nanyumbu" ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : chr  "Lake Nyasa" "Lake Victoria" "Pangani" "Ruvuma / Southern Coast" ...
##  $ subvillage           : chr  "Mnyusi B" "Nyamara" "Majengo" "Mahakamani" ...
##  $ region               : chr  "Iringa" "Mara" "Manyara" "Mtwara" ...
##  $ region_code          : int  11 20 21 90 18 4 17 17 14 18 ...
##  $ district_code        : int  5 2 4 63 1 8 3 3 6 1 ...
##  $ lga                  : chr  "Ludewa" "Serengeti" "Simanjiro" "Nanyumbu" ...
##  $ ward                 : chr  "Mundindi" "Natta" "Ngorika" "Nanyumbu" ...
##  $ population           : int  109 280 250 58 0 1 0 0 0 0 ...
##  $ public_meeting       : chr  "True" "" "True" "True" ...
##  $ recorded_by          : chr  "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" "GeoData Consultants Ltd" ...
##  $ scheme_management    : chr  "VWC" "Other" "VWC" "VWC" ...
##  $ scheme_name          : chr  "Roman" "" "Nyumba ya mungu pipe scheme" "" ...
##  $ permit               : chr  "False" "True" "True" "True" ...
##  $ construction_year    : int  1999 2010 2009 1986 0 2009 0 0 0 0 ...
##  $ extraction_type      : chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_group: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ extraction_type_class: chr  "gravity" "gravity" "gravity" "submersible" ...
##  $ management           : chr  "vwc" "wug" "vwc" "vwc" ...
##  $ management_group     : chr  "user-group" "user-group" "user-group" "user-group" ...
##  $ payment              : chr  "pay annually" "never pay" "pay per bucket" "never pay" ...
##  $ payment_type         : chr  "annually" "never pay" "per bucket" "never pay" ...
##  $ water_quality        : chr  "soft" "soft" "soft" "soft" ...
##  $ quality_group        : chr  "good" "good" "good" "good" ...
##  $ quantity             : chr  "enough" "insufficient" "enough" "dry" ...
##  $ quantity_group       : chr  "enough" "insufficient" "enough" "dry" ...
##  $ source               : chr  "spring" "rainwater harvesting" "dam" "machine dbh" ...
##  $ source_type          : chr  "spring" "rainwater harvesting" "dam" "borehole" ...
##  $ source_class         : chr  "groundwater" "surface" "surface" "groundwater" ...
##  $ waterpoint_type      : chr  "communal standpipe" "communal standpipe" "communal standpipe multiple" "communal standpipe multiple" ...
##  $ waterpoint_type_group: chr  "communal standpipe" "communal standpipe" "communal standpipe" "communal standpipe" ...

The top 10 rows of the data:

##       id amount_tsh date_recorded              funder gps_height
## 1  69572       6000    2011-03-14               Roman       1390
## 2   8776          0    2013-03-06             Grumeti       1399
## 3  34310         25    2013-02-25        Lottery Club        686
## 4  67743          0    2013-01-28              Unicef        263
## 5  19728          0    2011-07-13         Action In A          0
## 6   9944         20    2011-03-13 Mkinga Distric Coun          0
## 7  19816          0    2012-10-01                Dwsp          0
## 8  54551          0    2012-10-09               Rwssp          0
## 9  53934          0    2012-11-03            Wateraid          0
## 10 46144          0    2011-08-03         Isingiro Ho          0
##       installer longitude   latitude             wpt_name num_private
## 1         Roman  34.93809  -9.856322                 none           0
## 2       GRUMETI  34.69877  -2.147466             Zahanati           0
## 3  World vision  37.46066  -3.821329          Kwa Mahundi           0
## 4        UNICEF  38.48616 -11.155298 Zahanati Ya Nanyumbu           0
## 5       Artisan  31.13085  -1.825359              Shuleni           0
## 6           DWE  39.17280  -4.765587               Tajiri           0
## 7          DWSP  33.36241  -3.766365           Kwa Ngomho           0
## 8           DWE  32.62062  -4.226198         Tushirikiane           0
## 9     Water Aid  32.71110  -5.146712    Kwa Ramadhan Musa           0
## 10      Artisan  30.62699  -1.257051              Kwapeto           0
##                      basin      subvillage    region region_code
## 1               Lake Nyasa        Mnyusi B    Iringa          11
## 2            Lake Victoria         Nyamara      Mara          20
## 3                  Pangani         Majengo   Manyara          21
## 4  Ruvuma / Southern Coast      Mahakamani    Mtwara          90
## 5            Lake Victoria      Kyanyamisa    Kagera          18
## 6                  Pangani     Moa/Mwereme     Tanga           4
## 7                 Internal   Ishinabulandi Shinyanga          17
## 8          Lake Tanganyika Nyawishi Center Shinyanga          17
## 9          Lake Tanganyika      Imalauduki    Tabora          14
## 10           Lake Victoria        Mkonomre    Kagera          18
##    district_code             lga       ward population public_meeting
## 1              5          Ludewa   Mundindi        109           True
## 2              2       Serengeti      Natta        280               
## 3              4       Simanjiro    Ngorika        250           True
## 4             63        Nanyumbu   Nanyumbu         58           True
## 5              1         Karagwe Nyakasimbi          0           True
## 6              8          Mkinga        Moa          1           True
## 7              3 Shinyanga Rural     Samuye          0           True
## 8              3          Kahama     Chambo          0           True
## 9              6    Tabora Urban   Itetemia          0           True
## 10             1         Karagwe     Kaisho          0           True
##                recorded_by scheme_management                 scheme_name
## 1  GeoData Consultants Ltd               VWC                       Roman
## 2  GeoData Consultants Ltd             Other                            
## 3  GeoData Consultants Ltd               VWC Nyumba ya mungu pipe scheme
## 4  GeoData Consultants Ltd               VWC                            
## 5  GeoData Consultants Ltd                                              
## 6  GeoData Consultants Ltd               VWC                   Zingibali
## 7  GeoData Consultants Ltd               VWC                            
## 8  GeoData Consultants Ltd                                              
## 9  GeoData Consultants Ltd               VWC                            
## 10 GeoData Consultants Ltd                                              
##    permit construction_year extraction_type extraction_type_group
## 1   False              1999         gravity               gravity
## 2    True              2010         gravity               gravity
## 3    True              2009         gravity               gravity
## 4    True              1986     submersible           submersible
## 5    True                 0         gravity               gravity
## 6    True              2009     submersible           submersible
## 7    True                 0          swn 80                swn 80
## 8    True                 0     nira/tanira           nira/tanira
## 9    True                 0   india mark ii         india mark ii
## 10   True                 0     nira/tanira           nira/tanira
##    extraction_type_class management management_group        payment
## 1                gravity        vwc       user-group   pay annually
## 2                gravity        wug       user-group      never pay
## 3                gravity        vwc       user-group pay per bucket
## 4            submersible        vwc       user-group      never pay
## 5                gravity      other            other      never pay
## 6            submersible        vwc       user-group pay per bucket
## 7               handpump        vwc       user-group      never pay
## 8               handpump        wug       user-group        unknown
## 9               handpump        vwc       user-group      never pay
## 10              handpump        vwc       user-group      never pay
##    payment_type water_quality quality_group     quantity quantity_group
## 1      annually          soft          good       enough         enough
## 2     never pay          soft          good insufficient   insufficient
## 3    per bucket          soft          good       enough         enough
## 4     never pay          soft          good          dry            dry
## 5     never pay          soft          good     seasonal       seasonal
## 6    per bucket         salty         salty       enough         enough
## 7     never pay          soft          good       enough         enough
## 8       unknown         milky         milky       enough         enough
## 9     never pay         salty         salty     seasonal       seasonal
## 10    never pay          soft          good       enough         enough
##                  source          source_type source_class
## 1                spring               spring  groundwater
## 2  rainwater harvesting rainwater harvesting      surface
## 3                   dam                  dam      surface
## 4           machine dbh             borehole  groundwater
## 5  rainwater harvesting rainwater harvesting      surface
## 6                 other                other      unknown
## 7           machine dbh             borehole  groundwater
## 8          shallow well         shallow well  groundwater
## 9           machine dbh             borehole  groundwater
## 10         shallow well         shallow well  groundwater
##                waterpoint_type waterpoint_type_group
## 1           communal standpipe    communal standpipe
## 2           communal standpipe    communal standpipe
## 3  communal standpipe multiple    communal standpipe
## 4  communal standpipe multiple    communal standpipe
## 5           communal standpipe    communal standpipe
## 6  communal standpipe multiple    communal standpipe
## 7                    hand pump             hand pump
## 8                    hand pump             hand pump
## 9                    hand pump             hand pump
## 10                   hand pump             hand pump

Hmm, we have some missing data here, such as in “scheme_name” and “scheme_management.” Let’s take a look at the amount of missing data in each feature.

Cleaning the Data

##                    id            amount_tsh         date_recorded 
##                     0                     0                     0 
##                funder            gps_height             installer 
##                     0                     0                     0 
##             longitude              latitude              wpt_name 
##                     0                     0                     0 
##           num_private                 basin            subvillage 
##                     0                     0                     0 
##                region           region_code         district_code 
##                     0                     0                     0 
##                   lga                  ward            population 
##                     0                     0                     0 
##        public_meeting           recorded_by     scheme_management 
##                     0                     0                     0 
##           scheme_name                permit     construction_year 
##                     0                     0                     0 
##       extraction_type extraction_type_group extraction_type_class 
##                     0                     0                     0 
##            management      management_group               payment 
##                     0                     0                     0 
##          payment_type         water_quality         quality_group 
##                     0                     0                     0 
##              quantity        quantity_group                source 
##                     0                     0                     0 
##           source_type          source_class       waterpoint_type 
##                     0                     0                     0 
## waterpoint_type_group 
##                     0

Okay…we don’t have any null values, but we know there are some missing values. We’ll have to look at the features one-by-one to find what’s not quite right.

First, let’s see what the status of the water pumps are, since that’s our main feature.

## 
##              functional functional needs repair          non functional 
##                   32259                    4317                   22824

It’s good to see that most of the wells are functioning, but the data is very unequally distributed.

Next, what are the top 20 subvillages.

## feature
##    Shuleni    Majengo   Madukani                  Kati    Mtakuja 
##        646        631        629        470        467        322 
##     Sokoni          M   Muungano    Mbuyuni Songambele    Mlimani 
##        294        243        215        210        194        190 
##   Miembeni  Msikitini          1   Kanisani    Kibaoni      Mjini 
##        168        164        157        146        143        131 
##   Mjimwema  Mapinduzi 
##        129        126

Hmm, looks like we have some interesting values here, including capital letters, empty strings, and 1. There are a few things we could do to try to correct these values, such as assign them the nearest subvillages based on latitude or longitude, however these features and many others are also missing values, so I am going to assign the value of ‘None’ to all of the subvillages.

## [1] "Number of subvillages with the value 'None'"
## 
## FALSE  TRUE 
## 72954  1296

Below are the region codes, but I am going to remove this feature because we will use dummy variables to represent the regions. You might be thinking, why not keep this feature and remove the regions’ names to save a step. To help to ensure that the data is correct and make feature engineering easier, I think that it is best to keep the regions’ names instead.

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2779 3709 5494 3145 5079 2032 1020  375  499 3306 6608 5759 2536 2488 2242 
##   16   17   18   19   20   21   24   40   60   80   90   99 
## 3533 6334 4183 3824 2451 1972  402    1 1298 1536 1133  512

Below we have the district codes. There are 21 districts and 20 regions. I’m not sure how accurate the district codes are or how they work because when I try to find the average district code for subvillages, lgas, wards, or regions, there are never still 21 districts. I’m going to drop this feature because of its unreliability and there should be enough geographic features already.

## [1] "The district codes:"
## 
##     0     1     2     3     4     5     6     7     8    13    23    30 
##    27 15299 13929 12521 11253  5428  5108  4166  1282   496   360  1256 
##    33    43    53    60    62    63    67    80 
##  1063   653   921    76   127   264     8    13

Below we have the 20 most frequent wards, everything looks good here.

## feature
##         Igosi      Imalinyi     Siha Kati        Mdandu       Nduruma 
##           386           318           297           292           261 
##       Kitunda       Mishamo        Msindo      Chalinze  Maji ya Chai 
##           260           251           243           238           227 
##         Usuka       Chanika  Ngarenanyuki       Vikindu       Mtwango 
##           222           221           212           199           189 
## Zinga/Ikerege         Itete        Matola      Magomeni       Maramba 
##           185           179           177           170           167

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    318.7     20.0 350000.0
## 
##      0    0.2   0.25    0.5      1      2      3      5      6      7 
##  52049      4      1      1      3     18      1    471    231     87 
##      9     10     12     15     20     25     26     30     33     35 
##      1    995      1     17   1812    445      1    929     25     10 
##     40     50     53     59     60     70    100    150    200    220 
##     74   3103      1      1     19     30   1034     25   1516      1 
##    250    300    306    350    400    450    500    520    550    590 
##    731    688      1      4     55     59   3874      2      4      2 
##    600    700    750    800    900   1000   1200   1300   1400   1500 
##    219     31     71      2      1   1858    322     10      1    244 
##   2000   2200   2400   2500   2550   2800   3000   3500   3600   4000 
##    882     39    186    169      1     15    425      6     56    195 
##   4500   4700   5000   5400   5500   6000   6300   6500   7000   7200 
##      2     31    559      1      5    150      1     11      8     15 
##   7500   8000   8500   9000  10000  11000  12000  13000  14000  15000 
##     13     87      1      3     74      2     64      2      5     21 
##  16000  16300  18000  20000  25000  26000  30000  35000  38000  40000 
##      2      2      6     53     16      2     12      3      1      8 
##  45000  50000  60000  70000  1e+05 117000 120000 138000 170000  2e+05 
##      3      6      2      2      4      7      1      1      1      2 
## 250000 350000 
##      1      1

We have some very skewed data here. Let’s tranform amount_tsh by log10 to help make this plot a little clearer.

That’s much better. Now we can see the range of values better. I am going to create a new feature for this tranformation of amount_tsh.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.689   1.322   5.544

The dates are impossible to read, but they vary from 2002-10-14 to 2013-12-03, so we have quite the range here.

## 
## Government Of Tanzania                                        Danida 
##                  11299                   4504                   3907 
##                 Hesawa                  Rwssp             World Bank 
##                   2782                   1703                   1701 
##                   Kkkt           World Vision                 Unicef 
##                   1623                   1562                   1324 
##                  Tasaf                    Dhv       District Council 
##                   1136                   1065                   1037 
##     Private Individual                   Dwsp                      0 
##                   1034                   1012                    980 
##                  Norad        Germany Republi                  Water 
##                    949                    765                    739 
##                   Tcrs      Ministry Of Water 
##                    735                    728

Here we have the 20 most common funders, except the second key is missing and another is “0”. I’m going to set these to Unknown because I don’t think there will be an easy way to find their true values.

0 is a very common value for gps_height. I expect that this means we have many missing values. I am going to take an average of the subvillage values, where gps_height does not equal 0, and assign these values to the subvillages where the gps_height does equal 0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     -90     424    1178    1025    1495    2777

That looks better. A few subvillages have gps_heights very close to 0 or even below, but this is possible because some locations can be below sea level.

## feature
##                DWE                            Government 
##              21751               4532               2282 
##                RWE              Commu             DANIDA 
##               1498               1347               1305 
##               KKKT             Hesawa                  0 
##               1120               1070                980 
##               TCRS                CES Central government 
##                887                765                764 
##              DANID          Community             HESAWA 
##                690                687                679 
##   District Council       World vision              TASAF 
##                663                517                504 
##                LGA             WEDECO 
##                501                496

Much like with funder, the installer feature has keys with values of 0 and nothing. Let’s also change these to “Unknown”.

## 
##           0 31.61952953 32.91986139 32.92057868 32.92488627 32.92601185 
##        2269           2           2           2           2           2 
## 32.93668943 32.95559708 32.95652279 32.96573445 32.96700926  32.9727187 
##           2           2           2           2           2           2 
##  32.9771906  32.9780624 32.98269806 32.98393225 32.98478963 32.98724945 
##           2           2           2           2           2           2 
## 32.98751118 32.98767048 
##           2           2

Quite a few values are set to 0 here. Just like with gps_height, I am going to find, then apply, the non-zero average of the subvillages, which will hence forth be referred to as the ‘non-zero average function.’

## [1] "Summary of longitude"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.61   33.24   35.00   35.13   37.22   40.35

As expected, latitude has the same problem as longitude and the non-zero average function will be applied here.

## 
##      -2e-08  -9.2893492 -8.90539462 -7.17908174  -7.1772029 -7.17715478 
##        2269           2           2           2           2           2 
## -7.17517443 -7.17160463 -7.17083546 -7.17066586 -7.16249333 -7.10591905 
##           2           2           2           2           2           2 
## -7.10492257 -7.10462503 -7.10374232 -7.10200368 -7.10013213 -7.09516233 
##           2           2           2           2           2           2 
## -7.09513079  -7.0939143 
##           2           2
## [1] "Summary of latitude"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -11.6500  -8.5800  -5.1190  -5.8520  -3.3670  -0.9985

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 73299    94    31    38    30    60    92    31    58     4     5    18 
##    12    13    14    15    16    17    20    22    23    24    25    26 
##     7     1     3    47    13    24    14    11     2     5    14     3 
##    27    30    32    34    35    38    39    41    42    45    47    50 
##     2     1    45    13     1     4    34    12     1    41    27     7 
##    55    58    60    61    62    65    80    87    93    94   102   111 
##     1     4     1     3     1    27    17     1    37     1    28     1 
##   120   131   136   141   150   160   180   213   240   280   300   420 
##     9     1     1     2     6     1     3     1     1     1     1     1 
##   450   668   669   672   698   755  1402  1776 
##     1     1     1     1     1     1     1     1

I am not sure what num_private is, but most of the values are 0 and there are a few extreme outliers. I’m going to keep this feature as it should still be be useful in the feature engineering stage.

## feature
##           Lake Victoria                 Pangani                  Rufiji 
##                   12871                   11143                    9987 
##                Internal         Lake Tanganyika             Wami / Ruvu 
##                    9642                    8052                    7577 
##              Lake Nyasa Ruvuma / Southern Coast              Lake Rukwa 
##                    6332                    5587                    3059

Everything looks good with the values of basin.

Looks like we have some big outliers and many of the values are very close to 0. Let’s take a closer look.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    25.0   180.8   215.0 30500.0
## [1] "Number of values that are <= 1:"
## 
## FALSE  TRUE 
## 38634 35616

Many of the values are <= 1. I don’t believe this is the case, so we are going to use the non-zero average function, but using 1 instead of 0.

## [1] "public_meeting values"
## 
##       False  True 
##  4155  6346 63749

Since the number of True values outnumber False 10:1, I’m going to assign all of the empty strings to True.

## [1] "recorded_by values:"
## 
## GeoData Consultants Ltd 
##                   74250

Since it’s always the same values, I’m going to drop this feature.

## [1] "scheme_management values"
## 
##                           Company             None            Other 
##             4846             1341                1              996 
##       Parastatal Private operator              SWC            Trust 
##             2124             1326              123               92 
##              VWC  Water authority      Water Board              WUA 
##            45917             3975             3462             3551 
##              WUG 
##             6496

Let’s set all of the empty strings to None.

## [1] "Top 20 most frequent scheme_name values"
## feature
##                                       
##                                 35258 
##                                     K 
##                                   858 
##                                  None 
##                                   794 
##                              Borehole 
##                                   704 
##                         Chalinze wate 
##                                   501 
##                                     M 
##                                   490 
##                                DANIDA 
##                                   483 
##                            Government 
##                                   395 
##           Ngana water supplied scheme 
##                                   335 
##           wanging'ombe water supply s 
##                                   323 
##                         Bagamoyo wate 
##                                   296 
##            wanging'ombe supply scheme 
##                                   284 
##                                     I 
##                                   281 
##           Uroki-Bomang'ombe water sup 
##                                   266 
##                                     N 
##                                   258 
## Kirua kahe gravity water supply trust 
##                                   237 
##             Machumba estate pipe line 
##                                   225 
##           Makwale water supplied sche 
##                                   209 
##                                Kijiji 
##                                   205 
##                  Handeni Trunk Main(H 
##                                   204

Looks like we have the same problem as with subvillage values. We’ll set all of these unlikely values to None.

## [1] "permit values"
## 
##       False  True 
##  3793 21851 48606

I’m going to change the value of True to 1 and False to 0, then find the average value for each subvillage (very similar to the non-zero average function). I will then apply these values to the empty strings.

## 
##     0     1 
## 22971 51279

Some very obvious incorrect values here. I’m going to assume that there is a strong correlation between proximity to another well and year built, i.e. wells in the same subvillage have the same (or very similar) construction year. Therefore, we are going to assign the missing values the average construction year in the same subvillage.

All of the tables below have values that look reasonable.

## [1] "extraction_type"
## 
##                   afridev                      cemo 
##                      2208                       108 
##                    climax                   gravity 
##                        41                     33263 
##             india mark ii            india mark iii 
##                      3029                       135 
##                       ksb                      mono 
##                      1790                      3628 
##               nira/tanira                     other 
##                     10205                      8102 
## other - mkulima/shinyanga         other - play pump 
##                         2                       101 
##         other - rope pump            other - swn 81 
##                       572                       284 
##               submersible                    swn 80 
##                      5982                      4588 
##                    walimi                  windmill 
##                        60                       152
## [1] "extraction_type_group"
## 
##         afridev         gravity   india mark ii  india mark iii 
##            2208           33263            3029             135 
##            mono     nira/tanira           other  other handpump 
##            3628           10205            8102             447 
## other motorpump       rope pump     submersible          swn 80 
##             149             572            7772            4588 
##    wind-powered 
##             152
## [1] "extraction_type_class"
## 
##      gravity     handpump    motorpump        other    rope pump 
##        33263        20612         3777         8102          572 
##  submersible wind-powered 
##         7772          152
## [1] "management"
## 
##          company            other   other - school       parastatal 
##              859             1083              126             2229 
## private operator            trust          unknown              vwc 
##             2504              105              683            50624 
##  water authority      water board              wua              wug 
##             1123             3688             3118             8108
## [1] "management_group"
## 
## commercial      other parastatal    unknown user-group 
##       4591       1209       2229        683      65538
## [1] "payment"
## 
##             never pay                 other          pay annually 
##                 31712                  1314                  4570 
##           pay monthly        pay per bucket pay when scheme fails 
##                 10397                 11266                  4842 
##               unknown 
##                 10149
## [1] "payment_type"
## 
##   annually    monthly  never pay on failure      other per bucket 
##       4570      10397      31712       4842       1314      11266 
##    unknown 
##      10149
## [1] "water_quality"
## 
##           coloured           fluoride fluoride abandoned 
##                623                244                 23 
##              milky              salty    salty abandoned 
##               1005               6082                423 
##               soft            unknown 
##              63505               2345
## [1] "quality_group"
## 
##  colored fluoride     good    milky    salty  unknown 
##      623      267    63505     1005     6505     2345
## [1] "quantity"
## 
##          dry       enough insufficient     seasonal      unknown 
##         7782        41522        18896         5075          975
## [1] "quantity_group"
## 
##          dry       enough insufficient     seasonal      unknown 
##         7782        41522        18896         5075          975
## [1] "Quantity group is the same as quantity, so we're going to drop it."
## [1] "source"
## 
##                  dam             hand dtw                 lake 
##                  840                 1108                  950 
##          machine dbh                other rainwater harvesting 
##                13822                  261                 2863 
##                river         shallow well               spring 
##                11964                21140                21216 
##              unknown 
##                   86
## [1] "source_type"
## 
##             borehole                  dam                other 
##                14930                  840                  347 
## rainwater harvesting           river/lake         shallow well 
##                 2863                12914                21140 
##               spring 
##                21216
## [1] "source_class"
## 
## groundwater     surface     unknown 
##       57286       16617         347
## [1] "waterpoint_type"
## 
##               cattle trough          communal standpipe 
##                         150                       35628 
## communal standpipe multiple                         dam 
##                        7611                           8 
##                   hand pump             improved spring 
##                       21884                         959 
##                       other 
##                        8010
## [1] "waterpoint_type_group"
## 
##      cattle trough communal standpipe                dam 
##                150              43239                  8 
##          hand pump    improved spring              other 
##              21884                959               8010

Alright, the features have all been cleaned, now we can seperate the train and test sets to see how the features compare to the status of the wells.

We have some outliers, let’s take a closer look by transforming amount_tsh by log10.

## training$status: functional
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    461.8     50.0 350000.0 
## -------------------------------------------------------- 
## training$status: functional needs repair
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    267.1     50.0 100000.0 
## -------------------------------------------------------- 
## training$status: non functional
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    123.5      0.0 117000.0

We can see that there is typically more water at functioning wells, than nonfunctional. There are also quite a few extreme outliers for functional wells.

## training$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -90.0   496.7  1235.0  1082.0  1546.0  2628.0 
## -------------------------------------------------------- 
## training$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     -51     572    1170    1049    1450    2770 
## -------------------------------------------------------- 
## training$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -59.0   355.4  1099.0   953.7  1432.0  2774.0

(Note: The three horizontal lines per graph represent the 75th, 50th, and 25th quantiles.) The values looks to be rather evenly distributed, even in terms of outliers.

## training$status: functional
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    0.539    0.000 1402.000 
## -------------------------------------------------------- 
## training$status: functional needs repair
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.3076   0.0000 160.0000 
## -------------------------------------------------------- 
## training$status: non functional
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    0.414    0.000 1776.000
## 
## FALSE  TRUE 
##   757 58643

Most of the values are 0 (58643 out of 59400), so it’s not too surprising to see the violin plots we have here.

## training$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   100.0   216.2   365.9   400.0 30500.0 
## -------------------------------------------------------- 
## training$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   130.0   250.0   377.1   450.0 15300.0 
## -------------------------------------------------------- 
## training$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   130.0   250.0   383.8   417.2 15300.0

The distributions look very similar here as well. The main differences that I see are the outliers in the functional wells and the higher number of 0 values in the non functional wells.

## 
## FALSE 
## 59400

## training$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1993    2002    1999    2008    2013 
## -------------------------------------------------------- 
## training$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1986    1999    1996    2006    2013 
## -------------------------------------------------------- 
## training$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1984    1996    1994    2005    2013

It looks pretty clear that newer wells are more likely to be functional than older wells and that many wells were built during 2007/2008.

## training$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.61   33.48   35.05   35.16   37.08   40.35 
## -------------------------------------------------------- 
## training$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.64   31.81   34.11   34.29   36.80   40.28 
## -------------------------------------------------------- 
## training$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.61   33.06   35.01   35.21   37.58   40.34

Functional wells look to have a more central longitude. Extreme longitudes seem to be more common with wells that need repair or are non functional.

## training$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.560  -8.681  -5.010  -5.836  -3.352  -0.999 
## -------------------------------------------------------- 
## training$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.650  -8.332  -4.730  -5.577  -3.369  -1.003 
## -------------------------------------------------------- 
## training$status: non functional
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -11.5900  -8.5410  -5.5110  -5.9250  -3.4230  -0.9985

The main thing I see here is that there are disproportionately more well just below -3 and -9 degrees latitude.

Feature Engineering

Let’s rejoin the training and testing test to do some feature engineering (unfortunately this can only be seen if you look at my .RMD file).

Build the Model

To run the model properly, use cross validation and 10 folds, but to knit the file quickly to HTML, I changed these values.

## + Resample1: mtry=15 
## Growing trees.. Progress: 33%. Estimated remaining time: 1 minute, 1 seconds.
## Growing trees.. Progress: 66%. Estimated remaining time: 31 seconds.
## - Resample1: mtry=15 
## Aggregating results
## Fitting final model on full training set
## Growing trees.. Progress: 34%. Estimated remaining time: 1 minute, 1 seconds.
## Growing trees.. Progress: 65%. Estimated remaining time: 33 seconds.
## Growing trees.. Progress: 97%. Estimated remaining time: 2 seconds.

Predicting with the first validation set:

## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair
##   functional                    3903                     319
##   functional needs repair         76                     174
##   non functional                 344                      86
##                          Reference
## Prediction                non functional
##   functional                         670
##   functional needs repair             44
##   non functional                    2344
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8067          
##                  95% CI : (0.7978, 0.8153)
##     No Information Rate : 0.5431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.635           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.9028                        0.30052
## Specificity                     0.7281                        0.98374
## Pos Pred Value                  0.7978                        0.59184
## Neg Pred Value                  0.8631                        0.94717
## Prevalence                      0.5431                        0.07274
## Detection Rate                  0.4903                        0.02186
## Detection Prevalence            0.6146                        0.03693
## Balanced Accuracy               0.8155                        0.64213
##                      Class: non functional
## Sensitivity                         0.7665
## Specificity                         0.9123
## Pos Pred Value                      0.8450
## Neg Pred Value                      0.8623
## Prevalence                          0.3842
## Detection Rate                      0.2945
## Detection Prevalence                0.3485
## Balanced Accuracy                   0.8394

Predicting with the final validating set:

## Confusion Matrix and Statistics
## 
##                          Reference
## Prediction                functional functional needs repair
##   functional                    1968                     154
##   functional needs repair         24                      92
##   non functional                 136                      38
##                          Reference
## Prediction                non functional
##   functional                         345
##   functional needs repair             18
##   non functional                    1143
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8175          
##                  95% CI : (0.8051, 0.8295)
##     No Information Rate : 0.5431          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6533          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: functional Class: functional needs repair
## Sensitivity                     0.9248                        0.32394
## Specificity                     0.7212                        0.98844
## Pos Pred Value                  0.7977                        0.68657
## Neg Pred Value                  0.8897                        0.94926
## Prevalence                      0.5431                        0.07249
## Detection Rate                  0.5023                        0.02348
## Detection Prevalence            0.6297                        0.03420
## Balanced Accuracy               0.8230                        0.65619
##                      Class: non functional
## Sensitivity                         0.7590
## Specificity                         0.9279
## Pos Pred Value                      0.8679
## Neg Pred Value                      0.8604
## Prevalence                          0.3844
## Detection Rate                      0.2917
## Detection Prevalence                0.3361
## Balanced Accuracy                   0.8434

Summary

The goal of this model is to predict which wells are functioning, need repair, or are not working. The data was provided by DrivenData as part of their ‘Pump it Up: Data Mining the Water Table’ competition. I believe that my model does as rather good job of performing this task as it currently rank 65th out of 2567 users. My score (based on classification accuracy) is 0.8213, and the top score is 0.8285, so I’m not off the mark by too much. There are a few ways for how my score could be improved, including finding different values for the missing values when cleaning my data, engineering new features, using a different algorithm, using different features to train the model, and using an ensemble of models to make a prediction. After putting quite a few hours into build my model, I chose to stop fine-tuning my dataset and model because my time would be better spent working on different datasets/projects.

The five most important features for my model are: log_amount_tsh, construction_year, hasWater, extraction_type_groupsubmersible, and extraction_type_groupnira/tanira. From just these features we can learn that functional wells typically have more water at the source and are built more recently.

## [1] "Comparing status of wells with log_amount_tsh"
## finalTraining$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.9017  1.7080  5.5440 
## -------------------------------------------------------- 
## finalTraining$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7238  1.7080  5.0000 
## -------------------------------------------------------- 
## finalTraining$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3796  0.0000  5.0680
## [1] "Comparing status of wells with construction_year"
## finalTraining$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1993    2002    1999    2008    2013 
## -------------------------------------------------------- 
## finalTraining$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1986    1999    1996    2006    2013 
## -------------------------------------------------------- 
## finalTraining$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1984    1996    1994    2005    2013
## [1] "Comparing status of wells with hasWater"
## finalTraining$status: functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3898  1.0000  1.0000 
## -------------------------------------------------------- 
## finalTraining$status: functional needs repair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.293   1.000   1.000 
## -------------------------------------------------------- 
## finalTraining$status: non functional
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1714  0.0000  1.0000

My predictions followed a similar distribution, with respect to the status of the wells, to the training dataset, but functioning wells were overrepresented and functional-needs-repair were underrepresented. I tried upsampling to counter the unequal distrbution, but it did not improve my score. This issue expresses how machine learning competition models differ from real world models. Rather than overall accuracy being most important, it would be more useful to know which wells are not working or need repair. Catering the model to predict these types of wells correctly could decrease the overall accuracy, but would improve the usefullness of the model in the real world.