This report is about the bike sharing services in San Francisco and Seattle. We will exmaine how these cities compare when looking at features, such as bike usage rates, weather, duration of trips, traffic at stations, types of users, and a few others. The data we are going to use is from Kaggle, https://www.kaggle.com/benhamner/sf-bay-area-bike-share for San Francisco, and https://www.kaggle.com/pronto/cycle-share-dataset for Seattle. I thought we could learn much more by looking at these two dataset side-by-side than seperately. So here we go!

Size of Services

The number of stations is somewhat similar between the two cities; SF has 16, or 30% more stations.

There is a bit more of a difference here, SF has 185, or 39% more bikes than Seattle. This equates to 9.48 bikes/station for SF and 8.87 bikes/station for Seattle.

Note: Due to the difference in size of these services, most of the metrics used will be relative rather than absolute, to allow for a more equal comparison.

Bike Usage Rates

## [1] "Bike usage rates in San Francisco:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1114  0.3215  0.5346  0.4597  0.5557  0.6114
## [1] "Bike usage rate in Seattle:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05428 0.32150 0.41130 0.40770 0.52190 0.62000

Here we are looking at what percentage of all bikes are used, at least once, in a given day. You could think of this as an efficiency metric.

Bike usage rates are typically higher in SF, the median rate is 12 points higher (53% vs 41%), but the mean is a bit closer as the difference is only 5 points (46% vs 41%). It’s neat to see that the highest rate is very close (61-62%), but the lowest rates are further apart (11% vs 5%). SF has a large gap of data points between 50% and 40% bike usage, hopefully we can find what is causing this split.

## [1] "Summary of temperatures in SF (in Fahrenheit):"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   41.00   56.00   59.00   59.64   65.00   77.00
## [1] "Summary of temperatures in Seattle (in Fahrenheit):"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.00   49.00   56.00   57.33   66.00   83.00

Bike usage increases with a rise in temperature, but as mean temperatures increase past 70 degrees I am not certain if the maximum bike usage rate is reached. Perhaps the weather is becoming too warm for riding a bike, or we do not have enough data points to see the true trend line.

## [1] "Summary of bike usage rates in SF, by weather type:"
## bike_rateSF$Weather[bike_rateSF$Var2 == "TRUE"]: Cold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.14   30.87   52.48   43.90   54.71   60.39 
## -------------------------------------------------------- 
## bike_rateSF$Weather[bike_rateSF$Var2 == "TRUE"]: Warm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.63   34.19   54.37   47.81   56.40   61.14
## [1] "Summary of bike usage rates in Seattle, by weather type:"
## bike_rateS$Weather[bike_rateS$Var2 == "TRUE"]: Cold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.428  26.770  33.190  32.070  38.360  51.980 
## -------------------------------------------------------- 
## bike_rateS$Weather[bike_rateS$Var2 == "TRUE"]: Warm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.75   44.26   51.36   48.70   55.32   62.00

I was wondering if simplifying mean temperature into two categories (warm [above the median mean temperature] and cold [below the median mean temperature]) would help us to further understand the relationship between mean temperature and bike usage. The relationship is weaker in SF, and we have not explained the gap (between 40% and 50%) in data points. In Seattle, the relationship is stronger. The median bike usage rate for warm weather is 18 points higher than for cold weather (51% vs 33%).

## [1] "Summary of bike usage rates in SF, by amount of rain:"
## bike_rateSF$Rain[bike_rateSF$Var2 == "TRUE"]: No Rain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.81   32.23   53.92   46.67   55.57   61.14 
## -------------------------------------------------------- 
## bike_rateSF$Rain[bike_rateSF$Var2 == "TRUE"]: Rain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.14   31.29   50.30   43.60   54.67   59.49
## [1] "Summary of bike usage rates in Seattle, by amount of rain:"
## bike_rateS$Rain[bike_rateS$Var2 == "TRUE"]: No Rain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.58   37.58   48.33   45.77   54.49   62.00 
## -------------------------------------------------------- 
## bike_rateS$Rain[bike_rateS$Var2 == "TRUE"]: Rain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.428  26.300  33.190  33.060  40.290  55.950

Looking at the mean and median values, I’d say that rain is a better indicator of bike usage in SF than mean temperature, but we still cannot explain the gap in data points. Although the relationship is stronger in Seattle, relative to SF, rain does not affect bike usage as much as mean temperature. The difference in median values is only 15 points, compared to 18 for mean temperature.

## [1] "Summary of bike usage rates in SF, by type of day:"
## bike_rateSF$Is_Weekend[bike_rateSF$Var2 == "TRUE"]: Weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.14   53.01   54.67   52.60   56.33   61.14 
## -------------------------------------------------------- 
## bike_rateSF$Is_Weekend[bike_rateSF$Var2 == "TRUE"]: Weekend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.35   26.20   30.27   29.30   31.81   38.86
## [1] "Summary of bike usage rates in Seattle, by type of day:"
## bike_rateS$Is_Weekend[bike_rateS$Var2 == "TRUE"]: Weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.02   33.40   41.34   41.92   52.71   61.59 
## -------------------------------------------------------- 
## bike_rateS$Is_Weekend[bike_rateS$Var2 == "TRUE"]: Weekend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.428  27.920  41.020  37.900  49.530  62.000

We have found the answer! On weekends in SF, bike usage significantly drops. In addition, some of the lower values for weekday days are a result of them being on holidays (Christmas, New Years Day). Interestingly, the relationship is very weak in Seattle. This could suggest that more tourists are using the service on weekends in Seattle, but not as many in SF.

The SF plots will be slightly different than the Seattle plots. I am comparing the features that have the greatest affect on bike usage to really see the variations. As expected, the type of day had a greater affect than rain, but there are still noticable differences in all four boxes.

A warm, dry, weekday in Seattle has consistently high bike usage, close to 60%. I also feel confident that if I visit Seattle in the summer I will have some warm weather. If the weather is either cold or wet, bike usage drops anywhere from 10-20%, but due to their correlation with each other, a wet and cold day in Seattle will have bike usage drop between 15-25% (still rather significant).

Using a bike sharing services as a method of communting is much more common in SF than Seattle. 33% of all rides happen between 8:00-9:00am and 5:00-6:00pm in SF, but only 20% in Seattle. Ridership during the midday is about 50% more popular in Seattle than SF, this might have something to do with the weekday/weekend trends that we saw earlier.

Hmm, that doesn’t explain the difference in usage patterns between SF and Seattle. The weekend habits are very similar except, in SF, it is slightly more common to use the bikes earlier in the day.

Breaking things down by the day and hour, we can see that usage is very similar between the two cities during the week. In both cities, Tuesday is the most common day for riding a bike to work, while Monday is the most common for riding a bike home. I’m guessing that Friday is a reasonably common day to either work from home or take off from work, because of the decrease in usage. Interestingly, we can see that it is much more common to use the bikes on weekends in Seattle during the afternoon. A Saturday afternoon’s ridership is very similar to that of a weekday morning communte. This helps to explain the difference in usage patterns between the two cities, but I bet we will have an even clearer picture when we look at the type of users (subscriber vs customer) later.

## [1] "Summary of trip durations in SF:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.617   8.367   9.805  11.930  63.880
## [1] "Summary of trip durations in Seattle:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.322  10.050  13.340  16.900  63.880

The top five percent of data points were removed from this plots to better see the data. According to the data, someone once used a bike for 200 straight days in SF, I can’t imagine what the bill was at the end of that period. Nonetheless, shorter rides are more common in SF as the median and mean values are a few minutes less than in Seattle.

I was wondering if there would be a time of day that would be most common for longer or shorter rides, but I cannot see any patterns. I expect that times with fewer data points are being affected by outliers, such as Wednesdays in SF at 9:00pm. Note: The tooltip is not providing the correct information; often it says the hour is 0 and the day is Sunday. If you’re reading this on Kaggle and know what’s up, please post a comment. Thanks for the help!

Stations

## [1] "Summary of trips to stations in SF:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002385 0.0020870 0.0088670 0.0142900 0.0218100 0.0740500
## [1] "Summary of trips to stations in Seattle:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000909 0.0083990 0.0147200 0.0185200 0.0267100 0.0490900

Trips are more evenly distributed among the stations in Seattle than in SF. The most popular station in Seattle receives 4.9% of trips compared to 7.4% in SF, the median value in Seattle is closer to its mean, which indicates less skewed data.

## [1] "Summary of trips from stations in SF:"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000274 0.001946 0.008872 0.014290 0.020190 0.097370
## [1] "Summary of trips from stations in Seattle:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 5.680e-06 1.127e-02 1.862e-02 1.852e-02 2.410e-02 4.326e-02

Rather unsurprisingly, trips from a station tell a similar story. The distribution is more even in Seattle, but increasingly more even than trips to a station. The median and mean values for Seattle are nearly identical, which signals uniform distribution.

This should help to give a better view of the popularity of stations. Note: You can ignore the ‘0’ value in the tooltip; that is the point’s location on the x-axis, which is used to jitter the points.

## [1] "The five most common destination stations in SF:"
## [1] "San Francisco Caltrain (Townsend at 4th)"     
## [2] "San Francisco Caltrain 2 (330 Townsend)"      
## [3] "Harry Bridges Plaza (Ferry Building)"         
## [4] "Temporary Transbay Terminal (Howard at Beale)"
## [5] "Embarcadero at Sansome"
## [1] "The five most common origin stations in SF:"
## [1] "San Francisco Caltrain (Townsend at 4th)"
## [2] "San Francisco Caltrain 2 (330 Townsend)" 
## [3] "Harry Bridges Plaza (Ferry Building)"    
## [4] "Townsend at 7th"                         
## [5] "Embarcadero at Sansome"
## [1] "The five most common destination stations in Seattle:"
## [1] "2nd Ave & Pine St"               "Pier 69 / Alaskan Way & Clay St"
## [3] "PATH / 9th Ave & Westlake Ave"   "Westlake Ave & 6th Ave"         
## [5] "3rd Ave & Broad St"
## [1] "The five most common origin stations in Seattle:"
## [1] "Pier 69 / Alaskan Way & Clay St" "2nd Ave & Pine St"              
## [3] "3rd Ave & Broad St"              "E Pine St & 16th Ave"           
## [5] "Westlake Ave & 6th Ave"

I thought there would have been more variation in the names of the main destination and origin stations. Perhaps when we compare stations with weekdays and weekends, we will find some disparity.

## [1] "Summary of hourly trips to stations in SF:"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.001934 0.010960 0.059520 0.068730 1.869000
## [1] "Summary of hourly trips to stations in Seattle:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00909 0.04091 0.07716 0.10750 0.59200

For both cities, the trips are more concentrated to a select few stations for the morning commute, compared to the evening communte. This makes some sense because many people can work in the same neighbourhood, without needing to live in the same neighbourhood. It’s interesting to see the decrease in the percent of total trips among the main stations for communting in SF, compared to Seattle. In the morning in SF, the busiest station receives 1.87% of all traffic, but in the afternoon, the rate drops to 0.72% for the busiest station. In Seattle, the busiest station in the morning receives 0.59% of all trips, but then drops to 0.48% of all trips in the afternoon. Again, we can see that trips are more evenly distrubted in Seattle.

## [1] "Summary of hourly trips from stations in SF:"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.001934 0.010960 0.059520 0.062850 1.970000
## [1] "Summary of hourly trips from stations in Seattle:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00909 0.04403 0.07716 0.11590 0.60910

I was wondering how much of an inverse these plots would be compared to the previous set. It seems that this is very much the case. In terms of percent of total trips and the specific stations, there are strong similarities.

## [1] "Summary of trips to stations in SF, by type of day:"
## weekday_to_stationSF$Var1: Weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0174  0.1583  0.7439  1.2790  1.9630  7.0430 
## -------------------------------------------------------- 
## weekday_to_stationSF$Var1: Weekend
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002579 0.029730 0.088150 0.150000 0.202500 0.841900
## [1] "Summary of trips to stations in Seattle, by type of day:"
## weekday_to_stationS$Var1: Weekday
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00909 0.58410 1.09800 1.36800 1.93000 3.66600 
## -------------------------------------------------------- 
## weekday_to_stationS$Var1: Weekend
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2444  0.4471  0.4841  0.6349  1.8240

There is a serious weekend effect in SF. During the week, the busiest station receives just over 7% of all trips, but on weekends, the busiest station receives less than 1% of all trips. This clearly shows that there is far less traffic on weekends in SF, and users are going to a wider variety of stations. In Seattle, the story is similar, but not as strong. As we know from before, there is more traffic in Seattle on weekends, and the trips are more concentrated to a smaller set of stations.

## [1] "Top five destination stations on weekdays in SF:"
## [1] "San Francisco Caltrain (Townsend at 4th)"     
## [2] "San Francisco Caltrain 2 (330 Townsend)"      
## [3] "Harry Bridges Plaza (Ferry Building)"         
## [4] "Temporary Transbay Terminal (Howard at Beale)"
## [5] "Townsend at 7th"
## [1] "Top five destination stations on weekends in SF:"
## [1] "Embarcadero at Sansome"              
## [2] "Harry Bridges Plaza (Ferry Building)"
## [3] "Market at 4th"                       
## [4] "Powell Street BART"                  
## [5] "2nd at Townsend"
## [1] "Top five destination stations on weekdays in Seattle:"
## [1] "2nd Ave & Pine St"               "PATH / 9th Ave & Westlake Ave"  
## [3] "Westlake Ave & 6th Ave"          "Pier 69 / Alaskan Way & Clay St"
## [5] "Pine St & 9th Ave"
## [1] "Top five destination stations on weekends in Seattle:"
## [1] "Pier 69 / Alaskan Way & Clay St"                     
## [2] "3rd Ave & Broad St"                                  
## [3] "2nd Ave & Pine St"                                   
## [4] "Seattle Aquarium / Alaskan Way S & Elliott Bay Trail"
## [5] "Occidental Park / Occidental Ave S & S Washington St"

Note: We could continue the analysis of stations to include ‘Ridership vs Origin Station vs Type of Day’, comparing weekday and weekend traffic by hour, etc. but to avoid this analysis becoming excessively long, I’m going to stop here. If you are viewing this report on Kaggle, feel free to fork this report, continue on with the analysis, and post it on the forum…I’d be happy to read it!

Types of Users

We already know that SF had a larger bike network, so it’s not very surprising to see that so many more trips are taken there, compared to Seattle, 310,256 vs 176,008 in this dataset. We also know that usage patterns are different in the two cities, so we could expect to see different user ratios. What has surpised me is how extreme these differences are. Nearly 88% of all trips are taken by subcribers in SF, while only 63% in Seattle. In terms of trips per type of asset, in SF: 4,432 trips/station and 467 trips/bike; in Seattle: 3,259 trips/station, 367 trips/bike.

Here the difference might be more clear. Subscribers in SF, during the week, account for 82% of all trips, and customers use the service slightly more often. On weekends, customers and subscribers take about the same number of trips. In Seattle, weekday trips are more common amongst both types of users, but subscribers only take 53% of all trips during these five days. It’s neat to see that customers in Seattle use the service almost twice as often as subscribers on weekends.

## tripSF$User[tripSF$Duration <= 60]: Customer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   9.867  15.420  18.570  23.980  59.980 
## -------------------------------------------------------- 
## tripSF$User[tripSF$Duration <= 60]: Subscriber
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.400   7.950   8.747  11.120  59.980
## tripS$User[tripS$Duration <= 60]: Customer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.002  11.190  17.950  20.440  26.600  59.990 
## -------------------------------------------------------- 
## tripS$User[tripS$Duration <= 60]: Subscriber
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.477   8.064   9.515  11.870  59.960

I subsetted the data to only include trips lasting up to one hour, otherwise some very large outliers would have been included. Compared to the earlier duration stats we saw, we can now see that subscribers in both cities use the bikes for about the same duration, typically 8 minutes per trip. Customers use the bikes for longer, especially in Seattle, where the median time is 18 minutes, versus 15.4 in SF.

Customers use the bikes at about the same times in each city, but midday trips are a little less common in SF amongst subscribers.

## [1] "Summary of trips to a station in SF, by type of user:"
## stations_typeSF$Var1: Customer
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.004512 0.036340 0.111800 0.174700 0.222300 1.228000 
## -------------------------------------------------------- 
## stations_typeSF$Var1: Subscriber
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01515 0.14630 0.70670 1.25400 1.97700 7.00000
## [1] "Summary of trips to a station in Seattle, by type of user:"
## stations_typeS$Var1: Customer
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001704 0.297100 0.572400 0.679300 0.863700 3.222000 
## -------------------------------------------------------- 
## stations_typeS$Var1: Subscriber
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007386 0.505200 0.916700 1.173000 1.641000 3.411000
## [1] "Top five destination stations for Customers in SF:"
## [1] "Embarcadero at Sansome"              
## [2] "Harry Bridges Plaza (Ferry Building)"
## [3] "Market at 4th"                       
## [4] "Powell Street BART"                  
## [5] "Powell at Post (Union Square)"
## [1] "Top five destination stations for Subscribers in SF:"
## [1] "San Francisco Caltrain (Townsend at 4th)"     
## [2] "San Francisco Caltrain 2 (330 Townsend)"      
## [3] "Temporary Transbay Terminal (Howard at Beale)"
## [4] "Townsend at 7th"                              
## [5] "Harry Bridges Plaza (Ferry Building)"
## [1] "Top five destination stations for Customers in Seattle:"
## [1] "Pier 69 / Alaskan Way & Clay St"                     
## [2] "Seattle Aquarium / Alaskan Way S & Elliott Bay Trail"
## [3] "3rd Ave & Broad St"                                  
## [4] "2nd Ave & Pine St"                                   
## [5] "Lake Union Park / Valley St & Boren Ave N"
## [1] "Top five destination stations for Subscribers in Seattle:"
## [1] "PATH / 9th Ave & Westlake Ave"  "2nd Ave & Pine St"             
## [3] "Pine St & 9th Ave"              "REI / Yale Ave N & John St"    
## [5] "Republican St & Westlake Ave N"

The key thing I am looking for here is, are the most common destination stations different between the types of users. The answer is yes, because for both cities, four of the top five destination stations are different between the two types of users.

Summary

This analysis could be continued in many ways, such as how much the weather affects each type of customer, how a station’s usage rate changes over its lifetime, etc. I think we covered quite a bit of ground here. We now know that SF has a larger service than Seattle, Seattle uses the bikes, relatively more, on weekends, SF’s bike usage is less affected by the weather, Seattle’s traffic is more evenly distributed across the stations, customers have similar usage patterns in both cities, etc. I hope you enjoyed this analysis, learned something neat, and even want to continue it yourself! Thanks for reading!