The Goals: (1) To see if winners of 'Rookie of the Year' (RY) have above average careers and (2) To see if there is a trend of each winner of RY being better than the previous winner.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Load the datasets.

#1.1
#A dataset which includes players who have won RY and other awards
player_awards_df = pd.read_csv("/Users/Dave/Desktop/Programming/Personal Projects/Baseball Analysis/baseballdatabank-master/core/AwardsPlayers.csv")

#1.2
#contains the career statisticss of batters in the MLB
batting_df = pd.read_csv("/Users/Dave/Desktop/Programming/Personal Projects/Baseball Analysis/baseballdatabank-master/core/batting.csv")

#1.3
#contains the career statistics of pitchers in the MLB
pitching_df = pd.read_csv("/Users/Dave/Desktop/Programming/Personal Projects/Baseball Analysis/baseballdatabank-master/core/Pitching.csv")

Step 2: Adjust batting_df and pitching_df to the analysis' timeframe.

#2.1
#The Rookie of the Year award was first awarded in 1947, 
#therefore only values from 1947 onwards should be included in the analysis.
batting_df = batting_df[batting_df.yearID >= 1947]
pitching_df = pitching_df[pitching_df.yearID >= 1947]

Step 3: Find who the RY winners are.

#3.1
#ry_winners: Create a dataframe of only the players who have won RY
ry_winners = player_awards_df.loc[player_awards_df['awardID'] == 'Rookie of the Year']

#3.2
#batting_rookies: Create a dataframe of batters who have won RY, 
#                (this will contain the batting statistics of some pitchers who have won RY)
batting_rookies = ry_winners.merge(batting_df, 
                                     how = 'left',
                                     on = 'playerID',
                                    )

#3.3
#pitching_rookies: Create a dataframe of pitchers who have won RY
pitching_rookies = ry_winners.merge(pitching_df,
                                     how = 'inner',
                                     on = 'playerID')

Step 4: Format the datasets to only include the important statistics.

#4.0
'''
batting_rookies: Players' IDs, Games Played, At Bats, Hits, Home Runs, Runs, Runs Batted In.
batting_df: same statistics as batting_rookies.
pitching_rookies: Players' IDs, Games Played, Wins, Outs Pitched, Batters Faced, Hits, Home Runs, 
                  Strike Outs, and Earned Runs.
pitching_df: same statistics as pitching_rookies.
'''
batting_rookies = batting_rookies[['playerID', 'G', 'AB', 'H', 'HR', 'R', 'RBI']]
batting_df = batting_df[['playerID', 'G', 'AB', 'H', 'HR', 'R', 'RBI']]
pitching_rookies = pitching_rookies[['playerID', 'G', 'W', 'IPouts', 'BFP', 'H', 'HR', 'SO', 'ER']]
pitching_df = pitching_df[['playerID', 'G', 'W', 'IPouts', 'BFP', 'H', 'HR', 'SO', 'ER']]

Step 5: Find any missing values in the dataframes.

#5.1
#Search through dataframes to find any missing values.
dataframes = [batting_rookies, batting_df, pitching_rookies, pitching_df]
for df in dataframes:
    print df.isnull().sum()

playerID     0
G            0
AB          51
H           51
HR          51
R           51
RBI         51
dtype: int64
playerID       0
G              0
AB          5149
H           5149
HR          5149
R           5149
RBI         5149
dtype: int64
playerID    0
G           0
W           0
IPouts      0
BFP         0
H           0
HR          0
SO          0
ER          0
dtype: int64
playerID    0
G           0
W           0
IPouts      1
BFP         0
H           0
HR          0
SO          0
ER          0
dtype: int64

#5.2.1
'''
Since there are so many values missing in most of the categories, I am going to remove them from the dataframes
as they would overweight the 'G' statistic or underweight the 'AB','H','HR','R','RBI' statistics. The analysis 
should be more accurate without these values.

Because all players that have null values in 'AB' also have null values in 'H','HR','R', and 'RBI', I only need
to select AB.notnull() to remove all the players with missing values.
'''
batting_rookies = batting_rookies[batting_rookies.AB.notnull()]
batting_df = batting_df[batting_df.AB.notnull()]

#5.2.2
#This player's statistics minimally affect the analysis, so I will remove it from the pitching dataframe to have a 
#complete set of data.
inds = pd.isnull(pitching_df).any(1).nonzero()[0]
print pitching_df.iloc[inds]
pitching_df = pitching_df[pitching_df.IPouts.notnull()]

        playerID  G  W  IPouts   BFP  H  HR  SO  ER
42959  leblawa01  1  0     NaN  28.0  6   2   4   4

#5.3
#Check to make sure that all the missing values are removed.
dataframes = [batting_rookies, batting_df, pitching_rookies, pitching_df]
for df in dataframes:
    print df.isnull().sum()

playerID    0
G           0
AB          0
H           0
HR          0
R           0
RBI         0
dtype: int64
playerID    0
G           0
AB          0
H           0
HR          0
R           0
RBI         0
dtype: int64
playerID    0
G           0
W           0
IPouts      0
BFP         0
H           0
HR          0
SO          0
ER          0
dtype: int64
playerID    0
G           0
W           0
IPouts      0
BFP         0
H           0
HR          0
SO          0
ER          0
dtype: int64

Step 6: Sum the career statistics of all players.

#6.0
#Don't set playerID to index, as it will need to be removed from dataframe later; 
#that is easier to do when playerID is not the index.
batting_rookies_career_totals = batting_rookies.groupby('playerID', as_index = False).sum()
pitching_rookies_career_totals = pitching_rookies.groupby('playerID', as_index = False).sum()
batting_df_career_totals = batting_df.groupby('playerID', as_index = False).sum()
pitching_df_career_totals = pitching_df.groupby('playerID', as_index = False).sum()

Step 7: Include statistics that cannot be summed.

#7.1
#Add Batting Average (AVG) to batters by dividing their Hits (H) by At Bats (AB).
batting_rookies_career_totals['AVG'] = (batting_rookies_career_totals['H'] / 
                                        batting_rookies_career_totals['AB'])

batting_df_career_totals['AVG'] = (batting_df_career_totals['H'] / 
                                   batting_df_career_totals['AB'])

#7.2
#Add Oppenent Batting Average (BAopp) to pitchers by dividing their Hits (H) by Batters Faced (BFP).
#Add Earned Run Average (ERA) to pitchers by dividing their Earned Runs (ER) by Outs Pitched (IPouts), 
#    divided by 27 (the number of outs in a full baseball game)
pitching_rookies_career_totals['BAopp'] = (pitching_rookies_career_totals['H'] /
                                          pitching_rookies_career_totals['BFP'])

pitching_rookies_career_totals['ERA'] = (pitching_rookies_career_totals['ER'] /
                                        (pitching_rookies_career_totals['IPouts'] / 27))

pitching_df_career_totals['BAopp'] = (pitching_df_career_totals['H'] /
                                          pitching_df_career_totals['BFP'])

pitching_df_career_totals['ERA'] = (pitching_df_career_totals['ER'] /
                                        (pitching_df_career_totals['IPouts'] / 27))

Step 8: Create new series for the overall means of each statistic in each dataframe.

#8.0
batting_rookies_overall_means = batting_rookies_career_totals.mean()
batting_df_overall_means = batting_df_career_totals.mean()

pitching_rookies_overall_means = pitching_rookies_career_totals.mean()
pitching_df_overall_means = pitching_df_career_totals.mean()

Step 9: Find the percentile that each RY winner statistic falls under.

#9.1
#Create the function 'percentile' that finds which percentile the rookie_mean is at in the league_totals.
def percentile(league_totals, rookie_mean):
    sorted_league = league_totals.sort_values() #sort league stats to get them in ascending order
    for value in sorted_league: 
        if value >= rookie_mean: #find the first row that is greater than the rookie mean
            i = 0.0
            for row in sorted_league:
                if row == value:
                    return i / league_totals.size #the percentile is the row/player that the rookie mean equals,
                i += 1                            #divided by the total number of rows/players.

#9.2
#Apply the percentile function on the batters' statistics via a loop through each batter statisitic.
batting_statistics = ['G','AB','H','HR','R','RBI','AVG']
for statistic in batting_statistics:
    print statistic + ' = ' + str(percentile(batting_df_career_totals[statistic], batting_rookies_overall_means[statistic]))

G = 0.925491644354
AB = 0.932139229988
H = 0.936940264057
HR = 0.95522112455
R = 0.942664573908
RBI = 0.946173021882
AVG = 0.583602622103

Follow up to Step 9.2: Compared to all other batters in the MLB, it appears that RY winners are excellent performers in all areas, but AVG, which they are slightly above average.

9.2.1: Plot the means of the statisitcs, to give a visual representation of the data and difference in values between RY winners and all other batters.

#9.2.1
'''
At Bats were intentionally left out, as its values would be far greater than those of the other statisitics and would
make the plot more difficult to view. Batting Average was also left out as its value would not be shown clearly, and
I will plot it below.
'''
batting_statistics_1 = pd.DataFrame({'RY Winners': batting_rookies_overall_means[['G', 'H','HR','R', 'RBI']],
                                     'All Batters': batting_df_overall_means[['G', 'H','HR','R', 'RBI']]},
                                     columns=['RY Winners','All Batters'])

batting_statistics_1.plot(kind='bar')
plt.show()

#9.3
#Apply the percentile function on the pitchers' statistics via a loop through each pitcher statistic.
pitching_statistics = ['G','W','IPouts','BFP','H','HR','SO','ER','BAopp','ERA']
for statistic in pitching_statistics:
    print statistic + ' = ' + str(percentile(pitching_df_career_totals[statistic], pitching_rookies_overall_means[statistic]))

G = 0.847070844687
W = 0.909400544959
IPouts = 0.898671662125
BFP = 0.895776566757
H = 0.890667574932
HR = 0.882152588556
SO = 0.91757493188
ER = 0.886069482289
BAopp = 0.26924386921
ERA = 0.541723433243

Follow up to step 9.3: I believe some of these figures are misleading. Since RY winners typically faced so many more batters than an average pitcher, it makes sense that they have also given up many more Hits, Home Runs, and Earned Runs, on average. In terms of positive statistics for RY winners, they have played more games, have more wins, and more strikeouts than most other pitchers.

9.3.1: Plot the means of the statisitcs, to give a visual representation of the data and difference in values between RY winners and all other pitchers.

#9.3.1.1
# Statistics are grouped by relatively similar values - first set.
pitching_stats_1 = pd.DataFrame({'RY Winners': pitching_rookies_overall_means[['G','W','HR','ER']],
                    'All Pitchers': pitching_df_overall_means[['G','W','HR','ER']]},
                     columns=['RY Winners','All Pitchers'])

pitching_stats_1.plot(kind='bar')
plt.show()

#9.3.1.2
# Statistics are grouped by relatively similar values - second set.
pitching_stats_2 = pd.DataFrame({'RY Winners': pitching_rookies_overall_means[['IPouts','BFP','H','SO']],
                    'All Pitchers': pitching_df_overall_means[['IPouts','BFP','H','SO']]},
                     columns=['RY Winners','All Pitchers'])

pitching_stats_2.plot(kind='bar')
plt.show()

Step 10: Inquiry about unexpected results.

#10.1
#Print the batting_rookies_career_totals to see if there are any unexpected players included.
#Set max_rows and max_columns to 999 to ensure all the data is visible.
with pd.option_context('display.max_rows', 999, 'display.max_columns', 999): 
    print batting_rookies_career_totals

      playerID     G       AB       H     HR       R     RBI       AVG
0    abreujo02   299   1169.0   354.0   66.0   168.0   208.0  0.302823
1     ageeto01  1129   3912.0   999.0  130.0   558.0   433.0  0.255368
2    allendi01  1749   6332.0  1848.0  351.0  1099.0  1119.0  0.291851
3    allisbo01  1541   5032.0  1281.0  256.0   811.0   796.0  0.254571
4    alomasa02  1377   4530.0  1236.0  112.0   520.0   588.0  0.272848
5    aparilu01  2601  10230.0  2677.0   83.0  1335.0   791.0  0.261681
6    bagweje01  2150   7797.0  2314.0  449.0  1517.0  1529.0  0.296781
7    bahnsst01   519    479.0    56.0    1.0    22.0    19.0  0.116910
8    bailean01   216      0.0     0.0    0.0     0.0     0.0       NaN
9      bayja01  1278   4505.0  1200.0  222.0   737.0   754.0  0.266371
10   beltrca01  2306   8749.0  2454.0  392.0  1449.0  1443.0  0.280489
11   benchjo01  2158   7658.0  2048.0  389.0  1091.0  1376.0  0.267433
12   berroan01   746   2575.0   665.0   46.0   329.0   254.0  0.258252
13   blackjo02   175     96.0    13.0    0.0     5.0     9.0  0.135417
14   blefacu01   974   2947.0   699.0  112.0   394.0   382.0  0.237190
15   braunry02  1219   4738.0  1441.0  255.0   799.0   846.0  0.304137
16   bryankr01   151    559.0   154.0   26.0    87.0    99.0  0.275492
17   bumbral01  1496   5053.0  1422.0   54.0   778.0   402.0  0.281417
18    byrdha01   187    262.0    42.0    0.0    16.0    14.0  0.160305
19   cansejo01  1887   7057.0  1877.0  462.0  1186.0  1407.0  0.265977
20   carewro01  2469   9315.0  3053.0   92.0  1424.0  1015.0  0.327751
21   castijo01   666   2320.0   646.0   41.0   293.0   249.0  0.278448
22   cepedor01  2124   7927.0  2351.0  379.0  1131.0  1365.0  0.296581
23   chambch01  2175   7571.0  2109.0  185.0   912.0   972.0  0.278563
24   charbjo01   201    647.0   172.0   29.0    97.0   114.0  0.265842
25   coghlch01   666   2244.0   602.0   46.0   311.0   199.0  0.268271
26   colemvi01  1371   5406.0  1425.0   28.0   849.0   346.0  0.263596
27   cordoma01   952   3419.0   938.0  122.0   480.0   540.0  0.274349
28   correca01    99    387.0   108.0   22.0    52.0    68.0  0.279070
29   crosbbo01   747   2559.0   605.0   62.0   329.0   276.0  0.236420
30    darkal01  1813   7206.0  2086.0  126.0  1064.0   756.0  0.289481
31   davisal01  1206   4240.0  1189.0  160.0   568.0   683.0  0.280425
32   dawsoan01  2627   9927.0  2774.0  438.0  1373.0  1591.0  0.279440
33   degroja01    54    105.0    21.0    0.0     6.0     6.0  0.200000
34   dropowa01  1288   4124.0  1113.0  152.0   478.0   704.0  0.269884
35   felizne01   246      2.0     0.0    0.0     0.0     0.0  0.000000
36   fernajo02    47     84.0    16.0    2.0     8.0     8.0  0.190476
37   fidryma01    10      0.0     0.0    0.0     0.0     0.0       NaN
38    fiskca01  2499   8756.0  2356.0  376.0  1276.0  1330.0  0.269073
39   furcara01  1614   6477.0  1817.0  113.0  1063.0   587.0  0.280531
40   garcino01  1434   5586.0  1747.0  229.0   927.0   936.0  0.312746
41   gilliji01  1956   7119.0  1889.0   65.0  1163.0   558.0  0.265346
42   goodedw01   405    741.0   145.0    8.0    60.0    67.0  0.195682
43   grievbe01   976   3215.0   864.0  118.0   471.0   492.0  0.268740
44   griffal01  1962   6780.0  1688.0   24.0   759.0   527.0  0.248968
45    grimbo01   268    189.0    24.0    3.0    12.0    21.0  0.126984
46   guilloz01  1993   6686.0  1764.0   28.0   773.0   619.0  0.263835
47   hamelbo01   497   1272.0   313.0   67.0   179.0   209.0  0.246069
48   hansero02  1384   4311.0  1007.0  106.0   446.0   501.0  0.233588
49   hargrmi01  1666   5564.0  1614.0   80.0   783.0   686.0  0.290079
50   harpebr03   510   1830.0   528.0   97.0   328.0   248.0  0.288525
51   hellije01   144     47.0    10.0    0.0     3.0     7.0  0.212766
52   helmsto01  1435   4997.0  1342.0   34.0   414.0   477.0  0.268561
53   hinsker01  1387   3797.0   947.0  137.0   549.0   522.0  0.249407
54   hollato01  1118   3191.0   871.0   98.0   451.0   401.0  0.272955
55   hornebo01  1020   3777.0  1047.0  218.0   560.0   685.0  0.277204
56   howarfr01  1895   6488.0  1774.0  382.0   864.0  1119.0  0.273428
57   howarry01  1460   5376.0  1410.0  357.0   813.0  1135.0  0.262277
58    howest01   282     27.0     2.0    0.0     2.0     0.0  0.074074
59   hubbske01   324   1255.0   310.0   14.0   148.0    98.0  0.247012
60   jennija01   234    329.0    68.0    2.0    18.0    26.0  0.206687
61   jeterde01  2747  11195.0  3465.0  260.0  1923.0  1311.0  0.309513
62   jethrsa01   442   1763.0   460.0   49.0   280.0   181.0  0.260919
63   justida01  1610   5625.0  1571.0  305.0   929.0  1017.0  0.279289
64   karroer01  1755   6441.0  1724.0  284.0   797.0  1027.0  0.267660
65   kimbrcr01   355      1.0     0.0    0.0     0.0     0.0  0.000000
66   kittlro01   843   2708.0   648.0  176.0   356.0   460.0  0.239291
67   knoblch01  1632   6366.0  1839.0   98.0  1132.0   615.0  0.288878
68   kubekto01  1092   4167.0  1109.0   57.0   522.0   373.0  0.266139
69   kuennha01  1833   6913.0  2092.0   87.0   951.0   671.0  0.302618
70   lefebji01   922   3014.0   756.0   74.0   313.0   404.0  0.250829
71   listapa01   503   1772.0   444.0    5.0   250.0   143.0  0.250564
72   longoev01  1119   4204.0  1138.0  205.0   628.0   708.0  0.270695
73    lynnfr01  1969   6925.0  1960.0  306.0  1063.0  1111.0  0.283032
74   matlajo01   204    441.0    57.0    0.0    32.0    23.0  0.129252
75   matthga01  2033   7147.0  2011.0  234.0  1083.0   978.0  0.281377
76    mayswi01  2992  10881.0  3283.0  660.0  2062.0  1903.0  0.301719
77   mcbriba01  1071   3853.0  1153.0   63.0   548.0   430.0  0.299247
78   mccovwi01  2588   8197.0  2211.0  521.0  1229.0  1555.0  0.269733
79   mcdougi01  1336   4676.0  1291.0  112.0   697.0   576.0  0.276091
80   mcgwima01  1874   6187.0  1626.0  583.0  1167.0  1414.0  0.262809
81   metzgbu01   191     15.0     0.0    0.0     1.0     0.0  0.000000
82   mondera01  1525   5814.0  1589.0  271.0   909.0   860.0  0.273306
83   montejo01   278    455.0    44.0    4.0    23.0    18.0  0.096703
84    moonwa01  1457   4843.0  1399.0  142.0   737.0   661.0  0.288871
85   mortoca01   266    551.0    86.0    7.0    44.0    42.0  0.156080
86   munsoth01  1423   5344.0  1558.0  113.0   696.0   701.0  0.291542
87   murraed02  3026  11336.0  3255.0  504.0  1627.0  1917.0  0.287138
88   myerswi01   235    885.0   227.0   27.0   127.0   117.0  0.256497
89   newcodo01   452    878.0   238.0   15.0    94.0   108.0  0.271071
90    nomohi01   324    485.0    65.0    4.0    22.0    26.0  0.134021
91   olivato01  1676   6301.0  1917.0  220.0   870.0   947.0  0.304237
92   olsongr01   491      4.0     1.0    1.0     1.0     2.0  0.250000
93   pearsal02   988   3077.0   831.0   28.0   485.0   214.0  0.270068
94   pedrodu01  1244   4961.0  1482.0  118.0   769.0   588.0  0.298730
95   peterga01   450    807.0   179.0   19.0    86.0   102.0  0.221809
96   piazzmi01  1912   6911.0  2127.0  427.0  1048.0  1335.0  0.307770
97   pinielo01  1747   5867.0  1705.0  102.0   651.0   766.0  0.290608
98   poseybu01   753   2739.0   850.0  102.0   361.0   447.0  0.310332
99   pujolal01  2274   8545.0  2666.0  560.0  1599.0  1698.0  0.311995
100  ramirha01  1328   5078.0  1503.0  210.0   881.0   707.0  0.295983
101  righeda01   281     11.0     2.0    0.0     0.0     1.0  0.181818
102  ripkeca01  3001  11551.0  3184.0  431.0  1647.0  1695.0  0.275647
103  robinfr02  2808  10006.0  2943.0  586.0  1829.0  1812.0  0.294124
104  robinja02  1382   4877.0  1518.0  137.0   947.0   734.0  0.311257
105  rolensc01  2038   7398.0  2077.0  316.0  1211.0  1287.0  0.280752
106   rosepe01  3562  14053.0  4256.0  160.0  2165.0  1314.0  0.302853
107   saboch01   911   3354.0   898.0  116.0   494.0   426.0  0.267740
108  salmoti01  1672   5934.0  1674.0  299.0   986.0  1016.0  0.282103
109  sanfoja02   393    665.0   105.0    3.0    51.0    37.0  0.157895
110  santibe01  1978   6951.0  1830.0  217.0   755.0   920.0  0.263271
111  sasakka01   228      0.0     0.0    0.0     0.0     0.0       NaN
112    saxst01  1769   6940.0  1949.0   54.0   913.0   550.0  0.280836
113  schwado01   176    234.0    34.0    0.0    16.0    12.0  0.145299
114  scorehe01   150    281.0    36.0    1.0    21.0    12.0  0.128114
115  seaveto01   649   1315.0   202.0   12.0    99.0    86.0  0.153612
116  sievero01  1887   6387.0  1703.0  318.0   945.0  1147.0  0.266635
117  sizemte01  1411   5011.0  1311.0   23.0   577.0   430.0  0.261624
118   sotoge01   758   2402.0   590.0  101.0   292.0   343.0  0.245629
119  strawda01  1583   5418.0  1401.0  335.0   898.0  1000.0  0.258583
120  streehu01   638      2.0     0.0    0.0     0.0     0.0  0.000000
121  sutclri01   316    562.0   102.0    4.0    42.0    55.0  0.181495
122  suzukic01  2357   9362.0  2935.0  113.0  1348.0   738.0  0.313501
123  treshto01  1192   4251.0  1041.0  153.0   595.0   530.0  0.244884
124  troutmi01   652   2448.0   744.0  139.0   477.0   397.0  0.303922
125  valenfe01   440    936.0   187.0   10.0    56.0    84.0  0.199786
126  verlaju01   318     32.0     2.0    0.0     1.0     0.0  0.062500
127  virdobi01  1583   5980.0  1596.0   91.0   735.0   502.0  0.266890
128  waltoje01   598   1573.0   423.0   25.0   241.0   132.0  0.268913
129  weisswa01  1495   4686.0  1207.0   25.0   623.0   386.0  0.257576
130  whitalo01  2390   8570.0  2369.0  244.0  1386.0  1084.0  0.276429
131  willibi01  2488   9350.0  2711.0  426.0  1410.0  1475.0  0.289947
132  willido03   229    389.0    95.0    9.0    43.0    39.0  0.244216
133  williea02   889   3058.0   756.0  138.0   361.0   457.0  0.247220
134  willisc01   344     23.0     1.0    0.0     1.0     0.0  0.043478
135   woodke02   449    346.0    59.0    7.0    23.0    32.0  0.170520
136  worreto01   617     27.0     2.0    0.0     1.0     0.0  0.074074
137  zachrpa01   293    318.0    36.0    0.0     9.0     6.0  0.113208

Followup to Step 10.1: It appears that there are some AVG equal NaN. I will change these values to 0, because they have 0 hits.

Step 10.1.1: Change NaN values to 0.

#10.1.1
batting_rookies_career_totals = batting_rookies_career_totals.fillna(0)

#10.2
#Print the pitching_rookies_careet_totals to see if there are any unexpected players included.
#Set max_rows and max_columns to 999 to ensure all the data is visible.
with pd.option_context('display.max_rows', 999, 'display.max_columns', 999):
    print pitching_rookies_career_totals

     playerID    G    W   IPouts      BFP     H   HR    SO    ER     BAopp  \
0   bahnsst01  574  146   7587.0  10701.0  2440  223  1359  1013  0.228016   
1   bailean01  216   11    680.0    913.0   170   22   233    69  0.186199   
2   blackjo02  172   30   1242.0   1737.0   391   53   222   180  0.225101   
3    byrdha01  187   46   2483.0   3675.0   890   71   381   400  0.242177   
4   cansejo01    1    0      3.0      8.0     2    0     0     3  0.250000   
5    darkal01    1    0      3.0      5.0     1    1     0     2  0.200000   
6   degroja01   52   23    994.0   1316.0   266   23   349    96  0.202128   
7   felizne01  246   15    869.0   1168.0   208   26   265   102  0.178082   
8   fernajo02   47   22    867.0   1151.0   208   18   336    77  0.180712   
9   fidryma01   58   29   1237.0   1695.0   397   23   170   142  0.234218   
10  goodedw01  430  194   8402.0  11705.0  2564  210  2293  1091  0.219052   
11   grimbo01  268   61   2279.0   3234.0   708   50   443   305  0.218924   
12  hellije01  142   49   2358.0   3318.0   748  105   584   344  0.225437   
13   howest01  497   47   1818.0   2523.0   586   32   328   204  0.232263   
14  jennija01  225   62   3385.0   5032.0  1253  137   749   621  0.249006   
15  kimbrcr01  355   19   1045.0   1366.0   193   18   563    63  0.141288   
16  matlajo01  361  125   7089.0   9789.0  2276  161  1516   835  0.232506   
17  metzgbu01  191   18    880.0   1286.0   289   23   175   122  0.224728   
18  montejo01  298   90   4957.0   6983.0  1604  135  1081   650  0.229701   
19  mortoca01  255   87   4946.0   7113.0  1753  120   650   684  0.246450   
20  newcodo01  344  149   6464.0   8891.0  2102  252  1129   852  0.236419   
21   nomohi01  323  123   5929.0   8462.0  1768  251  1918   932  0.208934   
22  olsongr01  622   40   2016.0   2889.0   598   46   588   258  0.206992   
23  peterga01  359  124   6243.0   8707.0  1894  157  1420   751  0.217526   
24  righeda01  718   82   4211.0   5988.0  1287   95  1112   540  0.214930   
25  sanfoja02  388  137   6148.0   8659.0  1907  174  1182   840  0.220233   
26  sasakka01  228    7    670.0    925.0   165   24   242    78  0.178378   
27  schwado01  172   49   2229.0   3251.0   710   50   408   307  0.218394   
28  scorehe01  150   55   2575.0   3696.0   609   79   837   320  0.164773   
29  seaveto01  656  311  14348.0  19369.0  3971  380  3640  1521  0.205018   
30  streehu01  638   39   1961.0   2598.0   509   65   648   207  0.195920   
31  sutclri01  457  171   8093.0  11548.0  2662  236  1679  1223  0.230516   
32  suzukic01    1    0      3.0      5.0     2    0     0     1  0.400000   
33  valenfe01  453  173   8790.0  12398.0  2718  226  2074  1154  0.219229   
34  verlaju01  318  157   6334.0   8761.0  1901  187  1943   825  0.216984   
35  willido03  205   72   3665.0   5325.0  1251  114   896   566  0.234930   
36  willisc01  344   28   1318.0   1864.0   326   35   510   164  0.174893   
37   woodke02  446   86   4140.0   5863.0  1083  148  1582   563  0.184718   
38  worreto01  617   50   2081.0   2893.0   608   65   628   238  0.210162   
39  zachrpa01  293   69   3532.0   5050.0  1147   88   669   461  0.227129   

          ERA  
0    3.604982  
1    2.739706  
2    3.913043  
3    4.349577  
4   27.000000  
5   18.000000  
6    2.607646  
7    3.169160  
8    2.397924  
9    3.099434  
10   3.505951  
11   3.613427  
12   3.938931  
13   3.029703  
14   4.953323  
15   1.627751  
16   3.180279  
17   3.743182  
18   3.540448  
19   3.733926  
20   3.558787  
21   4.244223  
22   3.455357  
23   3.247958  
24   3.462360  
25   3.689005  
26   3.143284  
27   3.718708  
28   3.355340  
29   2.862211  
30   2.850076  
31   4.080193  
32   9.000000  
33   3.544710  
34   3.516735  
35   4.169714  
36   3.359636  
37   3.671739  
38   3.087938  
39   3.524066

Followup to Step 10.2: It appears that there are three fielders included in the dataframe. I will remove them from the dataframe to calculate more accurate values of how well RY winning pitchers pitch. Fielders rarely pitch so it does not make a great deal of sense to include them in the dataframe.

#10.2.1
#Create a new dataframe without the fielders.
only_pitching_rookies_career_totals = pitching_rookies_career_totals[pitching_rookies_career_totals.G > 1]

#10.2.2
#Calculate the new pitching_rookies_overall_means.
only_pitching_rookies_overall_means = only_pitching_rookies_career_totals.mean()

#10.2.3
#Calculate the new percentiles.
pitching_statistics = ['G','W','IPouts','BFP','H','HR','SO','ER','BAopp','ERA']
for statistic in pitching_statistics:
    print statistic + ' = ' + str(percentile(pitching_df_career_totals[statistic], only_pitching_rookies_overall_means[statistic]))

G = 0.865803814714
W = 0.918085831063
IPouts = 0.910762942779
BFP = 0.908719346049
H = 0.901566757493
HR = 0.891178474114
SO = 0.927963215259
ER = 0.895265667575
BAopp = 0.215429155313
ERA = 0.165701634877

Followup to 10.2.3: These values, especially ERA, are much closer to those that I expected. I believe removing the three fielders from the dataframe have provided percentiles that are much more reflective of the skills of RY winning pitchers.

#10.2.4
#Subtract new values from the old ones to show the difference.
pitching_statistics = ['G','W','IPouts','BFP','H','HR','SO','ER','BAopp','ERA']
for statistic in pitching_statistics:
    print statistic + ' = ' + str(percentile(pitching_df_career_totals[statistic], 
                                             only_pitching_rookies_overall_means[statistic]) -
                                  percentile(pitching_df_career_totals[statistic], 
                                             pitching_rookies_overall_means[statistic]))

G = 0.0187329700272
W = 0.00868528610354
IPouts = 0.012091280654
BFP = 0.0129427792916
H = 0.0108991825613
HR = 0.00902588555858
SO = 0.0103882833787
ER = 0.0091961852861
BAopp = -0.0538147138965
ERA = -0.376021798365

Followup to 10.2.4: These values are similar to those prior, therefore I do not feel the need to make another set of plots. BAopp and ERA will be plotted later on.

Step 11: Adjust measurements: The RY winners' Batting Average seems surprisingly low compared to the the other percentiles. I believe this is due to finding the mean of the players' batting averages, instead of finding the batting average of all of the At Bats (AB) of RY winners. Finding this new percentile will dilute the weight of pitchers who have batted, because they play less often, therefore bat less, and typically have a lower batting average. In short, this new measurement should provide a more accurate representation of how well RY winners bat.

#11.1
#Find the new value for AVG, by using the sum of all players' Hits (H) and At Bats (AB) statistics.
#Then divide the value of Hits by At Bats.
batting_rookies_new_AVG = batting_rookies_career_totals['H'].sum() / batting_rookies_career_totals['AB'].sum()

#11.2
#Find the new percentile for AVG.
print 'AVG = ' + str(percentile(batting_df_career_totals['AVG'], batting_rookies_new_AVG))

AVG = 0.801218724033

Followup to 11.2: This value is much closer to the other percentiles of the other batting statistics, and one that I believe is more accurate of the skill of RY winning batters.

#11.2.1 
#Subtract previous value from new value to show the difference, followed by their relative differences.
print "AVG: Difference = " + str(batting_rookies_new_AVG - batting_rookies_overall_means['AVG'])
print "AVG: % Difference = " + str((batting_rookies_new_AVG - batting_rookies_overall_means['AVG'])/ 
                              batting_rookies_overall_means['AVG']*100)

AVG: Difference = 0.036527506919
AVG: % Difference = 15.1165840091

Step 11.3 Compare RY winners' AVG, BAopp, and ERA to all other players via plots. I am using box plots to show the spread of these values, as I believe they are the most important in determining the quality of the players, thus a more accurate depiction of their values should be worthwhile.

#11.3.1
#Compare RY winners' AVG to all other batters.
batting_statisitc_AVG = pd.DataFrame({'RY Winners': batting_rookies_career_totals['AVG'],
                                      'All Batters': batting_df_career_totals['AVG']},
                                      columns = ['RY Winners', 'All Batters'])

batting_statisitc_AVG.plot(kind='box', ylim = [-0.1,0.5])
plt.show()

#10.3.2
#Compare RY winners' BAopp to all other pitchers.
pitching_stats_BAopp = pd.DataFrame({'RY Winners': only_pitching_rookies_career_totals['BAopp'],
                                     'All Pitchers': pitching_df_career_totals['BAopp']},
                                     columns=['RY Winners','All Pitchers'])

pitching_stats_BAopp.plot(kind='box', ylim = [0,0.4])
plt.show()

#10.3.3
#Compare RY winners' ERA to all other pitchers.
pitching_stats_ERA = pd.DataFrame({'RY Winners': only_pitching_rookies_career_totals['ERA'],
                                   'All Pitchers': pitching_df_career_totals['ERA']},
                                   columns=['RY Winners','All Pitchers'])

pitching_stats_ERA.plot(kind='box', ylim = [0,10])
plt.show()

The Goal (1) Summary: I believe that it is clear to say that RY winners have above average careers in the MLB. In all the measurements that were used to compare RY winners with all other MLB players, they performed well above average, regularly in the 90th+ percentile.

Step 11: Calculate if RY winners are improving with time.

#11.1
#Create new dataframes for the yearly calculations.
yearly_batting_rookies = ry_winners.merge(batting_df,
                                         how = 'left',
                                         on = 'playerID')

yearly_pitching_rookies = ry_winners.merge(pitching_df,
                                           how = 'inner',
                                           on = 'playerID')

#11.2
#Include only the relevant statistics in the dataframes.
yearly_batting_rookies = yearly_batting_rookies[['yearID','G','AB','H','HR','R','RBI']]
yearly_pitching_rookies = yearly_pitching_rookies[['yearID','G','W','IPouts','BFP','H','HR','SO','ER']]

#11.3
#Group statistics by year to calculate the yearly changes.
yearly_batting_rookies_career_totals = yearly_batting_rookies.groupby('yearID', as_index = False).sum()
yearly_pitching_rookies_career_totals = yearly_pitching_rookies.groupby('yearID', as_index = False).sum()

#11.4 
#Create empty dataframes to be filled with standardized values.
yearly_batting_rookies_career_totals_standardized = pd.DataFrame()
yearly_pitching_rookies_career_totals_standardized = pd.DataFrame()

#11.5.1.1
'''
Add the standardized values to the empty dataframes. 
Batters are standardized by At Bats, because this will show how successful a batter is at each At Bat and 
if there are different amount of At Bats each season, the yearly differences will be factored out.
'''
yearly_batting_statistics = ['G','AB','H','HR','R','RBI']
for statistic in yearly_batting_statistics:
    yearly_batting_rookies_career_totals_standardized[statistic] = (yearly_batting_rookies_career_totals[statistic] /
                                                                    yearly_batting_rookies_career_totals['AB'])

#11.5.1.2
#Calculate and add AVG to the batting standardized values dataframe.
yearly_batting_rookies_career_totals_standardized['AVG'] = (yearly_batting_rookies_career_totals['H'] /
                                               yearly_batting_rookies_career_totals['AB'])

#11.5.2.1
'''
Add the standardized values to the empty dataframes.
Pitchers are standardized by Batters Faced, 
because this will show how successful a pitcher typically is against a batter, 
and if pitchers face a different amount of batters each year, this will be factored out.
'''
yearly_pitching_statistics = ['G','W','IPouts','BFP','H','HR','SO','ER']
for statistic in yearly_pitching_statistics:
    yearly_pitching_rookies_career_totals_standardized[statistic] = (yearly_pitching_rookies_career_totals[statistic] / 
                                                                     yearly_pitching_rookies_career_totals['BFP'])

#11.5.2.2
#Calculate and add BAopp and ERA to the pitching standardized values dataframe.
yearly_pitching_rookies_career_totals_standardized['BAopp'] = (yearly_pitching_rookies_career_totals['H'] /
                                                  yearly_pitching_rookies_career_totals['BFP'])
yearly_pitching_rookies_career_totals_standardized['ERA'] = (yearly_pitching_rookies_career_totals['ER'] /
                                                (yearly_pitching_rookies_career_totals['IPouts'] / 27))

#11.6
#Add the years (yearID) to the dataframes.
yearly_batting_rookies_career_totals_standardized['yearID'] = yearly_batting_rookies_career_totals['yearID']
yearly_pitching_rookies_career_totals_standardized['yearID'] = yearly_pitching_rookies_career_totals['yearID']

#11.7
#Calculate the correlation for batters.
yearly_batting_rookies_career_totals_standardized.corr()

Follow up to 11.7: By looking at the yearID column/row, we can see that RY winning batters, on average, improve in all areas with each successive year.

#11.8
#Calculate the correlation for pitchers.
yearly_pitching_rookies_career_totals_standardized.corr()

Follow up to 11.8: By looking at the yearID column/row, we can see that RY winning pitchers also, on average, improve with each successive year. H, HR, ER, BAopp, and ERA all have negative values, which means that pitchers are giving up fewer hits, home runs, and earned runs, and they are playing more games, winning more games, and striking out more players, on average, with each passing year.

Summary of Goal 2: It appears that RY winners are becoming better batters and pitchers, on average, with each passing year, based on the correlation between years and each statistic.

Overall Summary: I am quite pleased to have answered both of my questions and received the conclusive evidence that I did. Based on these statisitcs, it appears that RY winners are excellent baseball players and are improving over time.

Limitations to Analysis: 1: There is a possibility that I overweighted the better batters in terms of their Batting Averages. By summing the values of the RY winners before calculating AVG, this could overrepresent the better batters, as they bat more often. The main reason I did this, was to underweight the pitchers who have batted. Another option would have been to caluculate the median value for the Batting Average (see below), however, this does not significantly alter the percentile, so I do not think that I have made a misleading analysis with my calculations.

2: It could have been worth comparing RY winners to players that have played at least 200 games, for example. These players could be considered 'regular' MLB players who are more 'worthy' of a comparison. By comparing RY winners to these 'regular' MLB players, it could be a more accurate comparison for the true relative skill of RY winners. I thought this analysis was not essential because I found the percentile for each statistic comparing RY winners to all other players. I believe these percentiles are sufficient in answering the original question: To see if winners of 'Rookie of the Year' (RY) have above average careers.

3: I did not calculate if there was a trend in the MLB of the average player becoming better overtime. If the average player is becoming relatively better than the average RY winner, then one could argue that the average RY winner is becoming relatively worse overtime. I do not think that this calculation is necessary because it is not essential to answer part 2 of the original question: To see if there is a trend of each winner of RY being better than the previous winner. The correlations found show that RY winners are becoming better over time, which answers the question. Plus, I cannot think of a reason why RY winners would become relatively worse over time.

4: I did not remove fielders from pitching_df_career_totals. This could underscore the quality of pitchers in the MLB, but because fielders so rarely pitch in the MLB, I am not worried that I have created a misleading analysis and over-valued the skills of RY winning pitchers.

#Followup to limitation 1.
print percentile(batting_df_career_totals['AVG'], batting_rookies_career_totals['AVG'].median())

0.748499676853

Reason for these questions: During the 2003-2004 NHL season, Andrew Raycroft, a goalie for the Boston Bruins, played a fantastic season and won the Calder Memorial Trophy (aka Rookie of the Year). During the summer of 2006, the Toronto Maple Leafs (my favourite team) traded for Raycroft. Raycroft never played another particularly good season for the remaineder of his NHL career. I was wondering if this was a trend, RY winners being average players, or if Raycroft was more of an exception. I can't say anything about the NHL, but I can safely say that there is a trend in RY winners being exceptional players in the MLB.

List of websites used to help perform analysis: http://stackoverflow.com/ http://pandas.pydata.org/pandas-docs/stable/index.html https://www.quora.com/ https://en.wikipedia.org/wiki/Main_Page http://www.baseball-reference.com/

	G	AB	H	HR	R	RBI	AVG	yearID
G	1.000000	NaN	-0.348781	-0.288255	-0.429037	-0.032460	-0.348781	0.172644
AB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
H	-0.348781	NaN	1.000000	0.361156	0.832639	0.436682	1.000000	0.061390
HR	-0.288255	NaN	0.361156	1.000000	0.570076	0.900936	0.361156	0.227543
R	-0.429037	NaN	0.832639	0.570076	1.000000	0.518495	0.832639	0.108133
RBI	-0.032460	NaN	0.436682	0.900936	0.518495	1.000000	0.436682	0.225669
AVG	-0.348781	NaN	1.000000	0.361156	0.832639	0.436682	1.000000	0.061390
yearID	0.172644	NaN	0.061390	0.227543	0.108133	0.225669	0.061390	1.000000

	G	W	IPouts	BFP	H	HR	SO	ER	BAopp	ERA	yearID
G	1.000000	-0.398919	-0.077348	NaN	-0.069148	0.212708	0.096844	0.216759	-0.069148	0.223468	0.320711
W	-0.398919	1.000000	0.751863	NaN	-0.441512	-0.533458	0.540934	-0.765019	-0.441512	-0.774742	0.034162
IPouts	-0.077348	0.751863	1.000000	NaN	-0.590532	-0.485380	0.767301	-0.801075	-0.590532	-0.813054	0.322662
BFP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
H	-0.069148	-0.441512	-0.590532	NaN	1.000000	-0.181902	-0.709215	0.261430	1.000000	0.279431	-0.083578
HR	0.212708	-0.533458	-0.485380	NaN	-0.181902	1.000000	-0.353812	0.882490	-0.181902	0.870750	-0.277637
SO	0.096844	0.540934	0.767301	NaN	-0.709215	-0.353812	1.000000	-0.633312	-0.709215	-0.640417	0.615613
ER	0.216759	-0.765019	-0.801075	NaN	0.261430	0.882490	-0.633312	1.000000	0.261430	0.999432	-0.272141
BAopp	-0.069148	-0.441512	-0.590532	NaN	1.000000	-0.181902	-0.709215	0.261430	1.000000	0.279431	-0.083578
ERA	0.223468	-0.774742	-0.813054	NaN	0.279431	0.870750	-0.640417	0.999432	0.279431	1.000000	-0.269192
yearID	0.320711	0.034162	0.322662	NaN	-0.083578	-0.277637	0.615613	-0.272141	-0.083578	-0.269192	1.000000