Predicting Destination - AirBnB Customers

The goal for this analysis is to build a neural network, using Tensorflow, that can predict which country a new user of AirBnB will make his/her first trip to. More information about the data can be found at: https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings. This data was originally from a 2016 Kaggle competition, sponsored by AirBnB. I believe that it is still an excellent learning reasource because of the opportunity to clean data, engineering features, and build a model, which will all significantly impact the quality of the predictions.

In [ ]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from scipy import stats
pd.set_option("display.max_columns", 1000)
import math
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
from sklearn.cross_validation import train_test_split
import tensorflow as tf
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.metrics import classification_report
//anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [ ]:
# Load the data
countries = pd.read_csv("countries.csv")
test = pd.read_csv("test_users.csv")
train = pd.read_csv("train_users_2.csv")
sessions = pd.read_csv("sessions.csv")

First, let's have a look at the data we are working with.

In [ ]:
test.head()
Out[ ]:
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser
0 5uwns89zht 2014-07-01 20140701000006 NaN FEMALE 35.0 facebook 0 en direct direct untracked Moweb iPhone Mobile Safari
1 jtl0dijy2j 2014-07-01 20140701000051 NaN -unknown- NaN basic 0 en direct direct untracked Moweb iPhone Mobile Safari
2 xx0ulgorjt 2014-07-01 20140701000148 NaN -unknown- NaN basic 0 en direct direct linked Web Windows Desktop Chrome
3 6c6puo6ix0 2014-07-01 20140701000215 NaN -unknown- NaN basic 0 en direct direct linked Web Windows Desktop IE
4 czqhjk3yfe 2014-07-01 20140701000305 NaN -unknown- NaN basic 0 en direct direct untracked Web Mac Desktop Safari
In [ ]:
train.head()
Out[ ]:
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser country_destination
0 gxn3p5htnn 2010-06-28 20090319043255 NaN -unknown- NaN facebook 0 en direct direct untracked Web Mac Desktop Chrome NDF
1 820tgsjxq7 2011-05-25 20090523174809 NaN MALE 38.0 facebook 0 en seo google untracked Web Mac Desktop Chrome NDF
2 4ft3gnwmtx 2010-09-28 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked Web Windows Desktop IE US
3 bjjt8pjhuk 2011-12-05 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked Web Mac Desktop Firefox other
4 87mebub9p4 2010-09-14 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked Web Mac Desktop Chrome US
In [ ]:
sessions.head(10)
Out[ ]:
user_id action action_type action_detail device_type secs_elapsed
0 d1mm9tcy42 lookup NaN NaN Windows Desktop 319.0
1 d1mm9tcy42 search_results click view_search_results Windows Desktop 67753.0
2 d1mm9tcy42 lookup NaN NaN Windows Desktop 301.0
3 d1mm9tcy42 search_results click view_search_results Windows Desktop 22141.0
4 d1mm9tcy42 lookup NaN NaN Windows Desktop 435.0
5 d1mm9tcy42 search_results click view_search_results Windows Desktop 7703.0
6 d1mm9tcy42 lookup NaN NaN Windows Desktop 115.0
7 d1mm9tcy42 personalize data wishlist_content_update Windows Desktop 831.0
8 d1mm9tcy42 index view view_search_results Windows Desktop 20842.0
9 d1mm9tcy42 lookup NaN NaN Windows Desktop 683.0
In [ ]:
print(test.shape)
print(train.shape)
print(sessions.shape)
(62096, 15)
(213451, 16)
(10567737, 6)

Our target feature is 'country_destination,' which can be found in the 'train' dataframe. Given this, let's first explore the sessions data, then merge it with the train dataframe (on 'user_id'), after we are done transforming it.

Sessions

In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id            34496
action             79626
action_type      1126204
action_detail    1126204
device_type            0
secs_elapsed      136031
dtype: int64
In [ ]:
#Drop rows where user_id is null because we want to tie everything back to a user.
sessions = sessions[sessions.user_id.notnull()]
In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id                0
action             79480
action_type      1122957
action_detail    1122957
device_type            0
secs_elapsed      135483
dtype: int64
In [ ]:
# How do nulls in action relate to action_type
sessions[sessions.action.isnull()].action_type.value_counts()
Out[ ]:
message_post    79480
Name: action_type, dtype: int64
In [ ]:
# Every action with a null value, has action_type equal to 'message_post'.
# Let's change all the null values to 'message'
sessions.loc[sessions.action.isnull(), 'action'] = 'message'
In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id                0
action                 0
action_type      1122957
action_detail    1122957
device_type            0
secs_elapsed      135483
dtype: int64
In [ ]:
# action_type and action_detail are missing values in the same rows, this simplifies things a little.
print(sessions[sessions.action_type.isnull()].action.value_counts())
print()
print(sessions[sessions.action_detail.isnull()].action.value_counts())
show                      580485
similar_listings_v2       168457
lookup                    161422
campaigns                 104331
track_page_view            80949
index                      16682
localization_settings       5380
uptodate                    3329
signed_out_modal            1054
currencies                   292
update                       225
braintree_client_token       120
check                        119
widget                        75
phone_verification            16
satisfy                        9
disaster_action                6
track_activity                 6
Name: action, dtype: int64

show                      580485
similar_listings_v2       168457
lookup                    161422
campaigns                 104331
track_page_view            80949
index                      16682
localization_settings       5380
uptodate                    3329
signed_out_modal            1054
currencies                   292
update                       225
braintree_client_token       120
check                        119
widget                        75
phone_verification            16
satisfy                        9
disaster_action                6
track_activity                 6
Name: action, dtype: int64

To fill in the null values for action_type and action_detail, we will perform these steps:

  1. Use the most common value relative to each user and action
  2. Use the most common value relative to each action
  3. Use the value 'missing'
In [ ]:
# function that finds the most common value of a feature, specific to each user and action.
def most_common_value_by_user(merge_df, feature): 
    # Find the value counts for a feature, for each user and action.
    new_df = pd.DataFrame(merge_df.groupby(['user_id','action'])[feature].value_counts())
    # Set the index to a new feature so that it can be transformed.
    new_df['index_tuple'] = new_df.index 
    # Change the feature name to count, since it is the value count of the feature.
    new_df['count'] = new_df[feature]
    
    new_columns = ['user_id','action',feature]
    # separate the features of index_tuple (a list), into their own columns
    for n,col in enumerate(new_columns):
        new_df[col] = new_df.index_tuple.apply(lambda index_tuple: index_tuple[n])
    
    # reset index and drop index_tuple
    new_df = new_df.reset_index(drop = True)
    new_df = new_df.drop(['index_tuple'], axis = 1) 
    
    # Create a new dataframe for each user, action, and the count of the most common feature
    new_df_max = pd.DataFrame(new_df.groupby(['user_id','action'], as_index = False)['count'].max())
    # Merge dataframes to include the name of the most common feature
    new_df_max = new_df_max.merge(new_df, on = ['user_id','action','count'])
    # Drop count as it is not needed for the next step
    new_df_max = new_df_max.drop('count', axis = 1)
    
    # Merge with main dataframe (sessions)
    merge_df = merge_df.merge(new_df_max, left_on = ['user_id','action'], right_on = ['user_id','action'], how = 'left')
    
    return merge_df
In [ ]:
sessions = most_common_value_by_user(sessions, 'action_type')
print("action_type is complete.")

sessions = most_common_value_by_user(sessions, 'action_detail')
print("action_detail is complete.")
action_type is complete.
action_detail is complete.
In [ ]:
# Replace the nulls with the values from the features created by 'most_common_value_by_user' function.
sessions.loc[sessions.action_type_x.isnull(), 'action_type_x'] = sessions.action_type_y
sessions.loc[sessions.action_detail_x.isnull(), 'action_detail_x'] = sessions.action_detail_y

# Change the features' names to their originals and drop unnecessary features.
sessions['action_type'] = sessions.action_type_x
sessions['action_detail'] = sessions.action_detail_x
sessions = sessions.drop(['action_type_x','action_type_y','action_detail_x','action_detail_y'], axis = 1)

That helped to remove some of the nulls values. Now let's try the more general function.

In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id               0
action                0
device_type           0
secs_elapsed     167233
action_type      531386
action_detail    531386
dtype: int64
In [ ]:
# function that finds the most common value of a feature, specific to each action.
def most_common_value_by_all_users(merge_df, feature):
    # Group by action, then find the value counts of the feature
    new_df = pd.DataFrame(merge_df.groupby('action')[feature].value_counts())
    # Set the index to a new feature so that it can be transformed.
    new_df['index_tuple'] = new_df.index 
    # Change the feature name to count, since it is the value count of the feature.
    new_df['count'] = new_df[feature]
    
    new_columns = ['action',feature]
    # separate the features of index_tuple (a list), into their own columns
    for n,col in enumerate(new_columns):
        new_df[col] = new_df.index_tuple.apply(lambda index_tuple: index_tuple[n])
    
    # reset index and drop index_tuple
    new_df = new_df.reset_index(drop = True)
    new_df = new_df.drop(['index_tuple'], axis = 1) 
    
    # Create a new dataframe for each action, and the count of the most common feature
    new_df_max = pd.DataFrame(new_df.groupby('action', as_index = False)['count'].max())
    # Merge dataframes to include the name of the most common feature
    new_df_max = new_df_max.merge(new_df, on = ['action','count'])
    # Drop count as it is not needed for next step
    new_df_max = new_df_max.drop('count', axis = 1)
    
    # Merge dataframe with main dataframe (sessions)
    merge_df = merge_df.merge(new_df_max, left_on = 'action', right_on = 'action', how = 'left')
    
    return merge_df
In [ ]:
sessions = most_common_value_by_all_users(sessions, 'action_type')
print("action_type is complete.")
sessions = most_common_value_by_all_users(sessions, 'action_detail')
print("action_detail is complete.")
action_type is complete.
action_detail is complete.
In [ ]:
# Replace the nulls with the values from the features created by 'most_common_value_by_all_users' function.
sessions.loc[sessions.action_type_x.isnull(), 'action_type_x'] = sessions.action_type_y
sessions.loc[sessions.action_detail_x.isnull(), 'action_detail_x'] = sessions.action_detail_y

# Change the features' names to their originals and drop the unnecessary features.
sessions['action_type'] = sessions.action_type_x
sessions['action_detail'] = sessions.action_detail_x
sessions = sessions.drop(['action_type_x','action_type_y','action_detail_x','action_detail_y'], axis = 1)

There are still some null values remaining. Let's take a look at what actions these null values are related to.

In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id               0
action                0
device_type           0
secs_elapsed     167233
action_type      415562
action_detail    415562
dtype: int64
In [ ]:
sessions[sessions.action_type.isnull()].action.value_counts()
Out[ ]:
similar_listings_v2       168457
lookup                    161422
track_page_view            80949
uptodate                    3329
signed_out_modal            1054
braintree_client_token       120
check                        119
widget                        75
phone_verification            16
satisfy                        9
disaster_action                6
track_activity                 6
Name: action, dtype: int64

Let's take a look at the value counts for all of the actions to see how their frequency compares to others.

In [ ]:
sessions.action.value_counts()
Out[ ]:
show                           2866444
index                           893606
search_results                  723124
personalize                     704782
search                          533887
ajax_refresh_subtotal           486414
update                          370379
similar_listings                363423
social_connections              337764
reviews                         324825
create                          225961
active                          187370
similar_listings_v2             168457
lookup                          161422
dashboard                       152515
header_userpic                  141315
collections                     124067
edit                            108999
campaigns                       104647
track_page_view                  80949
message                          79484
unavailabilities                 77985
qt2                              64585
notifications                    61946
confirm_email                    58565
requested                        57068
identity                         53550
ajax_check_dates                 52426
show_personalize                 50353
authenticate                     44323
                                ...   
reset_calendar                       2
envoy_bank_details_redirect          2
recommendation_page                  2
unsubscribe                          2
views_campaign                       2
sandy                                2
stpcv                                2
rest-of-world                        2
accept_decline                       2
tos_2014                             2
special_offer                        2
views_campaign_rules                 2
use_mobile_site                      2
preapproval                          2
confirmation                         2
desks                                1
deactivate                           1
nyan                                 1
revert_to_admin                      1
set_minimum_payout_amount            1
plaxo_cb                             1
reactivate                           1
deauthorize                          1
host_cancel                          1
wishlists                            1
acculynk_bin_check_failed            1
sldf                                 1
events                               1
update_message                       1
deactivated                          1
Name: action, dtype: int64

'similar_listings_v2', 'lookup', and 'track_page_view' are the three main features with null values. I will give each of them specific values for action_type and action_detail, otherwise I will set the value to 'missing'.

In [ ]:
# Use these values for 'similar_listings_v2' since they are similar actions.
print(sessions[sessions.action == 'similar_listings'].action_type.value_counts())
print(sessions[sessions.action == 'similar_listings'].action_detail.value_counts())
data    363423
Name: action_type, dtype: int64
similar_listings    363423
Name: action_detail, dtype: int64
In [ ]:
sessions.loc[sessions.action == 'similar_listings_v2', 'action_type'] = "data"
sessions.loc[sessions.action == 'similar_listings_v2', 'action_detail'] = "similar_listings"

# No other action is similar, so we'll use the same work for all three features.
sessions.loc[sessions.action == 'lookup', 'action_type'] = "lookup"
sessions.loc[sessions.action == 'lookup', 'action_detail'] = "lookup"

sessions.loc[sessions.action == 'track_page_view', 'action_type'] = "track_page_view"
sessions.loc[sessions.action == 'track_page_view', 'action_detail'] = "track_page_view"

sessions.action_type = sessions.action_type.fillna("missing")
sessions.action_detail = sessions.action_detail.fillna("missing")

All good. Now just secs_elapsed is left.

In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id               0
action                0
device_type           0
secs_elapsed     167233
action_type           0
action_detail         0
dtype: int64

To keep things simple, let's fill the nulls with the median value for each action.

In [ ]:
# Find the median secs_elapsed for each action
median_duration = pd.DataFrame(sessions.groupby('action', as_index = False)['secs_elapsed'].median())
median_duration.head()
Out[ ]:
action secs_elapsed
0 10 47320.5
1 11 72764.0
2 12 180407.0
3 15 54223.0
4 about_us 18627.5
In [ ]:
# Merge dataframes on action
sessions = sessions.merge(median_duration, left_on = 'action', right_on = 'action', how = 'left')
print("Merge complete.")
# if secs_elapsed is null, fill it with the median value
sessions.loc[sessions.secs_elapsed_x.isnull(), 'secs_elapsed_x'] = sessions.secs_elapsed_y
print("Nulls are filled.")
# Change column name
sessions['secs_elapsed'] = sessions.secs_elapsed_x
print("Column is created.")
# Drop unneeded columns
sessions = sessions.drop(['secs_elapsed_x','secs_elapsed_y'], axis = 1)
print("Columns are dropped.")
Merge complete.
Nulls are filled.
Column is created.
Columns are dropped.

All clean!

In [ ]:
sessions.isnull().sum()
Out[ ]:
user_id          0
action           0
device_type      0
action_type      0
action_detail    0
secs_elapsed     0
dtype: int64

I think the best next step would be to take the information from sessions and summarize it. We will create a new dataframe, add the most important features, then join it with the train dataframe.

Sessions' Summary

In [ ]:
sessions.head()
Out[ ]:
user_id action device_type action_type action_detail secs_elapsed
0 d1mm9tcy42 lookup Windows Desktop lookup lookup 319.0
1 d1mm9tcy42 search_results Windows Desktop click view_search_results 67753.0
2 d1mm9tcy42 lookup Windows Desktop lookup lookup 301.0
3 d1mm9tcy42 search_results Windows Desktop click view_search_results 22141.0
4 d1mm9tcy42 lookup Windows Desktop lookup lookup 435.0
In [ ]:
# sessions_summary is set to the number of times a user_id appears in sessions
sessions_summary = pd.DataFrame(sessions.user_id.value_counts(sort = False))
# Set action_count equal to user_id
sessions_summary['action_count'] = sessions_summary.user_id
# Set user_id equal to the index
sessions_summary['user_id'] = sessions_summary.index
# Rest the index
sessions_summary = sessions_summary.reset_index(drop = True)

Looks good, now let's add some features!

In [ ]:
sessions_summary.head()
Out[ ]:
user_id action_count
0 lvs98g7ggz 12
1 9hue70lsfi 22
2 8wqf53khcc 468
3 q7jew74rm9 50
4 f2t2nbphmv 43
In [ ]:
# user_duration is the sums of secs_elapsed for each user
user_duration = pd.DataFrame(sessions.groupby('user_id').secs_elapsed.sum())
user_duration['user_id'] = user_duration.index
# Merge dataframes
sessions_summary = sessions_summary.merge(user_duration)
# Create new feature, 'duration', to equal secs_elapsed
sessions_summary['duration'] = sessions_summary.secs_elapsed
sessions_summary = sessions_summary.drop("secs_elapsed", axis = 1)
In [ ]:
sessions_summary.head()
Out[ ]:
user_id action_count duration
0 lvs98g7ggz 12 113525.5
1 9hue70lsfi 22 439490.0
2 8wqf53khcc 468 7468072.0
3 q7jew74rm9 50 363874.0
4 f2t2nbphmv 43 3340242.0
In [ ]:
# This function finds the most common value, for a specific feature, for each user.
def most_frequent_value(merge_df, feature):
    # Group by the users and find the value counts of the feature
    new_df = pd.DataFrame(sessions.groupby('user_id')[feature].value_counts())
    # The index is a tuple, and we need to seperate it, so let's create a new feature from it.
    new_df['index_tuple'] = new_df.index
    # The new columns are the features created from the tuple.
    new_columns = ['user_id',feature]
    for n,col in enumerate(new_columns):
        new_df[col] = new_df.index_tuple.apply(lambda index_tuple: index_tuple[n])
    
    # Drop the old index (the tuple index)
    new_df = new_df.reset_index(drop = True)
    # Drop the unneeded feature
    new_df = new_df.drop('index_tuple', axis = 1)
    # Select the first value for each user, its most common
    new_df = new_df.groupby('user_id').first()
    
    # Set user_id equal to the index, then reset the index
    new_df['user_id'] = new_df.index
    new_df = new_df.reset_index(drop = True)
    
    merge_df = merge_df.merge(new_df)
    
    return merge_df
In [ ]:
# For each categorical feature in sessions, find the most common value for each user.
sessions_feature = ['action','action_type','action_detail','device_type']

for feature in sessions_feature:
    sessions_summary = most_frequent_value(sessions_summary, feature)
    print("{} is complete.".format(feature))
action is complete.
action_type is complete.
action_detail is complete.
device_type is complete.
In [ ]:
sessions_summary.head()
Out[ ]:
user_id action_count duration action action_type action_detail device_type
0 lvs98g7ggz 12 113525.5 search_results click view_search_results Windows Desktop
1 9hue70lsfi 22 439490.0 show view -unknown- -unknown-
2 8wqf53khcc 468 7468072.0 show view p3 Mac Desktop
3 q7jew74rm9 50 363874.0 show view user_profile iPhone
4 f2t2nbphmv 43 3340242.0 update submit update_listing Windows Desktop
In [ ]:
# This function finds the number of unique values of a feature for each user.
def unique_features(feature, feature_name, merge_df):
    # Create a dataframe by grouping the users and the feature
    unique_feature = pd.DataFrame(sessions.groupby('user_id')[feature].unique())
    unique_feature['user_id'] = unique_feature.index
    unique_feature = unique_feature.reset_index(drop = True)
    # Create a new feature equal to the number of unique features for each user
    unique_feature[feature_name] = unique_feature[feature].map(lambda x: len(x))
    # Drop the needed feature
    unique_feature = unique_feature.drop(feature, axis = 1)
    merge_df = merge_df.merge(unique_feature, on = 'user_id')
    return merge_df
In [ ]:
# Apply unique_features to each of the categorical features in sessions
sessions_summary = unique_features('action', 'unique_actions', sessions_summary)
print("action is complete.")
sessions_summary = unique_features('action_type', 'unique_action_types', sessions_summary)
print("action_type is complete.")
sessions_summary = unique_features('action_detail', 'unique_action_details', sessions_summary)
print("action_detail is complete.")
sessions_summary = unique_features('device_type', 'unique_device_types', sessions_summary)
print("device_type is complete.")
action is complete.
action_type is complete.
action_detail is complete.
device_type is complete.
In [ ]:
sessions_summary.head()
Out[ ]:
user_id action_count duration action action_type action_detail device_type unique_actions unique_action_types unique_action_details unique_device_types
0 lvs98g7ggz 12 113525.5 search_results click view_search_results Windows Desktop 3 3 3 1
1 9hue70lsfi 22 439490.0 show view -unknown- -unknown- 7 5 8 2
2 8wqf53khcc 468 7468072.0 show view p3 Mac Desktop 41 7 22 3
3 q7jew74rm9 50 363874.0 show view user_profile iPhone 8 5 8 2
4 f2t2nbphmv 43 3340242.0 update submit update_listing Windows Desktop 14 5 16 1
In [ ]:
# Find the maximum and minimum secs_elapsed/duration for each user in sessions.
max_durations = pd.DataFrame(sessions.groupby(['user_id'], as_index = False)['secs_elapsed'].max())
sessions_summary = sessions_summary.merge(max_durations, on = 'user_id')
sessions_summary['max_duration'] = sessions_summary.secs_elapsed
sessions_summary = sessions_summary.drop('secs_elapsed', axis = 1)

print("max_durations is complete.")

min_durations = pd.DataFrame(sessions.groupby(['user_id'], as_index = False)['secs_elapsed'].min())
sessions_summary = sessions_summary.merge(min_durations, on = 'user_id')
sessions_summary['min_duration'] = sessions_summary.secs_elapsed
sessions_summary = sessions_summary.drop('secs_elapsed', axis = 1)

print("min_durations is complete.")
max_durations is complete.
min_durations is complete.
In [ ]:
# Find the average duration for each user
sessions_summary['avg_duration'] = sessions_summary.duration / sessions_summary.action_count
In [ ]:
sessions_summary.head(5)
Out[ ]:
user_id action_count duration action action_type action_detail device_type unique_actions unique_action_types unique_action_details unique_device_types max_duration min_duration avg_duration
0 lvs98g7ggz 12 113525.5 search_results click view_search_results Windows Desktop 3 3 3 1 31091.0 117.0 9460.458333
1 9hue70lsfi 22 439490.0 show view -unknown- -unknown- 7 5 8 2 244093.0 286.0 19976.818182
2 8wqf53khcc 468 7468072.0 show view p3 Mac Desktop 41 7 22 3 814514.0 0.0 15957.418803
3 q7jew74rm9 50 363874.0 show view user_profile iPhone 8 5 8 2 64106.0 12.0 7277.480000
4 f2t2nbphmv 43 3340242.0 update submit update_listing Windows Desktop 14 5 16 1 1107772.0 68.0 77680.046512
In [ ]:
# Add new features based on the type of device that the user used most frequently.
apple_device = ['Mac Desktop','iPhone','iPdad Tablet','iPodtouch']
desktop_device = ['Mac Desktop','Windows Desktop','Chromebook','Linux Desktop']
tablet_device = ['Android App Unknown Phone/Tablet','iPad Tablet','Tablet']
mobile_device = ['Android Phone','iPhone','Windows Phone','Blackberry','Opera Phone']

device_types = {'apple_device': apple_device, 
                'desktop_device': desktop_device,
                'tablet_device': tablet_device,
                'mobile_device': mobile_device}

for device in device_types:
    sessions_summary[device] = 0
    sessions_summary.loc[sessions_summary.device_type.isin(device_types[device]), device] = 1
In [ ]:
sessions_summary.head()
Out[ ]:
user_id action_count duration action action_type action_detail device_type unique_actions unique_action_types unique_action_details unique_device_types max_duration min_duration avg_duration tablet_device mobile_device desktop_device apple_device
0 lvs98g7ggz 12 113525.5 search_results click view_search_results Windows Desktop 3 3 3 1 31091.0 117.0 9460.458333 0 0 1 0
1 9hue70lsfi 22 439490.0 show view -unknown- -unknown- 7 5 8 2 244093.0 286.0 19976.818182 0 0 0 0
2 8wqf53khcc 468 7468072.0 show view p3 Mac Desktop 41 7 22 3 814514.0 0.0 15957.418803 0 0 1 1
3 q7jew74rm9 50 363874.0 show view user_profile iPhone 8 5 8 2 64106.0 12.0 7277.480000 0 1 0 1
4 f2t2nbphmv 43 3340242.0 update submit update_listing Windows Desktop 14 5 16 1 1107772.0 68.0 77680.046512 0 0 1 0
In [ ]:
# Check if there are any null values before merging with train.
sessions_summary.isnull().sum()
Out[ ]:
user_id                  0
action_count             0
duration                 0
action                   0
action_type              0
action_detail            0
device_type              0
unique_actions           0
unique_action_types      0
unique_action_details    0
unique_device_types      0
max_duration             0
min_duration             0
avg_duration             0
tablet_device            0
mobile_device            0
desktop_device           0
apple_device             0
dtype: int64
In [ ]:
print(sessions_summary.shape)
print(train.shape)
print(test.shape)
(135483, 18)
(213451, 16)
(62096, 15)
In [ ]:
# Merge train and test with sessions_summary
train1 = train.merge(sessions_summary, left_on = train['id'], right_on = sessions_summary['user_id'], how = 'inner')
# train2 is equal to the users that are not in train1
train2 = train[~train.id.isin(train1.id)]
train = pd.concat([train1, train2])

test1 = test.merge(sessions_summary, left_on = test['id'], right_on = sessions_summary['user_id'], how = 'inner')
# test2 is equal to the users that are not in test1
test2 = test[~test.id.isin(test1.id)]
test = pd.concat([test1, test2])

The next step is to transform the features so that they are ready for training the neural network.

Feature Engineering

In [ ]:
# Concatenate train and test because all transformations need to happen to both dataframes.
df = pd.concat([train,test])
In [ ]:
df.head()
Out[ ]:
action action_count action_detail action_type affiliate_channel affiliate_provider age apple_device avg_duration country_destination date_account_created date_first_booking desktop_device device_type duration first_affiliate_tracked first_browser first_device_type gender id language max_duration min_duration mobile_device signup_app signup_flow signup_method tablet_device timestamp_first_active unique_action_details unique_action_types unique_actions unique_device_types user_id
0 show 127.0 p3 view sem-non-brand google 62.0 0.0 27283.503937 other 2014-01-01 2014-01-04 1.0 Windows Desktop 3465005.0 omg Chrome Windows Desktop MALE d1mm9tcy42 en 606881.0 2.0 0.0 Web 0 basic 0.0 20140101000936 10.0 7.0 17.0 2.0 d1mm9tcy42
1 show 12.0 p3 view direct direct NaN 1.0 24522.125000 NDF 2014-01-01 NaN 1.0 Mac Desktop 294265.5 untracked Firefox Mac Desktop -unknown- yo8nz8bqcq en 115983.0 36.0 0.0 Web 0 basic 0.0 20140101001558 8.0 4.0 7.0 1.0 yo8nz8bqcq
2 create 16.0 -unknown- -unknown- sem-brand google NaN 0.0 71887.156250 NDF 2014-01-01 NaN 1.0 Windows Desktop 1150194.5 omg Firefox Windows Desktop -unknown- 4grx6yxeby en 336801.0 53.0 0.0 Web 0 basic 0.0 20140101001639 8.0 6.0 13.0 2.0 4grx6yxeby
3 ajax_refresh_subtotal 160.0 view_search_results click direct direct NaN 0.0 24384.262500 NDF 2014-01-01 NaN 1.0 Windows Desktop 3901482.0 linked Chrome Windows Desktop -unknown- ncf87guaf0 en 732296.0 0.0 0.0 Web 0 basic 0.0 20140101002146 13.0 7.0 19.0 3.0 ncf87guaf0
4 index 8.0 -unknown- -unknown- direct direct NaN 1.0 2163.187500 GB 2014-01-01 2014-01-02 0.0 iPhone 17305.5 untracked -unknown- iPhone -unknown- 4rvqpxoh3h en 14750.5 21.0 1.0 iOS 25 basic 0.0 20140101002619 1.0 1.0 7.0 1.0 4rvqpxoh3h
In [ ]:
df.shape
Out[ ]:
(275547, 34)
In [ ]:
df.isnull().sum()
Out[ ]:
action                     140064
action_count               140064
action_detail              140064
action_type                140064
affiliate_channel               0
affiliate_provider              0
age                        116866
apple_device               140064
avg_duration               140064
country_destination         62096
date_account_created            0
date_first_booking         186639
desktop_device             140064
device_type                140064
duration                   140064
first_affiliate_tracked      6085
first_browser                   0
first_device_type               0
gender                          0
id                              0
language                        0
max_duration               140064
min_duration               140064
mobile_device              140064
signup_app                      0
signup_flow                     0
signup_method                   0
tablet_device              140064
timestamp_first_active          0
unique_action_details      140064
unique_action_types        140064
unique_actions             140064
unique_device_types        140064
user_id                    140064
dtype: int64
In [ ]:
# We don't need this because we already have id and it has 0 null values.
df = df.drop('user_id', axis = 1)

Since there are many users that do not appear in the sessions dataframe, all of their values are NaN. Let's sort out those nulls values first.

In [ ]:
def missing_session_data_cat(feature):
    return df[feature].fillna("missing")

def missing_session_data_cont(feature):
    return df[feature].fillna(0)
In [ ]:
session_features_cat = ['action','action_detail','action_type','device_type']
session_features_cont = ['action_count','apple_device','desktop_device','mobile_device','tablet_device',
                         'duration','avg_duration','max_duration','min_duration','unique_action_details',
                         'unique_action_types','unique_actions','unique_device_types']

for feature in session_features_cat:
    df[feature] = missing_session_data_cat(feature)
    
for feature in session_features_cont:
    df[feature] = missing_session_data_cont(feature)

That removed most of the null values!

In [ ]:
df.isnull().sum()
Out[ ]:
action                          0
action_count                    0
action_detail                   0
action_type                     0
affiliate_channel               0
affiliate_provider              0
age                        116866
apple_device                    0
avg_duration                    0
country_destination         62096
date_account_created            0
date_first_booking         186639
desktop_device                  0
device_type                     0
duration                        0
first_affiliate_tracked      6085
first_browser                   0
first_device_type               0
gender                          0
id                              0
language                        0
max_duration                    0
min_duration                    0
mobile_device                   0
signup_app                      0
signup_flow                     0
signup_method                   0
tablet_device                   0
timestamp_first_active          0
unique_action_details           0
unique_action_types             0
unique_actions                  0
unique_device_types             0
dtype: int64
In [ ]:
df.action_count.describe()
Out[ ]:
count    275547.000000
mean         39.140927
std          88.944077
min           0.000000
25%           0.000000
50%           0.000000
75%          42.000000
max        2724.000000
Name: action_count, dtype: float64
In [ ]:
df[df.action_count > 0].action_count.describe()
Out[ ]:
count    135483.000000
mean         79.605301
std         113.439177
min           1.000000
25%          17.000000
50%          43.000000
75%          97.000000
max        2724.000000
Name: action_count, dtype: float64
In [ ]:
plt.hist(df[df.action_count > 0].action_count)
plt.yscale('log')
plt.show()
In [ ]:
# Group action_count into quartiles.
df['action_count_quartile'] = df.action_count.map(lambda x: 0 if x == 0 else (
                                                            1 if x <= 17 else (
                                                            2 if x <= 43 else (
                                                            3 if x <= 97 else 4))))
In [ ]:
df[df.age.notnull()].age.describe()
Out[ ]:
count    158681.000000
mean         47.145310
std         142.629468
min           1.000000
25%          28.000000
50%          33.000000
75%          42.000000
max        2014.000000
Name: age, dtype: float64
In [ ]:
plt.hist(df[df.age <= 100].age, bins = 80)
plt.show()

No one is 2014 years old. If anyone is older than 80, let's bring their age down to 80...I'm sure some of them wouldn't mind that.

In [ ]:
df.loc[df.age > 80, 'age'] = 80
In [ ]:
df[df.age.notnull()].age.describe()
Out[ ]:
count    158681.000000
mean         36.756436
std          12.770211
min           1.000000
25%          28.000000
50%          33.000000
75%          42.000000
max          80.000000
Name: age, dtype: float64

Let's see if there is a feature that is correlated with age, to help find good values for the nulls

In [ ]:
for feature in df.columns:
    if(df[feature].dtype == float or df[feature].dtype == int):
        correlation = stats.pearsonr(df[df.age.notnull()].age, df[df.age.notnull()][feature])
        print("Correlation with {} = {}".format(feature, correlation)) 
Correlation with action_count = (-0.064026876831618271, 8.935019331436594e-144)
Correlation with age = (1.0, 0.0)
Correlation with apple_device = (-0.11877768421316583, 0.0)
Correlation with avg_duration = (-0.015890959356089224, 2.4440110140227963e-10)
Correlation with desktop_device = (-0.04078523014721834, 2.1151097257467199e-59)
Correlation with duration = (-0.031522514371148634, 3.5056510866694804e-36)
Correlation with max_duration = (-0.046902158494160351, 5.5907076372683492e-78)
Correlation with min_duration = (-0.0061751882510133168, 0.013898465412299924)
Correlation with mobile_device = (-0.10935875994089284, 0.0)
Correlation with signup_flow = (-0.10945378061441423, 0.0)
Correlation with tablet_device = (0.032457538742075048, 2.94315571171229e-38)
Correlation with timestamp_first_active = (-0.10000080826874851, 0.0)
Correlation with unique_action_details = (-0.080635864696282103, 4.1759774725532945e-227)
Correlation with unique_action_types = (-0.087255839917087102, 1.0418634090552499e-265)
Correlation with unique_actions = (-0.073144851623982723, 3.9161188539179863e-187)
Correlation with unique_device_types = (-0.087245472405669544, 1.2041418624458218e-265)
Correlation with action_count_quartile = (-0.095963056084712631, 3.4732814902639632e-321)

Unfortunately, nothing is really correlated with age. Since there are too many missing values for age, I'm going to set the missing values equal to the median, 33.

In [ ]:
# Create age_group before filling in the nulls, so that the distribution is not altered.
df['age_group'] = df.age.map(lambda x: 0 if math.isnan(x) else (
                                       1 if x < 18 else (
                                       2 if x <= 33 else (
                                       3 if x <= 42 else 4))))
In [ ]:
df.age = df.age.fillna(33)
In [ ]:
df.age.isnull().sum()
Out[ ]:
0
In [ ]:
plt.figure(figsize=(8,4))
plt.hist(df.duration, bins = 100)
plt.title("Duration")
plt.xlabel("Duration")
plt.ylabel("Number of Users")
plt.yscale('log')
plt.show()
In [ ]:
plt.figure(figsize=(8,4))
plt.hist(df.avg_duration, bins = 100)
plt.title("Average Duration")
plt.xlabel("Average Duration")
plt.ylabel("Number of Users")
plt.yscale('log')
plt.show()
In [ ]:
print(df.duration.describe())
print()
print(df.avg_duration.describe())
count    2.755470e+05
mean     7.671002e+05
std      1.570985e+06
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      8.771662e+05
max      3.851321e+07
Name: duration, dtype: float64

count    275547.000000
mean      13423.996460
std       29962.460601
min           0.000000
25%           0.000000
50%           0.000000
75%       16553.788947
max      933452.333333
Name: avg_duration, dtype: float64
In [ ]:
print(np.percentile(df.duration, 50))
print(np.percentile(df.duration, 51))
print(np.percentile(df.duration, 75))
print()
print(np.percentile(df.avg_duration, 50))
print(np.percentile(df.avg_duration, 51))
print(np.percentile(df.avg_duration, 75))
0.0
678.46
877166.25

0.0
346.23
16553.7889474
In [ ]:
# Divide users into 3 equal-ish groups
df['duration_group'] = df.duration.map(lambda x: 0 if x == 0 else (
                                                 1 if x <= 877166.25 else 2))

df['avg_duration_group'] = df.avg_duration.map(lambda x: 0 if x == 0 else (
                                                         1 if x <= 16553.7889474 else 2))
In [ ]:
print(df.duration_group.value_counts())
print()
print(df.avg_duration_group.value_counts())
0    140064
2     68887
1     66596
Name: duration_group, dtype: int64

0    140064
2     68887
1     66596
Name: avg_duration_group, dtype: int64

There are too many unknowns to try to group them with a gender. I'm going to leave OTHER because it could/probably represent(s) people who do not identify as either male/female.

In [ ]:
df.gender.value_counts()
Out[ ]:
-unknown-    129480
FEMALE        77524
MALE          68209
OTHER           334
Name: gender, dtype: int64
In [ ]:
df.mobile_device.value_counts()
Out[ ]:
0.0    241225
1.0     34322
Name: mobile_device, dtype: int64
In [ ]:
df.signup_flow.value_counts()
Out[ ]:
0     206092
25     29834
12     11244
3       8822
2       6881
23      6408
24      4328
1       1047
8        315
6        301
21       197
5         36
20        14
16        11
15        10
14         4
10         2
4          1
Name: signup_flow, dtype: int64
In [ ]:
# If signup_flow == 0, signup_flow_simple == 0
# If signup_flow > 0, signup_flow_simple == 1
df['signup_flow_simple'] = df.signup_flow.map(lambda x: 0 if x == 0 else 1)
In [ ]:
df['signup_flow_simple'].value_counts()
Out[ ]:
0    206092
1     69455
Name: signup_flow_simple, dtype: int64
In [ ]:
df.tablet_device.value_counts()
Out[ ]:
0.0    262566
1.0     12981
Name: tablet_device, dtype: int64
In [ ]:
# Convert dates to datetime for manipulation
df.date_account_created = pd.to_datetime(df.date_account_created, format='%Y-%m-%d')
df.date_first_booking = pd.to_datetime(df.date_first_booking, format='%Y-%m-%d')
In [ ]:
# Check to make sure the date range makes sense.
print(df.date_account_created.min())
print(df.date_account_created.max())
print()
print(df.date_first_booking.min())
print(df.date_first_booking.max())
2010-01-01 00:00:00
2014-09-30 00:00:00

2010-01-02 00:00:00
2015-06-29 00:00:00
In [ ]:
# calendar contains more years of information than we need.
calendar = USFederalHolidayCalendar()
# Set holidays equal to the holidays in our date range.
holidays = calendar.holidays(start = df.date_account_created.min(), 
                             end = df.date_first_booking.max())

# us_bd contains more years of information than we need.
us_bd = CustomBusinessDay(calendar = USFederalHolidayCalendar())
# Set business_days equal to the work days in our date range.
business_days = pd.DatetimeIndex(start = df.date_account_created.min(), 
                                 end = df.date_first_booking.max(), 
                                 freq = us_bd)
In [ ]:
# Create date features
df['year_account_created'] = df.date_account_created.dt.year
df['month_account_created'] = df.date_account_created.dt.month
df['weekday_account_created'] = df.date_account_created.dt.weekday
df['business_day_account_created'] = df.date_account_created.isin(business_days)
df['business_day_account_created'] = df.business_day_account_created.map(lambda x: 1 if x == True else 0)
df['holiday_account_created'] = df.date_account_created.isin(holidays)
df['holiday_account_created'] = df.holiday_account_created.map(lambda x: 1 if x == True else 0)

df['year_first_booking'] = df.date_first_booking.dt.year
df['month_first_booking'] = df.date_first_booking.dt.month
df['weekday_first_booking'] = df.date_first_booking.dt.weekday
df['business_day_first_booking'] = df.date_first_booking.isin(business_days)
df['business_day_first_booking'] = df.business_day_first_booking.map(lambda x: 1 if x == True else 0)
df['holiday_first_booking'] = df.date_first_booking.isin(holidays)
df['holiday_first_booking'] = df.holiday_first_booking.map(lambda x: 1 if x == True else 0)

# Drop unneeded features
df = df.drop(["date_first_booking","date_account_created"], axis = 1)
In [ ]:
df.isnull().sum()
Out[ ]:
action                               0
action_count                         0
action_detail                        0
action_type                          0
affiliate_channel                    0
affiliate_provider                   0
age                                  0
apple_device                         0
avg_duration                         0
country_destination              62096
desktop_device                       0
device_type                          0
duration                             0
first_affiliate_tracked           6085
first_browser                        0
first_device_type                    0
gender                               0
id                                   0
language                             0
max_duration                         0
min_duration                         0
mobile_device                        0
signup_app                           0
signup_flow                          0
signup_method                        0
tablet_device                        0
timestamp_first_active               0
unique_action_details                0
unique_action_types                  0
unique_actions                       0
unique_device_types                  0
action_count_quartile                0
age_group                            0
duration_group                       0
avg_duration_group                   0
signup_flow_simple                   0
year_account_created                 0
month_account_created                0
weekday_account_created              0
business_day_account_created         0
holiday_account_created              0
year_first_booking              186639
month_first_booking             186639
weekday_first_booking           186639
business_day_first_booking           0
holiday_first_booking                0
dtype: int64
In [ ]:
# Set nulls values equal to one less than the minimum.
# I could set the nulls to 0, but the scale would be ugly when we normalize the features.
df.year_first_booking = df.year_first_booking.fillna(min(df.year_first_booking) - 1)
df.month_first_booking = df.month_first_booking.fillna(min(df.month_first_booking) - 1)
df.weekday_first_booking += 1
df.weekday_first_booking = df.weekday_first_booking.fillna(0)
In [ ]:
df.isnull().sum()
Out[ ]:
action                              0
action_count                        0
action_detail                       0
action_type                         0
affiliate_channel                   0
affiliate_provider                  0
age                                 0
apple_device                        0
avg_duration                        0
country_destination             62096
desktop_device                      0
device_type                         0
duration                            0
first_affiliate_tracked          6085
first_browser                       0
first_device_type                   0
gender                              0
id                                  0
language                            0
max_duration                        0
min_duration                        0
mobile_device                       0
signup_app                          0
signup_flow                         0
signup_method                       0
tablet_device                       0
timestamp_first_active              0
unique_action_details               0
unique_action_types                 0
unique_actions                      0
unique_device_types                 0
action_count_quartile               0
age_group                           0
duration_group                      0
avg_duration_group                  0
signup_flow_simple                  0
year_account_created                0
month_account_created               0
weekday_account_created             0
business_day_account_created        0
holiday_account_created             0
year_first_booking                  0
month_first_booking                 0
weekday_first_booking               0
business_day_first_booking          0
holiday_first_booking               0
dtype: int64
In [ ]:
df.first_affiliate_tracked.value_counts()
Out[ ]:
untracked        143181
linked            62064
omg               54859
tracked-other      6655
product            2353
marketing           281
local ops            69
Name: first_affiliate_tracked, dtype: int64

For the missing values for "first_affiliate_tracked" I am going to set these equal to "untracked". Not only is this the most common value, but it makes sense that if we are missing data on these people that they would not have been tracked.

In [ ]:
df.first_affiliate_tracked = df.first_affiliate_tracked.fillna("untracked")
In [ ]:
df.isnull().sum()
Out[ ]:
action                              0
action_count                        0
action_detail                       0
action_type                         0
affiliate_channel                   0
affiliate_provider                  0
age                                 0
apple_device                        0
avg_duration                        0
country_destination             62096
desktop_device                      0
device_type                         0
duration                            0
first_affiliate_tracked             0
first_browser                       0
first_device_type                   0
gender                              0
id                                  0
language                            0
max_duration                        0
min_duration                        0
mobile_device                       0
signup_app                          0
signup_flow                         0
signup_method                       0
tablet_device                       0
timestamp_first_active              0
unique_action_details               0
unique_action_types                 0
unique_actions                      0
unique_device_types                 0
action_count_quartile               0
age_group                           0
duration_group                      0
avg_duration_group                  0
signup_flow_simple                  0
year_account_created                0
month_account_created               0
weekday_account_created             0
business_day_account_created        0
holiday_account_created             0
year_first_booking                  0
month_first_booking                 0
weekday_first_booking               0
business_day_first_booking          0
holiday_first_booking               0
dtype: int64

Everything is all clean (the null values in 'country_destination' belong to the testing data). Now let's explore the categorical features that might have too many values and reduce that number before we do one-hot encoding.

In [ ]:
df.head()
Out[ ]:
action action_count action_detail action_type affiliate_channel affiliate_provider age apple_device avg_duration country_destination desktop_device device_type duration first_affiliate_tracked first_browser first_device_type gender id language max_duration min_duration mobile_device signup_app signup_flow signup_method tablet_device timestamp_first_active unique_action_details unique_action_types unique_actions unique_device_types action_count_quartile age_group duration_group avg_duration_group signup_flow_simple year_account_created month_account_created weekday_account_created business_day_account_created holiday_account_created year_first_booking month_first_booking weekday_first_booking business_day_first_booking holiday_first_booking
0 show 127.0 p3 view sem-non-brand google 62.0 0.0 27283.503937 other 1.0 Windows Desktop 3465005.0 omg Chrome Windows Desktop MALE d1mm9tcy42 en 606881.0 2.0 0.0 Web 0 basic 0.0 20140101000936 10.0 7.0 17.0 2.0 4 4 2 2 0 2014 1 2 0 1 2014.0 1.0 6.0 0 0
1 show 12.0 p3 view direct direct 33.0 1.0 24522.125000 NDF 1.0 Mac Desktop 294265.5 untracked Firefox Mac Desktop -unknown- yo8nz8bqcq en 115983.0 36.0 0.0 Web 0 basic 0.0 20140101001558 8.0 4.0 7.0 1.0 1 0 1 2 0 2014 1 2 0 1 2009.0 0.0 0.0 0 0
2 create 16.0 -unknown- -unknown- sem-brand google 33.0 0.0 71887.156250 NDF 1.0 Windows Desktop 1150194.5 omg Firefox Windows Desktop -unknown- 4grx6yxeby en 336801.0 53.0 0.0 Web 0 basic 0.0 20140101001639 8.0 6.0 13.0 2.0 1 0 2 2 0 2014 1 2 0 1 2009.0 0.0 0.0 0 0
3 ajax_refresh_subtotal 160.0 view_search_results click direct direct 33.0 0.0 24384.262500 NDF 1.0 Windows Desktop 3901482.0 linked Chrome Windows Desktop -unknown- ncf87guaf0 en 732296.0 0.0 0.0 Web 0 basic 0.0 20140101002146 13.0 7.0 19.0 3.0 4 0 2 2 0 2014 1 2 0 1 2009.0 0.0 0.0 0 0
4 index 8.0 -unknown- -unknown- direct direct 33.0 1.0 2163.187500 GB 0.0 iPhone 17305.5 untracked -unknown- iPhone -unknown- 4rvqpxoh3h en 14750.5 21.0 1.0 iOS 25 basic 0.0 20140101002619 1.0 1.0 7.0 1.0 1 0 1 1 1 2014 1 2 0 1 2014.0 1.0 4.0 1 0
In [ ]:
df.first_browser.value_counts()
Out[ ]:
Chrome                  78671
Safari                  53302
-unknown-               44394
Firefox                 38665
Mobile Safari           29636
IE                      24744
Chrome Mobile            3186
Android Browser          1577
AOL Explorer              254
Opera                     228
Silk                      172
IE Mobile                 118
BlackBerry Browser         89
Chromium                   83
Mobile Firefox             64
Maxthon                    60
Apple Mail                 45
Sogou Explorer             43
SiteKiosk                  27
RockMelt                   24
Iron                       24
Yandex.Browser             14
IceWeasel                  14
Pale Moon                  13
CometBird                  12
SeaMonkey                  12
Camino                      9
TenFourFox                  8
Opera Mini                  8
wOSBrowser                  7
CoolNovo                    6
Avant Browser               4
Opera Mobile                4
Mozilla                     3
Flock                       2
Comodo Dragon               2
SlimBrowser                 2
OmniWeb                     2
Crazy Browser               2
TheWorld Browser            2
IceDragon                   1
Conkeror                    1
Googlebot                   1
Kindle Browser              1
IBrowse                     1
Nintendo Browser            1
Outlook 2007                1
NetNewsWire                 1
Epic                        1
PS Vita browser             1
Google Earth                1
Palm Pre web browser        1
UC Browser                  1
Arora                       1
Stainless                   1
Name: first_browser, dtype: int64
In [ ]:
# Create a new feature for those using mobile browsers
mobile_browsers = ['Mobile Safari','Chrome Mobile','IE Mobile','Mobile Firefox','Android Browser']
df.loc[df.first_browser.isin(mobile_browsers), "first_browser"] = "Mobile"
In [ ]:
# The cut_off is set at 0.5% of the data. If a value is not common enough, it will be grouped into something generic.
cut_off = 1378

other_browsers = []
for browser, count in df.first_browser.value_counts().iteritems():
    if count < cut_off:
        other_browsers.append(browser)
   
df.loc[df.first_browser.isin(other_browsers), "first_browser"] = "Other"

print(other_browsers)
['AOL Explorer', 'Opera', 'Silk', 'BlackBerry Browser', 'Chromium', 'Maxthon', 'Apple Mail', 'Sogou Explorer', 'SiteKiosk', 'RockMelt', 'Iron', 'Yandex.Browser', 'IceWeasel', 'Pale Moon', 'CometBird', 'SeaMonkey', 'Camino', 'TenFourFox', 'Opera Mini', 'wOSBrowser', 'CoolNovo', 'Opera Mobile', 'Avant Browser', 'Mozilla', 'Crazy Browser', 'SlimBrowser', 'TheWorld Browser', 'Flock', 'Comodo Dragon', 'OmniWeb', 'Conkeror', 'IBrowse', 'Nintendo Browser', 'Arora', 'Stainless', 'IceDragon', 'Epic', 'Googlebot', 'Outlook 2007', 'NetNewsWire', 'Google Earth', 'Palm Pre web browser', 'PS Vita browser', 'Kindle Browser', 'UC Browser']
In [ ]:
df.first_browser.value_counts()
Out[ ]:
Chrome       78671
Safari       53302
-unknown-    44394
Firefox      38665
Mobile       34581
IE           24744
Other         1190
Name: first_browser, dtype: int64
In [ ]:
df.language.value_counts()
Out[ ]:
en           265538
zh             2634
fr             1508
es             1174
ko             1116
de              977
it              633
ru              508
ja              345
pt              322
sv              176
nl              134
tr               92
da               75
pl               75
no               51
cs               49
el               30
th               28
hu               25
id               23
fi               20
ca                6
is                5
hr                2
-unknown-         1
Name: language, dtype: int64

I think that language might be a more important feature than some others, so I will decrease the cut off to 275, or 0.1% of the data.

In [ ]:
other_languages = []
for language, count in df.language.value_counts().iteritems():
    if count < 275:
        other_languages.append(language)
    
print(other_languages)

df.loc[df.language.isin(other_languages), "language"] = "Other"
['sv', 'nl', 'tr', 'da', 'pl', 'no', 'cs', 'el', 'th', 'hu', 'id', 'fi', 'ca', 'is', 'hr', '-unknown-']
In [ ]:
df.language.value_counts()
Out[ ]:
en       265538
zh         2634
fr         1508
es         1174
ko         1116
de          977
Other       792
it          633
ru          508
ja          345
pt          322
Name: language, dtype: int64
In [ ]:
# New feature for languages that are not English.
df['not_English'] = df.language.map(lambda x: 0 if x == 'en' else 1)
In [ ]:
df.action.value_counts()
Out[ ]:
missing                                         140064
show                                             62664
search_results                                    9979
index                                             8054
create                                            6464
dashboard                                         5549
active                                            5198
update                                            5087
search                                            4962
requested                                         3156
authenticate                                      2294
edit                                              2284
personalize                                       2119
header_userpic                                    1861
ask_question                                      1753
ajax_refresh_subtotal                             1704
lookup                                            1158
identity                                          1012
message                                            709
cancellation_policies                              704
click                                              605
track_page_view                                    561
confirm_email                                      538
qt2                                                469
ajax_photo_widget_form_iframe                      445
reviews                                            425
ajax_check_dates                                   374
notifications                                      342
calendar_tab_inner2                                286
callback                                           283
                                                 ...  
recent_reservations                                  2
airbnb_picks                                         2
11                                                   2
apply                                                2
phone_verification_number_submitted_for_sms          2
my                                                   1
phone_verification_number_submitted_for_call         1
confirmation                                         1
book                                                 1
travel_plans_previous                                1
rate                                                 1
badge                                                1
top_destinations                                     1
show_personalize                                     1
spoken_languages                                     1
concierge                                            1
new_session                                          1
place_worth                                          1
other_hosting_reviews_first                          1
acculynk_pin_pad_inactive                            1
view                                                 1
ajax_price_and_availability                          1
clickthrough                                         1
google_importer                                      1
ajax_referral_banner_type                            1
photography                                          1
review_page                                          1
salute                                               1
home_safety_landing                                  1
requirements                                         1
Name: action, dtype: int64
In [ ]:
other_actions = []
for action, count in df.action.value_counts().iteritems():
    if count < cut_off:
        other_actions.append(action)
    
print(other_actions)

df.loc[df.action.isin(other_actions), "action"] = "Other"
['lookup', 'identity', 'message', 'cancellation_policies', 'click', 'track_page_view', 'confirm_email', 'qt2', 'ajax_photo_widget_form_iframe', 'reviews', 'ajax_check_dates', 'notifications', 'calendar_tab_inner2', 'callback', 'message_to_host_focus', 'similar_listings', 'edit_verification', 'apply_reservation', 'ajax_get_referrals_amt', 'manage_listing', 'unavailabilities', 'payment_methods', 'impressions', 'collections', 'campaigns', 'tos_confirm', 'coupon_field_focus', 'faq_category', 'travel_plans_current', 'faq', 'similar_listings_v2', 'pending', 'complete_status', 'new', 'references', 'populate_help_dropdown', 'endpoint_error', 'available', 'set_password', 'agree_terms_check', 'apply_coupon_click', 'account', 'custom_recommended_destinations', 'status', 'kba_update', 'message_to_host_change', '10', 'reviews_new', 'login', 'referrer_status', 'at_checkpoint', 'populate_from_facebook', 'signup_login', 'decision_tree', 'tell_a_friend', 'hosting_social_proof', 'position', 'create_multiple', 'listings', 'settings', 'contact_new', 'this_hosting_reviews', 'jumio_token', 'ajax_image_upload', 'terms', 'kba', 'profile_pic', 'delete', 'ajax_lwlb_contact', 'coupon_code_click', 'facebook_auto_login', 'phone_verification_error', '12', 'department', 'issue', 'itinerary', 'ajax_statsd', 'glob', 'open_graph_setting', 'forgot_password', 'authorize', 'about_us', 'connect', 'privacy', 'payout_preferences', 'social_connections', 'patch', 'signup_modal', 'localization_settings', 'read_policy_click', 'apply_code', 'this_hosting_reviews_3000', 'request_new_confirm_email', 'signed_out_modal', 'payment_instruments', 'pending_tickets', 'update_cached', 'host_summary', 'reputation', 'login_modal', 'ajax_google_translate_description', 'verify', 'other_hosting_reviews', 'office_location', 'departments', 'set_user', '15', 'recommend', 'invalid_action', 'hospitality', 'remove_dashboard_alert', 'change_currency', 'cancellation_policy_click', 'signature', 'pay', 'my_reservations', 'recommendations', 'mobile_landing_page', 'ajax_referral_banner_experiment_type', 'handle_vanity_url', 'ajax_google_translate', 'change', 'recent_reservations', 'airbnb_picks', '11', 'apply', 'phone_verification_number_submitted_for_sms', 'my', 'phone_verification_number_submitted_for_call', 'confirmation', 'book', 'travel_plans_previous', 'rate', 'badge', 'top_destinations', 'show_personalize', 'spoken_languages', 'concierge', 'new_session', 'place_worth', 'other_hosting_reviews_first', 'acculynk_pin_pad_inactive', 'view', 'ajax_price_and_availability', 'clickthrough', 'google_importer', 'ajax_referral_banner_type', 'photography', 'review_page', 'salute', 'home_safety_landing', 'requirements']
In [ ]:
df.action.value_counts()
Out[ ]:
missing                  140064
show                      62664
Other                     12355
search_results             9979
index                      8054
create                     6464
dashboard                  5549
active                     5198
update                     5087
search                     4962
requested                  3156
authenticate               2294
edit                       2284
personalize                2119
header_userpic             1861
ask_question               1753
ajax_refresh_subtotal      1704
Name: action, dtype: int64
In [ ]:
df.action_detail.value_counts()
Out[ ]:
missing                        140069
-unknown-                       34464
p3                              29932
view_search_results             27361
user_profile                    11608
dashboard                        4652
update_listing                   4343
header_userpic                   2460
p5                               2392
create_user                      2008
message_thread                   1777
change_trip_characteristics      1514
contact_host                     1391
edit_profile                     1135
confirm_email_link                963
wishlist_content_update           952
message_post                      756
cancellation_policies             731
track_page_view                   588
login                             560
signup                            495
create_phone_numbers              459
lookup                            405
similar_listings                  372
list_your_space                   359
p1                                307
change_contact_host_dates         255
book_it                           253
unavailable_dates                 220
listing_reviews                   213
                                ...  
view_reservations                   7
view_listing                        7
set_password_page                   7
forgot_password                     7
read_policy_click                   6
signup_modal                        5
listing_recommendations             5
listing_descriptions                5
apply_coupon_click                  4
account_privacy_settings            4
previous_trips                      4
user_tax_forms                      3
your_reservations                   3
login_modal                         3
user_listings                       3
modify_reservations                 3
cancellation_policy_click           3
admin_templates                     2
profile_reviews                     2
translations                        2
change_or_alter                     2
oauth_login                         1
user_profile_content_update         1
complete_booking                    1
modify_users                        1
friends_wishlists                   1
airbnb_picks_wishlists              1
alteration_field                    1
host_home                           1
guest_receipt                       1
Name: action_detail, dtype: int64
In [ ]:
other_action_details = []
for action_detail, count in df.action_detail.value_counts().iteritems():
    if count < cut_off:
        other_action_details.append(action_detail)
    
print(other_action_details)

df.loc[df.action_detail.isin(other_action_details), "action_detail"] = "Other"
['edit_profile', 'confirm_email_link', 'wishlist_content_update', 'message_post', 'cancellation_policies', 'track_page_view', 'login', 'signup', 'create_phone_numbers', 'lookup', 'similar_listings', 'list_your_space', 'p1', 'change_contact_host_dates', 'book_it', 'unavailable_dates', 'listing_reviews', 'message_to_host_focus', 'manage_listing', 'user_wishlists', 'oauth_response', 'account_notification_settings', 'apply_coupon', 'login_page', 'p4', 'update_listing_description', 'message_inbox', 'profile_verifications', 'your_trips', 'your_listings', 'trip_availability', 'update_user_profile', 'notifications', 'profile_references', 'create_listing', 'signup_login_page', 'wishlist', 'user_reviews', 'pending', 'message_to_host_change', 'reservations', 'instant_book', 'request_to_book', 'set_password', 'at_checkpoint', 'listing_reviews_page', 'coupon_field_focus', 'user_social_connections', 'update_user', 'terms_and_privacy', 'account_payment_methods', 'coupon_code_click', 'account_payout_preferences', 'guest_itinerary', 'view_reservations', 'view_listing', 'set_password_page', 'forgot_password', 'read_policy_click', 'signup_modal', 'listing_recommendations', 'listing_descriptions', 'apply_coupon_click', 'account_privacy_settings', 'previous_trips', 'user_tax_forms', 'your_reservations', 'login_modal', 'user_listings', 'modify_reservations', 'cancellation_policy_click', 'admin_templates', 'profile_reviews', 'translations', 'change_or_alter', 'oauth_login', 'user_profile_content_update', 'complete_booking', 'modify_users', 'friends_wishlists', 'airbnb_picks_wishlists', 'alteration_field', 'host_home', 'guest_receipt']
In [ ]:
df.action_detail.value_counts()
Out[ ]:
missing                        140069
-unknown-                       34464
p3                              29932
view_search_results             27361
user_profile                    11608
Other                           11576
dashboard                        4652
update_listing                   4343
header_userpic                   2460
p5                               2392
create_user                      2008
message_thread                   1777
change_trip_characteristics      1514
contact_host                     1391
Name: action_detail, dtype: int64
In [ ]:
df.action_type.value_counts()
Out[ ]:
missing             140070
view                 77695
-unknown-            17573
click                17307
data                 14198
submit                7056
message_post           990
track_page_view        293
partner_callback       132
booking_request        122
lookup                 109
modify                   2
Name: action_type, dtype: int64
In [ ]:
other_action_types = []
for action_type, count in df.action_type.value_counts().iteritems():
    if count < 1378:
        other_action_types.append(action_type)
    
print(other_action_types)

df.loc[df.action_type.isin(other_action_types), "action_type"] = "Other"
['message_post', 'track_page_view', 'partner_callback', 'booking_request', 'lookup', 'modify']
In [ ]:
df.action_type.value_counts()
Out[ ]:
missing      140070
view          77695
-unknown-     17573
click         17307
data          14198
submit         7056
Other          1648
Name: action_type, dtype: int64
In [ ]:
df.affiliate_provider.value_counts()
Out[ ]:
direct                 181270
google                  65956
other                   13036
facebook                 3996
bing                     3719
craigslist               3475
padmapper                 836
vast                      830
yahoo                     653
facebook-open-graph       566
gsp                       455
meetup                    358
email-marketing           270
naver                      66
baidu                      32
yandex                     18
wayn                        8
daum                        3
Name: affiliate_provider, dtype: int64
In [ ]:
other_affiliate_providers = []
for affiliate_provider, count in df.affiliate_provider.value_counts().iteritems():
    if count < cut_off:
        other_affiliate_providers.append(affiliate_provider)
    
print(other_affiliate_providers)

df.loc[df.affiliate_provider.isin(other_affiliate_providers), "affiliate_provider"] = "other"
['padmapper', 'vast', 'yahoo', 'facebook-open-graph', 'gsp', 'meetup', 'email-marketing', 'naver', 'baidu', 'yandex', 'wayn', 'daum']
In [ ]:
df.affiliate_provider.value_counts()
Out[ ]:
direct        181270
google         65956
other          17131
facebook        3996
bing            3719
craigslist      3475
Name: affiliate_provider, dtype: int64
In [ ]:
df.device_type.value_counts()
Out[ ]:
missing                             140064
Mac Desktop                          44271
Windows Desktop                      37221
iPhone                               26571
iPad Tablet                           8879
Android Phone                         7666
-unknown-                             5801
Android App Unknown Phone/Tablet      2634
Tablet                                1468
Linux Desktop                          428
Chromebook                             374
iPodtouch                               85
Windows Phone                           56
Blackberry                              27
Opera Phone                              2
Name: device_type, dtype: int64
In [ ]:
other_device_types = []
for device_type, count in df.device_type.value_counts().iteritems():
    if count < 1378:
        other_device_types.append(device_type)
    
print(other_device_types)

df.loc[df.device_type.isin(other_device_types), "device_type"] = "Other"
['Linux Desktop', 'Chromebook', 'iPodtouch', 'Windows Phone', 'Blackberry', 'Opera Phone']
In [ ]:
df.device_type.value_counts()
Out[ ]:
missing                             140064
Mac Desktop                          44271
Windows Desktop                      37221
iPhone                               26571
iPad Tablet                           8879
Android Phone                         7666
-unknown-                             5801
Android App Unknown Phone/Tablet      2634
Tablet                                1468
Other                                  972
Name: device_type, dtype: int64
In [ ]:
df.signup_method.value_counts()
Out[ ]:
basic       198222
facebook     74864
google        2438
weibo           23
Name: signup_method, dtype: int64
In [ ]:
# Create a new dataframe for the labels
labels = pd.DataFrame(df.country_destination)
df = df.drop("country_destination", axis = 1)
In [ ]:
labels.head()
Out[ ]:
country_destination
0 other
1 NDF
2 NDF
3 NDF
4 GB
In [ ]:
# Drop id since it is no longer needed.
df = df.drop('id', axis = 1)
In [ ]:
# Group all features as either continuous (cont) or categorical (cat)
cont_features = []
cat_features = []

for feature in df.columns:
    if df[feature].dtype == float or df[feature].dtype == int:
        cont_features.append(feature)
    elif df[feature].dtype == object:
        cat_features.append(feature)
In [ ]:
# Check to ensure that we have all of the features
print(cat_features)
print()
print(cont_features)
print()
print(len(cat_features) + len(cont_features))
print(df.shape[1])
['action', 'action_detail', 'action_type', 'affiliate_channel', 'affiliate_provider', 'device_type', 'first_affiliate_tracked', 'first_browser', 'first_device_type', 'gender', 'language', 'signup_app', 'signup_method']

['action_count', 'age', 'apple_device', 'avg_duration', 'desktop_device', 'duration', 'max_duration', 'min_duration', 'mobile_device', 'signup_flow', 'tablet_device', 'timestamp_first_active', 'unique_action_details', 'unique_action_types', 'unique_actions', 'unique_device_types', 'action_count_quartile', 'age_group', 'duration_group', 'avg_duration_group', 'signup_flow_simple', 'year_account_created', 'month_account_created', 'weekday_account_created', 'business_day_account_created', 'holiday_account_created', 'year_first_booking', 'month_first_booking', 'weekday_first_booking', 'business_day_first_booking', 'holiday_first_booking', 'not_English']

45
45
In [ ]:
# Although dates have continuous values, they should be treated as categorical features.
date_features = ['year_account_created','month_account_created','weekday_account_created',
                      'year_first_booking','month_first_booking','weekday_first_booking']
for feature in date_features:
    cont_features.remove(feature)
    cat_features.append(feature)
In [ ]:
for feature in cat_features:
    # Create dummies of each value of a categorical feature
    dummies = pd.get_dummies(df[feature], prefix = feature, drop_first = False)
    # Drop the unneeded feature
    df = df.drop(feature, axis = 1)
    df = pd.concat([df, dummies], axis=1)
    print("{} is complete".format(feature))
action is complete
action_detail is complete
action_type is complete
affiliate_channel is complete
affiliate_provider is complete
device_type is complete
first_affiliate_tracked is complete
first_browser is complete
first_device_type is complete
gender is complete
language is complete
signup_app is complete
signup_method is complete
year_account_created is complete
month_account_created is complete
weekday_account_created is complete
year_first_booking is complete
month_first_booking is complete
weekday_first_booking is complete
In [ ]:
min_max_scaler = preprocessing.MinMaxScaler()
# Normalize the continuous features
for feature in cont_features:
    df.loc[:,feature] = min_max_scaler.fit_transform(df[feature])
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by MinMaxScaler.
  warnings.warn(msg, _DataConversionWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:321: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python3.5/site-packages/sklearn/preprocessing/data.py:356: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
In [ ]:
df.head()
Out[ ]:
action_count age apple_device avg_duration desktop_device duration max_duration min_duration mobile_device signup_flow tablet_device timestamp_first_active unique_action_details unique_action_types unique_actions unique_device_types action_count_quartile age_group duration_group avg_duration_group signup_flow_simple business_day_account_created holiday_account_created business_day_first_booking holiday_first_booking not_English action_Other action_active action_ajax_refresh_subtotal action_ask_question action_authenticate action_create action_dashboard action_edit action_header_userpic action_index action_missing action_personalize action_requested action_search action_search_results action_show action_update action_detail_-unknown- action_detail_Other action_detail_change_trip_characteristics action_detail_contact_host action_detail_create_user action_detail_dashboard action_detail_header_userpic action_detail_message_thread action_detail_missing action_detail_p3 action_detail_p5 action_detail_update_listing action_detail_user_profile action_detail_view_search_results action_type_-unknown- action_type_Other action_type_click action_type_data action_type_missing action_type_submit action_type_view affiliate_channel_api affiliate_channel_content affiliate_channel_direct affiliate_channel_other affiliate_channel_remarketing affiliate_channel_sem-brand affiliate_channel_sem-non-brand affiliate_channel_seo affiliate_provider_bing affiliate_provider_craigslist affiliate_provider_direct affiliate_provider_facebook affiliate_provider_google affiliate_provider_other device_type_-unknown- device_type_Android App Unknown Phone/Tablet device_type_Android Phone device_type_Mac Desktop device_type_Other device_type_Tablet device_type_Windows Desktop device_type_iPad Tablet device_type_iPhone device_type_missing first_affiliate_tracked_linked first_affiliate_tracked_local ops first_affiliate_tracked_marketing first_affiliate_tracked_omg first_affiliate_tracked_product first_affiliate_tracked_tracked-other first_affiliate_tracked_untracked first_browser_-unknown- first_browser_Chrome first_browser_Firefox first_browser_IE first_browser_Mobile first_browser_Other first_browser_Safari first_device_type_Android Phone first_device_type_Android Tablet first_device_type_Desktop (Other) first_device_type_Mac Desktop first_device_type_Other/Unknown first_device_type_SmartPhone (Other) first_device_type_Windows Desktop first_device_type_iPad first_device_type_iPhone gender_-unknown- gender_FEMALE gender_MALE gender_OTHER language_Other language_de language_en language_es language_fr language_it language_ja language_ko language_pt language_ru language_zh signup_app_Android signup_app_Moweb signup_app_Web signup_app_iOS signup_method_basic signup_method_facebook signup_method_google signup_method_weibo year_account_created_2010 year_account_created_2011 year_account_created_2012 year_account_created_2013 year_account_created_2014 month_account_created_1 month_account_created_2 month_account_created_3 month_account_created_4 month_account_created_5 month_account_created_6 month_account_created_7 month_account_created_8 month_account_created_9 month_account_created_10 month_account_created_11 month_account_created_12 weekday_account_created_0 weekday_account_created_1 weekday_account_created_2 weekday_account_created_3 weekday_account_created_4 weekday_account_created_5 weekday_account_created_6 year_first_booking_2009.0 year_first_booking_2010.0 year_first_booking_2011.0 year_first_booking_2012.0 year_first_booking_2013.0 year_first_booking_2014.0 year_first_booking_2015.0 month_first_booking_0.0 month_first_booking_1.0 month_first_booking_2.0 month_first_booking_3.0 month_first_booking_4.0 month_first_booking_5.0 month_first_booking_6.0 month_first_booking_7.0 month_first_booking_8.0 month_first_booking_9.0 month_first_booking_10.0 month_first_booking_11.0 month_first_booking_12.0 weekday_first_booking_0.0 weekday_first_booking_1.0 weekday_first_booking_2.0 weekday_first_booking_3.0 weekday_first_booking_4.0 weekday_first_booking_5.0 weekday_first_booking_6.0 weekday_first_booking_7.0
0 0.046623 0.772152 0.0 0.029229 1.0 0.089969 0.337160 0.000039 0.0 0.0 0.0 0.983616 0.181818 0.636364 0.232877 0.333333 1.00 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.004405 0.405063 1.0 0.026270 1.0 0.007641 0.064436 0.000704 0.0 0.0 0.0 0.983616 0.145455 0.363636 0.095890 0.166667 0.25 0.0 0.5 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.005874 0.405063 0.0 0.077012 1.0 0.029865 0.187114 0.001037 0.0 0.0 0.0 0.983616 0.145455 0.545455 0.178082 0.333333 0.25 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.058737 0.405063 0.0 0.026123 1.0 0.101302 0.406836 0.000000 0.0 0.0 0.0 0.983616 0.236364 0.636364 0.260274 0.500000 1.00 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.002937 0.405063 1.0 0.002317 0.0 0.000449 0.008195 0.000411 1.0 1.0 0.0 0.983616 0.018182 0.090909 0.095890 0.166667 0.25 0.0 0.5 0.5 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
In [ ]:
# Split df into training and testing data
df_train = df[:len(train)]
df_test = df[len(train):]

# Shorten labels to length of the training data
y = labels[:len(train)]
In [ ]:
# Create dummy features for each country
y_dummies = pd.get_dummies(y, drop_first = False)
y = pd.concat([y, y_dummies], axis=1)
y = y.drop("country_destination", axis = 1)
y.head()
Out[ ]:
country_destination_AU country_destination_CA country_destination_DE country_destination_ES country_destination_FR country_destination_GB country_destination_IT country_destination_NDF country_destination_NL country_destination_PT country_destination_US country_destination_other
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
In [ ]:
print(df_train.shape)
print(df_test.shape)
print(y.shape)
(213451, 186)
(62096, 186)
(213451, 12)
In [ ]:
# Take a look to see how common each country is.
train.country_destination.value_counts() 
Out[ ]:
NDF      124543
US        62376
other     10094
FR         5023
IT         2835
GB         2324
ES         2249
CA         1428
DE         1061
NL          762
AU          539
PT          217
Name: country_destination, dtype: int64
In [ ]:
# Find the order of the features
y.columns
Out[ ]:
Index(['country_destination_AU', 'country_destination_CA',
       'country_destination_DE', 'country_destination_ES',
       'country_destination_FR', 'country_destination_GB',
       'country_destination_IT', 'country_destination_NDF',
       'country_destination_NL', 'country_destination_PT',
       'country_destination_US', 'country_destination_other'],
      dtype='object')

Due to the imbalance in the data, we are going to set the sum of each feature equal to each other. This will help the neural network to train because it won't be biased to reducing the NDF errors since that would have the greatest effect in the cost function.

In [ ]:
y[y.columns[0]] *= len(y)/539
y[y.columns[1]] *= len(y)/1428
y[y.columns[2]] *= len(y)/1061
y[y.columns[3]] *= len(y)/2249
y[y.columns[4]] *= len(y)/5023
y[y.columns[5]] *= len(y)/2324
y[y.columns[6]] *= len(y)/2835
y[y.columns[7]] *= len(y)/124543
y[y.columns[8]] *= len(y)/762
y[y.columns[9]] *= len(y)/217
y[y.columns[10]] *= len(y)/62376
y[y.columns[11]] *= len(y)/10094
In [ ]:
# Check the sum of each feature
totals = []
for i in range(12):
    totals.append(sum(y[y.columns[i]]))
totals
Out[ ]:
[213450.99999999805,
 213451.00000000515,
 213451.00000000218,
 213451.00000000844,
 213450.99999998632,
 213450.99999999162,
 213451.00000000955,
 213450.99999959889,
 213450.99999999884,
 213451.00000000067,
 213451.00000015536,
 213451.00000003129]
In [ ]:
x_train, x_test, y_train, y_test = train_test_split(df_train, y, test_size = 0.2, random_state = 2)
In [ ]:
# Tensorflow needs the data in matrices
inputX = x_train.as_matrix()
inputY = y_train.as_matrix()
inputX_test = x_test.as_matrix()
inputY_test = y_test.as_matrix()
In [ ]:
# Number of input nodes/number of features.
input_nodes = 186

# Multiplier maintains a fixed ratio of nodes between each layer.
mulitplier = 1.33

# Number of nodes in each hidden layer
hidden_nodes1 = 50
hidden_nodes2 = round(hidden_nodes1 * mulitplier)

# Percent of nodes to keep during dropout.
pkeep = tf.placeholder(tf.float32)
In [ ]:
# The standard deviation when setting the values for the weights.
std = 1

#input
features = tf.placeholder(tf.float32, [None, input_nodes])

#layer 1
W1 = tf.Variable(tf.truncated_normal([input_nodes, hidden_nodes1], stddev = std))
b1 = tf.Variable(tf.zeros([hidden_nodes1]))
y1 = tf.nn.sigmoid(tf.matmul(features, W1) + b1)

#layer 2
W2 = tf.Variable(tf.truncated_normal([hidden_nodes1, hidden_nodes2], stddev = std))
b2 = tf.Variable(tf.zeros([hidden_nodes2]))
y2 = tf.nn.sigmoid(tf.matmul(y1, W2) + b2)
#y2 = tf.nn.dropout(y2, pkeep)

#layer 3
W3 = tf.Variable(tf.truncated_normal([hidden_nodes2, 12], stddev = std)) 
b3 = tf.Variable(tf.zeros([12]))
y3 = tf.nn.softmax(tf.matmul(y2, W3) + b3)

#output
predictions = y3
labels = tf.placeholder(tf.float32, [None, 12])
In [ ]:
#Parameters
training_epochs = 3000
training_dropout = 0.6 # Not using dropout led to the best results.
display_step = 10
n_samples = inputY.shape[1]
batch = tf.Variable(0)

learning_rate = tf.train.exponential_decay(
  0.05,              #Base learning rate.
  batch,             #Current index into the dataset.
  len(inputX),       #Decay step.
  0.95,              #Decay rate.
  staircase=False)

Based on the evaluation method of the Kaggle competition, we are going to check the accuracy of the top prediction and the top 5 predictions for each user.

In [ ]:
# Determine if the predictions are correct
correct_prediction = tf.equal(tf.argmax(predictions,1), tf.argmax(labels,1))
correct_top5 = tf.nn.in_top_k(predictions, tf.argmax(labels, 1), k = 5)

# Calculate the accuracy of the predictions
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
accuracy_top5 = tf.reduce_mean(tf.cast(correct_top5, tf.float32))

print('Accuracy function created.')

# Cross entropy
cross_entropy = -tf.reduce_sum(labels * tf.log(tf.clip_by_value(predictions,1e-10,1.0)))

# Training loss
loss = tf.reduce_mean(cross_entropy)

#We will optimize our model via AdamOptimizer
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
Accuracy function created.
In [ ]:
#Initialize variables and tensorflow session
init = tf.initialize_all_variables()
session = tf.Session()
session.run(init)
WARNING:tensorflow:From <ipython-input-138-f19d41dba435>:2 in <module>.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
In [ ]:
accuracy_summary = [] #Record accuracy values for plot
accuracy_top5_summary = [] #Record accuracy values for plot
loss_summary = [] #Record cost values for plot

test_accuracy_summary = [] #Record accuracy values for plot
test_accuracy_top5_summary = [] #Record accuracy values for plot
test_loss_summary = [] #Record cost values for plot

init = tf.initialize_all_variables()

for i in range(training_epochs):  
    session.run([optimizer], 
                feed_dict={features: inputX, 
                           labels: inputY,
                           pkeep: training_dropout})

    # Display logs per epoch step
    if (i) % display_step == 0:
        train_accuracy, train_accuracy_top5, newLoss = session.run([accuracy,accuracy_top5,loss], 
                                                                   feed_dict={features: inputX, 
                                                                              labels: inputY,
                                                                              pkeep: training_dropout})
        print ("Epoch:", i,
               "Accuracy =", "{:.6f}".format(train_accuracy), 
               "Top 5 Accuracy =", "{:.6f}".format(train_accuracy_top5),
               "Loss = ", "{:.6f}".format(newLoss))
        accuracy_summary.append(train_accuracy)
        accuracy_top5_summary.append(train_accuracy_top5)
        loss_summary.append(newLoss)
        
        test_accuracy,test_accuracy_top5,test_newLoss = session.run([accuracy,accuracy_top5,loss], 
                                                              feed_dict={features: inputX_test, 
                                                                         labels: inputY_test,
                                                                         pkeep: 1})
        print ("Epoch:", i,
               "Test-Accuracy =", "{:.6f}".format(test_accuracy), 
               "Test-Top 5 Accuracy =", "{:.6f}".format(test_accuracy_top5),
               "Test-Loss = ", "{:.6f}".format(test_newLoss))
        test_accuracy_summary.append(test_accuracy)
        test_accuracy_top5_summary.append(test_accuracy_top5)
        test_loss_summary.append(test_newLoss)

print()
print ("Optimization Finished!")
training_accuracy, training_top5_accuracy = session.run([accuracy,accuracy_top5], 
                                feed_dict={features: inputX, labels: inputY, pkeep: training_dropout})
print ("Training Accuracy=", training_accuracy)
print ("Training Top 5 Accuracy=", training_top5_accuracy)
print()
testing_accuracy, testing_top5_accuracy = session.run([accuracy,accuracy_top5], 
                                                       feed_dict={features: inputX_test, 
                                                                  labels: inputY_test,
                                                                  pkeep: 1})
print ("Testing Accuracy=", testing_accuracy)
print ("Testing Top 5 Accuracy=", testing_top5_accuracy)
WARNING:tensorflow:From <ipython-input-139-6246d5175f43>:9 in <module>.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Epoch: 0 Accuracy = 0.165900 Top 5 Accuracy = 0.509434 Loss =  9022430.000000
Epoch: 0 Test-Accuracy = 0.170059 Test-Top 5 Accuracy = 0.512403 Test-Loss =  2246018.000000
Epoch: 10 Accuracy = 0.521393 Top 5 Accuracy = 0.650404 Loss =  5307889.000000
Epoch: 10 Test-Accuracy = 0.521538 Test-Top 5 Accuracy = 0.651964 Test-Loss =  1341002.750000
Epoch: 20 Accuracy = 0.596656 Top 5 Accuracy = 0.671984 Loss =  4771457.000000
Epoch: 20 Test-Accuracy = 0.596917 Test-Top 5 Accuracy = 0.674334 Test-Loss =  1207018.000000
Epoch: 30 Accuracy = 0.682016 Top 5 Accuracy = 0.849947 Loss =  4566464.500000
Epoch: 30 Test-Accuracy = 0.682064 Test-Top 5 Accuracy = 0.850507 Test-Loss =  1164330.750000
Epoch: 40 Accuracy = 0.718014 Top 5 Accuracy = 0.891918 Loss =  4473507.500000
Epoch: 40 Test-Accuracy = 0.718067 Test-Top 5 Accuracy = 0.889579 Test-Loss =  1148240.000000
Epoch: 50 Accuracy = 0.637122 Top 5 Accuracy = 0.859428 Loss =  4422973.000000
Epoch: 50 Test-Accuracy = 0.636973 Test-Top 5 Accuracy = 0.857347 Test-Loss =  1146496.500000
Epoch: 60 Accuracy = 0.624139 Top 5 Accuracy = 0.827108 Loss =  4371764.000000
Epoch: 60 Test-Accuracy = 0.624160 Test-Top 5 Accuracy = 0.824553 Test-Loss =  1149996.125000
Epoch: 70 Accuracy = 0.622283 Top 5 Accuracy = 0.819747 Loss =  4315948.500000
Epoch: 70 Test-Accuracy = 0.621958 Test-Top 5 Accuracy = 0.816027 Test-Loss =  1153743.000000
Epoch: 80 Accuracy = 0.616708 Top 5 Accuracy = 0.802003 Loss =  4252883.000000
Epoch: 80 Test-Accuracy = 0.616313 Test-Top 5 Accuracy = 0.796960 Test-Loss =  1160309.500000
Epoch: 90 Accuracy = 0.618828 Top 5 Accuracy = 0.809944 Loss =  4182342.250000
Epoch: 90 Test-Accuracy = 0.618514 Test-Top 5 Accuracy = 0.802933 Test-Loss =  1168843.250000
Epoch: 100 Accuracy = 0.621639 Top 5 Accuracy = 0.821270 Loss =  4105650.500000
Epoch: 100 Test-Accuracy = 0.620646 Test-Top 5 Accuracy = 0.813380 Test-Loss =  1178433.125000
Epoch: 110 Accuracy = 0.618072 Top 5 Accuracy = 0.815905 Loss =  4027599.500000
Epoch: 110 Test-Accuracy = 0.615774 Test-Top 5 Accuracy = 0.808109 Test-Loss =  1189457.125000
Epoch: 120 Accuracy = 0.618728 Top 5 Accuracy = 0.820116 Loss =  3952207.500000
Epoch: 120 Test-Accuracy = 0.616242 Test-Top 5 Accuracy = 0.811178 Test-Loss =  1201650.625000
Epoch: 130 Accuracy = 0.619079 Top 5 Accuracy = 0.828162 Loss =  3886280.000000
Epoch: 130 Test-Accuracy = 0.616968 Test-Top 5 Accuracy = 0.818931 Test-Loss =  1216675.000000
Epoch: 140 Accuracy = 0.617164 Top 5 Accuracy = 0.826780 Loss =  3831829.250000
Epoch: 140 Test-Accuracy = 0.614603 Test-Top 5 Accuracy = 0.815793 Test-Loss =  1230044.000000
Epoch: 150 Accuracy = 0.618945 Top 5 Accuracy = 0.835073 Loss =  3784665.000000
Epoch: 150 Test-Accuracy = 0.614884 Test-Top 5 Accuracy = 0.823382 Test-Loss =  1247714.000000
Epoch: 160 Accuracy = 0.618570 Top 5 Accuracy = 0.835758 Loss =  3740164.250000
Epoch: 160 Test-Accuracy = 0.613806 Test-Top 5 Accuracy = 0.823733 Test-Loss =  1259344.750000
Epoch: 170 Accuracy = 0.619548 Top 5 Accuracy = 0.841427 Loss =  3696753.000000
Epoch: 170 Test-Accuracy = 0.614064 Test-Top 5 Accuracy = 0.829847 Test-Loss =  1275993.500000
Epoch: 180 Accuracy = 0.619319 Top 5 Accuracy = 0.842211 Loss =  3655125.000000
Epoch: 180 Test-Accuracy = 0.613127 Test-Top 5 Accuracy = 0.829730 Test-Loss =  1289048.625000
Epoch: 190 Accuracy = 0.618939 Top 5 Accuracy = 0.845379 Loss =  3613836.500000
Epoch: 190 Test-Accuracy = 0.612518 Test-Top 5 Accuracy = 0.832049 Test-Loss =  1302832.875000
Epoch: 200 Accuracy = 0.619565 Top 5 Accuracy = 0.852149 Loss =  3573714.000000
Epoch: 200 Test-Accuracy = 0.613408 Test-Top 5 Accuracy = 0.838748 Test-Loss =  1317959.250000
Epoch: 210 Accuracy = 0.618295 Top 5 Accuracy = 0.849701 Loss =  3533467.500000
Epoch: 210 Test-Accuracy = 0.611112 Test-Top 5 Accuracy = 0.835797 Test-Loss =  1332102.000000
Epoch: 220 Accuracy = 0.619940 Top 5 Accuracy = 0.859575 Loss =  3494618.250000
Epoch: 220 Test-Accuracy = 0.613244 Test-Top 5 Accuracy = 0.846338 Test-Loss =  1346006.500000
Epoch: 230 Accuracy = 0.619232 Top 5 Accuracy = 0.855686 Loss =  3456234.000000
Epoch: 230 Test-Accuracy = 0.611347 Test-Top 5 Accuracy = 0.843175 Test-Loss =  1356915.125000
Epoch: 240 Accuracy = 0.619337 Top 5 Accuracy = 0.855967 Loss =  3419809.000000
Epoch: 240 Test-Accuracy = 0.611112 Test-Top 5 Accuracy = 0.843316 Test-Loss =  1371355.750000
Epoch: 250 Accuracy = 0.620186 Top 5 Accuracy = 0.861554 Loss =  3388884.500000
Epoch: 250 Test-Accuracy = 0.611511 Test-Top 5 Accuracy = 0.849336 Test-Loss =  1388860.750000
Epoch: 260 Accuracy = 0.605821 Top 5 Accuracy = 0.819120 Loss =  3545866.000000
Epoch: 260 Test-Accuracy = 0.599260 Test-Top 5 Accuracy = 0.804877 Test-Loss =  1530484.250000
Epoch: 270 Accuracy = 0.668781 Top 5 Accuracy = 0.948870 Loss =  3421295.000000
Epoch: 270 Test-Accuracy = 0.660748 Test-Top 5 Accuracy = 0.935420 Test-Loss =  1489959.000000
Epoch: 280 Accuracy = 0.609399 Top 5 Accuracy = 0.813065 Loss =  3357680.250000
Epoch: 280 Test-Accuracy = 0.600899 Test-Top 5 Accuracy = 0.798974 Test-Loss =  1430165.750000
Epoch: 290 Accuracy = 0.635617 Top 5 Accuracy = 0.907361 Loss =  3300976.500000
Epoch: 290 Test-Accuracy = 0.625893 Test-Top 5 Accuracy = 0.892553 Test-Loss =  1440139.250000
Epoch: 300 Accuracy = 0.616286 Top 5 Accuracy = 0.846791 Loss =  3264129.750000
Epoch: 300 Test-Accuracy = 0.605420 Test-Top 5 Accuracy = 0.831955 Test-Loss =  1438067.625000
Epoch: 310 Accuracy = 0.626446 Top 5 Accuracy = 0.886712 Loss =  3236772.500000
Epoch: 310 Test-Accuracy = 0.614931 Test-Top 5 Accuracy = 0.871074 Test-Loss =  1444352.750000
Epoch: 320 Accuracy = 0.622944 Top 5 Accuracy = 0.874268 Loss =  3211510.000000
Epoch: 320 Test-Accuracy = 0.611300 Test-Top 5 Accuracy = 0.857862 Test-Loss =  1459722.125000
Epoch: 330 Accuracy = 0.623729 Top 5 Accuracy = 0.877196 Loss =  3188301.750000
Epoch: 330 Test-Accuracy = 0.611394 Test-Top 5 Accuracy = 0.861423 Test-Loss =  1469061.375000
Epoch: 340 Accuracy = 0.625363 Top 5 Accuracy = 0.883855 Loss =  3166312.000000
Epoch: 340 Test-Accuracy = 0.612776 Test-Top 5 Accuracy = 0.867771 Test-Loss =  1480894.250000
Epoch: 350 Accuracy = 0.624391 Top 5 Accuracy = 0.881776 Loss =  3145008.500000
Epoch: 350 Test-Accuracy = 0.612237 Test-Top 5 Accuracy = 0.864304 Test-Loss =  1491645.125000
Epoch: 360 Accuracy = 0.625281 Top 5 Accuracy = 0.885254 Loss =  3124510.000000
Epoch: 360 Test-Accuracy = 0.612588 Test-Top 5 Accuracy = 0.867700 Test-Loss =  1503555.000000
Epoch: 370 Accuracy = 0.625796 Top 5 Accuracy = 0.887520 Loss =  3104588.500000
Epoch: 370 Test-Accuracy = 0.613103 Test-Top 5 Accuracy = 0.869668 Test-Loss =  1516310.125000
Epoch: 380 Accuracy = 0.625873 Top 5 Accuracy = 0.887743 Loss =  3085142.000000
Epoch: 380 Test-Accuracy = 0.612658 Test-Top 5 Accuracy = 0.869223 Test-Loss =  1529080.250000
Epoch: 390 Accuracy = 0.626312 Top 5 Accuracy = 0.890337 Loss =  3065810.750000
Epoch: 390 Test-Accuracy = 0.613010 Test-Top 5 Accuracy = 0.871542 Test-Loss =  1541329.000000
Epoch: 400 Accuracy = 0.626763 Top 5 Accuracy = 0.891825 Loss =  3046702.250000
Epoch: 400 Test-Accuracy = 0.612752 Test-Top 5 Accuracy = 0.872713 Test-Loss =  1553015.000000
Epoch: 410 Accuracy = 0.627044 Top 5 Accuracy = 0.893172 Loss =  3027404.000000
Epoch: 410 Test-Accuracy = 0.613455 Test-Top 5 Accuracy = 0.873814 Test-Loss =  1564915.250000
Epoch: 420 Accuracy = 0.627506 Top 5 Accuracy = 0.894460 Loss =  3008180.000000
Epoch: 420 Test-Accuracy = 0.613174 Test-Top 5 Accuracy = 0.875524 Test-Loss =  1576596.500000
Epoch: 430 Accuracy = 0.628104 Top 5 Accuracy = 0.895953 Loss =  2989394.000000
Epoch: 430 Test-Accuracy = 0.613127 Test-Top 5 Accuracy = 0.877047 Test-Loss =  1588426.000000
Epoch: 440 Accuracy = 0.628397 Top 5 Accuracy = 0.897488 Loss =  2971131.000000
Epoch: 440 Test-Accuracy = 0.613572 Test-Top 5 Accuracy = 0.878452 Test-Loss =  1599828.250000
Epoch: 450 Accuracy = 0.628455 Top 5 Accuracy = 0.900023 Loss =  2979848.000000
Epoch: 450 Test-Accuracy = 0.612916 Test-Top 5 Accuracy = 0.881052 Test-Loss =  1611988.500000
Epoch: 460 Accuracy = 0.630212 Top 5 Accuracy = 0.904287 Loss =  2936936.750000
Epoch: 460 Test-Accuracy = 0.614204 Test-Top 5 Accuracy = 0.884425 Test-Loss =  1620108.375000
Epoch: 470 Accuracy = 0.630259 Top 5 Accuracy = 0.903244 Loss =  2924608.750000
Epoch: 470 Test-Accuracy = 0.613408 Test-Top 5 Accuracy = 0.883980 Test-Loss =  1630292.500000
Epoch: 480 Accuracy = 0.629416 Top 5 Accuracy = 0.902477 Loss =  2905937.000000
Epoch: 480 Test-Accuracy = 0.612752 Test-Top 5 Accuracy = 0.882622 Test-Loss =  1641145.000000
Epoch: 490 Accuracy = 0.629913 Top 5 Accuracy = 0.903496 Loss =  2888264.750000
Epoch: 490 Test-Accuracy = 0.612822 Test-Top 5 Accuracy = 0.883114 Test-Loss =  1652784.000000
Epoch: 500 Accuracy = 0.630130 Top 5 Accuracy = 0.905101 Loss =  2872441.500000
Epoch: 500 Test-Accuracy = 0.613267 Test-Top 5 Accuracy = 0.884941 Test-Loss =  1664442.250000
Epoch: 510 Accuracy = 0.630423 Top 5 Accuracy = 0.906453 Loss =  2857422.500000
Epoch: 510 Test-Accuracy = 0.613642 Test-Top 5 Accuracy = 0.886135 Test-Loss =  1675863.750000
Epoch: 520 Accuracy = 0.630780 Top 5 Accuracy = 0.907437 Loss =  2842949.250000
Epoch: 520 Test-Accuracy = 0.613759 Test-Top 5 Accuracy = 0.887049 Test-Loss =  1687370.625000
Epoch: 530 Accuracy = 0.631008 Top 5 Accuracy = 0.908158 Loss =  2828546.500000
Epoch: 530 Test-Accuracy = 0.613876 Test-Top 5 Accuracy = 0.887634 Test-Loss =  1698580.375000
Epoch: 540 Accuracy = 0.631155 Top 5 Accuracy = 0.909809 Loss =  2814355.000000
Epoch: 540 Test-Accuracy = 0.613549 Test-Top 5 Accuracy = 0.888829 Test-Loss =  1709835.875000
Epoch: 550 Accuracy = 0.631518 Top 5 Accuracy = 0.911267 Loss =  2800691.750000
Epoch: 550 Test-Accuracy = 0.613712 Test-Top 5 Accuracy = 0.890445 Test-Loss =  1720924.250000
Epoch: 560 Accuracy = 0.631940 Top 5 Accuracy = 0.912122 Loss =  2787335.750000
Epoch: 560 Test-Accuracy = 0.614181 Test-Top 5 Accuracy = 0.891898 Test-Loss =  1731755.875000
Epoch: 570 Accuracy = 0.632197 Top 5 Accuracy = 0.913797 Loss =  2774265.750000
Epoch: 570 Test-Accuracy = 0.614368 Test-Top 5 Accuracy = 0.893045 Test-Loss =  1742361.250000
Epoch: 580 Accuracy = 0.632420 Top 5 Accuracy = 0.914775 Loss =  2761413.500000
Epoch: 580 Test-Accuracy = 0.614415 Test-Top 5 Accuracy = 0.893467 Test-Loss =  1753143.250000
Epoch: 590 Accuracy = 0.632754 Top 5 Accuracy = 0.915923 Loss =  2751627.500000
Epoch: 590 Test-Accuracy = 0.614603 Test-Top 5 Accuracy = 0.895060 Test-Loss =  1765984.125000
Epoch: 600 Accuracy = 0.632894 Top 5 Accuracy = 0.916837 Loss =  2748743.500000
Epoch: 600 Test-Accuracy = 0.614696 Test-Top 5 Accuracy = 0.894568 Test-Loss =  1775142.500000
Epoch: 610 Accuracy = 0.633445 Top 5 Accuracy = 0.918189 Loss =  2729975.000000
Epoch: 610 Test-Accuracy = 0.614649 Test-Top 5 Accuracy = 0.896044 Test-Loss =  1782635.500000
Epoch: 620 Accuracy = 0.633632 Top 5 Accuracy = 0.918839 Loss =  2715619.500000
Epoch: 620 Test-Accuracy = 0.614860 Test-Top 5 Accuracy = 0.896840 Test-Loss =  1796384.750000
Epoch: 630 Accuracy = 0.634264 Top 5 Accuracy = 0.920110 Loss =  2702567.500000
Epoch: 630 Test-Accuracy = 0.614837 Test-Top 5 Accuracy = 0.897309 Test-Loss =  1808805.250000
Epoch: 640 Accuracy = 0.634505 Top 5 Accuracy = 0.921287 Loss =  2691456.000000
Epoch: 640 Test-Accuracy = 0.615165 Test-Top 5 Accuracy = 0.898246 Test-Loss =  1822867.625000
Epoch: 650 Accuracy = 0.634598 Top 5 Accuracy = 0.922072 Loss =  2681144.750000
Epoch: 650 Test-Accuracy = 0.615235 Test-Top 5 Accuracy = 0.898925 Test-Loss =  1836016.250000
Epoch: 660 Accuracy = 0.634791 Top 5 Accuracy = 0.923208 Loss =  2670648.750000
Epoch: 660 Test-Accuracy = 0.615118 Test-Top 5 Accuracy = 0.899417 Test-Loss =  1848357.250000
Epoch: 670 Accuracy = 0.635078 Top 5 Accuracy = 0.924350 Loss =  2660500.000000
Epoch: 670 Test-Accuracy = 0.615048 Test-Top 5 Accuracy = 0.900424 Test-Loss =  1860843.750000
Epoch: 680 Accuracy = 0.635529 Top 5 Accuracy = 0.925346 Loss =  2650398.250000
Epoch: 680 Test-Accuracy = 0.614977 Test-Top 5 Accuracy = 0.901010 Test-Loss =  1874202.125000
Epoch: 690 Accuracy = 0.635623 Top 5 Accuracy = 0.926306 Loss =  2640422.250000
Epoch: 690 Test-Accuracy = 0.615048 Test-Top 5 Accuracy = 0.902743 Test-Loss =  1887367.750000
Epoch: 700 Accuracy = 0.635676 Top 5 Accuracy = 0.927266 Loss =  2630415.250000
Epoch: 700 Test-Accuracy = 0.614486 Test-Top 5 Accuracy = 0.903516 Test-Loss =  1900982.250000
Epoch: 710 Accuracy = 0.635775 Top 5 Accuracy = 0.928145 Loss =  2620514.500000
Epoch: 710 Test-Accuracy = 0.614439 Test-Top 5 Accuracy = 0.904851 Test-Loss =  1914457.500000
Epoch: 720 Accuracy = 0.636045 Top 5 Accuracy = 0.929140 Loss =  2610700.500000
Epoch: 720 Test-Accuracy = 0.613712 Test-Top 5 Accuracy = 0.906327 Test-Loss =  1928652.500000
Epoch: 730 Accuracy = 0.636121 Top 5 Accuracy = 0.929913 Loss =  2601018.000000
Epoch: 730 Test-Accuracy = 0.613619 Test-Top 5 Accuracy = 0.907803 Test-Loss =  1943249.000000
Epoch: 740 Accuracy = 0.636285 Top 5 Accuracy = 0.931225 Loss =  2597511.500000
Epoch: 740 Test-Accuracy = 0.613057 Test-Top 5 Accuracy = 0.908786 Test-Loss =  1958998.500000
Epoch: 750 Accuracy = 0.636086 Top 5 Accuracy = 0.932748 Loss =  2589274.000000
Epoch: 750 Test-Accuracy = 0.614111 Test-Top 5 Accuracy = 0.911129 Test-Loss =  1975408.000000
Epoch: 760 Accuracy = 0.636437 Top 5 Accuracy = 0.932449 Loss =  2576093.500000
Epoch: 760 Test-Accuracy = 0.612916 Test-Top 5 Accuracy = 0.911433 Test-Loss =  1986345.750000
Epoch: 770 Accuracy = 0.636706 Top 5 Accuracy = 0.933796 Loss =  2566251.750000
Epoch: 770 Test-Accuracy = 0.612963 Test-Top 5 Accuracy = 0.912253 Test-Loss =  1999974.250000
Epoch: 780 Accuracy = 0.636747 Top 5 Accuracy = 0.935026 Loss =  2556964.500000
Epoch: 780 Test-Accuracy = 0.613338 Test-Top 5 Accuracy = 0.913963 Test-Loss =  2014101.000000
Epoch: 790 Accuracy = 0.636906 Top 5 Accuracy = 0.935793 Loss =  2548164.250000
Epoch: 790 Test-Accuracy = 0.612893 Test-Top 5 Accuracy = 0.914502 Test-Loss =  2030396.875000
Epoch: 800 Accuracy = 0.636841 Top 5 Accuracy = 0.936373 Loss =  2539894.500000
Epoch: 800 Test-Accuracy = 0.612846 Test-Top 5 Accuracy = 0.915041 Test-Loss =  2041358.750000
Epoch: 810 Accuracy = 0.636964 Top 5 Accuracy = 0.937245 Loss =  2531572.500000
Epoch: 810 Test-Accuracy = 0.612822 Test-Top 5 Accuracy = 0.916071 Test-Loss =  2055304.375000
Epoch: 820 Accuracy = 0.636952 Top 5 Accuracy = 0.938253 Loss =  2523969.500000
Epoch: 820 Test-Accuracy = 0.612776 Test-Top 5 Accuracy = 0.917406 Test-Loss =  2068028.250000
Epoch: 830 Accuracy = 0.637023 Top 5 Accuracy = 0.939359 Loss =  2520860.000000
Epoch: 830 Test-Accuracy = 0.612939 Test-Top 5 Accuracy = 0.918976 Test-Loss =  2083729.500000
Epoch: 840 Accuracy = 0.637239 Top 5 Accuracy = 0.939547 Loss =  2510291.750000
Epoch: 840 Test-Accuracy = 0.613174 Test-Top 5 Accuracy = 0.918156 Test-Loss =  2093131.375000
Epoch: 850 Accuracy = 0.637403 Top 5 Accuracy = 0.940882 Loss =  2501326.250000
Epoch: 850 Test-Accuracy = 0.613221 Test-Top 5 Accuracy = 0.919796 Test-Loss =  2105081.750000
Epoch: 860 Accuracy = 0.637450 Top 5 Accuracy = 0.942041 Loss =  2493963.750000
Epoch: 860 Test-Accuracy = 0.613150 Test-Top 5 Accuracy = 0.920522 Test-Loss =  2120399.750000
Epoch: 870 Accuracy = 0.637362 Top 5 Accuracy = 0.942885 Loss =  2486478.000000
Epoch: 870 Test-Accuracy = 0.612776 Test-Top 5 Accuracy = 0.920873 Test-Loss =  2134351.750000
Epoch: 880 Accuracy = 0.637614 Top 5 Accuracy = 0.943623 Loss =  2479670.750000
Epoch: 880 Test-Accuracy = 0.612893 Test-Top 5 Accuracy = 0.921904 Test-Loss =  2146379.750000
Epoch: 890 Accuracy = 0.637749 Top 5 Accuracy = 0.943775 Loss =  2475710.500000
Epoch: 890 Test-Accuracy = 0.612494 Test-Top 5 Accuracy = 0.921904 Test-Loss =  2160966.000000
Epoch: 900 Accuracy = 0.637854 Top 5 Accuracy = 0.944934 Loss =  2464939.500000
Epoch: 900 Test-Accuracy = 0.612776 Test-Top 5 Accuracy = 0.923145 Test-Loss =  2169376.500000
Epoch: 910 Accuracy = 0.638153 Top 5 Accuracy = 0.946381 Loss =  2459572.500000
Epoch: 910 Test-Accuracy = 0.613127 Test-Top 5 Accuracy = 0.924645 Test-Loss =  2182095.500000
Epoch: 920 Accuracy = 0.638358 Top 5 Accuracy = 0.947084 Loss =  2450774.250000
Epoch: 920 Test-Accuracy = 0.612705 Test-Top 5 Accuracy = 0.924363 Test-Loss =  2194508.750000
Epoch: 930 Accuracy = 0.638575 Top 5 Accuracy = 0.947740 Loss =  2443714.500000
Epoch: 930 Test-Accuracy = 0.612541 Test-Top 5 Accuracy = 0.924949 Test-Loss =  2206397.000000
Epoch: 940 Accuracy = 0.638785 Top 5 Accuracy = 0.948091 Loss =  2439277.750000
Epoch: 940 Test-Accuracy = 0.612635 Test-Top 5 Accuracy = 0.925066 Test-Loss =  2218437.000000
Epoch: 950 Accuracy = 0.638797 Top 5 Accuracy = 0.948975 Loss =  2431477.000000
Epoch: 950 Test-Accuracy = 0.612939 Test-Top 5 Accuracy = 0.926144 Test-Loss =  2228784.000000
Epoch: 960 Accuracy = 0.638920 Top 5 Accuracy = 0.949695 Loss =  2424252.500000
Epoch: 960 Test-Accuracy = 0.612939 Test-Top 5 Accuracy = 0.926472 Test-Loss =  2247752.250000
Epoch: 970 Accuracy = 0.639189 Top 5 Accuracy = 0.950814 Loss =  2417390.000000
Epoch: 970 Test-Accuracy = 0.613314 Test-Top 5 Accuracy = 0.927198 Test-Loss =  2246546.250000
Epoch: 980 Accuracy = 0.639189 Top 5 Accuracy = 0.951042 Loss =  2409612.500000
Epoch: 980 Test-Accuracy = 0.612893 Test-Top 5 Accuracy = 0.927174 Test-Loss =  2267358.000000
Epoch: 990 Accuracy = 0.639189 Top 5 Accuracy = 0.951534 Loss =  2402521.750000
Epoch: 990 Test-Accuracy = 0.612869 Test-Top 5 Accuracy = 0.927596 Test-Loss =  2278736.250000
Epoch: 1000 Accuracy = 0.639418 Top 5 Accuracy = 0.952097 Loss =  2395399.000000
Epoch: 1000 Test-Accuracy = 0.612822 Test-Top 5 Accuracy = 0.927854 Test-Loss =  2290171.500000
Epoch: 1010 Accuracy = 0.639635 Top 5 Accuracy = 0.952770 Loss =  2389387.750000
Epoch: 1010 Test-Accuracy = 0.613291 Test-Top 5 Accuracy = 0.927877 Test-Loss =  2303861.500000
Epoch: 1020 Accuracy = 0.639582 Top 5 Accuracy = 0.952840 Loss =  2389624.000000
Epoch: 1020 Test-Accuracy = 0.613712 Test-Top 5 Accuracy = 0.928837 Test-Loss =  2322570.000000
Epoch: 1030 Accuracy = 0.639430 Top 5 Accuracy = 0.952460 Loss =  2379928.500000
Epoch: 1030 Test-Accuracy = 0.613197 Test-Top 5 Accuracy = 0.928205 Test-Loss =  2327114.000000
Epoch: 1040 Accuracy = 0.640103 Top 5 Accuracy = 0.954170 Loss =  2371105.750000
Epoch: 1040 Test-Accuracy = 0.613783 Test-Top 5 Accuracy = 0.928791 Test-Loss =  2338195.000000
Epoch: 1050 Accuracy = 0.639998 Top 5 Accuracy = 0.953578 Loss =  2362986.250000
Epoch: 1050 Test-Accuracy = 0.613150 Test-Top 5 Accuracy = 0.928322 Test-Loss =  2357343.750000
Epoch: 1060 Accuracy = 0.640326 Top 5 Accuracy = 0.954498 Loss =  2356899.250000
Epoch: 1060 Test-Accuracy = 0.613361 Test-Top 5 Accuracy = 0.929470 Test-Loss =  2363334.000000
Epoch: 1070 Accuracy = 0.640607 Top 5 Accuracy = 0.955071 Loss =  2351281.750000
Epoch: 1070 Test-Accuracy = 0.613549 Test-Top 5 Accuracy = 0.930337 Test-Loss =  2374273.500000
Epoch: 1080 Accuracy = 0.640976 Top 5 Accuracy = 0.955030 Loss =  2346180.000000
Epoch: 1080 Test-Accuracy = 0.613385 Test-Top 5 Accuracy = 0.929985 Test-Loss =  2390890.000000
Epoch: 1090 Accuracy = 0.641544 Top 5 Accuracy = 0.956126 Loss =  2338536.000000
Epoch: 1090 Test-Accuracy = 0.613923 Test-Top 5 Accuracy = 0.931648 Test-Loss =  2400878.000000
Epoch: 1100 Accuracy = 0.641725 Top 5 Accuracy = 0.956190 Loss =  2332444.000000
Epoch: 1100 Test-Accuracy = 0.614181 Test-Top 5 Accuracy = 0.931250 Test-Loss =  2420190.250000
Epoch: 1110 Accuracy = 0.641503 Top 5 Accuracy = 0.956366 Loss =  2335543.000000
Epoch: 1110 Test-Accuracy = 0.614064 Test-Top 5 Accuracy = 0.931274 Test-Loss =  2441918.500000
Epoch: 1120 Accuracy = 0.641930 Top 5 Accuracy = 0.956723 Loss =  2326694.750000
Epoch: 1120 Test-Accuracy = 0.614275 Test-Top 5 Accuracy = 0.931883 Test-Loss =  2441634.000000
Epoch: 1130 Accuracy = 0.642293 Top 5 Accuracy = 0.957484 Loss =  2318511.000000
Epoch: 1130 Test-Accuracy = 0.614626 Test-Top 5 Accuracy = 0.932492 Test-Loss =  2460612.000000
Epoch: 1140 Accuracy = 0.642340 Top 5 Accuracy = 0.957525 Loss =  2311091.750000
Epoch: 1140 Test-Accuracy = 0.614603 Test-Top 5 Accuracy = 0.931812 Test-Loss =  2470264.250000
Epoch: 1150 Accuracy = 0.642814 Top 5 Accuracy = 0.958082 Loss =  2304447.750000
Epoch: 1150 Test-Accuracy = 0.615024 Test-Top 5 Accuracy = 0.932820 Test-Loss =  2483530.250000
Epoch: 1160 Accuracy = 0.642885 Top 5 Accuracy = 0.958263 Loss =  2298872.000000
Epoch: 1160 Test-Accuracy = 0.614907 Test-Top 5 Accuracy = 0.933101 Test-Loss =  2501695.500000
Epoch: 1170 Accuracy = 0.643265 Top 5 Accuracy = 0.958837 Loss =  2293050.000000
Epoch: 1170 Test-Accuracy = 0.615259 Test-Top 5 Accuracy = 0.933944 Test-Loss =  2508715.750000
Epoch: 1180 Accuracy = 0.643576 Top 5 Accuracy = 0.959112 Loss =  2287656.750000
Epoch: 1180 Test-Accuracy = 0.615352 Test-Top 5 Accuracy = 0.934530 Test-Loss =  2521155.000000
Epoch: 1190 Accuracy = 0.643359 Top 5 Accuracy = 0.959510 Loss =  2288258.750000
Epoch: 1190 Test-Accuracy = 0.615376 Test-Top 5 Accuracy = 0.934998 Test-Loss =  2528600.000000
Epoch: 1200 Accuracy = 0.643892 Top 5 Accuracy = 0.960928 Loss =  2280958.500000
Epoch: 1200 Test-Accuracy = 0.615750 Test-Top 5 Accuracy = 0.936427 Test-Loss =  2541103.750000
Epoch: 1210 Accuracy = 0.643822 Top 5 Accuracy = 0.959792 Loss =  2273768.000000
Epoch: 1210 Test-Accuracy = 0.615469 Test-Top 5 Accuracy = 0.934951 Test-Loss =  2559193.500000
Epoch: 1220 Accuracy = 0.644179 Top 5 Accuracy = 0.960781 Loss =  2266861.000000
Epoch: 1220 Test-Accuracy = 0.615118 Test-Top 5 Accuracy = 0.936450 Test-Loss =  2569362.750000
Epoch: 1230 Accuracy = 0.644296 Top 5 Accuracy = 0.961302 Loss =  2261465.000000
Epoch: 1230 Test-Accuracy = 0.615446 Test-Top 5 Accuracy = 0.936966 Test-Loss =  2579270.250000
Epoch: 1240 Accuracy = 0.644302 Top 5 Accuracy = 0.961279 Loss =  2257407.750000
Epoch: 1240 Test-Accuracy = 0.615118 Test-Top 5 Accuracy = 0.936849 Test-Loss =  2592246.250000
Epoch: 1250 Accuracy = 0.644782 Top 5 Accuracy = 0.962292 Loss =  2251328.000000
Epoch: 1250 Test-Accuracy = 0.615165 Test-Top 5 Accuracy = 0.937879 Test-Loss =  2601963.500000
Epoch: 1260 Accuracy = 0.644823 Top 5 Accuracy = 0.962204 Loss =  2246036.500000
Epoch: 1260 Test-Accuracy = 0.615141 Test-Top 5 Accuracy = 0.938184 Test-Loss =  2619557.500000
Epoch: 1270 Accuracy = 0.644823 Top 5 Accuracy = 0.962866 Loss =  2243877.500000
Epoch: 1270 Test-Accuracy = 0.615118 Test-Top 5 Accuracy = 0.938394 Test-Loss =  2634694.500000
Epoch: 1280 Accuracy = 0.644688 Top 5 Accuracy = 0.962485 Loss =  2237495.500000
Epoch: 1280 Test-Accuracy = 0.614696 Test-Top 5 Accuracy = 0.938137 Test-Loss =  2652082.000000
Epoch: 1290 Accuracy = 0.645672 Top 5 Accuracy = 0.964582 Loss =  2232758.500000
Epoch: 1290 Test-Accuracy = 0.615376 Test-Top 5 Accuracy = 0.940432 Test-Loss =  2647461.250000
Epoch: 1300 Accuracy = 0.644624 Top 5 Accuracy = 0.963358 Loss =  2227615.500000
Epoch: 1300 Test-Accuracy = 0.614298 Test-Top 5 Accuracy = 0.939004 Test-Loss =  2678091.500000
Epoch: 1310 Accuracy = 0.645491 Top 5 Accuracy = 0.964681 Loss =  2223090.000000
Epoch: 1310 Test-Accuracy = 0.614884 Test-Top 5 Accuracy = 0.940643 Test-Loss =  2674678.000000
Epoch: 1320 Accuracy = 0.645303 Top 5 Accuracy = 0.964611 Loss =  2215818.750000
Epoch: 1320 Test-Accuracy = 0.614486 Test-Top 5 Accuracy = 0.940643 Test-Loss =  2693712.000000
Epoch: 1330 Accuracy = 0.645286 Top 5 Accuracy = 0.965162 Loss =  2213908.250000
Epoch: 1330 Test-Accuracy = 0.614720 Test-Top 5 Accuracy = 0.940643 Test-Loss =  2709291.000000
Epoch: 1340 Accuracy = 0.645789 Top 5 Accuracy = 0.965349 Loss =  2206027.250000
Epoch: 1340 Test-Accuracy = 0.614486 Test-Top 5 Accuracy = 0.940643 Test-Loss =  2718647.500000
Epoch: 1350 Accuracy = 0.645819 Top 5 Accuracy = 0.965607 Loss =  2207134.500000
Epoch: 1350 Test-Accuracy = 0.614743 Test-Top 5 Accuracy = 0.941369 Test-Loss =  2725136.250000
Epoch: 1360 Accuracy = 0.645965 Top 5 Accuracy = 0.966462 Loss =  2199840.000000
Epoch: 1360 Test-Accuracy = 0.615095 Test-Top 5 Accuracy = 0.941838 Test-Loss =  2742745.500000
Epoch: 1370 Accuracy = 0.646223 Top 5 Accuracy = 0.965812 Loss =  2193124.500000
Epoch: 1370 Test-Accuracy = 0.614509 Test-Top 5 Accuracy = 0.940924 Test-Loss =  2754396.500000
Epoch: 1380 Accuracy = 0.646691 Top 5 Accuracy = 0.967299 Loss =  2188037.250000
Epoch: 1380 Test-Accuracy = 0.615446 Test-Top 5 Accuracy = 0.942541 Test-Loss =  2756704.750000
Epoch: 1390 Accuracy = 0.646188 Top 5 Accuracy = 0.966667 Loss =  2185089.250000
Epoch: 1390 Test-Accuracy = 0.615024 Test-Top 5 Accuracy = 0.941768 Test-Loss =  2778436.250000
Epoch: 1400 Accuracy = 0.646498 Top 5 Accuracy = 0.966772 Loss =  2178445.500000
Epoch: 1400 Test-Accuracy = 0.614977 Test-Top 5 Accuracy = 0.941486 Test-Loss =  2783100.000000
Epoch: 1410 Accuracy = 0.646808 Top 5 Accuracy = 0.967065 Loss =  2177806.250000
Epoch: 1410 Test-Accuracy = 0.615422 Test-Top 5 Accuracy = 0.942658 Test-Loss =  2786615.500000
Epoch: 1420 Accuracy = 0.646756 Top 5 Accuracy = 0.967311 Loss =  2169892.500000
Epoch: 1420 Test-Accuracy = 0.615001 Test-Top 5 Accuracy = 0.942259 Test-Loss =  2803569.750000
Epoch: 1430 Accuracy = 0.646504 Top 5 Accuracy = 0.966842 Loss =  2166717.500000
Epoch: 1430 Test-Accuracy = 0.614158 Test-Top 5 Accuracy = 0.941299 Test-Loss =  2818313.500000
Epoch: 1440 Accuracy = 0.646680 Top 5 Accuracy = 0.967557 Loss =  2164130.500000
Epoch: 1440 Test-Accuracy = 0.614486 Test-Top 5 Accuracy = 0.942775 Test-Loss =  2818750.000000
Epoch: 1450 Accuracy = 0.647505 Top 5 Accuracy = 0.968945 Loss =  2158136.500000
Epoch: 1450 Test-Accuracy = 0.615305 Test-Top 5 Accuracy = 0.944040 Test-Loss =  2826777.250000
Epoch: 1460 Accuracy = 0.646791 Top 5 Accuracy = 0.967381 Loss =  2152875.250000
Epoch: 1460 Test-Accuracy = 0.614181 Test-Top 5 Accuracy = 0.941065 Test-Loss =  2849334.500000
Epoch: 1470 Accuracy = 0.647909 Top 5 Accuracy = 0.968992 Loss =  2147634.000000
Epoch: 1470 Test-Accuracy = 0.615493 Test-Top 5 Accuracy = 0.943852 Test-Loss =  2845585.500000
Epoch: 1480 Accuracy = 0.647734 Top 5 Accuracy = 0.968822 Loss =  2147202.000000
Epoch: 1480 Test-Accuracy = 0.615540 Test-Top 5 Accuracy = 0.943876 Test-Loss =  2854938.000000
Epoch: 1490 Accuracy = 0.647710 Top 5 Accuracy = 0.968851 Loss =  2140588.000000
Epoch: 1490 Test-Accuracy = 0.615376 Test-Top 5 Accuracy = 0.943407 Test-Loss =  2872957.500000
Epoch: 1500 Accuracy = 0.647775 Top 5 Accuracy = 0.968722 Loss =  2134431.250000
Epoch: 1500 Test-Accuracy = 0.615399 Test-Top 5 Accuracy = 0.943056 Test-Loss =  2882432.000000
Epoch: 1510 Accuracy = 0.647494 Top 5 Accuracy = 0.968910 Loss =  2130041.500000
Epoch: 1510 Test-Accuracy = 0.615352 Test-Top 5 Accuracy = 0.943009 Test-Loss =  2893115.750000
Epoch: 1520 Accuracy = 0.647394 Top 5 Accuracy = 0.969038 Loss =  2127296.500000
Epoch: 1520 Test-Accuracy = 0.614860 Test-Top 5 Accuracy = 0.943407 Test-Loss =  2906832.750000
Epoch: 1530 Accuracy = 0.647458 Top 5 Accuracy = 0.969302 Loss =  2123920.000000
Epoch: 1530 Test-Accuracy = 0.615118 Test-Top 5 Accuracy = 0.944180 Test-Loss =  2913661.250000
Epoch: 1540 Accuracy = 0.648589 Top 5 Accuracy = 0.971726 Loss =  2118354.000000
Epoch: 1540 Test-Accuracy = 0.616195 Test-Top 5 Accuracy = 0.946288 Test-Loss =  2914184.500000
Epoch: 1550 Accuracy = 0.649016 Top 5 Accuracy = 0.972306 Loss =  2115613.750000
Epoch: 1550 Test-Accuracy = 0.616523 Test-Top 5 Accuracy = 0.947834 Test-Loss =  2920663.000000
Epoch: 1560 Accuracy = 0.648542 Top 5 Accuracy = 0.971486 Loss =  2110375.000000
Epoch: 1560 Test-Accuracy = 0.616102 Test-Top 5 Accuracy = 0.946288 Test-Loss =  2942552.250000
Epoch: 1570 Accuracy = 0.648190 Top 5 Accuracy = 0.971082 Loss =  2105796.750000
Epoch: 1570 Test-Accuracy = 0.615095 Test-Top 5 Accuracy = 0.945328 Test-Loss =  2953548.500000
Epoch: 1580 Accuracy = 0.647212 Top 5 Accuracy = 0.970046 Loss =  2107247.750000
Epoch: 1580 Test-Accuracy = 0.614181 Test-Top 5 Accuracy = 0.943923 Test-Loss =  2982694.500000
Epoch: 1590 Accuracy = 0.650135 Top 5 Accuracy = 0.974186 Loss =  2100167.750000
Epoch: 1590 Test-Accuracy = 0.617250 Test-Top 5 Accuracy = 0.948397 Test-Loss =  2962164.000000
Epoch: 1600 Accuracy = 0.648021 Top 5 Accuracy = 0.970608 Loss =  2092607.250000
Epoch: 1600 Test-Accuracy = 0.614368 Test-Top 5 Accuracy = 0.944227 Test-Loss =  2994607.000000
Epoch: 1610 Accuracy = 0.650029 Top 5 Accuracy = 0.973343 Loss =  2088638.250000
Epoch: 1610 Test-Accuracy = 0.616242 Test-Top 5 Accuracy = 0.947483 Test-Loss =  2989836.500000
Epoch: 1620 Accuracy = 0.649151 Top 5 Accuracy = 0.971937 Loss =  2082992.000000
Epoch: 1620 Test-Accuracy = 0.614813 Test-Top 5 Accuracy = 0.945445 Test-Loss =  3008547.000000
Epoch: 1630 Accuracy = 0.649848 Top 5 Accuracy = 0.973226 Loss =  2080199.375000
Epoch: 1630 Test-Accuracy = 0.615821 Test-Top 5 Accuracy = 0.946616 Test-Loss =  3010231.250000
Epoch: 1640 Accuracy = 0.649906 Top 5 Accuracy = 0.973062 Loss =  2078709.500000
Epoch: 1640 Test-Accuracy = 0.615422 Test-Top 5 Accuracy = 0.946827 Test-Loss =  3027298.250000
Epoch: 1650 Accuracy = 0.650867 Top 5 Accuracy = 0.974543 Loss =  2075473.875000
Epoch: 1650 Test-Accuracy = 0.616922 Test-Top 5 Accuracy = 0.948397 Test-Loss =  3023039.750000
Epoch: 1660 Accuracy = 0.650205 Top 5 Accuracy = 0.972921 Loss =  2069283.750000
Epoch: 1660 Test-Accuracy = 0.615586 Test-Top 5 Accuracy = 0.946593 Test-Loss =  3046530.250000
Epoch: 1670 Accuracy = 0.650761 Top 5 Accuracy = 0.973940 Loss =  2064156.125000
Epoch: 1670 Test-Accuracy = 0.616125 Test-Top 5 Accuracy = 0.947670 Test-Loss =  3048980.250000
Epoch: 1680 Accuracy = 0.650919 Top 5 Accuracy = 0.973934 Loss =  2059496.625000
Epoch: 1680 Test-Accuracy = 0.616055 Test-Top 5 Accuracy = 0.947577 Test-Loss =  3061358.250000
Epoch: 1690 Accuracy = 0.651728 Top 5 Accuracy = 0.975281 Loss =  2064817.250000
Epoch: 1690 Test-Accuracy = 0.617460 Test-Top 5 Accuracy = 0.948912 Test-Loss =  3065214.500000
Epoch: 1700 Accuracy = 0.649818 Top 5 Accuracy = 0.973079 Loss =  2058106.250000
Epoch: 1700 Test-Accuracy = 0.615399 Test-Top 5 Accuracy = 0.946242 Test-Loss =  3085262.500000
Epoch: 1710 Accuracy = 0.651927 Top 5 Accuracy = 0.975398 Loss =  2053624.500000
Epoch: 1710 Test-Accuracy = 0.617109 Test-Top 5 Accuracy = 0.948842 Test-Loss =  3089841.750000
Epoch: 1720 Accuracy = 0.650170 Top 5 Accuracy = 0.973313 Loss =  2047960.875000
Epoch: 1720 Test-Accuracy = 0.615493 Test-Top 5 Accuracy = 0.946242 Test-Loss =  3105028.750000
Epoch: 1730 Accuracy = 0.651540 Top 5 Accuracy = 0.975182 Loss =  2042486.625000
Epoch: 1730 Test-Accuracy = 0.616523 Test-Top 5 Accuracy = 0.948280 Test-Loss =  3109925.000000
Epoch: 1740 Accuracy = 0.651054 Top 5 Accuracy = 0.974959 Loss =  2039258.500000
Epoch: 1740 Test-Accuracy = 0.616219 Test-Top 5 Accuracy = 0.947647 Test-Loss =  3123861.250000
Epoch: 1750 Accuracy = 0.650656 Top 5 Accuracy = 0.974754 Loss =  2037568.875000
Epoch: 1750 Test-Accuracy = 0.616008 Test-Top 5 Accuracy = 0.947647 Test-Loss =  3131712.750000
Epoch: 1760 Accuracy = 0.651476 Top 5 Accuracy = 0.975240 Loss =  2032484.750000
Epoch: 1760 Test-Accuracy = 0.616570 Test-Top 5 Accuracy = 0.948490 Test-Loss =  3140497.000000
Epoch: 1770 Accuracy = 0.651365 Top 5 Accuracy = 0.975627 Loss =  2029305.375000
Epoch: 1770 Test-Accuracy = 0.616477 Test-Top 5 Accuracy = 0.949053 Test-Loss =  3150976.000000
Epoch: 1780 Accuracy = 0.652120 Top 5 Accuracy = 0.977091 Loss =  2028626.250000
Epoch: 1780 Test-Accuracy = 0.616687 Test-Top 5 Accuracy = 0.950669 Test-Loss =  3153050.000000
Epoch: 1790 Accuracy = 0.651728 Top 5 Accuracy = 0.976323 Loss =  2024737.875000
Epoch: 1790 Test-Accuracy = 0.616500 Test-Top 5 Accuracy = 0.949825 Test-Loss =  3170407.250000
Epoch: 1800 Accuracy = 0.651552 Top 5 Accuracy = 0.976739 Loss =  2021782.500000
Epoch: 1800 Test-Accuracy = 0.616078 Test-Top 5 Accuracy = 0.949989 Test-Loss =  3171258.750000
Epoch: 1810 Accuracy = 0.652483 Top 5 Accuracy = 0.977706 Loss =  2017960.000000
Epoch: 1810 Test-Accuracy = 0.617132 Test-Top 5 Accuracy = 0.950903 Test-Loss =  3185601.000000
Epoch: 1820 Accuracy = 0.652026 Top 5 Accuracy = 0.977342 Loss =  2014126.750000
Epoch: 1820 Test-Accuracy = 0.616687 Test-Top 5 Accuracy = 0.950247 Test-Loss =  3195146.250000
Epoch: 1830 Accuracy = 0.650644 Top 5 Accuracy = 0.975357 Loss =  2014717.500000
Epoch: 1830 Test-Accuracy = 0.614931 Test-Top 5 Accuracy = 0.948725 Test-Loss =  3208618.000000
Epoch: 1840 Accuracy = 0.650826 Top 5 Accuracy = 0.975960 Loss =  2009982.625000
Epoch: 1840 Test-Accuracy = 0.615188 Test-Top 5 Accuracy = 0.949521 Test-Loss =  3224227.250000
Epoch: 1850 Accuracy = 0.653274 Top 5 Accuracy = 0.979550 Loss =  2013813.500000
Epoch: 1850 Test-Accuracy = 0.617999 Test-Top 5 Accuracy = 0.953035 Test-Loss =  3207965.500000
Epoch: 1860 Accuracy = 0.650984 Top 5 Accuracy = 0.976136 Loss =  2006054.125000
Epoch: 1860 Test-Accuracy = 0.615259 Test-Top 5 Accuracy = 0.949685 Test-Loss =  3237213.750000
Epoch: 1870 Accuracy = 0.652290 Top 5 Accuracy = 0.977770 Loss =  2000678.250000
Epoch: 1870 Test-Accuracy = 0.616617 Test-Top 5 Accuracy = 0.950341 Test-Loss =  3242853.000000
Epoch: 1880 Accuracy = 0.652553 Top 5 Accuracy = 0.978631 Loss =  1998126.625000
Epoch: 1880 Test-Accuracy = 0.616594 Test-Top 5 Accuracy = 0.951559 Test-Loss =  3239581.500000
Epoch: 1890 Accuracy = 0.651376 Top 5 Accuracy = 0.976950 Loss =  1994687.000000
Epoch: 1890 Test-Accuracy = 0.615657 Test-Top 5 Accuracy = 0.950130 Test-Loss =  3261173.750000
Epoch: 1900 Accuracy = 0.651160 Top 5 Accuracy = 0.976827 Loss =  1992578.500000
Epoch: 1900 Test-Accuracy = 0.615282 Test-Top 5 Accuracy = 0.949966 Test-Loss =  3263632.000000
Epoch: 1910 Accuracy = 0.652319 Top 5 Accuracy = 0.978736 Loss =  1990503.625000
Epoch: 1910 Test-Accuracy = 0.616430 Test-Top 5 Accuracy = 0.952051 Test-Loss =  3271236.500000
Epoch: 1920 Accuracy = 0.653174 Top 5 Accuracy = 0.979082 Loss =  1986442.250000
Epoch: 1920 Test-Accuracy = 0.617109 Test-Top 5 Accuracy = 0.952238 Test-Loss =  3285592.750000
Epoch: 1930 Accuracy = 0.652922 Top 5 Accuracy = 0.979029 Loss =  1985008.250000
Epoch: 1930 Test-Accuracy = 0.616781 Test-Top 5 Accuracy = 0.952519 Test-Loss =  3282521.000000
Epoch: 1940 Accuracy = 0.653525 Top 5 Accuracy = 0.979638 Loss =  1981577.750000
Epoch: 1940 Test-Accuracy = 0.617624 Test-Top 5 Accuracy = 0.953058 Test-Loss =  3295340.500000
Epoch: 1950 Accuracy = 0.653935 Top 5 Accuracy = 0.980915 Loss =  1978082.750000
Epoch: 1950 Test-Accuracy = 0.617320 Test-Top 5 Accuracy = 0.953948 Test-Loss =  3299154.000000
Epoch: 1960 Accuracy = 0.652313 Top 5 Accuracy = 0.978818 Loss =  1974552.500000
Epoch: 1960 Test-Accuracy = 0.615727 Test-Top 5 Accuracy = 0.952308 Test-Loss =  3312952.500000
Epoch: 1970 Accuracy = 0.654257 Top 5 Accuracy = 0.980973 Loss =  1976741.000000
Epoch: 1970 Test-Accuracy = 0.618163 Test-Top 5 Accuracy = 0.954112 Test-Loss =  3316082.500000
Epoch: 1980 Accuracy = 0.653028 Top 5 Accuracy = 0.979386 Loss =  1972220.500000
Epoch: 1980 Test-Accuracy = 0.617156 Test-Top 5 Accuracy = 0.952590 Test-Loss =  3329828.750000
Epoch: 1990 Accuracy = 0.654170 Top 5 Accuracy = 0.981032 Loss =  1972091.500000
Epoch: 1990 Test-Accuracy = 0.617695 Test-Top 5 Accuracy = 0.954065 Test-Loss =  3328398.500000
Epoch: 2000 Accuracy = 0.653408 Top 5 Accuracy = 0.980101 Loss =  1966708.125000
Epoch: 2000 Test-Accuracy = 0.618023 Test-Top 5 Accuracy = 0.953433 Test-Loss =  3347232.750000
Epoch: 2010 Accuracy = 0.653151 Top 5 Accuracy = 0.979539 Loss =  1961945.125000
Epoch: 2010 Test-Accuracy = 0.616922 Test-Top 5 Accuracy = 0.952402 Test-Loss =  3346944.000000
Epoch: 2020 Accuracy = 0.653356 Top 5 Accuracy = 0.980294 Loss =  1959715.500000
Epoch: 2020 Test-Accuracy = 0.617437 Test-Top 5 Accuracy = 0.952988 Test-Loss =  3361398.000000
Epoch: 2030 Accuracy = 0.653719 Top 5 Accuracy = 0.980692 Loss =  1954679.750000
Epoch: 2030 Test-Accuracy = 0.617507 Test-Top 5 Accuracy = 0.953386 Test-Loss =  3365930.500000
Epoch: 2040 Accuracy = 0.653918 Top 5 Accuracy = 0.980985 Loss =  1953237.250000
Epoch: 2040 Test-Accuracy = 0.618069 Test-Top 5 Accuracy = 0.953573 Test-Loss =  3374416.500000
Epoch: 2050 Accuracy = 0.654802 Top 5 Accuracy = 0.981963 Loss =  1949512.000000
Epoch: 2050 Test-Accuracy = 0.618561 Test-Top 5 Accuracy = 0.954862 Test-Loss =  3375240.500000
Epoch: 2060 Accuracy = 0.652992 Top 5 Accuracy = 0.979738 Loss =  1947692.750000
Epoch: 2060 Test-Accuracy = 0.617039 Test-Top 5 Accuracy = 0.952308 Test-Loss =  3400486.500000
Epoch: 2070 Accuracy = 0.654919 Top 5 Accuracy = 0.982244 Loss =  1945114.875000
Epoch: 2070 Test-Accuracy = 0.619006 Test-Top 5 Accuracy = 0.954651 Test-Loss =  3385847.500000
Epoch: 2080 Accuracy = 0.653426 Top 5 Accuracy = 0.980505 Loss =  1942137.250000
Epoch: 2080 Test-Accuracy = 0.617671 Test-Top 5 Accuracy = 0.952683 Test-Loss =  3411603.500000
Epoch: 2090 Accuracy = 0.654357 Top 5 Accuracy = 0.981658 Loss =  1938651.875000
Epoch: 2090 Test-Accuracy = 0.618116 Test-Top 5 Accuracy = 0.954159 Test-Loss =  3407766.500000
Epoch: 2100 Accuracy = 0.654146 Top 5 Accuracy = 0.981424 Loss =  1937035.000000
Epoch: 2100 Test-Accuracy = 0.617812 Test-Top 5 Accuracy = 0.953995 Test-Loss =  3423393.500000
Epoch: 2110 Accuracy = 0.654246 Top 5 Accuracy = 0.981682 Loss =  1933944.625000
Epoch: 2110 Test-Accuracy = 0.618140 Test-Top 5 Accuracy = 0.953901 Test-Loss =  3424594.750000
Epoch: 2120 Accuracy = 0.652899 Top 5 Accuracy = 0.980036 Loss =  1931825.625000
Epoch: 2120 Test-Accuracy = 0.616031 Test-Top 5 Accuracy = 0.951863 Test-Loss =  3445383.750000
Epoch: 2130 Accuracy = 0.654773 Top 5 Accuracy = 0.982642 Loss =  1931266.750000
Epoch: 2130 Test-Accuracy = 0.617765 Test-Top 5 Accuracy = 0.955611 Test-Loss =  3435830.500000
Epoch: 2140 Accuracy = 0.654363 Top 5 Accuracy = 0.981331 Loss =  1928146.750000
Epoch: 2140 Test-Accuracy = 0.617624 Test-Top 5 Accuracy = 0.953316 Test-Loss =  3453455.500000
Epoch: 2150 Accuracy = 0.654943 Top 5 Accuracy = 0.982543 Loss =  1925880.500000
Epoch: 2150 Test-Accuracy = 0.617929 Test-Top 5 Accuracy = 0.955026 Test-Loss =  3458761.500000
Epoch: 2160 Accuracy = 0.654591 Top 5 Accuracy = 0.982373 Loss =  1922320.750000
Epoch: 2160 Test-Accuracy = 0.617695 Test-Top 5 Accuracy = 0.954627 Test-Loss =  3457967.000000
Epoch: 2170 Accuracy = 0.654978 Top 5 Accuracy = 0.982262 Loss =  1918719.625000
Epoch: 2170 Test-Accuracy = 0.617905 Test-Top 5 Accuracy = 0.954370 Test-Loss =  3478985.500000
Epoch: 2180 Accuracy = 0.656571 Top 5 Accuracy = 0.983919 Loss =  1917736.500000
Epoch: 2180 Test-Accuracy = 0.619873 Test-Top 5 Accuracy = 0.956642 Test-Loss =  3475361.500000
Epoch: 2190 Accuracy = 0.654480 Top 5 Accuracy = 0.982145 Loss =  1916299.375000
Epoch: 2190 Test-Accuracy = 0.617695 Test-Top 5 Accuracy = 0.954534 Test-Loss =  3479088.500000
Epoch: 2200 Accuracy = 0.655130 Top 5 Accuracy = 0.982572 Loss =  1913632.250000
Epoch: 2200 Test-Accuracy = 0.617929 Test-Top 5 Accuracy = 0.954979 Test-Loss =  3500575.000000
Epoch: 2210 Accuracy = 0.655663 Top 5 Accuracy = 0.983123 Loss =  1914395.500000
Epoch: 2210 Test-Accuracy = 0.618655 Test-Top 5 Accuracy = 0.955752 Test-Loss =  3484969.250000
Epoch: 2220 Accuracy = 0.655774 Top 5 Accuracy = 0.983123 Loss =  1909353.875000
Epoch: 2220 Test-Accuracy = 0.618842 Test-Top 5 Accuracy = 0.955354 Test-Loss =  3513091.250000
Epoch: 2230 Accuracy = 0.656055 Top 5 Accuracy = 0.983257 Loss =  1904775.500000
Epoch: 2230 Test-Accuracy = 0.618491 Test-Top 5 Accuracy = 0.955822 Test-Loss =  3521884.500000
Epoch: 2240 Accuracy = 0.655932 Top 5 Accuracy = 0.983163 Loss =  1903761.875000
Epoch: 2240 Test-Accuracy = 0.618514 Test-Top 5 Accuracy = 0.955845 Test-Loss =  3518323.500000
Epoch: 2250 Accuracy = 0.655628 Top 5 Accuracy = 0.982584 Loss =  1901637.375000
Epoch: 2250 Test-Accuracy = 0.617976 Test-Top 5 Accuracy = 0.954721 Test-Loss =  3540861.750000
Epoch: 2260 Accuracy = 0.655552 Top 5 Accuracy = 0.982654 Loss =  1899311.125000
Epoch: 2260 Test-Accuracy = 0.617952 Test-Top 5 Accuracy = 0.954979 Test-Loss =  3534831.500000
Epoch: 2270 Accuracy = 0.655183 Top 5 Accuracy = 0.982150 Loss =  1899078.625000
Epoch: 2270 Test-Accuracy = 0.617226 Test-Top 5 Accuracy = 0.954323 Test-Loss =  3556326.500000
Epoch: 2280 Accuracy = 0.655768 Top 5 Accuracy = 0.983310 Loss =  1893951.750000
Epoch: 2280 Test-Accuracy = 0.617976 Test-Top 5 Accuracy = 0.955845 Test-Loss =  3549348.000000
Epoch: 2290 Accuracy = 0.658322 Top 5 Accuracy = 0.985160 Loss =  1893266.500000
Epoch: 2290 Test-Accuracy = 0.620295 Test-Top 5 Accuracy = 0.957181 Test-Loss =  3549977.750000
Epoch: 2300 Accuracy = 0.656401 Top 5 Accuracy = 0.983609 Loss =  1894619.375000
Epoch: 2300 Test-Accuracy = 0.618397 Test-Top 5 Accuracy = 0.955518 Test-Loss =  3584580.000000
Epoch: 2310 Accuracy = 0.655733 Top 5 Accuracy = 0.983076 Loss =  1888310.000000
Epoch: 2310 Test-Accuracy = 0.617554 Test-Top 5 Accuracy = 0.955424 Test-Loss =  3569114.000000
Epoch: 2320 Accuracy = 0.657601 Top 5 Accuracy = 0.984557 Loss =  1886903.000000
Epoch: 2320 Test-Accuracy = 0.619639 Test-Top 5 Accuracy = 0.956900 Test-Loss =  3575298.000000
Epoch: 2330 Accuracy = 0.657554 Top 5 Accuracy = 0.984458 Loss =  1886466.000000
Epoch: 2330 Test-Accuracy = 0.619873 Test-Top 5 Accuracy = 0.956337 Test-Loss =  3595119.500000
Epoch: 2340 Accuracy = 0.656418 Top 5 Accuracy = 0.983544 Loss =  1883548.875000
Epoch: 2340 Test-Accuracy = 0.618210 Test-Top 5 Accuracy = 0.955377 Test-Loss =  3594672.000000
Epoch: 2350 Accuracy = 0.655780 Top 5 Accuracy = 0.982783 Loss =  1882102.250000
Epoch: 2350 Test-Accuracy = 0.617414 Test-Top 5 Accuracy = 0.954674 Test-Loss =  3609957.750000
Epoch: 2360 Accuracy = 0.656278 Top 5 Accuracy = 0.983732 Loss =  1888402.875000
Epoch: 2360 Test-Accuracy = 0.618796 Test-Top 5 Accuracy = 0.955658 Test-Loss =  3611172.500000
Epoch: 2370 Accuracy = 0.655985 Top 5 Accuracy = 0.983562 Loss =  1877432.500000
Epoch: 2370 Test-Accuracy = 0.618163 Test-Top 5 Accuracy = 0.955330 Test-Loss =  3623707.750000
Epoch: 2380 Accuracy = 0.656395 Top 5 Accuracy = 0.983855 Loss =  1876925.250000
Epoch: 2380 Test-Accuracy = 0.618163 Test-Top 5 Accuracy = 0.955728 Test-Loss =  3617167.000000
Epoch: 2390 Accuracy = 0.658298 Top 5 Accuracy = 0.985254 Loss =  1873197.625000
Epoch: 2390 Test-Accuracy = 0.619686 Test-Top 5 Accuracy = 0.956923 Test-Loss =  3628210.750000
Epoch: 2400 Accuracy = 0.658749 Top 5 Accuracy = 0.985389 Loss =  1870515.625000
Epoch: 2400 Test-Accuracy = 0.619826 Test-Top 5 Accuracy = 0.957157 Test-Loss =  3647437.000000
Epoch: 2410 Accuracy = 0.659639 Top 5 Accuracy = 0.986010 Loss =  1868539.750000
Epoch: 2410 Test-Accuracy = 0.620529 Test-Top 5 Accuracy = 0.957626 Test-Loss =  3645456.000000
Epoch: 2420 Accuracy = 0.657150 Top 5 Accuracy = 0.984159 Loss =  1876191.500000
Epoch: 2420 Test-Accuracy = 0.618796 Test-Top 5 Accuracy = 0.956127 Test-Loss =  3644602.000000
Epoch: 2430 Accuracy = 0.655932 Top 5 Accuracy = 0.983134 Loss =  1875856.625000
Epoch: 2430 Test-Accuracy = 0.616453 Test-Top 5 Accuracy = 0.954745 Test-Loss =  3683556.500000
Epoch: 2440 Accuracy = 0.660348 Top 5 Accuracy = 0.986531 Loss =  1865729.500000
Epoch: 2440 Test-Accuracy = 0.620997 Test-Top 5 Accuracy = 0.958001 Test-Loss =  3676840.500000
Epoch: 2450 Accuracy = 0.656828 Top 5 Accuracy = 0.983480 Loss =  1860971.125000
Epoch: 2450 Test-Accuracy = 0.617460 Test-Top 5 Accuracy = 0.955166 Test-Loss =  3686473.500000
Epoch: 2460 Accuracy = 0.658304 Top 5 Accuracy = 0.985096 Loss =  1860178.000000
Epoch: 2460 Test-Accuracy = 0.618538 Test-Top 5 Accuracy = 0.956361 Test-Loss =  3677436.500000
Epoch: 2470 Accuracy = 0.657215 Top 5 Accuracy = 0.984089 Loss =  1855072.000000
Epoch: 2470 Test-Accuracy = 0.617648 Test-Top 5 Accuracy = 0.955916 Test-Loss =  3693729.250000
Epoch: 2480 Accuracy = 0.658005 Top 5 Accuracy = 0.984616 Loss =  1853965.750000
Epoch: 2480 Test-Accuracy = 0.618257 Test-Top 5 Accuracy = 0.956548 Test-Loss =  3707164.250000
Epoch: 2490 Accuracy = 0.657882 Top 5 Accuracy = 0.984551 Loss =  1851519.375000
Epoch: 2490 Test-Accuracy = 0.617835 Test-Top 5 Accuracy = 0.956150 Test-Loss =  3711633.250000
Epoch: 2500 Accuracy = 0.657168 Top 5 Accuracy = 0.984358 Loss =  1856212.250000
Epoch: 2500 Test-Accuracy = 0.617905 Test-Top 5 Accuracy = 0.955822 Test-Loss =  3701866.000000
Epoch: 2510 Accuracy = 0.653695 Top 5 Accuracy = 0.980926 Loss =  1851312.750000
Epoch: 2510 Test-Accuracy = 0.614532 Test-Top 5 Accuracy = 0.952262 Test-Loss =  3732493.000000
Epoch: 2520 Accuracy = 0.659610 Top 5 Accuracy = 0.986086 Loss =  1848748.750000
Epoch: 2520 Test-Accuracy = 0.620084 Test-Top 5 Accuracy = 0.957930 Test-Loss =  3724539.000000
Epoch: 2530 Accuracy = 0.659019 Top 5 Accuracy = 0.985717 Loss =  1846959.500000
Epoch: 2530 Test-Accuracy = 0.619686 Test-Top 5 Accuracy = 0.957743 Test-Loss =  3738445.500000
Epoch: 2540 Accuracy = 0.656032 Top 5 Accuracy = 0.982836 Loss =  1843560.625000
Epoch: 2540 Test-Accuracy = 0.616922 Test-Top 5 Accuracy = 0.954932 Test-Loss =  3743749.000000
Epoch: 2550 Accuracy = 0.663967 Top 5 Accuracy = 0.988364 Loss =  1840385.375000
Epoch: 2550 Test-Accuracy = 0.623855 Test-Top 5 Accuracy = 0.960366 Test-Loss =  3750802.000000
Epoch: 2560 Accuracy = 0.661484 Top 5 Accuracy = 0.987380 Loss =  1838597.500000
Epoch: 2560 Test-Accuracy = 0.621114 Test-Top 5 Accuracy = 0.959429 Test-Loss =  3760095.250000
Epoch: 2570 Accuracy = 0.656576 Top 5 Accuracy = 0.982970 Loss =  1837828.500000
Epoch: 2570 Test-Accuracy = 0.616430 Test-Top 5 Accuracy = 0.954815 Test-Loss =  3772278.250000
Epoch: 2580 Accuracy = 0.660330 Top 5 Accuracy = 0.986712 Loss =  1834041.125000
Epoch: 2580 Test-Accuracy = 0.619826 Test-Top 5 Accuracy = 0.958399 Test-Loss =  3771557.500000
Epoch: 2590 Accuracy = 0.660295 Top 5 Accuracy = 0.986601 Loss =  1838473.250000
Epoch: 2590 Test-Accuracy = 0.620014 Test-Top 5 Accuracy = 0.958328 Test-Loss =  3784936.000000
Epoch: 2600 Accuracy = 0.660348 Top 5 Accuracy = 0.987052 Loss =  1841798.000000
Epoch: 2600 Test-Accuracy = 0.620716 Test-Top 5 Accuracy = 0.958633 Test-Loss =  3776836.000000
Epoch: 2610 Accuracy = 0.659868 Top 5 Accuracy = 0.986425 Loss =  1834760.625000
Epoch: 2610 Test-Accuracy = 0.620037 Test-Top 5 Accuracy = 0.958352 Test-Loss =  3801035.000000
Epoch: 2620 Accuracy = 0.656325 Top 5 Accuracy = 0.982847 Loss =  1826926.250000
Epoch: 2620 Test-Accuracy = 0.617109 Test-Top 5 Accuracy = 0.955190 Test-Loss =  3795124.000000
Epoch: 2630 Accuracy = 0.655610 Top 5 Accuracy = 0.982513 Loss =  1824609.000000
Epoch: 2630 Test-Accuracy = 0.616734 Test-Top 5 Accuracy = 0.954440 Test-Loss =  3806889.500000
Epoch: 2640 Accuracy = 0.653748 Top 5 Accuracy = 0.980897 Loss =  1826868.000000
Epoch: 2640 Test-Accuracy = 0.614696 Test-Top 5 Accuracy = 0.952566 Test-Loss =  3821515.750000
Epoch: 2650 Accuracy = 0.662468 Top 5 Accuracy = 0.988007 Loss =  1821247.625000
Epoch: 2650 Test-Accuracy = 0.622098 Test-Top 5 Accuracy = 0.959898 Test-Loss =  3815637.000000
Epoch: 2660 Accuracy = 0.658099 Top 5 Accuracy = 0.985570 Loss =  1820030.750000
Epoch: 2660 Test-Accuracy = 0.618819 Test-Top 5 Accuracy = 0.957110 Test-Loss =  3817045.000000
Epoch: 2670 Accuracy = 0.657496 Top 5 Accuracy = 0.984786 Loss =  1816869.750000
Epoch: 2670 Test-Accuracy = 0.618327 Test-Top 5 Accuracy = 0.956455 Test-Loss =  3838001.500000
Epoch: 2680 Accuracy = 0.662966 Top 5 Accuracy = 0.988674 Loss =  1817508.250000
Epoch: 2680 Test-Accuracy = 0.623410 Test-Top 5 Accuracy = 0.960202 Test-Loss =  3834675.500000
Epoch: 2690 Accuracy = 0.655212 Top 5 Accuracy = 0.982332 Loss =  1814466.000000
Epoch: 2690 Test-Accuracy = 0.615399 Test-Top 5 Accuracy = 0.953831 Test-Loss =  3850601.500000
Epoch: 2700 Accuracy = 0.655487 Top 5 Accuracy = 0.982912 Loss =  1814248.250000
Epoch: 2700 Test-Accuracy = 0.616453 Test-Top 5 Accuracy = 0.954674 Test-Loss =  3845103.000000
Epoch: 2710 Accuracy = 0.662456 Top 5 Accuracy = 0.987807 Loss =  1811359.500000
Epoch: 2710 Test-Accuracy = 0.622192 Test-Top 5 Accuracy = 0.959874 Test-Loss =  3868683.250000
Epoch: 2720 Accuracy = 0.661788 Top 5 Accuracy = 0.987796 Loss =  1812505.000000
Epoch: 2720 Test-Accuracy = 0.621958 Test-Top 5 Accuracy = 0.959453 Test-Loss =  3861417.500000
Epoch: 2730 Accuracy = 0.664640 Top 5 Accuracy = 0.988393 Loss =  1805953.875000
Epoch: 2730 Test-Accuracy = 0.624019 Test-Top 5 Accuracy = 0.960811 Test-Loss =  3874309.000000
Epoch: 2740 Accuracy = 0.655944 Top 5 Accuracy = 0.983240 Loss =  1814995.500000
Epoch: 2740 Test-Accuracy = 0.616641 Test-Top 5 Accuracy = 0.955119 Test-Loss =  3865395.500000
Epoch: 2750 Accuracy = 0.660248 Top 5 Accuracy = 0.987228 Loss =  1809850.000000
Epoch: 2750 Test-Accuracy = 0.620435 Test-Top 5 Accuracy = 0.958727 Test-Loss =  3890314.500000
Epoch: 2760 Accuracy = 0.659106 Top 5 Accuracy = 0.986478 Loss =  1802453.250000
Epoch: 2760 Test-Accuracy = 0.619545 Test-Top 5 Accuracy = 0.957555 Test-Loss =  3889498.000000
Epoch: 2770 Accuracy = 0.657642 Top 5 Accuracy = 0.984874 Loss =  1799563.500000
Epoch: 2770 Test-Accuracy = 0.618093 Test-Top 5 Accuracy = 0.955822 Test-Loss =  3895853.000000
Epoch: 2780 Accuracy = 0.659669 Top 5 Accuracy = 0.986466 Loss =  1798764.125000
Epoch: 2780 Test-Accuracy = 0.619873 Test-Top 5 Accuracy = 0.957555 Test-Loss =  3902038.250000
Epoch: 2790 Accuracy = 0.657666 Top 5 Accuracy = 0.984967 Loss =  1794555.875000
Epoch: 2790 Test-Accuracy = 0.618046 Test-Top 5 Accuracy = 0.956009 Test-Loss =  3910401.250000
Epoch: 2800 Accuracy = 0.663563 Top 5 Accuracy = 0.988633 Loss =  1805250.250000
Epoch: 2800 Test-Accuracy = 0.623480 Test-Top 5 Accuracy = 0.960109 Test-Loss =  3921772.500000
Epoch: 2810 Accuracy = 0.664219 Top 5 Accuracy = 0.988165 Loss =  1793099.375000
Epoch: 2810 Test-Accuracy = 0.623504 Test-Top 5 Accuracy = 0.959898 Test-Loss =  3919615.250000
Epoch: 2820 Accuracy = 0.660459 Top 5 Accuracy = 0.987339 Loss =  1791986.500000
Epoch: 2820 Test-Accuracy = 0.620295 Test-Top 5 Accuracy = 0.958914 Test-Loss =  3931441.000000
Epoch: 2830 Accuracy = 0.658708 Top 5 Accuracy = 0.985787 Loss =  1790927.375000
Epoch: 2830 Test-Accuracy = 0.619264 Test-Top 5 Accuracy = 0.956736 Test-Loss =  3928774.750000
Epoch: 2840 Accuracy = 0.660313 Top 5 Accuracy = 0.986706 Loss =  1786639.500000
Epoch: 2840 Test-Accuracy = 0.620060 Test-Top 5 Accuracy = 0.957860 Test-Loss =  3946427.750000
Epoch: 2850 Accuracy = 0.665495 Top 5 Accuracy = 0.988393 Loss =  1790617.250000
Epoch: 2850 Test-Accuracy = 0.624769 Test-Top 5 Accuracy = 0.959453 Test-Loss =  3953658.750000
Epoch: 2860 Accuracy = 0.659106 Top 5 Accuracy = 0.985377 Loss =  1795613.125000
Epoch: 2860 Test-Accuracy = 0.619545 Test-Top 5 Accuracy = 0.956876 Test-Loss =  3938747.750000
Epoch: 2870 Accuracy = 0.649596 Top 5 Accuracy = 0.971797 Loss =  1801798.750000
Epoch: 2870 Test-Accuracy = 0.611300 Test-Top 5 Accuracy = 0.942939 Test-Loss =  3958055.000000
Epoch: 2880 Accuracy = 0.660717 Top 5 Accuracy = 0.987878 Loss =  1788744.250000
Epoch: 2880 Test-Accuracy = 0.621419 Test-Top 5 Accuracy = 0.958703 Test-Loss =  3965945.500000
Epoch: 2890 Accuracy = 0.661718 Top 5 Accuracy = 0.986607 Loss =  1781995.250000
Epoch: 2890 Test-Accuracy = 0.621700 Test-Top 5 Accuracy = 0.958118 Test-Loss =  3968165.000000
Epoch: 2900 Accuracy = 0.662737 Top 5 Accuracy = 0.987128 Loss =  1777899.000000
Epoch: 2900 Test-Accuracy = 0.622590 Test-Top 5 Accuracy = 0.958328 Test-Loss =  3974060.250000
Epoch: 2910 Accuracy = 0.661215 Top 5 Accuracy = 0.986911 Loss =  1775375.000000
Epoch: 2910 Test-Accuracy = 0.621021 Test-Top 5 Accuracy = 0.958094 Test-Loss =  3978348.500000
Epoch: 2920 Accuracy = 0.661425 Top 5 Accuracy = 0.986654 Loss =  1773545.625000
Epoch: 2920 Test-Accuracy = 0.620857 Test-Top 5 Accuracy = 0.957930 Test-Loss =  3986303.500000
Epoch: 2930 Accuracy = 0.661683 Top 5 Accuracy = 0.986759 Loss =  1771951.000000
Epoch: 2930 Test-Accuracy = 0.621068 Test-Top 5 Accuracy = 0.958047 Test-Loss =  3991109.500000
Epoch: 2940 Accuracy = 0.661215 Top 5 Accuracy = 0.986525 Loss =  1771377.125000
Epoch: 2940 Test-Accuracy = 0.620482 Test-Top 5 Accuracy = 0.957883 Test-Loss =  3999385.500000
Epoch: 2950 Accuracy = 0.662433 Top 5 Accuracy = 0.987128 Loss =  1769342.250000
Epoch: 2950 Test-Accuracy = 0.621817 Test-Top 5 Accuracy = 0.958422 Test-Loss =  4005133.750000
Epoch: 2960 Accuracy = 0.664570 Top 5 Accuracy = 0.988001 Loss =  1778581.250000
Epoch: 2960 Test-Accuracy = 0.623785 Test-Top 5 Accuracy = 0.960202 Test-Loss =  3992227.000000
Epoch: 2970 Accuracy = 0.659956 Top 5 Accuracy = 0.986367 Loss =  1768702.625000
Epoch: 2970 Test-Accuracy = 0.619662 Test-Top 5 Accuracy = 0.957532 Test-Loss =  4018618.500000
Epoch: 2980 Accuracy = 0.659060 Top 5 Accuracy = 0.985389 Loss =  1771573.250000
Epoch: 2980 Test-Accuracy = 0.619428 Test-Top 5 Accuracy = 0.957345 Test-Loss =  4012916.000000
Epoch: 2990 Accuracy = 0.652536 Top 5 Accuracy = 0.978004 Loss =  1767169.375000
Epoch: 2990 Test-Accuracy = 0.613174 Test-Top 5 Accuracy = 0.949404 Test-Loss =  4021825.000000

Optimization Finished!
Training Accuracy= 0.664746
Training Top 5 Accuracy= 0.988352

Testing Accuracy= 0.623668
Testing Top 5 Accuracy= 0.959898
In [ ]:
testing_predictions, testing_labels = session.run([tf.argmax(predictions,1), tf.argmax(labels,1)], 
                                                  feed_dict={features: inputX_test,
                                                             labels: inputY_test,
                                                             pkeep: 1})

print(classification_report(testing_labels, testing_predictions, target_names=y.columns))
                           precision    recall  f1-score   support

   country_destination_AU       0.01      0.04      0.01       110
   country_destination_CA       0.02      0.10      0.03       290
   country_destination_DE       0.02      0.11      0.03       219
   country_destination_ES       0.03      0.17      0.05       456
   country_destination_FR       0.06      0.08      0.07       998
   country_destination_GB       0.02      0.15      0.04       449
   country_destination_IT       0.04      0.18      0.06       565
  country_destination_NDF       1.00      1.00      1.00     24984
   country_destination_NL       0.01      0.06      0.02       155
   country_destination_PT       0.00      0.02      0.01        44
   country_destination_US       0.72      0.08      0.15     12427
country_destination_other       0.12      0.11      0.12      1994

              avg / total       0.80      0.62      0.64     42691

In [ ]:
#Plot accuracies and cost summary
f, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex=True, figsize=(10,8))

ax1.plot(accuracy_summary)
ax1.plot(test_accuracy_summary)
ax1.set_title('Top 1 Accuracy')

ax2.plot(accuracy_top5_summary)
ax2.plot(test_accuracy_top5_summary)
ax2.set_title('Top 5 Accuracy')

ax3.plot(loss_summary)
ax3.plot(test_loss_summary)
ax3.set_title('Loss')

plt.xlabel('Epochs (x10)')
plt.show()
In [ ]:
# Find the probabilities for each prediction
test_final = df_test.as_matrix()
final_probabilities = session.run(predictions, feed_dict={features: test_final,
                                                          pkeep: 1})
In [ ]:
# Explore some of the predictions
final_probabilities[0]
Out[ ]:
array([  7.34245255e-21,   5.02517741e-30,   1.42874395e-25,
         2.40796284e-15,   4.88082264e-07,   2.60900578e-07,
         1.31230571e-09,   9.99996901e-01,   3.10756943e-27,
         3.81869185e-14,   1.90081550e-06,   4.43023225e-07], dtype=float32)
In [ ]:
# Encode the labels for the countries
le = LabelEncoder()
fit_labels = le.fit_transform(train.country_destination) 

# Get the ids for the test data
test_getIDs = pd.read_csv("test_users.csv")
testIDs = test_getIDs['id']

ids = []  #list of ids
countries = []  #list of countries
for i in range(len(testIDs)):
    # Select the 5 countries with highest probabilities
    idx = testIDs[i]
    ids += [idx] * 5
    countries += le.inverse_transform(np.argsort(final_probabilities[i])[::-1])[:5].tolist()
    if i % 10000 == 0:
        print ("Percent complete: {}%".format(round(i / len(test),4)*100))

#Generate submission
submission = pd.DataFrame(np.column_stack((ids, countries)), columns=['id', 'country'])
submission.to_csv('submission.csv',index=False)
Percent complete: 0.0%
Percent complete: 16.1%
Percent complete: 32.21%
Percent complete: 48.309999999999995%
Percent complete: 64.42%
Percent complete: 80.52%
Percent complete: 96.61999999999999%
In [ ]:
# Check some of the submissions
submission.head(25)
Out[ ]:
id country
0 5uwns89zht NDF
1 5uwns89zht US
2 5uwns89zht FR
3 5uwns89zht other
4 5uwns89zht GB
5 jtl0dijy2j NDF
6 jtl0dijy2j US
7 jtl0dijy2j IT
8 jtl0dijy2j other
9 jtl0dijy2j FR
10 xx0ulgorjt NDF
11 xx0ulgorjt FR
12 xx0ulgorjt US
13 xx0ulgorjt other
14 xx0ulgorjt IT
15 6c6puo6ix0 NDF
16 6c6puo6ix0 FR
17 6c6puo6ix0 US
18 6c6puo6ix0 other
19 6c6puo6ix0 IT
20 czqhjk3yfe NDF
21 czqhjk3yfe other
22 czqhjk3yfe US
23 czqhjk3yfe FR
24 czqhjk3yfe IT
In [ ]:
# Compare the submission's distribution to the training data's distribution.
# Given that the data was randomly split, 
# a more equal distribution should lead to better scores in the Kaggle competition.
submission.country.value_counts()
Out[ ]:
US       62096
NDF      62074
FR       61792
other    61527
IT       45328
GB       15275
ES        1363
NL         679
AU         280
DE          60
CA           3
PT           3
Name: country, dtype: int64
In [ ]:
train.country_destination.value_counts()
Out[ ]:
NDF      124543
US        62376
other     10094
FR         5023
IT         2835
GB         2324
ES         2249
CA         1428
DE         1061
NL          762
AU          539
PT          217
Name: country_destination, dtype: int64

Summary

Based on Kaggle's evaluation method (https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings#evaluation), the neural network scores just under 0.86 when all of the training data is used. The winning submission scored 0.88697 and the sample submission scored 0.68411. Looking at the leaderboard and some kernels, the most common algorithm for the better scores was XGBoost. Given that the purpose of this analyis was to further my knowledge of TensorFlow (in addition to the other aspects of machine learning - i.e. feature engineering, cleaning data, etc.), I do not feel the need to use XGBoost to try to make a better prediction. I am rather pleased with this model on a whole, given its ability to accurately predict which country a user will make his/her first trip in. The 'lazy' prediction method would be to use the top and top 5 most common countries for the predictions. This would equal an accuracy score of 58.35% for the top predictions and 95.98% for the top 5 predictions. For the testing data, my top predictions scored a higher accuracy of 62.37%, as well as for the top 5 predictions, at 95.99%. My predictions are also more useful since they make use of all twelve countries, instead of just the five most common.

In [ ]: