KaggleチュートリアルTitanicで上位3%以内に入るには。(0.82297)

わかりますでしょうか？姑息なことにFareが一つ欠損しております。これを埋めなかったが為に何度エラーが起きたことでしょう。気をつけてください。こういうところでもチュートリアルなのかもしれません。

test["Age"].fillna(train.Age.mean(), inplace=True)
test["Fare"].fillna(train.Fare.mean(), inplace=True)

combine = [test]
for test in combine:
    test['Salutation'] = test.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
for test in combine:
    test['Salutation'] = test['Salutation'].replace(['Lady', 'Countess','Capt', 'Col',\
         'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    test['Salutation'] = test['Salutation'].replace('Mlle', 'Miss')
    test['Salutation'] = test['Salutation'].replace('Ms', 'Miss')
    test['Salutation'] = test['Salutation'].replace('Mme', 'Mrs')
    del test['Name']
Salutation_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for test in combine:
    test['Salutation'] = test['Salutation'].map(Salutation_mapping)
    test['Salutation'] = test['Salutation'].fillna(0)

for test in combine:
        test['Ticket_Lett'] = test['Ticket'].apply(lambda x: str(x)[0])
        test['Ticket_Lett'] = test['Ticket_Lett'].apply(lambda x: str(x))
        test['Ticket_Lett'] = np.where((test['Ticket_Lett']).isin(['1', '2', '3', 'S', 'P', 'C', 'A']), test['Ticket_Lett'],
                                   np.where((test['Ticket_Lett']).isin(['W', '4', '7', '6', 'L', '5', '8']),
                                            '0', '0'))
        test['Ticket_Len'] = test['Ticket'].apply(lambda x: len(x))
        del test['Ticket']
test['Ticket_Lett']=test['Ticket_Lett'].replace("1",1).replace("2",2).replace("3",3).replace("0",0).replace("S",3).replace("P",0).replace("C",3).replace("A",3) 

for test in combine:
        test['Cabin_Lett'] = test['Cabin'].apply(lambda x: str(x)[0])
        test['Cabin_Lett'] = test['Cabin_Lett'].apply(lambda x: str(x))
        test['Cabin_Lett'] = np.where((test['Cabin_Lett']).isin(['T', 'H', 'G', 'F', 'E', 'D', 'C', 'B', 'A']),test['Cabin_Lett'],
                                   np.where((test['Cabin_Lett']).isin(['W', '4', '7', '6', 'L', '5', '8']),
                                            '0','0'))        
        del test['Cabin']
test['Cabin_Lett']=test['Cabin_Lett'].replace("A",1).replace("B",2).replace("C",1).replace("0",0).replace("D",2).replace("E",2).replace("F",1).replace("G",1) 

test["FamilySize"] = train["SibSp"] + train["Parch"] + 1

for test in combine:
    test['IsAlone'] = 0
    test.loc[test['FamilySize'] == 1, 'IsAlone'] = 1
    
test_data = test.values
xs_test = test_data[:, 1:]

.py

さて、ここまでで面倒だった前処理が終わりました。もちろんもっといい処理の方法があったと思いますが、初心者ではこんなもんです。ここからがメインの学習タイムです。様々な学習方法を試した結果RandomForestClassifierを使うのが一番結果が私的には良かったです。ではやってみましょう。

from sklearn.ensemble import RandomForestClassifier

random_forest=RandomForestClassifier()
random_forest.fit(xs, y)
Y_pred = random_forest.predict(xs_test)

import csv
with open("predict_result_data.csv", "w") as f:
    writer = csv.writer(f, lineterminator='\n')
    writer.writerow(["PassengerId", "Survived"])
    for pid, survived in zip(test_data[:,0].astype(int), Y_pred.astype(int)):
        writer.writerow([pid, survived])

.py

from sklearn.ensemble import RandomForestClassifier

random_forest=RandomForestClassifier() random_forest.fit(xs, y) Y_pred = random_forest.predict(xs_test)

import csv with open("predict_result_data.csv", "w") as f: writer = csv.writer(f, lineterminator=’\n’) writer.writerow(["PassengerId", "Survived"]) for pid, survived in zip(test_data[:,0].astype(int), Y_pred.astype(int)): writer.writerow([pid, survived])

提出してみたでしょうか？どういう結果になったでしょうか？0.8を超えられたでしょうか？きっと超えていないと思います。0.82297なんてなっていないと思います。タイトル詐欺？いえ、違います。色々な学習方法を試しているうちに気がつきました。「あれ？毎回結果変わってないか？」そうです。RandomForestClassifierには自由に変えられるパラメータがたくさんあります。これをいじっていくことでどんどん結果が変わります。最初の壁は0.8を超えられるかです。その後は1つ結果を更新するのもしんどくなっていきます。私は、0.82297で止まりました。これ以上はもっとデータ処理を変えるか、もっといいパラメータを見つけるかしかないと思います。パラメータの探索方法をお教えしましょう。

from sklearn.ensemble import RandomForestClassifier
from sklearn import grid_search
from sklearn.grid_search import GridSearchCV

"""
parameters = {
        'n_estimators'      : [10,25,50,75,100],
        'random_state'      : [0],
        'n_jobs'            : [4],
        'min_samples_split' : [5,10, 15, 20,25, 30],
        'max_depth'         : [5, 10, 15,20,25,30]
}

clf = grid_search.GridSearchCV(RandomForestClassifier(), parameters)
clf.fit(xs, y)
 
print(clf.best_estimator_)
"""

.py

今回いじるのはn_estimators、min_samples_splitとmax_depthです。ちなみに上のコードを走らせると時間がかなりかかります。random_stateは初期状態を固定するためのものなのでとりあえず0にしました。n_jobsは計算に使うCPUの数なので適切のものにしてください。パラメータをいじりながら最適なものを探し、それを提出し、結果を確認する。これを何度も繰り返していいものを見つけてください。今現在私の最高を出したパラメータを下に示します。ただ、これが再現性があるのかどうかは知りません。

random_forest=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=25, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=15,
            min_weight_fraction_leaf=0.0, n_estimators=51, n_jobs=4,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

.py

4.データ分析

　ここからはデータの分析をして行きたいと思います。タイタニックでは、約1500人が犠牲となり、生存者は約700人であるそうです。なぜこれほどまで死者が出たのか？その原因の一つには救命ボートが船に乗っていた人の半分を載せられるぐらいしかなかったことでしょう。
それでは実際にデータをみていきましょう。はじめに、性別によってどれほど生存に違いが出たかをみてみましょう。下の図をみてください。0が男性、1が女性です。明らかに女性の生存率が高いですね。そして男性の人数がかなり多いですね。これは推測ですが、乗組員に男性が多かったからではないでしょうか。

%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns
g = sns.factorplot(x="Sex", y="Survived",  data=train,
                   size=6, kind="bar", palette="muted")
g.despine(left=True)
g = g.set_ylabels("survival probability")