Data Cleaning¶
We load in the relevant files for inspection and carry out a number of data manipulations, visualisations and processes to understand and analyse our data before proceeding with the clustering algorithm. We also format our data into a single source for our clustering algorithm.
Importing required libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Data Import¶
Our data is split across 12 files. These are split into trials of length 95, length 100 and length 150. The wi_95 file contains all wins from the 95 trials for each subject, lo_95 contains the losses, index_95 contains which study each subject relates to and choice_95 details what deck each subject chose across each trial.
These files contain data from 617 healthy subjects who have participated in the Iowa Gambling Task. These participants are split accross 10 independant studies with a variety of participants and length of trial:
Study |
Participants |
Trials |
---|---|---|
Fridberg et al. |
15 |
95 |
Horstmannb |
162 |
100 |
Kjome et al. |
19 |
100 |
Maia & McClelland |
40 |
100 |
Premkumar et al. |
25 |
100 |
Steingroever et al. |
70 |
100 |
Steingroever et al. |
57 |
150 |
Wetzels et al. |
41 |
150 |
Wood et al. |
153 |
100 |
Worthy et al. |
35 |
100 |
#import raw data into dataframes
wi_95 = pd.read_csv('../data/wi_95.csv')
wi_100 = pd.read_csv('../data/wi_100.csv')
wi_150 = pd.read_csv('../data/wi_150.csv')
lo_95 = pd.read_csv('../data/lo_95.csv')
lo_100 = pd.read_csv('../data/lo_100.csv')
lo_150 = pd.read_csv('../data/lo_150.csv')
index_95 = pd.read_csv('../data/index_95.csv')
index_100 = pd.read_csv('../data/index_100.csv')
index_150 = pd.read_csv('../data/index_150.csv')
choice_95 = pd.read_csv('../data/choice_95.csv')
choice_100 = pd.read_csv('../data/choice_100.csv')
choice_150 = pd.read_csv('../data/choice_150.csv')
Inspection of the raw data¶
We display the wins, losses, index and choices for the trials of length 95 to understand the current format of the data
#Display wins
wi_95.head()
Wins_1 | Wins_2 | Wins_3 | Wins_4 | Wins_5 | Wins_6 | Wins_7 | Wins_8 | Wins_9 | Wins_10 | ... | Wins_86 | Wins_87 | Wins_88 | Wins_89 | Wins_90 | Wins_91 | Wins_92 | Wins_93 | Wins_94 | Wins_95 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subj_1 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ... | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
Subj_2 | 100 | 100 | 50 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ... | 50 | 100 | 100 | 100 | 100 | 100 | 50 | 50 | 50 | 50 |
Subj_3 | 50 | 50 | 50 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ... | 100 | 100 | 100 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
Subj_4 | 50 | 50 | 100 | 100 | 100 | 100 | 100 | 50 | 100 | 100 | ... | 100 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
Subj_5 | 100 | 100 | 50 | 50 | 50 | 100 | 100 | 100 | 100 | 100 | ... | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
5 rows × 95 columns
#Display losses
lo_95.head()
Losses_1 | Losses_2 | Losses_3 | Losses_4 | Losses_5 | Losses_6 | Losses_7 | Losses_8 | Losses_9 | Losses_10 | ... | Losses_86 | Losses_87 | Losses_88 | Losses_89 | Losses_90 | Losses_91 | Losses_92 | Losses_93 | Losses_94 | Losses_95 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subj_1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1250 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -250 | 0 | 0 |
Subj_2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | -50 | -300 | 0 | -350 | 0 | 0 | 0 | 0 | 0 | -25 |
Subj_3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -150 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | -250 | 0 | 0 | 0 |
Subj_4 | 0 | 0 | 0 | 0 | -150 | 0 | 0 | 0 | 0 | 0 | ... | 0 | -50 | 0 | -50 | -50 | 0 | -25 | 0 | 0 | 0 |
Subj_5 | 0 | 0 | 0 | 0 | 0 | 0 | -150 | 0 | 0 | 0 | ... | -75 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 95 columns
#Display index
index_95.head()
Subj | Study | |
---|---|---|
0 | 1 | Fridberg |
1 | 2 | Fridberg |
2 | 3 | Fridberg |
3 | 4 | Fridberg |
4 | 5 | Fridberg |
#Display choices
choice_95.head()
Choice_1 | Choice_2 | Choice_3 | Choice_4 | Choice_5 | Choice_6 | Choice_7 | Choice_8 | Choice_9 | Choice_10 | ... | Choice_86 | Choice_87 | Choice_88 | Choice_89 | Choice_90 | Choice_91 | Choice_92 | Choice_93 | Choice_94 | Choice_95 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subj_1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | ... | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Subj_2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | ... | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 4 | 4 | 3 |
Subj_3 | 3 | 4 | 3 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | ... | 2 | 2 | 2 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Subj_4 | 4 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 2 | 2 | ... | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 |
Subj_5 | 1 | 2 | 3 | 4 | 3 | 1 | 1 | 2 | 2 | 2 | ... | 3 | 3 | 4 | 4 | 3 | 4 | 4 | 4 | 4 | 4 |
5 rows × 95 columns
Data manipulations¶
We want to consolidate the data into a single dataframe with certain features. The first step is to aggregate the choices for all subjects into single columns for each card deck.
#Count values for each choice by deck
agg_choice_95 = choice_95.apply(pd.Series.value_counts, axis=1)
agg_choice_100 = choice_100.apply(pd.Series.value_counts, axis=1)
agg_choice_150 = choice_150.apply(pd.Series.value_counts, axis=1)
#Display aggregated choice data
agg_choice_95.head()
1 | 2 | 3 | 4 | |
---|---|---|---|---|
Subj_1 | 12 | 9 | 3 | 71 |
Subj_2 | 24 | 26 | 12 | 33 |
Subj_3 | 12 | 35 | 10 | 38 |
Subj_4 | 11 | 34 | 12 | 38 |
Subj_5 | 10 | 24 | 15 | 46 |
We add columns labelled with the total wins and total losses for each subject calculated from the wins and losses raw data we previously imported
#calculate total wins and total losses for each subject
agg_choice_95["tot_win"] = wi_95.sum(axis=1)
agg_choice_95["tot_los"] = lo_95.sum(axis=1)
agg_choice_100["tot_win"] = wi_100.sum(axis=1)
agg_choice_100["tot_los"] = lo_100.sum(axis=1)
agg_choice_150["tot_win"] = wi_150.sum(axis=1)
agg_choice_150["tot_los"] = lo_150.sum(axis=1)
#resetting index for concatination in the next cell
agg_choice_95.reset_index(inplace=True)
agg_choice_100.reset_index(inplace=True)
agg_choice_150.reset_index(inplace=True)
We then add the index dataframe so as to know which Study each subject is from which will aid with our analysis and visualisations later on
agg_choice_95.head()
index | 1 | 2 | 3 | 4 | tot_win | tot_los | |
---|---|---|---|---|---|---|---|
0 | Subj_1 | 12 | 9 | 3 | 71 | 5800 | -4650 |
1 | Subj_2 | 24 | 26 | 12 | 33 | 7250 | -7925 |
2 | Subj_3 | 12 | 35 | 10 | 38 | 7100 | -7850 |
3 | Subj_4 | 11 | 34 | 12 | 38 | 7000 | -7525 |
4 | Subj_5 | 10 | 24 | 15 | 46 | 6450 | -6350 |
#Concatenate index to identify the study
final_95 = pd.concat([agg_choice_95, index_95], axis=1)
final_100 = pd.concat([agg_choice_100, index_100], axis=1)
final_150 = pd.concat([agg_choice_150, index_150], axis=1)
final_95.head()
index | 1 | 2 | 3 | 4 | tot_win | tot_los | Subj | Study | |
---|---|---|---|---|---|---|---|---|---|
0 | Subj_1 | 12 | 9 | 3 | 71 | 5800 | -4650 | 1 | Fridberg |
1 | Subj_2 | 24 | 26 | 12 | 33 | 7250 | -7925 | 2 | Fridberg |
2 | Subj_3 | 12 | 35 | 10 | 38 | 7100 | -7850 | 3 | Fridberg |
3 | Subj_4 | 11 | 34 | 12 | 38 | 7000 | -7525 | 4 | Fridberg |
4 | Subj_5 | 10 | 24 | 15 | 46 | 6450 | -6350 | 5 | Fridberg |
We inspect the new dataframes to see if our aggregation has created any null values
#Display summary info for our data types accross the three dataframes we have created
final_95.info()
final_100.info()
final_150.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 15 non-null object
1 1 15 non-null int64
2 2 15 non-null int64
3 3 15 non-null int64
4 4 15 non-null int64
5 tot_win 15 non-null int64
6 tot_los 15 non-null int64
7 Subj 15 non-null int64
8 Study 15 non-null object
dtypes: int64(7), object(2)
memory usage: 1.2+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 504 non-null object
1 1 504 non-null int64
2 2 504 non-null int64
3 3 504 non-null int64
4 4 504 non-null int64
5 tot_win 504 non-null int64
6 tot_los 504 non-null int64
7 Subj 504 non-null int64
8 Study 504 non-null object
dtypes: int64(7), object(2)
memory usage: 35.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 98 non-null object
1 1 96 non-null float64
2 2 96 non-null float64
3 3 98 non-null float64
4 4 96 non-null float64
5 tot_win 98 non-null int64
6 tot_los 98 non-null int64
7 Subj 98 non-null int64
8 Study 98 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 7.0+ KB
This has shown the presence of six null values within the final_150 dataframe. Further inspection of these null values may show us an insight into some participants choices in this task
sample = final_150[final_150[1].isnull()]
sample.head(2)
index | 1 | 2 | 3 | 4 | tot_win | tot_los | Subj | Study | |
---|---|---|---|---|---|---|---|---|---|
7 | Subj_8 | NaN | NaN | 150.0 | NaN | 7500 | -3750 | 8 | Steingroever2011 |
56 | Subj_57 | NaN | NaN | 150.0 | NaN | 7500 | -3750 | 57 | Steingroever2011 |
Note¶
Interestingly this step has shown that two participants, both from the Steingroever2011 study chose deck 3 for 150 consecutive trials. It would be interesting to know whether this was a tactical choice with prior knowledge of how the Iowa gambling task is set up - i.e. with decks 3 and 4 giving the best returns. This may have also been a tactic to quickly finish the game from a participant who was uninterested in the game itself. Either way this slightly varies from the expected results of the Iowa Gambling task in which we expect subjects to show a variety of choices but ultimately learn which of the decks are more rewarding.
We replace these cells marked as Null with 0 across the data
#replace all NaN values with '0'
final_150[1] = final_150[1].fillna(0)
final_150[2] = final_150[2].fillna(0)
final_150[4] = final_150[4].fillna(0)
We change all columns to type int from type float
#convert float to int
final_150[1] = final_150[1].astype(int)
final_150[2] = final_150[2].astype(int)
final_150[3] = final_150[3].astype(int)
final_150[4] = final_150[4].astype(int)
We then bring all three dataframes together into a single dataframe named ‘final’. This is comprised of:
Trials of length 95
Trials of length 100
Trials of length 150
We can then do manipulations across all of this data all at once
#creation of a single dataframe with all 617 subjects accross all studies
temp = final_95.append(final_100)
final = temp.append(final_150)
final
index | 1 | 2 | 3 | 4 | tot_win | tot_los | Subj | Study | |
---|---|---|---|---|---|---|---|---|---|
0 | Subj_1 | 12 | 9 | 3 | 71 | 5800 | -4650 | 1 | Fridberg |
1 | Subj_2 | 24 | 26 | 12 | 33 | 7250 | -7925 | 2 | Fridberg |
2 | Subj_3 | 12 | 35 | 10 | 38 | 7100 | -7850 | 3 | Fridberg |
3 | Subj_4 | 11 | 34 | 12 | 38 | 7000 | -7525 | 4 | Fridberg |
4 | Subj_5 | 10 | 24 | 15 | 46 | 6450 | -6350 | 5 | Fridberg |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93 | Subj_94 | 24 | 69 | 13 | 44 | 12150 | -11850 | 94 | Wetzels |
94 | Subj_95 | 5 | 31 | 46 | 68 | 9300 | -7150 | 95 | Wetzels |
95 | Subj_96 | 18 | 19 | 37 | 76 | 9350 | -7900 | 96 | Wetzels |
96 | Subj_97 | 25 | 30 | 44 | 51 | 10250 | -9050 | 97 | Wetzels |
97 | Subj_98 | 11 | 104 | 6 | 29 | 13250 | -15050 | 98 | Wetzels |
617 rows × 9 columns
As can be seen from the above output - the dataframe contains 617 rows (one for each of the the participants). However we dont have a unique identifier for each subject. We also have the issue that there is Subject 1 for each of the three length trials data. We add a column with a unique ID for each of the 617 participants to overcome this issue. We also calculate the balance for each participant after the completion of their trials which is the total wins added to the total losses.
#calculate balance and add Unique ID for participants
final['Unique_ID'] = (np.arange(len(final))+1)
final["balance"] = final["tot_win"] + final["tot_los"]
final = final.rename(columns={1:'Deck_A', 2:'Deck_B', 3:'Deck_C', 4:'Deck_D'})
final
index | Deck_A | Deck_B | Deck_C | Deck_D | tot_win | tot_los | Subj | Study | Unique_ID | balance | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Subj_1 | 12 | 9 | 3 | 71 | 5800 | -4650 | 1 | Fridberg | 1 | 1150 |
1 | Subj_2 | 24 | 26 | 12 | 33 | 7250 | -7925 | 2 | Fridberg | 2 | -675 |
2 | Subj_3 | 12 | 35 | 10 | 38 | 7100 | -7850 | 3 | Fridberg | 3 | -750 |
3 | Subj_4 | 11 | 34 | 12 | 38 | 7000 | -7525 | 4 | Fridberg | 4 | -525 |
4 | Subj_5 | 10 | 24 | 15 | 46 | 6450 | -6350 | 5 | Fridberg | 5 | 100 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93 | Subj_94 | 24 | 69 | 13 | 44 | 12150 | -11850 | 94 | Wetzels | 613 | 300 |
94 | Subj_95 | 5 | 31 | 46 | 68 | 9300 | -7150 | 95 | Wetzels | 614 | 2150 |
95 | Subj_96 | 18 | 19 | 37 | 76 | 9350 | -7900 | 96 | Wetzels | 615 | 1450 |
96 | Subj_97 | 25 | 30 | 44 | 51 | 10250 | -9050 | 97 | Wetzels | 616 | 1200 |
97 | Subj_98 | 11 | 104 | 6 | 29 | 13250 | -15050 | 98 | Wetzels | 617 | -1800 |
617 rows × 11 columns
Add Payoff to Dataframe¶
We are interested in the payoffs used in the various trials for the Iowa Gambling task study. There were three different kinds of payoff used in all the studies.
Payoff 1¶
This scheme gives +250 result from 10 choices of the good decks (C&D) while giving a -250 result from 10 choices of the bad decks (A&B). Payoff 1 also has a feature in which decks A&C result in frequent losses while decks B&D result in infrequent losses. Another feature of payoff 1 is that deck C has varialble loss between -25, -50 or -75. The final feature of Payoff 1 is that wins and losses are in a fixed sequence
Payoff 2¶
Payoff 2 is a variant on the first payoff with a number of changes. The first is that the loss from Deck C is constant at -50. The second change involves a random sequence of wins and losses through the trials.
Payoff 3¶
Payoff 3 is another variety which still contains the base feature of bad decks A&B and good decks C&D with A&C having frequent losses and B&D having infrequent losses. This scheme differs by having decks that change every 10 trials. This means the bad decks (A&B) produce a -250 result from 10 trials at the start but this decreases by 150 for every 10 trials resulting in a final result of -1000 for 10 card chosen fromn the bad decks. Similarly the good decks start with a +250 result and increase by 25 with every 10 trials finishing with a reward of 375 for every 10 trials. This means both choices are magnified as the trials progress but also the punishment for choosing a bad deck increases far more than the reward for choosing a good deck. The final feature of payoff scheme 3 is that there is a difference in wins between decks and the sequence of wins and losses is fixed.
This variety in reward/punishment from payoff to payoff is an interesting factor in the outcomes of our project and so we will further explore how much of an impact this variety has within our clustering analysis.
#payoff data
data = [['Fridberg', 1],['Horstmann', 2],['Kjome', 3],['Maia', 1],['SteingroverInPrep', 2],['Premkumar', 3],['Wood', 3],['Worthy', 1],['Steingroever2011', 2],['Wetzels', 2]]
# Create the pandas DataFrame
payoff = pd.DataFrame(data, columns = ['Study', 'Payoff'])
# print dataframe.
payoff
Study | Payoff | |
---|---|---|
0 | Fridberg | 1 |
1 | Horstmann | 2 |
2 | Kjome | 3 |
3 | Maia | 1 |
4 | SteingroverInPrep | 2 |
5 | Premkumar | 3 |
6 | Wood | 3 |
7 | Worthy | 1 |
8 | Steingroever2011 | 2 |
9 | Wetzels | 2 |
We join the final dataframe with the dataframe containing the payoff values
final = final.join(payoff.set_index('Study'), on='Study')
Export data¶
This final dataframe is in the appropriate format for clustering and for some initial visualisations and analysis. We export the data for use in the clustering notebook and then proceed with our exploration.
final.to_csv('../data/cleaned_data.csv')
Exploration of the data¶
We want to further understand the data set to see if there are any insights or trends that we can explore during our clustering. A number of methods and graphs can help us to understand the make up of this data and what questions we want to answer within this project.
final.describe()
Deck_A | Deck_B | Deck_C | Deck_D | tot_win | tot_los | Subj | Unique_ID | balance | Payoff | |
---|---|---|---|---|---|---|---|---|---|---|
count | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 | 617.000000 |
mean | 15.813614 | 33.388979 | 26.320908 | 32.296596 | 8087.252836 | -8244.084279 | 214.312804 | 309.000000 | -156.831442 | 2.173420 |
std | 7.999550 | 17.599469 | 21.428470 | 18.170402 | 1555.720966 | 2357.633329 | 154.912903 | 178.256837 | 1251.585443 | 0.660141 |
min | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 5300.000000 | -18800.000000 | 1.000000 | 1.000000 | -4250.000000 | 1.000000 |
25% | 10.000000 | 22.000000 | 14.000000 | 21.000000 | 7150.000000 | -9550.000000 | 70.000000 | 155.000000 | -1000.000000 | 2.000000 |
50% | 16.000000 | 31.000000 | 21.000000 | 29.000000 | 7750.000000 | -8100.000000 | 196.000000 | 309.000000 | -170.000000 | 2.000000 |
75% | 21.000000 | 41.000000 | 30.000000 | 40.000000 | 8480.000000 | -6650.000000 | 350.000000 | 463.000000 | 650.000000 | 3.000000 |
max | 48.000000 | 143.000000 | 150.000000 | 135.000000 | 14750.000000 | -2725.000000 | 504.000000 | 617.000000 | 3750.000000 | 3.000000 |
We can see that mean choices for decks B, C & D are higher than Deck A showing a clear trend of participants away from A and towards the good decks C&D. Interestingly the mean total wins is less than the mean average losses and hence participants mean balance was -156.83.
Analysis by Individual¶
#create histogram for wins
plt.hist(final["tot_win"], bins=100, color="green")
plt.title("Distribution of Wins")
plt.xlabel("Wins Value")
plt.ylabel("Count")
plt.show()
#create histogram for losses
plt.hist(final["tot_los"], bins=100, color="red")
plt.title("Distribution of Losses")
plt.xlabel("Losses Value")
plt.ylabel("Count")
plt.show()


We can see wins has a right skew distribution with a small number of subjects with wins above 10,000. The losses is similarly distributed but with a left skew distribution. The presence of trials of length 150 in this data alongside the data of trials with length 95/100 would be partly the cause of this.
#create histogram for wins
to_keep= ['Fridberg','Horstmann','Kjome','Maia','SteingroverInPrep','Premkumar','Wood','Worthy']
filtered_df = final[final['Study'].isin(to_keep)]
plt.hist(filtered_df["tot_win"], bins=100, color="green")
plt.title("Distribution of Wins for trials of length 95/100")
plt.xlabel("Wins Value")
plt.ylabel("Count")
plt.show()
#create histogram for losses
plt.hist(filtered_df["tot_los"], bins=100, color="red")
plt.title("Distribution of Losses for trials of length 95/100")
plt.xlabel("Losses Value")
plt.ylabel("Count")
plt.show()
filtered_df.describe()


Deck_A | Deck_B | Deck_C | Deck_D | tot_win | tot_los | Subj | Unique_ID | balance | Payoff | |
---|---|---|---|---|---|---|---|---|---|---|
count | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 |
mean | 15.132948 | 30.901734 | 23.354528 | 30.466281 | 7575.211946 | -7832.080925 | 245.433526 | 260.000000 | -256.868979 | 2.206166 |
std | 7.125167 | 13.933238 | 14.957389 | 14.430446 | 881.667949 | 1960.716200 | 149.256183 | 149.966663 | 1169.048170 | 0.715170 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 5300.000000 | -14800.000000 | 1.000000 | 1.000000 | -4250.000000 | 1.000000 |
25% | 10.000000 | 22.000000 | 13.000000 | 21.000000 | 7045.000000 | -9012.500000 | 115.500000 | 130.500000 | -1032.500000 | 2.000000 |
50% | 15.000000 | 30.000000 | 20.000000 | 28.000000 | 7600.000000 | -7800.000000 | 245.000000 | 260.000000 | -300.000000 | 2.000000 |
75% | 20.000000 | 39.000000 | 28.000000 | 38.000000 | 8100.000000 | -6512.500000 | 374.500000 | 389.500000 | 550.000000 | 3.000000 |
max | 38.000000 | 79.000000 | 84.000000 | 92.000000 | 10750.000000 | -2725.000000 | 504.000000 | 519.000000 | 3570.000000 | 3.000000 |
We can see once we remove the trials of length 150 the level of right skew in the wins histogram reduces as does the level of left skew in the losses histogram.
Analysis by Payoff¶
We now look at the distribution of the balance across the different payoff schemes.
i = 1
while i < 4:
pay1 = final['Payoff'] == i
temp = final[pay1]
plt.hist(temp['balance'], bins = 100)
plt.title("Payoff "+ str(i) + " Balance Distribution")
plt.xlabel("Balance")
plt.ylabel('Count')
plt.show()
i += 1
temp.describe()



We can see payoff 2 has distribution that centers just above 0 while payoff 1 and 3 have a negative center - payoff 3 showing the worst results. This may be due to the sharp increase in penalty that occurs with every 10 cards chosen in payoff 3.
#Show wins and losses by Payoff
plt.scatter(final["Payoff"], final["tot_win"], label = "Total Wins", color="g")
plt.scatter(final["Payoff"], final["tot_los"], label = "Total Losses", color="r")
plt.legend(ncol=1, loc='upper left')
plt.title("Wins/Losses by Payoff")
plt.xlabel("Payoff")
plt.ylabel("Value")
plt.xticks([1, 2, 3])
plt.show()

Payoff 2 shows the widest distribution across wins and losses while payoff 3 again shows a very tight distribution of wins but a wide variety of losses due to this evolving reward/punishment system every 10 cards.
Analysis by Study¶
Another theory we have is that the study in which the subject is part of may affect the participants results. We will explore this data with each study in miond to see if any obvious trends emerge. This may also be a useful clustering excercise to see if subjects will cluster with others in their study
#Show wins and losses by Study
plt.scatter(final["Study"], final["tot_win"], label = "Total Wins", color="g")
plt.scatter(final["Study"], final["tot_los"], label = "Total Losses", color="r")
plt.xticks(rotation=90)
plt.legend(ncol=2)
plt.title("Wins/Losses by Study")
plt.xlabel("Study")
plt.ylabel("Value")
plt.show()

Steingroever2011 and Wetzels, as the only studies with trials of length 150, show the greatest variety in wins and losses across their participants. The Fridberg study group shows limited wins but limited losses also. This group has 15 participants so this may be a factor although Kjome shows a wider variety of wins and losses with 19 participants.
#show balance by study
plt.scatter(final["Study"], final["balance"], label = "Balance")
plt.xticks(rotation=90)
plt.legend()
plt.title("Balance by Study")
plt.show()

Steingroever2011, Wood and Premkumar show the widest variety in terms of positive and negative balance of its participants. Fridberg and Kjome again vary despite their similar small group size. We can also see that the two participants from Steingroever2011 which chose deck 3 150 times had the greatest profit of 3,750 amongst all participants.
Analysis of deck choices¶
One of the features we have included in our dataframe is the number of choices of each deck. We know there are good decks and bad decks with varying rewards and punishments so this will definitely be a feature that will affect our clustering algorithm.
plt.scatter(final["Unique_ID"], final['Deck_D'], label = "Deck D", s=20)
plt.scatter(final["Unique_ID"], final['Deck_C'], label = "Deck C", s=20)
plt.scatter(final["Unique_ID"], final['Deck_B'], label = "Deck B", s=20)
plt.scatter(final["Unique_ID"], final['Deck_A'], label = "Deck A", s=20)
plt.title("Deck choices by Subject")
plt.xlabel("Subjects")
plt.ylabel("Count")
plt.legend()
plt.show()

We can see above the tendency of participants to choice Decks 2/3/4 more than 60 times where deck 1 was never chosen this many times by any participant. We can also see the trend of a number of participants who have chose the same deck close to 100% of the trials. The large increase in choices from Subject 500 onwards is due to the trials of length 150.
There seems to be a unnatural distribution of choices between subjects 300-480 with a consistent number of choices being exactly 60. We will explore this data in more detail:
filter1 = final["Unique_ID"]>300
filter2 = final["Unique_ID"]<480
final_zoom = final.where(filter1&filter2, inplace=False)
plt.scatter(final_zoom["Unique_ID"], final_zoom['Deck_D'], label = "Deck D", s=30)
plt.scatter(final_zoom["Unique_ID"], final_zoom['Deck_C'], label = "Deck C", s=30)
plt.scatter(final_zoom["Unique_ID"], final_zoom['Deck_B'], label = "Deck B", s=30)
plt.scatter(final_zoom["Unique_ID"], final_zoom['Deck_A'], label = "Deck A", s=30)
plt.title("Deck choices by Subject")
plt.xlabel("Subjects")
plt.ylabel("Choice")
plt.legend()
plt.show()

Further inspection on a subset of participants shows an unusual trend to choose a specific deck exactly 60 times. This is across decks 2/3/4 but again deck 1 is neglected in this. This may have been a stipulation of the study that no deck was allowed to be chosen more than 60 times.
plt.scatter(final["Study"], final['Deck_D'], label = "Deck D")
plt.scatter(final["Study"], final['Deck_C'], label = "Deck C")
plt.scatter(final["Study"], final['Deck_B'], label = "Deck B")
plt.scatter(final["Study"], final['Deck_A'], label = "Deck A")
plt.xticks(rotation=90)
plt.legend()
plt.title("Deck Choice by Study")
plt.xlabel("Study")
plt.ylabel("No. of Choices")
plt.show()

Steingroever2011 had the two most successful participants and also had the two highest choices of a single deck with two members selecting deck C 150/150 trials. There were also a number of other participants who chose a single deck more than 100 times in their study.
plt.scatter(final["Payoff"], final['Deck_D'], label = "Deck D")
plt.scatter(final["Payoff"], final['Deck_C'], label = "Deck C")
plt.scatter(final["Payoff"], final['Deck_B'], label = "Deck B")
plt.scatter(final["Payoff"], final['Deck_A'], label = "Deck A")
plt.legend()
plt.xticks([1, 2, 3])
plt.title("Deck Choice by Payoff")
plt.xlabel("Payoff")
plt.ylabel("No. of Choices")
plt.show()

Interestingly payoff 2 shows a much higher selection of choices compared to payoff 1 or payoff 3. The distribution across these three is clearly quite different and so it will be interesting to see if clustering will indicate this also.