# The SPHERE Challenge: Activity Recognition with Multimodal Sensor Data¶

This is a guest post by Niall Twomey of the SPHERE project's machine learning team. You can [download the notebook](https://gist.github.com/drivendata/d44783726edad76d4eaa07f9c0b10481) here to follow along. Enjoy!

Welcome to the SPHERE Challenge! We are very excited to be working with DrivenData, the ECML-PKDD conference and the AARP foundation with this challenge!

My name is Niall Twomey, and I am a postdoctoral researcher on the machine learning with the SPHERE project. In this post I'm going to first walk you through how to work with the data we provide, and then I'm going to provide an implmentation of a baseline classification algorithm. This baseline isn't by any means going to win the challenge :-) but it will show you how you can work with the data, as there is a bit of work needed to transform the raw data in the tabular formats that we are most familiar with in maching learning problems.

There are four main sections in this post:

1. I will start by visualising of all of the data modalities (a crucial initial step in all prediction problems!);
2. Then I will extract very simple features from all of the sensor modalities.
3. Due to the data collection system, we will need to deal with missing data with data imputation; and
4. Finally after we get the data into a nice tabular format, we will test a k-nearest neighbour based classifier on the test data.

Time to start! :-)

In [1]:
from __future__ import print_function

# For number crunching
import numpy as np
import pandas as pd

# For visualisation
import matplotlib.pyplot as pl
import seaborn as sns

# For prediction
import sklearn

# Misc
from itertools import cycle
import json
import os

# Magic and formatting
% matplotlib inline

sns.set_context('poster')
sns.set_style('darkgrid')

current_palette = cycle(sns.color_palette())


# Module versions¶

These are the library versions we worked with to produce our results. 
In [2]:
print('  numpy version: {}'.format(np.__version__))
print(' pandas version: {}'.format(pd.__version__))
print('seaborn version: {}'.format(sns.__version__))
print('   json version: {}'.format(json.__version__))
print('sklearn version: {}'.format(sklearn.__version__))

  numpy version: 1.10.4
pandas version: 0.18.0
seaborn version: 0.7.0
json version: 2.0.9
sklearn version: 0.17.1


# Visualisation¶

Before doing any prediction, I think it will be helpful to do visualisation on the data itself. To do this, we will use a set of classes that we wrote for easy access to the data. These are stored in a Python file called visualise_data.py (available here) and are quite interpretable.

We will visualise the training sequence identified by 00001. This is an arbitrary selection, and the cells below will work on all records (though if visualising the test instances the ground truth won't be shown, of course).

├── public_data
│   ├── accelerometer_axes.json
│   ├── access_point_names.json
│   ├── annotations.json
│   ├── pir_locations.json
│   ├── rooms.json
│   ├── sample_submission.csv
│   ├── train
│   │   ├── 00001
│   │   ├── ...
│   │   └── 00010
│   ├── test
│   │   ├── 00011
│   │   ├── ...
│   │   └── 00882
│   ├── video_feature_names.json
│   └── video_locations.json

We use a helper class called SequenceVisualisation that can be located in the visualise_data.py file. This class will load in all of the available data, let you slice training data by activity, will plot some of the sensor data and annotations (when they are available), amongst other super useful things!

We use three main sensor types in this challenge: PIR sensors, accelerometers, and video. We will show you what those data look like in this first, then show how you might perform feature extraction on the data, and finally we will then show how you might classify data!

In [3]:
from visualise_data import SequenceVisualisation

plotter = SequenceVisualisation('../public_data', '../public_data/train/00001')
sequence_window = (plotter.meta['start'], plotter.meta['end'])

# We can extract the time range of the activity 'jump' with the 'times_of_activity' function.
# This function returns all of the times that jump was annotated, and it is indexed first
# by the annotator, and then by the time at which it occurred.
times_of_jump = plotter.times_of_activity('a_jump')

# You can get the times at which these.
for ai, annotator_jumps in enumerate(times_of_jump):
print ('Annotator {}'.format(ai))

# The annotator_jumps is a list of tuples. The length of this list specifies the number of segments as
# annotated as 'a_jump' by this annotator. Each element of this list is a tuple that holds the start and
# end time of the t-th annotation in that order, ie:
for ti, (start, end) in enumerate(annotator_jumps, start=1):
print ('  Annotation {}'.format(ti))
print ('    Start time: {}'.format(start))
print ('    End time:   {}'.format(end))
print ('    Duration:   {}'.format(end - start))
print ()

# The sequence object also holds metadata regarding the length of the sequence.
sequence_window = (plotter.meta['start'], plotter.meta['end'])
print (sequence_window)

Annotator 0
Annotation 1
Start time: 1774.195
End time:   1776.739
Duration:   2.5440000000000964

Annotator 1
Annotation 1
Start time: 1775.6870000000001
End time:   1776.348
Duration:   0.6609999999998308

Annotation 2
Start time: 1776.901
End time:   1777.87
Duration:   0.9689999999998236

Annotator 2
Annotation 1
Start time: 1774.974
End time:   1778.844
Duration:   3.8700000000001182

(0, 1823.9170000000001)


# Plotting location.¶

The black lines in the image below are the times during which the PIR sensors were activated. The red, green, and blue lines here are then the annotations provided by the three annotators.

Note PIR sensors can elicit false negative and false positive activations - ie they do not always trigger when someone is present, and they can trigger when someone is not present due to the infrared radiation from the sun in warm weather.

In this challenge, evaluation does not consider prediction of location, but knowing the location will help in classification performance, eg someone on the staircase is more likely to be ascending/descending the stairs rather than walking flat, and they would be very unlikely to be sitting. Location information is provided in the training data sets only.

In [4]:
plotter.plot_pir(sequence_window, sharey=True)


This function plots the received signal strength indicator (RSSI) of the wireless transmission. RSSI is a measure of how strongly the accelerometer's signal was received. As a simple rule, the closer you are to the receiver (which we call an access point), the more power there is in the received signal (but this rule doesn't necessarily apply if your body gets in the way of the accelerometer and the access point!).

There are four access points located in the SPHERE house, and these are located in:

1. the kitchen;
2. the loung;
3. upstairs; and
4. in the study.

RSSI is a useful indication of location (together with PIR data). In the image below, the RSSI data is plotted together with the locations. There is a very strong correspondance, for example, between the lounge trace (green) and the annotation of presence in the lounge. And in locations without an access point (eg bed2) you can see patterns in the RSSI signal that should be indicative of your location.

In [5]:
plotter.plot_rssi(sequence_window)


# Plotting acceleration.¶

This plots the acceleration data (in the continuous line traces) and annotated activities.

Here we can see another interesting aspect of the data, namely that while the annotators agree quite a lot of the time, often there is disagreement regarding the annotated activities. There are various different types of disagreement, and one of the most interesting is the disagreement of the specific start and end times of the activities.

In [6]:
plotter.plot_acceleration((sequence_window[0] + 180, sequence_window[0] + 300))


# Plotting video data.¶

The visualisation below shows the video data that we give with this challenge. We do not give the raw video data (in order to procet the anonymity of the participants), so we give coordinate of 2D and 3D bounding boxes, and the centre of these boxes. These are quite coarse as features go, but there is certainly valuable information in the features that you might extract from these bounding boxes (including volumn, height, aspect ratios, etc). To understand what all of the columns are, see §2.2 from the following paper:

Niall Twomey, Tom Diethe, Meelis Kull, Hao Song, Massimo Camplani, Sion Hannuna, Xenofon Fafoutis, Ni Zhu, Pete Woznowski, Peter Flach, Ian Craddock: “The SPHERE Challenge: Activity Recognition with Multimodal Sensor Data”, 2016;arXiv:1603.00797.

In the images below, we plot the the centre of the 2D and 3D bounding boxes. The cameras are located in the hallway, living_room and the kitchen respectively. The red, blue and green horozontal lines in the images correspond to the ground truth location annotations that are provided, and these correspond to the right hand axis labels. eg, at 500 seconds, the annotations state that the participant was in the kitchen. The other lines indicate the values of the features as the participant moves throughout the room.

In [7]:
plotter.plot_video(plotter.centre_2d, sequence_window)
pl.gcf().suptitle('2D bounding box')

plotter.plot_video(plotter.centre_3d, sequence_window)
pl.gcf().suptitle('3D bounding box')

Out[7]:
<matplotlib.text.Text at 0x11acba320>

# Visualising the classification targets¶

We now take a look at the ground truth targets that accompany the training sequences. The labels that are annotated (20 in total) are described in the main page. In the image below, we plot the first two minutes of the target data. In the image, the x-axis is over time, and the y-axis is over probabilities. Each subplot represents a particular activity, and we only plot the activities which have been annotated over this time window so as to save vertical space.

One of the most important things to take away from this image is that the targets are not 0/1, but they are probabilities, and the probilities represent the average label assigned by the annotators.

In [8]:
annotation_names = plotter.targets.columns.difference(['start', 'end'])

# Select only the first minute of data
sub_df = plotter.targets.ix[:60 * 2]

# Select only the columns that are non-empty
sub_df_cols = [col for col in annotation_names if sub_df[col].sum() > 0]

# Plot a bar-plot w
current_palette = cycle(sns.color_palette())
sub_df[sub_df_cols].plot(
kind='bar',
subplots=True,
sharex=True,
sharey=True,
figsize=(20, 3 * len(sub_df_cols)),
width=1.0,
color=[next(current_palette) for _ in annotation_names]
);


# Extracting Features¶

Now that we understand the data and the classification targets better, it's time to do some feature extraction. We have intentionally given the data in a very raw state so that the participants of the challenge are free to do as much feature engineering (or feature modelling!) as possible. So, there is a bit of work that needs to be done in order to get the data into a standard tabular format that you would expect to see in a classification task.

One useful member function in the Sequence class from the visualise_data.py file is iterate. This is a function that iterates through all of the data in the sequence one second at a time, and it returns pandas dataframes for each sensor modality (environmental, accelerometer, video), and we extract features from these dataframes.

In this cell, we will extract very simple features from each modalitiy, namely the mean, standard deviation, minimum, median, and maxiumum values of all of the sensors. These are very simple features, and not all of them really make sense for all modalities, but this is just a baseline afterall!

In [9]:
from visualise_data import Sequence

import warnings
warnings.filterwarnings('ignore')

"""
For every data modality, we will extract some very simple features: the mean, min, max, median, and standard
deviation of one second windows. We will put function pointers in a list called 'feature_functions' so that we
can easily call these on our data for later
"""
feature_functions = [np.mean, np.std, np.min, np.median, np.max]
feature_names = ['mean', 'std', 'min', 'median', 'max']

# We will keep the number of extracted feature functions as a parameter
num_ff = len(feature_functions)

# We will want to keep track of the feature names for later, so we will collect these in the following list:
column_names = []

# These are the modalities that are available in the dataset, and the .iterate() function returns the data
# in this order
modality_names = ['acceleration', 'rssi', 'pir', 'video_living_room', 'video_kitchen', 'video_hallway']

"""
Iterate over all training directories
"""
for train_test in ('train', 'test', ):
#     if train_test is 'train':
#         print ('Extracting features from training data.\n')
#      else:
#         print ('\n\n\nExtracting features from testing data.\n')

for fi, file_id in enumerate(os.listdir('../public_data/{}/'.format(train_test))):
stub_name = str(file_id).zfill(5)

#         if train_test is 'train' or np.mod(fi, 50) == 0:
#             print ("Starting feature extraction for {}/{}".format(train_test, stub_name))

# Use the sequence loader to load the data from the directory.
data = Sequence('../public_data', '../public_data/{}/{}'.format(train_test, stub_name))

"""
Populate the column_name list here. This needs to only be done on the first iteration
because the column names will be the same between datasets.
"""
if len(column_names) == 0:
for lu, modalities in data.iterate():
for modality, modality_name in zip(modalities, modality_names):
for column_name, column_data in modality.transpose().iterrows():
for feature_name in feature_names:
column_names.append('{0}_{1}_{2}'.format(modality_name, column_name, feature_name))

# Break here
break

"""
Here, we will extract some features from the data. We will use the Sequence.iterate function.

This function partitions the data into one-second dataframes for the full data sequence. The first element
(which we call lu) represents the lower and upper times of the sliding window, and the second is a list of
dataframes, one for each modality. The dataframes may be empty (due to missing data), and feature extraction
has to be able to cope with this!

The list rows will store the features extracted for this dataset
"""
rows = []

for ri, (lu, modalities) in enumerate(data.iterate()):
row = []

"""
Iterate over the sensing modalities. The order is given in modality_names.

Note: If you want to treat each modality differently, you can do the following:

for ri, (lu, (accel, rssi, pir, vid_lr, vid_k, vid_h)) in enumerate(data.iterate()):
row.extend(extract_accel(accel))
row.extend(extract_pir(pir))
row.extend(extract_video(vid_lr, vid_k, vid_h))

where the extract_accel/extract_rssi/extract_pir/extract_video functions are designed by you.
In the case here, we extract the same features from each modality, but optimal performance will
probably be achieved by considering different features for each modality.
"""

for modality in modalities:
"""
The accelerometer dataframe, for example, has three columns: x, y, and z. We want to extract features
from all of these, and so we iterate over the columns here.
"""
for name, column_data in modality.transpose().iterrows():
if len(column_data) > 3:
"""
Extract the features stored in feature_functions on the column data if there is sufficient
data in the dataframe.
"""
row.extend(map(lambda ff: ff(column_data), feature_functions))

else:
"""
If no data is available, put nan placeholders to keep the column widths consistent
"""
row.extend([np.nan] * num_ff)

# Do a quick sanity check to ensure that the feature names and number of extracted features match
assert len(row) == len(column_names)

# Append the row to the full set of features
rows.append(row)

# Report progress
#             if train_test is 'train':
#                 if np.mod(ri + 1, 50) == 0:
#                     print ("{:5}".format(str(ri + 1))),

#                 if np.mod(ri + 1, 500) == 0:
#                     print

"""
At this stage we have extracted a bunch of simple features from the data. In real implementation,
it would be advisable to look at more interesting features, eg

We will save these features to a new file called 'columns.csv' for use later. This file will be located
in the name of the training sequence.
"""
df = pd.DataFrame(rows)
df.columns = column_names
df.to_csv('../public_data/{}/{}/columns.csv'.format(train_test, stub_name), index=False)

#         if train_test is 'train' or np.mod(fi, 50) == 0:
#             if train_test is 'train': print
#             print ("Finished feature extraction for {}/{}\n".format(train_test, stub_name))


# Feature Extraction¶

Now that we have a simple tabular set of features on hand, we can do some machine learning!

## Things to consider¶

There are a number of things that we will need to keep in mind:

1. we have missing data: we will handle this problem by imputing the mean column values with sklearn.preprocessing.Imputer class. In general you might want to be cleaverer than this.
2. the labels are probabilistic: this means that the targets are not just 0 and 1 because the labels are averaged over multiple annotators, and as we saw from the data visualisations, they do not always agree!

## The selected model¶

To make predictions we will use a very simple model. First, we will partition the data into training and testing sets. Secondly, we will learn a k-nearest neighbours (KNN) model on the training features. To predict with new test instances, then, we will select the neighbours from the training set that are nearest to the new instance, and the label we predict is the average label from those training instances.

# Preliminaries: data imputation¶

In [10]:
"""
We will define two convenience function to load the extracted features and their
"""

filename = str(file_id).zfill(5)

data = df.values

return data, target

x_es = []
y_es = []

for file_id in file_ids:

x_es.append(data)
y_es.append(target)

return np.row_stack(x_es), np.row_stack(y_es)

# Load the training and testing data
train_x, train_y = load_sequences([1, 2, 3, 4, 5, 6, 7, 8])

print ("Check whether the train/test features are all finite (before imputation)")
print ('All training data finite:', np.all(np.isfinite(train_x)))
print ('All testing data finite:', np.all(np.isfinite(test_x)))
print

# We will want to impute the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputer.fit(train_x)

train_x = imputer.transform(train_x)
test_x = imputer.transform(test_x)

print ("Check whether the train/test features are all finite (after imputation)")
print ('All training data finite:', np.all(np.isfinite(train_x)))
print ('All testing data finite:', np.all(np.isfinite(test_x)))
print

n_classes = len(labels)

"""
Note, not all data is annotated, so we select only the annotated rows
"""
train_y_has_annotation = np.isfinite(train_y.sum(1))
train_x = train_x[train_y_has_annotation]
train_y = train_y[train_y_has_annotation]

test_y_has_annotation = np.isfinite(test_y.sum(1))
test_x = test_x[test_y_has_annotation]
test_y = test_y[test_y_has_annotation]

"""
Print simple statistics regarding the number of instances
"""
print ("Training data shapes:")
print ("train_x.shape: {}".format(train_x.shape))
print ("train_y.shape: {}".format(train_y.shape))
print

print ("Testing data shapes")
print ("test_x.shape: {}".format(test_x.shape))
print ("test_y.shape: {}".format(test_y.shape))
print

Check whether the train/test features are all finite (before imputation)
All training data finite: False
All testing data finite: False
Check whether the train/test features are all finite (after imputation)
All training data finite: True
All testing data finite: True
Training data shapes:
train_x.shape: (12520, 305)
train_y.shape: (12520, 20)
Testing data shapes
test_x.shape: (3604, 305)
test_y.shape: (3604, 20)

Out[10]:
<function print>

# Class weights¶

In this work, we have uneven class distributions. This means that some of the activities were performed more frequently than other activities. As an example, if 99% of the time the activity was 'sitting', we could use a very simple 'prior' model that always predicts 'sitting', and we would get 99% accuracy even without learning anything. Clearly this isn't a very useful classifier, so in order to encourage the exploration of more sophisticated models that have to predict all classes, we put more emphasis on the classes that are harder to classify.

In the cell below we load the activity names and their associated weights. We also plot the prior distribution of the training data. We can see that p_sit occurrs nearly 20% of the time, and that p_kneel occurrs about 2% of the time. These two activities are given weights of 0.2 and 1 respectively, meaning that participants are rewarded approximately 5 times more for correctly predicting a jump activity over a sit activity.

In [11]:
activity_names = json.load(open('../public_data/annotations.json', 'r'))

class_prior = train_y.mean(0)

df = pd.DataFrame({
'Activity': activity_names,
'Class Weight': class_weights,
'Prior Class Distribution': class_prior
})

df.set_index('Activity', inplace=True)
# reset colour palette
current_palette = cycle(sns.color_palette())
df.plot(
kind='bar',
width=1.0,
subplots=True,
color=[next(current_palette), next(current_palette)],
)

df

Out[11]:
Class Weight Prior Class Distribution
Activity
a_ascend 1.352985 0.008286
a_descend 1.386846 0.007952
a_jump 1.595874 0.002636
a_walk 0.347784 0.100441
p_bent 0.661082 0.043648
p_kneel 1.047236 0.018865
p_lie 0.398865 0.083713
p_sit 0.207586 0.185152
p_squat 1.505783 0.003379
p_stand 0.110181 0.369017
t_bend 1.078033 0.016238
t_kneel_stand 1.365604 0.008440
t_lie_sit 1.170241 0.013070
t_sit_lie 1.193364 0.012110
t_sit_stand 1.180370 0.012330
t_stand_kneel 1.344149 0.008851
t_stand_sit 1.116838 0.014649
t_straighten 1.080839 0.015714
t_turn 0.503152 0.070158