The Human Activity Language
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
We empirically
discovered that the space of human actions has a linguistic structure. This
is a sensory-motor space consisting of the evolution of joint angles of the
human body in movement. The space of human activity has its own phonemes,
morphemes, and sentences. We present a Human Activity Language (HAL) for
symbolic non-arbitrary representation of sensory and motor information of
human activity. This language was learned from large amounts of motion
capture data. Kinetology,
the phonology of human movement, finds basic primitives for human motion
(segmentation) and associates them with symbols (symbolization). This way,
kinetology provides a symbolic representation for human movement that allows
synthesis, analysis, and symbolic manipulation. We introduce a kinetological
system and propose five basic principles on which such a system should be
based: compactness, view-invariance, reproducibility, selectivity, and
reconstructivity. We demonstrate the kinetological properties of our
sensory-motor primitives. Further evaluation is accomplished with experiments
on compression and decompression of motion data. The
morphology of a human action
relates to the inference of essential parts of movement (morpho-kinetology)
and its structure (morpho-syntax). To learn morphemes and their structure, we
present a grammatical inference methodology and introduce a parallel learning
algorithm to induce a grammar system representing a single action. The
algorithm infers components of the grammar system as a subset of essential
actuators, a CFG grammar for the language of each component representing the
motion pattern performed in a single actuator, and synchronization rules
modeling coordination among actuators. The syntax of human activities involves the construction of sentences
using action morphemes. A sentence may range from a single action morpheme
(nuclear syntax) to a sequence of sets of morphemes. A single morpheme is
decomposed into analogs of lexical categories: nouns, adjectives, verbs, and
adverbs. The sets of morphemes represent simultaneous actions (parallel
syntax) and a sequence of movements is related to the concatenation of
activities (sequential syntax). We demonstrate this linguistic
framework on real motion capture data from a large scale database containing
around 200 different actions corresponding to English verbs associated with
voluntary meaningful observable movement.
Kinetology
Human
movement is a biological phenomenon which consists in the voluntary motion of
the human body. The understanding of the biomechanical bases and description
of human movement contributes to the improvement of this capacity in humans.
On the other hand, these motor aspects of human movement have important
applications to artificial systems in the synthesis and analysis of movement. Motion synthesis is the generation of movement for
animation characters with a realistic appearance which aims to avoid
unnatural and mechanical artifacts. Mostly, realistic motion synthesis is
based on real examples coming from motion capture. Human motion capture
usually corresponds to a very large amount of data. Segmentation, the
extraction of key postures or motion primitives, summarizes the motion
content and results in the compression of motion data. The precise exemplar movements are constrained only
to the ones stored in a motion database library. Novel realistic movement
either needs to be captured or adapted from previously recorded motion.
Adaptation involves the reuse of motion segments, manipulation of motion
attributes, and sequencing (concatenation) of movement according to physics
laws. This way, any representation should assist in those tasks and be able
to reconstruct the original and adapted movement. On the other hand, motion analysis relates to
perception and involves the parsing of visual information into action representations
ranging from optical flow to stick figure models. These representations are
used to uniquely identify the action performed in a video. Therefore, an
action representation should be able to select among different activities and
to reproduce the same structure for different performances of the same
action. Furthermore, an action representation should be based on primitives
robust to variations of the image formation process. In this sense, camera
view-invariance is a desired property for representations dealing with motion
analysis. Adequate primitives and segmentation must consider
both generation and perception of movement. One reason is that motion
synthesis involves the generation of animation satisfying a realistic
criteria based ultimately on perception. On the other hand, motion
analysis should map the parsed structure from video into a representation
which should regenerate the original observed motion. Furthermore, an
integrated approach would allow imitation, an important component of an artificial
cognitive system [Matarić, 2002]. Therefore, the research problems of
motion synthesis and motion analysis should be combined and based on common
representations. Action units in behavior are all organized within a
clearly definable narrow time window or temporal segment. This temporal
segmentation appears to represent a basic property of the neuronal mechanisms
underlying the integration and organization of successive events [Kien, 1992:
19]. If words are formed from simultaneous combinations of gestures, the
perception somehow finds these elements in the movement signal. The signal
cannot be divided into a neat sequence of units and the patterns associated
with a particular segment vary with the phonetic context. The lack of
invariant segments in the signal matching the invariant segments of
perception constitutes the anisomorphism
paradox [Studdert-Kennedy, 1985: 142]. An initial step in our linguistic framework is to
find basic primitives for human movement. These motion primitives are
analogous to phonemes in spoken language. While phonemes are units of phonic
origin (sounds), the motion primitives are units of kinetic origin (movement)
that we refer as kinetemes. These
atomic units are the building blocks of a foundational system for human movement
denoted as kinetological system.
The problem addressed in this chapter concerns the representation of human
movement in terms of atomic sensory-motor primitives. In
this sense, kinetology is dedicated
to the study of systems of movement as the foundations for a kinetic
language. In addition to a geometric representation for 3D human movement, a
kinetological system consists of segmentation, symbolization, and principles.
We introduce a kinetological system with five principles on which such a
system should be based: compactness, view-invariance, reproducibility,
selectivity, and reconstructivity. We propose sensory-motor primitives and demonstrate
their kinetological properties. Further evaluation is accomplished with
experiments on compression and decompression of motion data. To represent
human movement satisfying the above requirements, we consider whole body
movement associated with general human actions. Although we consider whole
body movement, each DOF is treated independently. An initial 3D geometric
representation for human movement is assumed as input towards the computation
of our sensory-motor representation. Actual movement data is analyzed in the
process of evaluating the proposed kinetological system according to its
principles. Geometric
Representation
A model for the human body which considers only rigid
articulated movement consists of a skeleton. A skeleton is defined as a set
of rigid body parts connected through joints. Formally, the topology of a
skeleton is modeled as a graph where vertices correspond to body parts and
edges are associated with joints. A posture is the geometric configuration of
the skeleton at one instant. Human movement consists in the continuous time
variation of postures. There are two basic 3D geometric representations for
whole body movement: external and internal. The external representation consists of a set P of points in the human body, as
shown in Figure 3.1a. At an instant t, a point pi Î P is associated with the corresponding 3D Cartesian coordinate [Xi(t), Yi(t), Zi(t)]. The
whole human movement is fully determined if at least three points in each
rigid body part are included in the representation. This way, a local
coordinated system can be defined for each body part. The degrees of freedom for a joint can be recovered
from the transformation relating the two local coordinated systems
corresponding to the adjacent body parts. The internal representation of human movement may use Euler angles to
specify the rotational degrees of freedom of each joint. The internal system
describes human movement with a set Q
of joints, where a joint qj
Î Q is associated with Euler angles fj(t), qj(t), and yj(t) at instant t, as shown in Figure 3.1b.
The
internal representation makes explicit use of embodiment through the
topological specification of a skeleton. The topological graph of a skeleton
is defined as a tree where the root resembles the human vestibular system.
This system provides measurements about global movement and orientation in
space for humans. The internal representation is analogous to the proprioceptive system which monitors
movement and is responsible for kinesthesia:
the sense of body position awareness. Segmentation
The input for our kinetological system is real human
motion obtained with a motion capture system. Each DOF i in a model for the articulated human body, refered as actuator, corresponds to a
time-varying function Ji.
The value Ji(t) represents the joint angle of a
specific actuator i at a particular
instant t. In kinetology, our goal
is to identify the motor primitives (segmentation) and to associate them with
symbols (symbolization). This way, kinetology provides a non-arbitrary
grounded symbolic representation for human movement. While motion synthesis
is performed by translating symbols into motion signal, motion analysis uses
this symbolic representation to transform the original signal into a string
of symbols used in the next steps of our linguistic framework. Automatic
segmentation is the decomposition of action sequences into movement
primitives. Theses primitives are atomic elements with characteristic properties
that stay constant within a segment. This concept of motion primitives differ
from the one associated with behavioral basis [Matarić, 2002], which are
used for composition of movement through linear combination. To segment human
movement, we consider each actuator independently. An actuator is a 3D point
or a joint angle describing the motion in an external or internal
representation, respectively. Each joint angle is represented as a
one-dimensional function over time. We associate an actuator with a joint
angle specifying the actuator’s original 3D motion according to an internal
geometric representation as shown in Figure 3.2a. The segmentation process
assigns one state to each instant of the movement for the actuator in
consideration. Contiguous instants assigned to the same state belong to the
same segment, as Figure 3.2b shows.
We
define a state according to the sign of derivatives of a joint angle
function. In our segmentation method, we use angular velocity J’ (first derivative) and angular
acceleration J’’ (second
derivative), as shown in Figure 3.3. This leads to a four-state system:
positive velocity/positive acceleration (J’i(t) ³ 0 and J’’i(t) ³ 0), positive velocity/negative
acceleration (J’i(t) ³ 0 and J’’i(t) < 0), negative velocity/positive acceleration (J’i(t) < 0 and J’’i(t) ³ 0), and negative velocity/negative
acceleration (J’i(t) < 0 and J’’i(t) <
0). It is worth noting that a kinetological system can be defined in both
complex (considering higher order derivatives such as jerk) and simple ways.
A simpler system could have used only the first derivative. In that case, we
would have only two states: positive velocity (J’i(t) ³ 0) and negative
velocity (J’i(t) < 0). Higher order derivatives
increase the amount of segmentation, adding complexity to the description of
the movement. The number 2h
of possible states depends on the order h
of the highest derivative used.
Figure 3.3. Angular derivatives used in our
segmentation method. The
representation has a qualitative aspect, the state of each segment, and a
quantitative aspect corresponding to the time length and angular displacement
(i.e., the absolute difference between initial joint angle and final joint
angle) of each segment. Once the segments are identified, we keep these three
attribute values for each segment: the state, the time length, and the
angular displacement. Each segment is graphically displayed as a filled
rectangle, where the color represents its state, the vertical width
corresponds to angular displacement, and the horizontal length denotes the
time length, as Figure 3.2b shows. The four colors used to depict a
four-state kinetological system are blue for positive velocity/positive
acceleration segments, green for positive velocity/negative acceleration
segments, yellow for negative velocity/positive acceleration segments, and
red for negative velocity/negative acceleration segments. In a two-state
kinetological system, the two colors used are blue for positive velocity
segments and red for negative velocity segments. Given a compact
representation, the attributes are used in the reconstruction of an
approximation for the original motion signal and in the symbolization
process.
Symbolization
The
kinetological segmentation process results into atoms observing some natural
variability. Our goal is to identify the same kineteme amidst this
variability. The symbolization process consists in associating each segment
with a symbol such that segments with the same state corresponding to
different performances of the same motion are associated with the same
symbol. Symbolization amounts to classifying motion segments such that each
class contains variations of the same motion. This way, each segment is associated
with a symbol representing the cluster that contains motion primitives with a
similar spatiotemporal structure, as Figure 3.2c shows. Hierarchical
clustering, using an appropriate similarity distance for segments with the
same atomic state, offers a simple way to perform symbolization. Another
way to perform symbolization is to compute a graph, where the set of vertices
corresponds to all segments with the same atomic state. There exists an edge
between two vertices in the graph if the similarity distance between the two
corresponding segments is less than a threshold value. The similarity
distance is the absolute difference between the time normalized versions of
the joint angle functions associated with the segments. The symbolization
clusters are the connected components of the similarity graph. A probabilistic method to achieve symbolization is model-based probabilistic clustering. Different from model-based clustering, we also used a generalized probabilistic clustering algorithm to classify segments for each joint angle independently. A segment is represented as tuple (a, d, t), where a denotes the atomic state, d corresponds to the angular displacement, and t is the time length. The movement corresponding to a specific joint angle is segmented into a sequence of m atoms (aj, dj, tj) for j = 1, …, m. Our algorithm partitions the 2D parametric space concerning the quantitative attributes (d, t) into regions of any shape. Initially,
we compute probability distributions over the 2D parametric space, as shown
in Figure 3.4. We find one distribution Pa
for each of the possible states by considering only the atoms where aj
= a.
Each atom (aj,
dj, tj) contributes with the probability
modeled as a Gaussian filter h(k1, k2)
centered at (dj, tj) with size
Figure 3.4. A generalized probabilistic clustering method for symbolization. Each sample atom (aj, dj, tj) is assigned to the cluster c which maximizes the expected probability
Figure 3.5. Segmentation
of human motion. Given
the segmentation for a motion data, as shown in Figure 3.5, the symbolization
output is a string of symbols for each actuator in the body. This set of
strings for the whole body defines a single structure, denoted as actiongram,
shown in Figure 3.6. An actiongram A
has n strings A1, …, An.
Each string Ai
corresponds to an actuator of the human body model and contains a possibly
different number of mi
symbols. Each symbol Ai(j) is associated with a segment and
its attributes.
Principles
Besides
sensory-motor primitives, we suggest five kinetological properties to evaluate
our approach and any other: compactness, view-invariance, reproducibility,
selectivity, and reconstructivity. We describe these principles in detail and
demonstrate that our segmentation method and primitives possess these
properties. Compactness
The
compactness principle relates to describing a human activity with the least
possible number of atoms to decrease complexity, improve efficiency, and
allow compression. We achieve compactness through segmentation, which reduces
the representation’s number of parameters. We implemented our segmentation
approach as a compression method for motion data, tested our compression
efficiency algorithm on several different actions, and recorded a median
compression rate of 3.698 percent of the original file size for all motion
files. We achieved the best compression for actions with smooth movement.
Further compression could be achieved through symbolization.
Figure 3.6. Actiongram. View-invariance
An
action representation should be based on primitives robust to variations of
the image formation process. View-invariance regards the effect of projecting
a 3D representation of human movement into a 2D representation according to a
vision system. A view-invariant representation provides the same 2D projected
description of an intrinsically 3D action captured from different viewpoints.
View-invariance is desired to allow visual perception and motor generation
under any geometric configuration in the environment space.
The
view-invariance evaluation requires a 2D-projected version of the initial
representative function according to varying viewpoints. For an internal
geometric representation, the 3D joint angle is projected according to the
two angle sides corresponding to the adjacent body parts, as shown in Figure
3.7. For example, the knee joint 2D angle is formed by the axes of the thigh
and shank. These axes are determined by the segments from the hip to the knee
joint and from the knee to the ankle joint. These 3D joints are projected and
the 2D joint angle is computed in the projection plane. To
evaluate the view-invariance of our representations, a circular surrounding
configuration of viewpoints is used, as shown in Figure 3.8. A viewpoint
consists of the camera position (specified by the camera center) and the
camera orientation (described by a look-at vector and an upward vector). In
our viewpoint configuration, the camera center trajectory corresponds to a
circle in 3D space centered at the target point. The look-at vector is
oriented from the camera center towards the target point, which is the center
of the axis-aligned parallelepiped containing the trajectories of the
movement in 3D space. The upward vector has the same orientation as the z-axis
vector. The camera center circle is defined as a parametric curve
where
Figure 3.8. A circular configuration of viewpoints. A
view-invariance graph shows for each time instant (horizontal axis) and for
each viewpoint in the configuration of viewpoints (vertical axis), the state
associated with the movement, as Figure 3.9 shows. A view-invariance
measurement concerns the fraction of the most frequent state among all states
for all viewpoints at a single instant in time. Let s be a state in
our kinetological system, vs(t) is the fraction of
the state s among all viewpoints in our circular configuration at the
time instant t. The view-invariance measurement is the maximum value
for vs(t) considering all possible states. A
four-state system has a view-invariance measure between 0.25 and 1.0. For
each time instant t, the
view-invariance measure is computed and plotted on the top of the
view-invariance graph. For any joint and any action in our database, the
graph demonstrates a high view-invariance measure for our segmentation
process, with the only exception at the segment’s borders and two degenerated
viewpoints.
Figure 3.9. View-invariance of the left knee flexion-extension
angle during walk.
Note
that the view-invariance measure has some uncertainty at degenerate
viewpoints and at the borders of segments. In these special cases, the movement
states are not fully consistent which degrades the view-invariance measure.
The degenerate viewpoints are special cases of frontal views where the sides
of a joint angle tend to be aligned. In what concerns view-invariance, the
border effect shows that movement segments are not completely stable only
during the temporal transition between segments. This is analog to
coarticulation in speech with similar implications to action recognition
tasks. Reproducibility
Reproducibility
requires an action to have the same description even when a different
performance of this action is considered. Intra-personal invariance deals
with the same subject performing the same action repeated times.
Inter-personal invariance concerns different subjects executing the same
action several times. A kinetological system is reproducible when the same
symbolic representation is associated with the same action performed at
different occasions (intrapersonal) or by different subjects (interpersonal).
To
evaluate the reproducibility of our kinetological system, we used human gait
data for 16 subjects covering males and females at several ages. For each person,
we considered only 12 DOFs associated with the joint angles of the lower
limbs: pelvic tilt, pelvic obliquity, pelvic rotation, hip flexion-extension,
hip abduction-adduction, hip rotation, knee flexion-extension, knee
valgus-varus, knee rotation, ankle dorsi-plantar flexion, foot rotation, and
foot progression. A reproducibility measure is computed for each joint angle.
The reproducibility measure of a joint angle is the fraction of the most
representative symbolic description among all descriptions for the 16
individuals. A very high reproducibility measure means that symbolic
descriptions match among different gait performances and the kinetological
system is reproducible. The reproducibility measure is very high for the
joint angles which play a primary role in an action, as Figure 3.10 shows for
a walking action. The identification of the intrinsic and essential variables
of an action is a byproduct of the reproducibility requirement of a
kinetological system.
(a) Knee flexion-extension.
(b) Pelvic obliquity. Figure 3.10. Reproducibility during gait. Using our kinetological system, six joint angles
obtained very high reproducibility: pelvic obliquity, hip flexion-extension,
hip abduction-adduction, knee flexion-extension, foot rotation, and foot
progression, as shown in Figure 3.11. These variables seem to be the most
related to the movement of walking forward. Other joint angles obtained only
a high reproducibility measure which is interpreted as a secondary role in
the action: pelvic tilt and ankle dorsi-plantar flexion. The remaining joint
angles had a poor reproducibility rate and seem not to be correlated to the
action but probably to its stability instead: pelvic rotation, hip rotation,
knee valgus-varus, and knee rotation. Our kinetological system performance on the
reproducibility measure for all the joint angles shows that the system is
reproducible for the DOFs intrinsically related to the action.
Figure 3.11. Reproducibility measure for 12 DOFs during
gait. Selectivity
The
selectivity principle concerns the ability to discern between distinct
actions. In terms of representation, this principle requires a different
structure to represent different actions. To evaluate our kinetological
system according to the selectivity principle, we compare the compact
representation of several different actions and verify whether their
structures are dissimilar. The selectivity property is demonstrated using a
set of actions performed by the same individual. Four joint angles are
considered: left and right hip flexion-extension, left and right knee
flexion-extension, as shown in Figure 3.12.
Figure 3.12. Selectivity: Different representations for
three distinct actions. The different actions are clearly represented by
different structures. However, manner variations of an action are only
different in the quantitative aspect. We investigate the quantitative aspect
of four manner variations of the walk action performed by a single subject,
as shown in Figure 3.13.
Figure 3.13. Compact representations of four manner
variations of the walk action. Each manner variation has a total of 24 segments for
the four joint angles considered. For each pair of manner variations, we
compute a dissimilarity vector, where each element corresponds to the
difference between the quantitative aspects of the associated segments in the
two variations, as shown in Figure 3.14.
Figure 3.14. Dissimilarity vectors between manner variations
of walk: time length (blue) and angular displacement (red). From these vectors, we can verify the dissimilarity
of the manner variations. The closest variations according time length are
“Walk with stride” and “Walk with exaggerated stride” (median dissimilarity
12.0%), and according to angular displacement are “Walk” and “Walk with
stride” (median dissimilarity 12.2%). This way, even for the same action, the
representation has enough dissimilarity to select between different manner
variations. Reconstructivity
Reconstructivity
is associated with the ability to reconstruct the original movement signal up
to an approximation factor from a compact representation. We propose a
reconstruction method that consists in a novel interpolation algorithm based
on the kinetological structure. We consider one segment at a time and
concentrate on the state transitions between consecutive segments. Based on a
transition, we determine constraints about the derivatives at border points
of a segment. Derivatives will have zero value (equation) or a known sign
(inequality) at these points.
Figure 3.15. Possible state transitions
between segments. For this discussion about reconstructivity, we
consider a four-state kinetological system. We investigate the possible state
transitions that are feasible in our kinetological system. Each segment can
have only two possible states for a next neighbor segment. However, the
transition B ® Y (R ® G) is impossible, since velocity cannot become
negative (positive) with positive (negative) acceleration. The kinetological
rules of our system are represented by a finite automaton, as shown in Figure
3.15. From these kinetological rules, each of the four segment states has
only two possible state configurations for previous and next segments and,
consequently, there are eight possible state sequences for three consecutive
segments, as shown in Table 3.1. Each possible sequence of three segments
corresponds to two equations and two inequality constraints associated with
first and second derivatives at border points t1 and t2
of the center segment. Other two inequalities come from the derivatives at
interior points (t1 <
t < t2) of the segment.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||