The Human Activity Language

 

Gutemberg Guerra-Filho

Department of Computer Science and Engineering

University of Texas at Arlington

guerra@cse.uta.edu

Yiannis Aloimonos

Department of Computer Science

University of Maryland, College Park

yiannis@cfar.umd.edu

Complete pdf version of this document.

Further bibliography from the same authors.

The support of the National Science Foundation (NSF)

under the Human and Social Dynamics (HSD) program and of the

Defense Advanced Research Project Agency (DARPA)

under the Video Analysis and Content Exploitation (VACE) project

are gratefully acknowledged.

 

We empirically discovered that the space of human actions has a linguistic structure. This is a sensory-motor space consisting of the evolution of joint angles of the human body in movement. The space of human activity has its own phonemes, morphemes, and sentences. We present a Human Activity Language (HAL) for symbolic non-arbitrary representation of sensory and motor information of human activity. This language was learned from large amounts of motion capture data.

Kinetology, the phonology of human movement, finds basic primitives for human motion (segmentation) and associates them with symbols (symbolization). This way, kinetology provides a symbolic representation for human movement that allows synthesis, analysis, and symbolic manipulation. We introduce a kinetological system and propose five basic principles on which such a system should be based: compactness, view-invariance, reproducibility, selectivity, and reconstructivity. We demonstrate the kinetological properties of our sensory-motor primitives. Further evaluation is accomplished with experiments on compression and decompression of motion data.

The morphology of a human action relates to the inference of essential parts of movement (morpho-kinetology) and its structure (morpho-syntax). To learn morphemes and their structure, we present a grammatical inference methodology and introduce a parallel learning algorithm to induce a grammar system representing a single action. The algorithm infers components of the grammar system as a subset of essential actuators, a CFG grammar for the language of each component representing the motion pattern performed in a single actuator, and synchronization rules modeling coordination among actuators.

The syntax of human activities involves the construction of sentences using action morphemes. A sentence may range from a single action morpheme (nuclear syntax) to a sequence of sets of morphemes. A single morpheme is decomposed into analogs of lexical categories: nouns, adjectives, verbs, and adverbs. The sets of morphemes represent simultaneous actions (parallel syntax) and a sequence of movements is related to the concatenation of activities (sequential syntax).

We demonstrate this linguistic framework on real motion capture data from a large scale database containing around 200 different actions corresponding to English verbs associated with voluntary meaningful observable movement.

 

 

 

Kinetology

Human movement is a biological phenomenon which consists in the voluntary motion of the human body. The understanding of the biomechanical bases and description of human movement contributes to the improvement of this capacity in humans. On the other hand, these motor aspects of human movement have important applications to artificial systems in the synthesis and analysis of movement.

Motion synthesis is the generation of movement for animation characters with a realistic appearance which aims to avoid unnatural and mechanical artifacts. Mostly, realistic motion synthesis is based on real examples coming from motion capture. Human motion capture usually corresponds to a very large amount of data. Segmentation, the extraction of key postures or motion primitives, summarizes the motion content and results in the compression of motion data.

The precise exemplar movements are constrained only to the ones stored in a motion database library. Novel realistic movement either needs to be captured or adapted from previously recorded motion. Adaptation involves the reuse of motion segments, manipulation of motion attributes, and sequencing (concatenation) of movement according to physics laws. This way, any representation should assist in those tasks and be able to reconstruct the original and adapted movement.

On the other hand, motion analysis relates to perception and involves the parsing of visual information into action representations ranging from optical flow to stick figure models. These representations are used to uniquely identify the action performed in a video. Therefore, an action representation should be able to select among different activities and to reproduce the same structure for different performances of the same action. Furthermore, an action representation should be based on primitives robust to variations of the image formation process. In this sense, camera view-invariance is a desired property for representations dealing with motion analysis.

Adequate primitives and segmentation must consider both generation and perception of movement. One reason is that motion synthesis involves the generation of animation satisfying a realistic criteria based ultimately on perception. On the other hand, motion analysis should map the parsed structure from video into a representation which should regenerate the original observed motion. Furthermore, an integrated approach would allow imitation, an important component of an artificial cognitive system [Matarić, 2002]. Therefore, the research problems of motion synthesis and motion analysis should be combined and based on common representations.

Action units in behavior are all organized within a clearly definable narrow time window or temporal segment. This temporal segmentation appears to represent a basic property of the neuronal mechanisms underlying the integration and organization of successive events [Kien, 1992: 19]. If words are formed from simultaneous combinations of gestures, the perception somehow finds these elements in the movement signal. The signal cannot be divided into a neat sequence of units and the patterns associated with a particular segment vary with the phonetic context. The lack of invariant segments in the signal matching the invariant segments of perception constitutes the anisomorphism paradox [Studdert-Kennedy, 1985: 142].

An initial step in our linguistic framework is to find basic primitives for human movement. These motion primitives are analogous to phonemes in spoken language. While phonemes are units of phonic origin (sounds), the motion primitives are units of kinetic origin (movement) that we refer as kinetemes. These atomic units are the building blocks of a foundational system for human movement denoted as kinetological system. The problem addressed in this chapter concerns the representation of human movement in terms of atomic sensory-motor primitives.

In this sense, kinetology is dedicated to the study of systems of movement as the foundations for a kinetic language. In addition to a geometric representation for 3D human movement, a kinetological system consists of segmentation, symbolization, and principles. We introduce a kinetological system with five principles on which such a system should be based: compactness, view-invariance, reproducibility, selectivity, and reconstructivity.

We propose sensory-motor primitives and demonstrate their kinetological properties. Further evaluation is accomplished with experiments on compression and decompression of motion data. To represent human movement satisfying the above requirements, we consider whole body movement associated with general human actions. Although we consider whole body movement, each DOF is treated independently. An initial 3D geometric representation for human movement is assumed as input towards the computation of our sensory-motor representation. Actual movement data is analyzed in the process of evaluating the proposed kinetological system according to its principles.

Geometric Representation

A model for the human body which considers only rigid articulated movement consists of a skeleton. A skeleton is defined as a set of rigid body parts connected through joints. Formally, the topology of a skeleton is modeled as a graph where vertices correspond to body parts and edges are associated with joints. A posture is the geometric configuration of the skeleton at one instant. Human movement consists in the continuous time variation of postures. There are two basic 3D geometric representations for whole body movement: external and internal.

The external representation consists of a set P of points in the human body, as shown in Figure 3.1a. At an instant t, a point pi Î P is associated with the corresponding 3D Cartesian coordinate [Xi(t), Yi(t), Zi(t)]. The whole human movement is fully determined if at least three points in each rigid body part are included in the representation. This way, a local coordinated system can be defined for each body part.

The degrees of freedom for a joint can be recovered from the transformation relating the two local coordinated systems corresponding to the adjacent body parts. The internal representation of human movement may use Euler angles to specify the rotational degrees of freedom of each joint. The internal system describes human movement with a set Q of joints, where a joint qj Î Q is associated with Euler angles fj(t), qj(t), and yj(t) at instant t, as shown in Figure 3.1b.

(a) External representation.

(b) Internal representation.

Figure 3.1. Three-dimensional representations of human movement.

The internal representation makes explicit use of embodiment through the topological specification of a skeleton. The topological graph of a skeleton is defined as a tree where the root resembles the human vestibular system. This system provides measurements about global movement and orientation in space for humans. The internal representation is analogous to the proprioceptive system which monitors movement and is responsible for kinesthesia: the sense of body position awareness.

Segmentation

The input for our kinetological system is real human motion obtained with a motion capture system. Each DOF i in a model for the articulated human body, refered as actuator, corresponds to a time-varying function Ji. The value Ji(t) represents the joint angle of a specific actuator i at a particular instant t. In kinetology, our goal is to identify the motor primitives (segmentation) and to associate them with symbols (symbolization). This way, kinetology provides a non-arbitrary grounded symbolic representation for human movement. While motion synthesis is performed by translating symbols into motion signal, motion analysis uses this symbolic representation to transform the original signal into a string of symbols used in the next steps of our linguistic framework.

Automatic segmentation is the decomposition of action sequences into movement primitives. Theses primitives are atomic elements with characteristic properties that stay constant within a segment. This concept of motion primitives differ from the one associated with behavioral basis [Matarić, 2002], which are used for composition of movement through linear combination. To segment human movement, we consider each actuator independently. An actuator is a 3D point or a joint angle describing the motion in an external or internal representation, respectively. Each joint angle is represented as a one-dimensional function over time. We associate an actuator with a joint angle specifying the actuator’s original 3D motion according to an internal geometric representation as shown in Figure 3.2a. The segmentation process assigns one state to each instant of the movement for the actuator in consideration. Contiguous instants assigned to the same state belong to the same segment, as Figure 3.2b shows.

(a) Geometric representation.

(b) Segmentation.

(c) Symbolization.

Figure 3.2. Kinetological system.

We define a state according to the sign of derivatives of a joint angle function. In our segmentation method, we use angular velocity J’ (first derivative) and angular acceleration J’’ (second derivative), as shown in Figure 3.3. This leads to a four-state system: positive velocity/positive acceleration (J’i(t) ³ 0 and J’’i(t) ³ 0), positive velocity/negative acceleration (J’i(t) ³ 0 and J’’i(t) < 0), negative velocity/positive acceleration (J’i(t) < 0 and J’’i(t) ³ 0), and negative velocity/negative acceleration (J’i(t) < 0 and J’’i(t) < 0). It is worth noting that a kinetological system can be defined in both complex (considering higher order derivatives such as jerk) and simple ways. A simpler system could have used only the first derivative. In that case, we would have only two states: positive velocity (J’i(t) ³ 0) and negative velocity (J’i(t) < 0). Higher order derivatives increase the amount of segmentation, adding complexity to the description of the movement. The number 2h of possible states depends on the order h of the highest derivative used.

Figure 3.3. Angular derivatives used in our segmentation method.

The representation has a qualitative aspect, the state of each segment, and a quantitative aspect corresponding to the time length and angular displacement (i.e., the absolute difference between initial joint angle and final joint angle) of each segment. Once the segments are identified, we keep these three attribute values for each segment: the state, the time length, and the angular displacement. Each segment is graphically displayed as a filled rectangle, where the color represents its state, the vertical width corresponds to angular displacement, and the horizontal length denotes the time length, as Figure 3.2b shows. The four colors used to depict a four-state kinetological system are blue for positive velocity/positive acceleration segments, green for positive velocity/negative acceleration segments, yellow for negative velocity/positive acceleration segments, and red for negative velocity/negative acceleration segments. In a two-state kinetological system, the two colors used are blue for positive velocity segments and red for negative velocity segments. Given a compact representation, the attributes are used in the reconstruction of an approximation for the original motion signal and in the symbolization process.

These videos show the segmentation of the run action according to a two-state kinetological system that accounts only for positive and negative angular velocity states. Instead of considering the whole body, the segmentation only considers a single joint actuator: left hip flexion-extension (top video) performed by the left upper leg motion (green blob) and left knee flexion-extension (bottom video) performed by the left lower leg motion (red blob).

Symbolization

The kinetological segmentation process results into atoms observing some natural variability. Our goal is to identify the same kineteme amidst this variability. The symbolization process consists in associating each segment with a symbol such that segments with the same state corresponding to different performances of the same motion are associated with the same symbol. Symbolization amounts to classifying motion segments such that each class contains variations of the same motion. This way, each segment is associated with a symbol representing the cluster that contains motion primitives with a similar spatiotemporal structure, as Figure 3.2c shows. Hierarchical clustering, using an appropriate similarity distance for segments with the same atomic state, offers a simple way to perform symbolization.

Another way to perform symbolization is to compute a graph, where the set of vertices corresponds to all segments with the same atomic state. There exists an edge between two vertices in the graph if the similarity distance between the two corresponding segments is less than a threshold value. The similarity distance is the absolute difference between the time normalized versions of the joint angle functions associated with the segments. The symbolization clusters are the connected components of the similarity graph.

A probabilistic method to achieve symbolization is model-based probabilistic clustering. Different from model-based clustering, we also used a generalized probabilistic clustering algorithm to classify segments for each joint angle independently. A segment is represented as tuple (a, d, t), where a denotes the atomic state, d corresponds to the angular displacement, and t is the time length. The movement corresponding to a specific joint angle is segmented into a sequence of m atoms (aj, dj, tj) for j = 1, …, m. Our algorithm partitions the 2D parametric space concerning the quantitative attributes (d, t) into regions of any shape.

Initially, we compute probability distributions over the 2D parametric space, as shown in Figure 3.4. We find one distribution Pa for each of the possible states by considering only the atoms where aj = a. Each atom (aj, dj, tj) contributes with the probability modeled as a Gaussian filter h(k1, k2) centered at (dj, tj) with size
(2WD + 1)´(2WT + 1) and standard deviation s. This way, the probability distribution is defined as


Once the probability distribution Pa is computed, each local maximum is associated with a class. This way, the number of clusters is selected automatically. The partitioning of the parametric space is performed by selecting a connected region for each cluster c associated with a local maximum pc. For a cluster c, we find the minimum value vc such that the region rc in the parametric space satisfying Pa(d, t) > vc contains only the peak pc and no other.

Figure 3.4. A generalized probabilistic clustering method for symbolization.

Each sample atom (aj, dj, tj) is assigned to the cluster c which maximizes the expected probability


where Rc is a binary matrix specifying the connected region rc corresponding to the cluster c. This probabilistic clustering algorithm uses a more general model than standard probabilistic clustering techniques.

Figure 3.5. Segmentation of human motion.

Given the segmentation for a motion data, as shown in Figure 3.5, the symbolization output is a string of symbols for each actuator in the body. This set of strings for the whole body defines a single structure, denoted as actiongram, shown in Figure 3.6. An actiongram A has n strings A1, …, An. Each string Ai corresponds to an actuator of the human body model and contains a possibly different number of mi symbols. Each symbol Ai(j) is associated with a segment and its attributes.

These videos display the motor primitives extracted through symbolization for the run action according to a four-state kinetological system that accounts for angular velocity and angular acceleration. The symbolization is performed in each single joint actuator separately. In the top video, we show the primitives for the left hip flexion-extension performed by the left upper leg motion (green blob) and, in the bottom video, we show the primitives for the left knee flexion-extension performed by the left lower leg motion (red blob).

Principles

Besides sensory-motor primitives, we suggest five kinetological properties to evaluate our approach and any other: compactness, view-invariance, reproducibility, selectivity, and reconstructivity. We describe these principles in detail and demonstrate that our segmentation method and primitives possess these properties.

Compactness

The compactness principle relates to describing a human activity with the least possible number of atoms to decrease complexity, improve efficiency, and allow compression. We achieve compactness through segmentation, which reduces the representation’s number of parameters. We implemented our segmentation approach as a compression method for motion data, tested our compression efficiency algorithm on several different actions, and recorded a median compression rate of 3.698 percent of the original file size for all motion files. We achieved the best compression for actions with smooth movement. Further compression could be achieved through symbolization.

Figure 3.6. Actiongram.

View-invariance

An action representation should be based on primitives robust to variations of the image formation process. View-invariance regards the effect of projecting a 3D representation of human movement into a 2D representation according to a vision system. A view-invariant representation provides the same 2D projected description of an intrinsically 3D action captured from different viewpoints. View-invariance is desired to allow visual perception and motor generation under any geometric configuration in the environment space.

This video illustrates the concept of view-invariance by showing the same action from a spherically uniform 8 by 8 grid of viewpoints. Given the center C of the rectangular box containing all the 3D trajectories of the joint points, the camera viewpoints are located at the same distance from C on a spherical surface centered at this point C and looking at it.

(a) 2D Trajectory.

(b) 2D Joint Angle.

Figure 3.7. 2D projected version of the knee joint angle trajectory from a single viewpoint during a walk action.

The view-invariance evaluation requires a 2D-projected version of the initial representative function according to varying viewpoints. For an internal geometric representation, the 3D joint angle is projected according to the two angle sides corresponding to the adjacent body parts, as shown in Figure 3.7. For example, the knee joint 2D angle is formed by the axes of the thigh and shank. These axes are determined by the segments from the hip to the knee joint and from the knee to the ankle joint. These 3D joints are projected and the 2D joint angle is computed in the projection plane.

To evaluate the view-invariance of our representations, a circular surrounding configuration of viewpoints is used, as shown in Figure 3.8. A viewpoint consists of the camera position (specified by the camera center) and the camera orientation (described by a look-at vector and an upward vector). In our viewpoint configuration, the camera center trajectory corresponds to a circle in 3D space centered at the target point. The look-at vector is oriented from the camera center towards the target point, which is the center of the axis-aligned parallelepiped containing the trajectories of the movement in 3D space. The upward vector has the same orientation as the z-axis vector. The camera center circle is defined as a parametric curve

,

where  is a parameter representing a direction in degrees from 0° to 360°,  is the radius of the circle, and  is the target point.

Figure 3.8. A circular configuration of viewpoints.

A view-invariance graph shows for each time instant (horizontal axis) and for each viewpoint in the configuration of viewpoints (vertical axis), the state associated with the movement, as Figure 3.9 shows. A view-invariance measurement concerns the fraction of the most frequent state among all states for all viewpoints at a single instant in time. Let s be a state in our kinetological system, vs(t) is the fraction of the state s among all viewpoints in our circular configuration at the time instant t. The view-invariance measurement is the maximum value for vs(t) considering all possible states. A four-state system has a view-invariance measure between 0.25 and 1.0. For each time instant t, the view-invariance measure is computed and plotted on the top of the view-invariance graph. For any joint and any action in our database, the graph demonstrates a high view-invariance measure for our segmentation process, with the only exception at the segment’s borders and two degenerated viewpoints.

Figure 3.9. View-invariance of the left knee flexion-extension angle during walk.

This video displays, in the left side, the view-invariance graph of the right knee flexion-extension during a walk action. The view-invariance graph shows the motion state at each time (horizontal axis) and camera location (vertical axis). A horizontal bar is plotted over the current camera location in the circular surrounding configuration shown in the right side of the video. The camera location varies from 0° to 360°. In the center, the two-dimensional joint angle function is plotted. This angle function is obtained from the angle formed by the projection of the bones connected to the joint into the camera plane. The state colors are also superimposed on this joint angle function and maxima-minima are highlighted.

Note that the view-invariance measure has some uncertainty at degenerate viewpoints and at the borders of segments. In these special cases, the movement states are not fully consistent which degrades the view-invariance measure. The degenerate viewpoints are special cases of frontal views where the sides of a joint angle tend to be aligned. In what concerns view-invariance, the border effect shows that movement segments are not completely stable only during the temporal transition between segments. This is analog to coarticulation in speech with similar implications to action recognition tasks.

Reproducibility

Reproducibility requires an action to have the same description even when a different performance of this action is considered. Intra-personal invariance deals with the same subject performing the same action repeated times. Inter-personal invariance concerns different subjects executing the same action several times. A kinetological system is reproducible when the same symbolic representation is associated with the same action performed at different occasions (intrapersonal) or by different subjects (interpersonal).

These videos display different subjects performing one step cycle of the same action: walk (top video) and run (bottom video). For the walk action we have 36 different subjects and, for the run action, we have 12 different subjects. This motion data was extracted from the CMU Motion Capture database. For each action, the performance of each subject was segmented according to our kinetological system and the same motor primitives were obtained for the (right and left) hip flexion-extension and (right and left) knee flexion-extension joint actuators.

To evaluate the reproducibility of our kinetological system, we used human gait data for 16 subjects covering males and females at several ages. For each person, we considered only 12 DOFs associated with the joint angles of the lower limbs: pelvic tilt, pelvic obliquity, pelvic rotation, hip flexion-extension, hip abduction-adduction, hip rotation, knee flexion-extension, knee valgus-varus, knee rotation, ankle dorsi-plantar flexion, foot rotation, and foot progression. A reproducibility measure is computed for each joint angle. The reproducibility measure of a joint angle is the fraction of the most representative symbolic description among all descriptions for the 16 individuals. A very high reproducibility measure means that symbolic descriptions match among different gait performances and the kinetological system is reproducible. The reproducibility measure is very high for the joint angles which play a primary role in an action, as Figure 3.10 shows for a walking action. The identification of the intrinsic and essential variables of an action is a byproduct of the reproducibility requirement of a kinetological system.

(a) Knee flexion-extension.

(b) Pelvic obliquity.

Figure 3.10. Reproducibility during gait.

Using our kinetological system, six joint angles obtained very high reproducibility: pelvic obliquity, hip flexion-extension, hip abduction-adduction, knee flexion-extension, foot rotation, and foot progression, as shown in Figure 3.11. These variables seem to be the most related to the movement of walking forward. Other joint angles obtained only a high reproducibility measure which is interpreted as a secondary role in the action: pelvic tilt and ankle dorsi-plantar flexion. The remaining joint angles had a poor reproducibility rate and seem not to be correlated to the action but probably to its stability instead: pelvic rotation, hip rotation, knee valgus-varus, and knee rotation. Our kinetological system performance on the reproducibility measure for all the joint angles shows that the system is reproducible for the DOFs intrinsically related to the action.

Figure 3.11. Reproducibility measure for 12 DOFs during gait.

Selectivity

The selectivity principle concerns the ability to discern between distinct actions. In terms of representation, this principle requires a different structure to represent different actions. To evaluate our kinetological system according to the selectivity principle, we compare the compact representation of several different actions and verify whether their structures are dissimilar. The selectivity property is demonstrated using a set of actions performed by the same individual. Four joint angles are considered: left and right hip flexion-extension, left and right knee flexion-extension, as shown in Figure 3.12.

(a) Walk.

(b) Run.

(c) Jump.

Figure 3.12. Selectivity: Different representations for three distinct actions.

The different actions are clearly represented by different structures. However, manner variations of an action are only different in the quantitative aspect. We investigate the quantitative aspect of four manner variations of the walk action performed by a single subject, as shown in Figure 3.13.

(a) Slow walk.

(b) Walk.

(c) Walk with stride.

(d) Walk with exaggerated stride.

Figure 3.13. Compact representations of four manner variations of the walk action.

Each manner variation has a total of 24 segments for the four joint angles considered. For each pair of manner variations, we compute a dissimilarity vector, where each element corresponds to the difference between the quantitative aspects of the associated segments in the two variations, as shown in Figure 3.14.

Figure 3.14. Dissimilarity vectors between manner variations of walk: time length (blue) and angular displacement (red).

From these vectors, we can verify the dissimilarity of the manner variations. The closest variations according time length are “Walk with stride” and “Walk with exaggerated stride” (median dissimilarity 12.0%), and according to angular displacement are “Walk” and “Walk with stride” (median dissimilarity 12.2%). This way, even for the same action, the representation has enough dissimilarity to select between different manner variations.

Reconstructivity

Reconstructivity is associated with the ability to reconstruct the original movement signal up to an approximation factor from a compact representation. We propose a reconstruction method that consists in a novel interpolation algorithm based on the kinetological structure. We consider one segment at a time and concentrate on the state transitions between consecutive segments. Based on a transition, we determine constraints about the derivatives at border points of a segment. Derivatives will have zero value (equation) or a known sign (inequality) at these points.

Figure 3.15. Possible state transitions between segments.

For this discussion about reconstructivity, we consider a four-state kinetological system. We investigate the possible state transitions that are feasible in our kinetological system. Each segment can have only two possible states for a next neighbor segment. However, the transition B ® Y (R ® G) is impossible, since velocity cannot become negative (positive) with positive (negative) acceleration. The kinetological rules of our system are represented by a finite automaton, as shown in Figure 3.15. From these kinetological rules, each of the four segment states has only two possible state configurations for previous and next segments and, consequently, there are eight possible state sequences for three consecutive segments, as shown in Table 3.1. Each possible sequence of three segments corresponds to two equations and two inequality constraints associated with first and second derivatives at border points t1 and t2 of the center segment. Other two inequalities come from the derivatives at interior points (t1 < t < t2) of the segment.

Kinetemes

Border Point t1

Border Point t2

Interior Points

Previous

Current

Next