{ "id": "1807.00612", "version": "v1", "published": "2018-07-02T12:04:24.000Z", "updated": "2018-07-02T12:04:24.000Z", "title": "Multi-modal Egocentric Activity Recognition using Audio-Visual Features", "authors": [ "Mehmet Ali Arabacı", "Fatih Özkan", "Elif Surer", "Peter Jančovič", "Alptekin Temizel" ], "categories": [ "cs.CV" ], "abstract": "Egocentric activity recognition in first-person videos has an increasing importance with a variety of applications such as lifelogging, summarization, assisted-living and activity tracking. Existing methods for this task are based on interpretation of various sensor information using pre-determined weights for each feature. In this work, we propose a new framework for egocentric activity recognition problem based on combining audio-visual features with multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance, cuboid are extracted from the video. The audio signal is characterized using a \"supervector\", obtained based on Gaussian mixture modelling of frame-level features, followed by a maximum a-posteriori adaptation. Then, the extracted multi-modal features are adaptively fused by MKL classifiers in which both the feature and kernel selection/weighing and recognition tasks are performed together. The proposed framework was evaluated on a number of egocentric datasets. The results showed that using multi-modal features with MKL outperforms the existing methods.", "revisions": [ { "version": "v1", "updated": "2018-07-02T12:04:24.000Z" } ], "analyses": { "keywords": [ "multi-modal egocentric activity recognition", "egocentric activity recognition problem", "multi-modal features", "existing methods", "maximum a-posteriori adaptation" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }