Egocentric Human Activity Recognition (ego-HAR) has received attention in fields where human intentions in a video must be estimated. The performance of existing methods, however, are limited due to insufficient information about the subject's motion in egocentric videos. We consider that a dataset of egocentric videos along with two inertial sensors attached to both wrists of the subject to obtain more information about the subject's motion will be useful to study the problem in depth. Therefore, this paper provides a publicly available dataset, EvIs-Kitchen, which contains well-synchronized egocentric videos and two-hand inertial sensor data, as well as interaction-highlighted annotations. We also present a baseline multimodal activity recognition method with two-stream architecture and score fusion to validate that such multimodal learning on egocentric videos and intertial sensor data is more effective to tackle the problem. Experiments show that our multimodal method outperforms other single-modal methods on EvIs-Kitchen.