Egocentric human activity recognition (ego-HAR) has received attention in fields where human intentions in a video must be estimated. However, the performance of existing methods is limited due to insufficient information about the subject’s motion in egocentric videos. To overcome the problem, we proposed to use two hands’ inertial sensor data as supplements for egocentric videos to do the ego-HAR task. For this purpose, we construct a publicly available dataset, egocentric video, and inertial sensor data kitchen (EvIs-Kitchen), which contains well-synchronized egocentric videos and two-hand inertial sensor data and includes interaction-focus actions as recognition targets. We also designed the optimal choices of input combination and component variants through experiments under two-branch late-fusion architecture. The results show our multimodal setup outperforms any other single-modal methods on EvIs-Kitchen.