Feature extraction is an important process in human activity recognition (HAR) with wearable sensors. Recent studies have shown that learned features are more effective than manually engineered features in related fields. However, the scarcity and expensiveness of labeled data are limiting the development of sensor data representation learning. Our work focuses on this issue and introduces a self-supervised learning method that uses unlabeled data to improve the quality of learned sensor representations. We hypothesize that unlabeled wearable sensor data in human activities have long-term and short-term temporal contextual correlations and exploit such correlations with Transformer and Contrastive Predictive Coding (CPC) framework. The learned representation is evaluated on human activity recognition and detection tasks in real-life scenarios. The experiments show that our method outperforms previous state-of-the-art methods on MotionSense and MobiAct datasets on the HAR task and gets a remarkable performance on the EVARS dataset on the action detection task.