Scientific simulations often require solving ex- tremely large sparse linear equations, whose dominant kernel is sparse matrix vector multiplication. On modern many-core processors such as GPU, the operation has been known to pose significant bottleneck and thus would result in extremely poor efficiency, because of limited processor-to-memory bandwidth and low cache hit ratio due to random access to the input vector. Our family of new sparse matrix formats for many-core processors significantly increases the cache hit ratio and thus performance by segmenting the matrix along the columns, dividing the work among the many core up to the internal cache capacity, and aggregating the result later on. When compared to the best vendor libraries and recently proposed formats such as SELL-C-σ, our format achieved speedups of up to x2.0 for real datasets taken from the Florida Sparse Matrix Collection, x3.0 for the synthetic matrices in SpMV, and 1.68x for multi-node CG.