Helmet: How to Evaluate Long-Context Language Models Effectively and Thoroughly