arxiv CodeScore: Evaluating Code Generation by Learning Code Execution