arxiv Autonomous Evaluation and Refinement of Digital Agents