arxiv Multi-modal preference alignment remedies regression of visual instruction tuning on language model