Multimodal dialog system has attracted increasing attention from both academia and industry over recent years. Although existing methods have achieved some progress, they are still confronted with challenges in the aspect of question understanding (i.e., user intention comprehension). In this paper, we present a relational graph-based context-aware question understanding scheme, which enhances the user intention comprehension from local to global. Specifically, we first utilize multiple attribute matrices as the guidance information to fully exploit the product-related keywords from each textual sentence, strengthening the local representation of user intentions. Afterwards, we design a sparse graph attention network to adaptively aggregate effective context information for each utterance, completely understanding the user intentions from a global perspective. Moreover, extensive experiments over a benchmark dataset show the superiority of our model compared with several state-of-the-art baselines.