Linguistic compositionality -- the use of simpler expressions to form the meanings of more complex ones -- underlies people's ability to use natural languages creatively, and to generalize their use of language correctly in novel settings. While traditional theories of compositionality focus on the rules for computing the meaning of more complex expressions from simpler ones, this does not provide a complete theory of linguistic meaning. In particular, it does not provide an account of what is being composed together, i.e. how meanings are grounded in the external world. In the current work, we provide a machine learning framework for jointly learning a language's compositional semantics and its external groundings. We evaluate this system on visual question answering benchmarks, and demonstrate that linguistic data can be used to learn both canonical and non-canonical representations of objects.