I agree that ``interpretation,'' in the sense of rule-governed physical pattern matching (as in a mechanical loom or digital computer) is not the same as the conscious interpretation of syntactic symbol manipulation rules by a person. But it's the execution of the manipulations that we are equating here, not the ``interpretation'' in either of these senses. And the sense of ``interpretation'' that we are actually aiming for is yet a third one: the sense in which thoughts are meaningful (and ungrounded symbols, undergoing manipulation, no matter by whom or what, are not).
Never mind. Let us concede that if Hayes can ever give a nonarbitrary criterion for what does and does not count as an implementation of the same software among otherwise turing indistinguishable, turing-equivalent and even strongly equivalent ``implementations'' (``virtual'' ones, shall we call them?) of the same symbol system, then the Chinese Room Argument will have to be reconsidered (but probably so will a lot of the computationalism and functionalism that currently depends on the older, looser criterion).
I do have to point out, though, that there is a difference between a computer being connected to peripheral transducers (cameras, say), and the computer's being those transducers (which it is not: a computer certainly consists of transducers too, but not the transducers that would be a robot's sensorimotor surfaces; those are the kinds of transducers I am talking about). This is not just a terminological point. My own grounding hypothesis is that, to a great extent, we are (sensorimotor) transducers (and their analog extensions); our mental states are the activity of sensorimotor transducers (which are part of an overall TTT-capable system). Their activity is an essential component of thinking states. No transducer activity: no thinking state. There is no way to ``reconfigure'' an all-purpose computer, one that can implement just about any program you like, into a sensorimotor transducer -- except by adding a sensorimotor transducer to it. That, I take it, is bad news for the hypothesis that thinking is just computation (if my transduction hypothesis is right).
Because I'm interested in mind-modelling and not just in machine virtuosity, I have singled out TTT-scale grounding as the empirical goal. One can speak of a digital camera as ``grounded'' in a trivial sense: the internal computational states in such a ``dedicated'' computer are indeed ``bound'' to certain external energy configurations falling on its transducer surface, and not just as a matter of our interpretations. But such trivial grounding does not justify talking about the camera's having ``beliefs! Only the TTT has the power to match the complexity and narrow the degrees of freedom for the interpretation of its internal states to something that is commensurate with our own (and I agree with Hayes that the expressive power of natural language, a subset of the TTT, may well loom large in such a system). Otherwise we are indeed talking metaphor (or hermeneutics) rather than reality.