We're supposed to stick to particular articles, as the OP did, but this discussion has veered off into the general. I'm hoping to turn it back with this post, by pointing out some history, which most folks (luckily) are too young to have lived through, and some new behavioral theory.
In the young days of audio, 1950s, 60s and even the 70s, measuring gear was difficult, and unreliable due to lack of standards. Lots of gear was produced that measured terribly. Lots of gear didn't meet its maker's specs, or even come close. Some of it sounded bad, and some sounded good anyway.
The objectivist mags arose to combat this problem. Their first goal was to measure, to provide real specs. They correctly felt that their first priority was to force mfrs to publish real numbers. During this time, the industry struggled to agree on measurements. Even something as simple as amp power could be stated as IHF, or RMS, or peak, or any given mfrs' own standard. There was no obvious way to discuss how things sounded, but as much gear that measured badly also sounded bad, it was valid to focus on measurements.
Harry Pearson and J. Gordon Holt started with the luxury of decent measurements, and focused on the fact that measurements didn't appear to fully correlate with SQ. The debate over if and why this is true still rages, but it's interesting to note that most industry professionals simply accept that once you get the measurements within acceptable range, listening tests are crucial in the design process.
Ken Kantor brought up in an interview that we haven't yet defined the goal. Oversimplified, is the goal to have the playback itself measure like the original, or is it to create a human perception of being like the original? Is it the case that certain types of distortion of the original cause humans to perceive the playback as being more like the original? There's little work on this that's come to the audio public's attention as yet, although our own RichPA has pointed out studies showing that humans are capable of having repeatable preferences even where they cannot reliably tell two things apart. The point is that this whole area of study is still in its infancy, and HP and JGH were applying practical problem-solving years ahead of the behavioral science.
If you start with the idea that it's difficult to define exactly how a system should perform (so you can't really even describe "a perfect copy machine"), and add the idea that the measurements don't always indicate whether the perceived performance will be acceptable or not, then subjective listening tests become valid data points. In fact, rather than double-blind testing for a small number of folks' ability to differentiate, a better data set would be to test large numbers of people on their preferences. That data set is indirectly gathered by sales. The more people have a preference (even though they can't reliably differentiate), the more they buy. The role of a professional reviewer thus becomes to act as a substitute for mass preference testing. To the extent that they accurately describe what they're hearing, a consumer can use the review as a reference, once the consumer gets an idea of how the reviewer's perceptions usually differ from his own. This, of course, is exactly the point Erik was making.
Art Dudley is good at describing what he's hearing, and many people have found their perceptions agree with his, so he's thus a valid data point for those consumers, even if he can't reliably differentiate his preferences, and neither can they. The name-calling could more happily be left out, but I suspect it arises from long-term frustration with folks who haven't read or can't accept that behavioral science has validated the existence of reliable preferences that can't be differentiated.
OK... sorry for the long post!