Our methodology was computational at the beginning and subjective at the end.
We started with about 7,000 Major League Baseball interview transcripts that were compiled by ASAP Sports, mostly from news conferences at playoffs and All-Star Games. We transformed the text into a database containing questions, answers and metadata about the answers, then extracted four- and five-word phrases and calculated a PMI (pointwise mutual information) score for each. (The higher the PMI score, the more probable that the phrase is a cliché.) We eliminated phrases that showed up fewer than seven times and had PMI scores of less than 25. The Python library NLTK was used for the text analysis.
We grouped phrases that were variations of each other together (within a one- or two-word difference) into a list of roughly 20,000 possible clichés. Then came the subjective part. From that list, we chose the ones that were the most interesting, then grouped those with similar meanings. And voila — the phrases we considered to be the cream of the cliché crop.