Calgary Herald - - SPORTS -

Our method­ol­ogy was com­pu­ta­tional at the be­gin­ning and sub­jec­tive at the end.

We started with about 7,000 Ma­jor League Base­ball in­ter­view tran­scripts that were compiled by ASAP Sports, mostly from news con­fer­ences at play­offs and All-Star Games. We trans­formed the text into a data­base con­tain­ing ques­tions, an­swers and meta­data about the an­swers, then ex­tracted four- and five-word phrases and cal­cu­lated a PMI (point­wise mu­tual in­for­ma­tion) score for each. (The higher the PMI score, the more prob­a­ble that the phrase is a cliché.) We elim­i­nated phrases that showed up fewer than seven times and had PMI scores of less than 25. The Python library NLTK was used for the text anal­y­sis.

We grouped phrases that were vari­a­tions of each other to­gether (within a one- or two-word dif­fer­ence) into a list of roughly 20,000 pos­si­ble clichés. Then came the sub­jec­tive part. From that list, we chose the ones that were the most in­ter­est­ing, then grouped those with sim­i­lar mean­ings. And voila — the phrases we con­sid­ered to be the cream of the cliché crop.

Newspapers in English

Newspapers from Canada

© PressReader. All rights reserved.