Up un­til 8:22 pm Eastern Time in the US on Novem­ber 8, 2016, The New York Times’ on­line polling had given Hil­lary Clin­ton 82% chance of be­com­ing the next US pres­i­dent. The rest, of course, is his­tory.

On­line polls failed to pre­dict the out­comes of other re­cent elec­tions, in­clud­ing the Brexit vote in Bri­tain. Such ver­i­fi­able ev­i­dence pro­vides enough rea­son to doubt the re­sults of any on­line poll. For, they grossly vi­o­late the ba­sic prin­ci­ples of sta­tis­ti­cal sur­veys.

In 19th- and 20th-cen­tury US, news­pa­pers and mag­a­zines fea­tured clipout coupons for ‘straw polls’ that read­ers were sent in to cast bal­lots for their pre­ferred can­di­date. To­day’s on­line polls orig­i­nated from that. Ran­dom sam­pling is the core of any sta­tis­ti­cal sur­vey — which hap­pens to be im­pos­si­ble to en­sure on the in­ter­net.

For­getaboutran­dom­ness,thereisno way of even se­lect­ing sam­ples on the in­ter­net. To com­pen­sate, on­line poll­sters some­times use some ex­ten­sive sta­tis­ti­cal modelling. How­ever, statis­ti­cians cast doubts about the use­ful­ness of such modelling as a com­pro­mise for ran­dom­ness.

There is well-known ‘non-re­sponse bias’ in on­line sur­veys. If I’m in­ter­ested in cricket, the like­li­hood is high that I would find an on­line sur­vey on, say, ‘Best T20 bats­men’ or ‘Best ODI bowlers’. Rather, these on­line polls would find me on the ba­sis of my in­ter­net search his­tory. But, for unbiased- ness, every mem­ber of the pop­u­la­tion must have an equal prob­a­bil­ity of se­lec­tion. How­ever, the de­mo­graphic pat­tern of in­ter­net users grossly mis­matches with the de­mo­graphic na­ture of the pop­u­la­tion.

For ex­am­ple, in the US, while 97% of peo­ple in the age group be­tween18 and 29 use the in­ter­net, they made up just 13% of the 2014 elec­torate, ac­cord­ing to the exit poll con­ducted by Edi­son Re­search. Some 40% of those 65 years old and older do not use the in­ter­net. But they made up 22% of those who voted.

In ad­di­tion, in­ter­net users tend to be ur­ban dwellers and have above-av­er­age in­comes. There­fore, they don’t nec­es­sar­ily rep­re­sent the pop­u­la­tion as a whole. The de­mo­graphic back­ground and lo­ca­tion of the re­spon­ders of the in­ter­net-based sur­vey re­main un­ver­i­fied. Also, on­line polling can be ma­nip­u­lated by tak­ing the help of friends or em­ploy­ees to sat­isfy vested in­ter­ests and ‘de­sir­able re­sults’.

The Web Un­rav­els

Usu­ally, there is no way to prevent peo­ple from vot­ing more than once. So, how are the data weighted? What is the sam­pling er­ror and how is that mea­sured? Ac­cord­ing to the Na­tional Coun­cil on Pub­lic Polls (NCPP) in the US, “many web-based sur­veys are com­plete­lyun­re­li­able.In­deed,tode­scri­bethem as ‘polls’ is to mis­use that term.”

In1998, 52% of more than100,000 re­spon­ders of an AOL on­line poll opined that US Pres­i­dent Bill Clin­ton should have re­signed be­cause of his re­la­tion­ship with White House in­tern Mon­ica Lewin­sky. Tele­phone polls con­ducted at the same time with much smaller but rep­re­sen­ta­tive sam­ples showed far fewer peo­ple seek­ing Clin­ton’s res­ig­na­tion — 21% in a CBS poll, 23% in a Gallup poll, and 36% in an ABC poll.

Quite of­ten, peo­ple hav­ing a par­tic­u­lar view might re­spond more in on­line polling. But, usu­ally, on­line poll re­sul- ts do not men­tion that the in­di­vid­u­als had cho­sen to par­tic­i­pate in on­line polls, and that they are un­likely to be rep­re­sen­ta­tive of the gen­eral pop­u­la­tion. In the ab­sence of this cri­te­rion, the read­ers are left with an in­cor­rect im­pres­sion that the re­sults ap­ply to the gen­eral pop­u­la­tion.

In 2009, the Re­spon­sive Man­age­ment and the South Carolina De­part­ment of Nat­u­ral Re­sources (SCDNR) in the US con­ducted a sur­vey on salt­wa­ter fish­ing and shell-fish­ing in the state of South Carolina. A sci­en­tific sur­vey con­ducted by tele­phone, and a sur­vey con­ducted via the in­ter­net, were both used, to a closed pop­u­la­tion who ob­tained a South Carolina Salt­wa­ter Recre­ational Fish­eries Li­cence. Every li­cence-holder had an equal chance of be­ing con­tacted by tele­phone.

How­ever, the on­line sur­vey used a sam­ple con­sist­ing of li­censees who pro­vided an email ad­dress while pur­chas­ing their li­cences. Thus, the on­line sur­vey had elim­i­nated ap­prox­i­mately 88% of the pos­si­ble sam­ple in a sys­tem­atic way, yield­ing a se­vere bias. In ad­di­tion, only 20.5% of the email ad­dressh­old­ers re­sponded to the on­line sur­vey. The on­line sur­vey re­spon­dents were, in gen­eral, a more ed­u­cated and af­flu­ent group, and also dis­pro­por­tion­ately male. 5.7% of the on­line sur­vey sam­ple was fe­male, while 19.9% of the tele­phone sam­ple was fe­male.

How­ever, from the li­cence-holder database, 18.5% were fe­male. The on­line poll­sam­ple­wasadis­as­trous­mis­match of the pop­u­la­tion de­mog­ra­phy. So, no data is cer­tainly bet­ter than bad data.

Net-Net, A Ques­tion Mark

Of late, there is a huge surge in on­line polling with an in­crease in in­ter­net users. If the scope was kept within mereen­ter­tain­ment,that­would­have­been not much of an is­sue. How­ever, it may be nearly im­pos­si­ble to ig­nore the easy, cheap,quickan­de­ver-ex­pand­ingscope of the in­ter­net to gauge pub­lic opin­ion on any is­sue.

Buteven­then,thereshould­bea­com­bi­na­tion of on­line and in-per­son sur­veys, where the on­line re­sults can be val­i­dated,orad­justed,base­donthein-per­son sur­vey re­sults. Of course, a lot of the­o­ret­i­cal stud­ies are still needed to be done be­fore putting this in prac­tice.

The writer is pro­fes­sor of statis­tics, In­dian Sta­tis­ti­cal In­sti­tute, Kolkata

Cherry-pick­pock­et­ing re­sults

