Chance News 75
July 6, 2011 to August 9, 2011
"I suspect the amygdala did not evolve to store odds ratios and heterogeneity P scores, but when an adverse event has prompted me to review the literature, I come away with a clearer understanding. There’s nothing like a baby free-floating in the abdomen to drive home the lessons from a prospective study of risk factors for uterine rupture. And that clarity of understanding will serve the next at-risk patient I encounter."
Submitted by Steve Simon.
"On the nature of common errors. ...[M]ost of these mixups involve simple switches or oﬀsets. These mistakes are easy to make, particularly if working with Excel or if working with 0/1 labels instead of names... [A] mixup in annotation aﬀecting roughly a third of the data which was driven by a simple one-cell deletion from an Excel ﬁle coupled with an inappropriate shifting up of all values in the affected column only."
Forensic bioinformatics and reproducible research in high-throughput biology .
Submitted by Paul Alper
“As every student of statistics quickly learns, statistical significance is not the same as substantive significance. …. The latter is often addressed by going beyond the question of whether an association exists to ask how strong the relationship is. But there is an even more basic problem in this case – statistical significance tests are inappropriate for this data analysis. They can determine whether two values derived from a random sample are different enough that they cannot be attributed to chance (sampling error). But the FARS [Fatality Analysis Reporting System] data are not a sample of some larger data set; they represent the entire data set of fatal auto accidents. The authors [of this study] note this problem but choose to use this approach anyway, following ‘official recommendations and our previous practice.’”
STATS, July 14, 2011
Submitted by Margaret Cibes
"Coefficients for physical traits are on the average +0.2--not so high as for personality traits (+0.4) or religion (+0.9), but still significantly higher than zero. For a few physical traits the correlation is even higher that 0.2--e.g., an astonishing 0.61 for length of middle finger. At least unconsciously, people care more about their spouse's middle-finger length than about his or her hair color and intelligence!"
Submitted by Paul Alper
(Note: See 2D:4D in Chance News 44 for some other musings about the mysteries of finger length distributions.)
Freshman composition, for example, “does not demand that faculty ask existential questions.” Ditto for courses in “Security and Protective Services,” and “Business Statistics.” These are, she says, “fields of study with fairly definitive answers” and it would be hard to argue that they are “essential to civilization.” Those who teach these and similarly vocational subjects “don’t really need the freedom to ask controversial questions in discussing them.”
Naomi Schaefer Riley, The Faculty Lounges: and Other Reasons Why You Won't Get the College Education You Paid For
as quoted in Vocationalism, academic freedom and tenure, by Stanley Fish, New York Times, 11 July 2011
Submitted by Paul Alper
“Each of [spacecraft] Juno’s three [solar-powered] wings is 29 feet long and 9 feet wide, necessary given that Jupiter receives 25 times less sunlight than Earth.”
Marcia Dunn (Associated Press), Solar-powered Jupiter probe brings NASA into ‘new era’
published in The Tennessean, August 2, 2011, page 3A.
Submitted by Emil Posavac
"The Bureau of Labor Statistics (BLS) projects that the U.S. will add 15.3 million new jobs between 2008 and 2018, and a whopping 15 out of 30 jobs with the most projected openings and vacancies will pay wages that are above the national median wage for all workers in the United States."
Christian Hudspeth, "Employees Wanted: 10 Middle-Class Jobs That Are Actually Growing", August 3, 2011 
Submitted by Chris Andrews
Discussion of Ariely
A post in Chance News 74 described Dan Ariely's 2008 book Predictably Irrational: The Hidden Forces That Shape Our Decisions as a great summer read , while pointing out that it was not written as an academic work. Paul Alper wrote to say that he had occasion to review the book in the context of some related work, and had identified some statistical concerns. As Paul writes:
Ariely enjoys concocting experiments to demonstrate the irrationality. For example, he finds that satisfaction with a product depends on the price paid for the product--for example Bayer aspirin vs. the identical generic. Or, the enticing but utterly misleading “Free gift” will alter a decision. Reviewers loved his book. Nonetheless, there are some serious shortcomings.
- He invariably gives the average value of one group (e.g., satisfaction of Bayer aspirin users) compared to the other group (e.g., satisfaction of generic aspirin users) but he almost never indicates the variability. Averages alone are meaningless.
- Almost never does he state how many subjects are involved in each arm of a study.
- Almost all of his samples are convenience ones, rather than random samples.
- Almost all of his samples are MIT students, but his implicit inference is to the world at large.
- His examples of predictable irrationality appear unfailingly successful leading me to suspect a “file-drawer” issue--experiments which showed nothing in particular or the negative of what he theorizes, are put aside and not counted.
The earlier post also noted that Ariely has a new book, The Upside of Irrationality: The Unexpected Benefits of Defying Logic at Work and at Home. This was reviewed by the New York Times; you can find a link to the review and read Ariely's reaction on his blog here. He notes that that he had consciously adopted a more conversational style in the book, and that this had drawn some criticism from the Times. He invited readers to submit their own opinions on this. Readers come down on both sides, and it is interesting to read the comments. One statistically minded reader wrote:
Of course your [sic] irrationally asking for personal thoughts in comments instead of a (slightly) more accurate poll or a (very) accurate scientific survey.
Submitted by Bill Peterson, based on a message from Paul Alper
The perils of genetic testing
How Bright Promise in Cancer Testing Fell Apart by Gina Kolata, New York Times, July 7, 2011.
We have seen a lot of advances in genetics recently, and there has been the hope that these would translate into better clinical care. But making the bridge from the laboratory to clinical practice has been much more difficult than expected. A program at Duke, for example, was supposed to identify weak spots in a cancer genome so that drugs could be targeted to that weak spot rather than just trying a range of different drugs in sequence.
But the research at Duke turned out to be wrong. Its gene-based tests proved worthless, and the research behind them was discredited. Ms. Jacobs died a few months after treatment, and her husband and other patients’ relatives are suing Duke.
The problems at Duke are not an isolated problem.
The Duke case came right after two other claims that gave medical researchers pause. Like the Duke case, they used complex analyses to detect patterns of genes or cell proteins. But these were tests that were supposed to find ovarian cancer in patients’ blood. One, OvaSure, was developed by a Yale scientist, Dr. Gil G. Mor, licensed by the university and sold to patients before it was found to be useless.
The other, OvaCheck, was developed by a company, Correlogic, with contributions from scientists from the National Cancer Institute and the Food and Drug Administration. Major commercial labs licensed it and were about to start using it before two statisticians from M. D. Anderson discovered and publicized its faults.
The two statisticians, Keith Baggerly and Kevin Coombes, have made a career of debunking medical claims. In 2004, they (along with another M.D. Anderson statistician, Jeffrey Morris, published a paper that demonstrated serious flaws in the use of proteomic mass spectra to identify early ovarian cancer from normal tissue. The complex method proposed by Petrocoin et al in 2002 was apparently an artefact of equipment drift that would have been prevented if the original researchers had taken simple steps like randomizing the order of analysis of cancer and normal tissues.
Baggerly and Coombes had also found problems with the data supporting the Duke test.
Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed careless — moving a row or a column over by one in a giant spreadsheet — while others seemed inexplicable. The Duke team shrugged them off as "clerical errors."
Even though Baggerly and Coombes published a critique in a statistics journal, the Duke team continued to promote their genetic test. It was something else entirely that caused the problems with the Duke test to be treated seriously by the broader research community.
The situation finally grabbed the cancer world’s attention last July, not because of the efforts of Dr. Baggerly and Dr. Coombes, but because a trade publication, The Cancer Letter, reported that the lead researcher, Dr. Potti, had falsified parts of his résumé. He claimed, among other things, that he had been a Rhodes scholar.
Researchers in this area have a new-found sense of humility.
With such huge data sets and complicated analyses, researchers can no longer trust their hunches that a result does — or does not — make sense.
Submitted by Steve Simon
In measuring hunger, quality may be more important than quantity
A Taste Test for Hunger Robert Jensen and Nolan Miller, New York Times, July 9, 2011.
The traditional measure of global hunger is the number of calories consumed. If you consume less calories than you need, then you are classified as hungry. But this had led to some paradoxical results. There is an alternative way of measuring hunger. You need to
start with a baseline, namely the share of calories people get from the cheapest foods available to them: typically staples like rice, wheat or cassava. We call this the “staple calorie share.” We measure how many calories people get from these low-cost foods and how much they get from more expensive foods like meat. The greater the share of calories they receive from the former, the hungrier they are.
The rationale for this is simple.
Imagine you are a poor consumer in a developing country. You have very little money in your pocket, not enough to afford all the calories you need. And suppose you have only two foods to choose from, rice and meat. Rice is cheap and has a lot of calories, but you prefer the taste of meat. If you spent all your money on meat, you would get too few calories. You might do this every so often, but usually you would spend almost all of your money on rice; when faced with true hunger, taste is a luxury you can’t afford.
But suppose you had a bit more money. You would probably add some meat to your diet, because now you can afford to do so while still getting the calories you need. You might even like meat so much that you start to switch away from rice even if you haven’t quite met your complete calorie needs, as long as you aren’t too far below.
The authors argue that this approach removes some of the variations associated with the traditional calorie count method: some people need more calories than others, for example. They illustrate how the new measure of hunger performs better than the traditional measure in explaining trends in hunger in China from 1991 to 2001.
Submitted by Steve Simon
1. What are some other ways that you might assess hunger on a global scale?
2. What are some possible pitfalls to the use of this new measure of hunger.
More on US visa lottery
"Plaintiffs Lose Fight Over Visa Lottery"
by Miriam Jordan, The Wall Street Journal, July 15, 2010
Applicants who had been notified that they had won a US visa in the May 2011 lottery drawing have lost a federal court case in which they tried to stop a new drawing of the 2011 lottery. (See Chance News 74.)
It turns out that the May sampling process had violated a legal requirement of randomness: "a computer error caused 90% of the 50,000 winners to be selected from the first two days of applications instead of from the entire 30-day registration period."
The court hearing earlier this week focused on the meaning of "random." Lawyers for the plaintiffs argued that they had been randomly chosen and that they didn't know that by filling their applications in the first two days that they would gain any advantage. They also contended that the "outcome was indeed not uniform, but nevertheless still random as required by law."
The State Department argued that the results didn't represent a fair, random selection.
This example reminds me of the 1970 Vietnam War draft lottery issues related to sampling in an inadequately stirred-up jar of birthday capsules. For this first draft lottery since 1940, 366 balls were dumped from a box into a glass jar, from which balls were drawn. The result was that higher numbers (fall birthdays) were somewhat more likely to have been chosen first and, consequently, their holders somewhat more likely to have been drafted into military service. See a 10-minute video of the actual Fall 1969 drawing for the 1970 draft, “CBS news draft lottery nov 1969”, and/or Norton Starr's 1997 article, "Nonrandom Risk: The 1970 Draft Lottery", for the raw data and a statistical discussion of it.
Submitted by Margaret Cibes
1. The article does not specify the daily distribution of applications over the 30-day application period.
(a) If the daily distribution of applications had been uniform over the 30 days, what percent of the 50,000 winners would you have expected from a random selection process?
(b) If the distribution of applications had been skewed so that an overwhelming majority of them had been received during the first two days, could the resulting 22,000 "winners" have been expected?
2. Would you consider the May result an "error"? Would you attribute the "error" to a "computer"?
3. Might you have agreed with the plaintiffs that the May outcome was "still random"?
Note: See Diversity Visa Lottery 2011 Results for official figures and other information from the U.S. State Department.
More cancer risk for taller women?
“Cancer Risk Increases With Height”
by Sten Stovall, The Wall Street Journal, July 21, 2011
British researchers studied 1.3 million middle-aged women in Britain during the period 1996-2001 and reported their results in The Lancet Oncology, August 2011. (See an abstract of the article, “Height and cancer incidence in the Million Women Study”.) Their finding – that height was positively correlated with cancer risk – is apparently consistent with the results of earlier studies around the world.
Stovall notes in the article that height depends directly upon factors such as diet and infections in childhood, growth hormone levels, and genetic factors, and hence, in a commentary accompanying the article, another researcher noted:
In the future, researchers need to explore the predictive capacities of direct measures of nutrition, psychosocial stress, and illness during childhood, rather than final adult height.
A blogger suggested that “a possible reason for any correlation between height and cancer risk could be simply that the taller you are, the bigger your organs are and the more cells they have; hence, the greater the probability that one of those cells will become cancerous.”
Submitted by Margaret Cibes
1. Stovall uses the term “risk,” while the journal article abstract contains the term “relative risk.” What’s the difference, if any? Does it change his interpretation of the research results?
2. What statistical term would you use to refer to the other “direct measures,” with respect to inferring causation in the relationship between height and cancer risk, or relative risk?
3. Responding to the blogger, one of the authors of the study responded that they did discuss the issue of body mass in their Lancet article. If you can access the full text of the article, look it up and see if you can tell whether the blogger’s point was valid.
Airline’s random boarding strategy
“Airlines Go Back to Boarding School to Move Fliers Onto Planes Faster”
by Scott McCartney, The Wall Street Journal, July 21, 2011
… American Airlines undertook a two-year study to try and speed up boarding. The result: The airline has recently rolled out a new strategy—randomized boarding. Travelers without elite status get assigned randomly to boarding groups instead of filing onto planes from back to front.
The article discusses problems with other boarding strategies and reports that AA expects to be able to “shave three to four minutes off the average boarding time of 20 to 25 minutes.”
Submitted by Margaret Cibes
1. For a mathematical discussion of the problem from a few years back, see the 2006 post Aircraft boarding by the numbers in Ivars Peterson's MathTrek column for the MAA. It summarizes an intriguing mathematical analysis of the problem, Analysis of airplane boarding via space-time geometry and random matrix theory. As described by Peterson
The model demonstrates that boarding from back to front is less efficient than letting everyone board at the same time. The problem with the back-to-front approach is that everyone else has to wait until the passengers in the designated row get settled, so, in effect, the boarding time is proportional to the number of passengers.
"Among row-dependent policies which do not severely constrain passengers, random boarding (no policy) is almost optimal," Bachmat and his colleagues report. For random boarding, boarding time is roughly proportional to the square root of the number of passengers.
2. Paul Alper wondered about outliers in the boarding time distribution, and wrote to point out that, as described the article, capitilism had offered the following solution:
For passengers who find American's random boarding too risky, there are options to board early. "Express Seating" offers Group 1 boarding and a seat at the front of the airplane for 14 to 54 dollars, depending on demand and distance of the flight; a "Boarding and Flexibility" package at time of booking costs 19 dollars and gets some added benefits in making reservation changes, or there's early boarding at check-in, which costs a 9 dollar fee.
Date Lab: By the numbers
Washington Post, 28 July 2011
These data go with a story in the Lifestyle section. One of the opening paragraphs tells it all:
Spark, chemistry, x-factor, zing — curse you, whatever you are. For five years, the editors of Date Lab have been sending single Washingtonians out on blind dates and recounting the highs (and lows) for our readers. And after wining and dining 250 couples — 20-somethings, senior citizens, straight, gay, rich, not-so-rich, and nearly every race on the Census form — the one thing we know is this: No matter how perfect a pair seems on paper, no matter how sure we are that the spark will ignite, there’s no predicting when that indefinable romantic catalyst will show itself.The graphic summarizes responses to a survey about the Datelab experience
Click here for full-size image from Washington Post.
Then comes the disclaimer: "And just like our matching process, it was unscientific. Results reflect responses from about 35 percent of past participants."
Apropos of online dating, here is an intriguing quote from p. 220 Dan Ariely's book (discussed above):
Talk about market failures. A ratio [5.2 hours per week searching profiles and 6.7 hours per week e-mailing potential partners vs. 1.8 hours per week spent actually meeting any prospective partners] worse than 6:1 speaks for itself. Imagine driving six hours in order to spend one hour at the beach with a friend (or, even worse, with someone you don't really know and are not sure you will like). Given these odds, it seems hard to explain why anyone in their right mind would intentionally spend time on online dating.
Submitted by Paul Alper
A Wimbledon coincidence
John Isner vs. Nicolas Mahut, again
ESPN.com News Services, 18 June 2011
At last year's Wimbledon tournament, the first-round pairing of Isner and Mahut set a record for the longest tennis match in history. Their fifth set went to 138 games before Isner prevailed, 70-68. The match concluded on its third day, after being twice delayed by nightfall.
As fate would have it, the two were paired again in the first round this year! Such coincidences always provoke hyperbole, as in this quote from the Guardian on June 17 : "Isaac Newton would struggle to calculate the odds of John Isner and Nicolas Mahut being thrown together in the first round at Wimbledon a year after their historic match on Court 18." Therefore, it was refreshing to see ESPN article calmly state that "[t]he probability that Isner and Mahut would draw each other again in this year's Wimbledon first round -- given that they were both unseeded -- was 0.7 percent, which is 7 in 1,000 or about 1 in 143, according to ESPN Stats & Information."
To understand this calculation, one needs to know about the how the pairings are made. An explanation can be found in a June 20 letter Game, set, match … and the odds at Wimbledon, responding to the original Guardian report:
If there were no seeds, the calculation would be simplicity itself: there are 127 players whom Isner could face in the first round, one of whom is Mahut, so the probability is 1 in 127. The presence of seeds complicates the calculation, but only slightly: given that there are 32 seeds, each of whom faces a non-seed in the first round, and that Isner and Mahut are both unseeded, the probability is 64/96 x 1/95, which works out as 1 in 142.5. (The first fraction represents the probability that Isner is drawn in a position where he faces another non-seed; the second is the probability that Mahut is drawn as his opponent.) Thus, although this rematch is an improbable event, the odds against it are by no means astronomical.
Do you understand the above calculation? What assumptions are being made?
Submitted by Bill Peterson